A computer-implement method and an apparatus are provided for neural network reinforcement learning. The method includes obtaining, by a processor, an action and observation sequence. The method further includes inputting, by the processor, each of a plurality of time frames of the action and observation sequence sequentially into a plurality of input nodes of a neural network. The method also includes updating, by the processor, a plurality of parameters of the neural network by using the neural network to approximate an action-value function of the action and observation sequence.