Patent attributes
Methods, systems and apparatus, including computer programs encoded on computer storage media, for training an action selection neural network. One of the methods includes receiving an observation characterizing a current state of the environment; determining a target network output for the observation by performing a look ahead search of possible future states of the environment starting from the current state until the environment reaches a possible future state that satisfies one or more termination criteria, wherein the look ahead search is guided by the neural network in accordance with current values of the network parameters; selecting an action to be performed by the agent in response to the observation using the target network output generated by performing the look ahead search; and storing, in an exploration history data store, the target network output in association with the observation for use in updating the current values of the network parameters.