Patent attributes
Provided is an information processing apparatus including: a reward estimator generating unit using action history data, which includes state data expressing a state of an agent, action data expressing an agent's action, and a reward value expressing a reward of the action, as learning data to generate, through machine learning, a reward estimator estimating the reward value from inputted state data and action data; an action selecting unit preferentially selecting an action not included in the action history data but with a high estimated reward value; and an action history adding unit causing the agent to perform the selected action and adding to the action history data the state data and action data for the action and the action's reward value in association with each other. The reward estimator is regenerated when a set of state data, action data, and the reward value is added to the action history data.