Patent attributes
A reinforcement learning device includes a processor that determines a first action on a control target by using a basic controller that defines an action on the control target depending on a state of the control target. The processor performs a first reinforcement learning within a first action range around the first action in order to acquire a first policy for determining an action on the control target depending on a state of the control target. The first action range is smaller than a limit action range for the control target. The processor determines a second action on the control target by using the first policy. The processor updates the first policy to a second policy by performing a second reinforcement learning within a second action range around the second action. The second action range is smaller than the limit action range.