Reinforcement learning (RL) is a subset of machine learning that uses negative and positive feedback to allow an AI-driven system, or agent, to learn through trial and error. RL uses a method of rewarding desired behaviors and punishing negative behavior. It assigns positive values to desired actions, encouraging the agent to use them and negative values to undesired behaviors, discouraging them. This programs an agent to seek long-term maximum rewards through trial and error to find the optimal solution for a given scenario. RL is a powerful approach to machine learning, allowing agents to learn actions that result in eventual success in an unseen environment without supervision.
Optimal behavior is learned through feedback from the reward function as the agent interacts with its environment. Without any supervision, the agent must independently discover the sequence of actions that maximize the reward, a discovery process akin to a trial-and-error search. To ensure long-term maximum rewards, the quality of actions shouldn't be measured by immediate rewards but should also take into account delayed rewards. RL is sometimes described in relation to how children learn when exploring the world around them and discovering how to interact to achieve a goal.
RL has found use as a way of directing unsupervised machine learning through positive and negative reinforcement. It is one of several approaches to training machine learning systems, enabling agents to learn from their environment and optimize their behavior. Use cases include gaming, resource management, personalized recommendations, and robotics. While it has benefits, RL can be difficult to deploy as it relies on exploring an environment. If the environment frequently changes, it is difficult for the agent to consistently take the best actions. Additionally, RL can take significant time and computing resources. In contrast, supervised learning can deliver faster results if the proper amount of structured data is available.
RL has similarities to other forms of machine learning, such as supervised learning in that developers must define the reward function and the goal to be achieved. Therefore, RL requires more explicit programming compared to unsupervised learning. But once these parameters are defined, the agent operates on its own, making it more self-directed than supervised learning. This has led some to describe RL as a branch of semisupervised learning.
The main components of an RL system include the following:
- Agent or learner
- Environment the agent is interacting with
- Policy the agent follows when taking actions
- Reward signal the agent receives upon taking actions
The agent explores the environment to achieve a goal. RL uses the hypothesis that all goals can be defined by the maximization of an expected cumulative reward. The agent takes actions and learns to sense and perturb the environment to find actions that derive maximal reward as defined by the reward signal. The formal framework for RL uses ideas from Markov Decision Processes (MDP). This signal captures the immediate benefit of the new state as well as the cumulative reward that is expected to be collected from that state moving forward. RL algorithms attempt to find the action policy that maximizes the average value it can extract from every state of the system.
There are three main types of RL implementations:
- Model-based
- Value-based
- Policy-based
More simply, RL algorithms can be defined as either model-based or model-free. Model-based RL creates a virtual model for the environment, and the agent learns to perform actions within it based on the defined constraints of the model. Model-free algorithms do not build an explicit model of the environment or, more specifically, the MDP. They achieve results similar to a trial-and-error algorithm, running experiments with the environment using actions and deriving the optimal policy from it directly. Model-free algorithms are either value-based (maximizing for an arbitrary value function) or policy-based (applying a policy or deterministic strategy to maximize cumulative reward).
Model-based RL algorithms build a model of the environment. To do this, the models sample the states, take actions, and observe the rewards. Then, it predicts the expected reward and the expected future state for every state and a possible action. Calculating the expected reward is a regression problem, and predicting the expected future state is a density estimation problem. Given a model of the environment, the agent can plan its actions without directly interacting with the environment.
Value-based algorithms determine the optimal policy for selecting actions based on the direct result of estimating the value function of every state accurately. It uses a recursive relation described by the Bellman equation. The agent interacts with the environment and samples trajectories of states and rewards. With enough trajectories, it is possible to estimate the value function of the MDP. Once the value function is known, the optimal policy can be found by maximizing it at every state in the environment. Popular value-based algorithms are state–action–reward–state–action (SARSA) and Q-learning.
Policy-based algorithms directly estimate the optimal policy without modeling the value function. Instead, they parametrize the policy directly using learnable weights to render the learning problem into an explicit optimization problem. Similar to value-based algorithms, the agent samples trajectories of states and rewards. However, this information is used to explicitly improve the policy by maximizing the average value function across all states. Policy-based approaches can result in high variances that manifest as instabilities during the training process. Although more stable, value-based approaches are not suitable to model continuous action spaces. Popular policy-based RL algorithms include Monte Carlo policy gradient (REINFORCE) and deterministic policy gradient (DPG). A popular RL algorithm, called the actor-critic algorithm, combines the value-based and policy-based approaches with both the policy (actor) and the value function (critic) parametrized to enable effective use of training data with stable convergence.
Reinforcement learning from human feedback (RLHF) is a related machine learning (ML) technique that incorporates human feedback into the rewards function to make self-learning more efficient and improve model performance by making it more aligned with human goals. RLHF has become used throughout generative AI applications, including large language models.
RL is the outcome of two main research fields that were developed independently learning by trial and error which started in the psychology of animal testing and the problem of optimal control and its solution using dynamic programming and value functions. Another, less distinct, field that contributed to the development of RL is temporal-difference methods. These fields came together in the 1980s to produce the modern field of reinforcement learning.
Optimal control as a phrase was first used in the late 1950s to describe the problem of minimizing a dynamic system's behavior using a controller. An approach was developed in the mid-1950s by Richard Bellman and others by extending a nineteenth-century theory from Hamilton and Jacobi. It uses a value function or optimal return function to define what is now called the Bellman equation. This class of methods for solving optimal control problems came to be known as dynamic programming. Richard Bellman introduced the stochastic version of the optimal control problem known as Markovian decision processes (MDPs) in 1957, and Ronald Howard devised the policy iteration method for MDPs in 1960. Dynamic programming is generally considered the only feasible way of solving general stochastic optimal control problems and has been extensively developed since the late 1950s.
The term reinforcement in the field of animal learning came into use in the 1927 English translation of Pavlov’s monograph on conditioned reflexes. It was used to refer to the strengthening of a pattern of behavior as a result of an animal receiving a stimulus or reinforcer. Implementing trial and error learning in computing, not animal learning first, appeared in a 1948 report from Alan Turing, who stated:
When a configuration is reached for which the action is undetermined, a random choice for the missing data is made and the appropriate entry is made in the description, tentatively, and is applied. When a pain stimulus occurs all tentative entries are cancelled, and when a pleasure stimulus occurs they are all made permanent.
Trial and error learning was integrated into the ML analog by Minsky in 1954, postulating the use of SNARCs (Stochastic Neural-Analogue Reinforcement Calculators). Computational trial-and-error processes were generalized to pattern recognition by Clark & Farley in their 1954 and 1955 papers. In the early 1960s, this work was adapted to supervised learning by Rosenblatt and Widrow & Hoff using error information to update connection weights.
Building on ideas from Harry Klopf, Richard Sutton developed links to animal learning theory and explored the rules by which learning is driven by changes in temporally successive predictions in the late 1970s and 1980s. Sutton's work inspired a large amount of research into the field of RL. The disparate fields of RL were united in 1989 when Chris Watkins developed Q-learning, an RL policy that finds the best action given the current state.
Tesauro applied these concepts to the development of the TD-Gammon program in 1992, which was able to achieve "Master Level" in the game of backgammon. Researchers also applied RL techniques to the game of chess and Go, with IBM developing DeepBlue, an AI that defeated a world champion in 1997. Google's AlphaGo beat a Go world champion in 2016. DeepBlue utilized a parallelized tree-based search methodology that is not possible in Go as there are too many possible moves. Instead, AlphaGo utilized a combination of Monte Carlo simulations, Monte Carlo tree search, Bayesian optimization, and learning from the previous matches of world champions.