🍓Strawberry, o1, and Self-Play Reinforcement Learning
A recent paper from China's Tsinghua University and Peking University provides a great overview of self-play RL, which is emerging as the next paradigm for LLM and generative AI.
On Thursday, OpenAI released a new series of AI models called o1, also known as the highly anticipated “Strawberry” model, which has been generating buzz over the past year. The o1 series, which currently includes o1-preview and o1-mini, are designed to significantly improve reasoning capabilities and tackle complex problems. The results are jaw-dropping:
Achieved 83% accuracy on International Mathematics Olympiad qualifying exams, compared to GPT-4o’s 13%;
Ranked in the 89th percentile on Codeforces programming competitions;
Demonstrated PhD-level performance on challenging tasks in physics, chemistry, and biology.
The secret of the model’s enhanced reasoning ability is reinforcement learning, a machine learning technique that employs rewards and penalties to teach the model how to solve problems independently. With reinforcement learning, the model is trained to generate a long internal chain of thought (not predefined by human engineers) before providing an answer.
First time proper RL over the space of language! This brings up memories of early days of computers getting better at Go via self play. - Boris Power, Head of Applied Research at OpenAI.
The release of the o1 series suggests that reinforcement learning, specifically self-play RL, could be the next major paradigm for LLMs and generative AI. This shift comes at a time when many are concerned about the slowing progress of LLM pre-training, as seen with the yet-to-be-released GPT-5.
OpenAI isn’t the only contender in this race. Google DeepMind’s AlphaProof, which combines language models with reinforcement learning to solve complex math problems, and Anthropic’s new Claude model are also likely to leverage self-play RL.
But what exactly is self-play RL? In this article, I will reference a recent paper, A Survey on Self-play Methods in Reinforcement Learning, from Tsinghua University, Peking University, and others, which provides a comprehensive overview of self-play methods in RL and introduces a unified framework for understanding these techniques.
What is Self-Play in Reinforcement Learning?
Self-play has emerged as a transformative concept in reinforcement learning (RL), enabling AI agents to train and refine their strategies without the need for human opponents or external data. By allowing an agent to compete against itself or its past versions, self-play opens up new possibilities for solving complex problems and achieving superhuman performance in domains like board games, video games, and real-time strategy games.
In traditional reinforcement learning, an agent learns by interacting with an environment and receiving feedback through rewards or penalties. The objective is to maximize the cumulative reward over time, which typically involves optimizing a policy that maps states to actions. This approach is modeled using a Markov Decision Process (MDP), which provides a mathematical framework for decision-making in uncertain environments.
Self-play takes this concept a step further by introducing a multi-agent setting where an agent learns by playing against itself or its historical versions. In this setup, the agent continually faces a dynamic opponent that evolves alongside it. This mechanism allows for more stable learning and avoids the pitfalls of static environments where the agent could easily overfit to a fixed opponent.
Why is Self-Play Important?
The importance of self-play in RL lies in its ability to generate diverse and challenging scenarios for the agent to learn from. Traditional RL methods often struggle with environments that are highly non-stationary or involve multiple interacting agents with conflicting goals. Self-play addresses these issues by providing a framework where the learning agent encounters a wide range of strategies and behaviors, thereby promoting more generalized and robust policies.
The authors categorize self-play into four main types, each with its own strengths and applications:
Traditional Self-Play Algorithms: These algorithms involve an agent repeatedly playing against its latest version, continuously refining its strategy based on past outcomes. Vanilla self-play and Fictitious Self-Play (FSP) fall under this category. While effective, these methods can sometimes lead to convergence on cyclic or suboptimal strategies, especially in non-transitive games like Rock-Paper-Scissors, where no single strategy is dominant.
Policy-Space Response Oracles (PSRO) Series: The PSRO framework expands on traditional self-play by employing a meta-strategy that governs how opponent policies are selected from a pool of existing strategies. By incorporating game-theoretic concepts such as Nash equilibrium and correlated equilibrium, PSRO algorithms diversify the learning experience and reduce the risk of overfitting to a narrow set of opponents. This series has been particularly successful in strategic games like poker, where balancing exploitation and exploration is crucial.
Ongoing-Training-Based Series: Unlike the PSRO series, which adds new strategies iteratively, ongoing-training-based methods involve continuously updating all strategies in parallel. This approach is exemplified by the FTW agent, which was trained to play Quake III Arena using a population-based approach. By maintaining a constantly evolving set of policies, these methods ensure that agents remain adaptable to changing environments and can handle scenarios with multiple, simultaneous players.
Regret-Minimization-Based Series: Regret minimization focuses on minimizing the difference between the rewards an agent could have achieved and the rewards it actually received over multiple rounds of play. This approach is well-suited for environments where deception and strategic planning are critical, such as in card games like poker. By emphasizing long-term gains over short-term wins, regret-minimization-based methods encourage more sophisticated strategies and greater resilience against opponent exploitation.
Applications of Self-Play: From Games to Real-World Scenarios
Self-play has been pivotal in some of the most significant breakthroughs in AI. The most famous example is AlphaGo, developed by DeepMind, which used self-play to master the game of Go—an ancient board game known for its vast search space and strategic depth. AlphaGo Zero, its successor, took this concept further by learning from scratch, without any human game data, and surpassing its predecessor in just a few days of training.
AlphaZero extended these ideas to multiple games, including chess and shogi, demonstrating that self-play can be generalized across different domains. In each case, the AI was able to discover novel strategies that had never been seen before, suggesting that self-play can unlock new levels of creativity and insight in problem-solving.
Beyond games, self-play has potential applications in more complex, real-world problems. For instance, in autonomous driving, self-play could be used to simulate various traffic scenarios, allowing vehicles to learn robust decision-making strategies without real-world risks. In financial modeling, self-play could help design algorithms that adapt to ever-changing market dynamics, outperforming static models that rely on historical data.
Challenges and Future Directions in Self-Play
While self-play has proven effective in many scenarios, it also comes with challenges. One of the primary issues is the computational expense. Training AI agents through self-play requires substantial computational resources, particularly when the state space is large or the environment is highly complex. This makes it difficult to apply self-play to many real-world tasks without simplifying the problem or optimizing the algorithms.
Another challenge is the risk of converging to suboptimal equilibria. In some cases, agents trained through self-play can become overly specialized to certain types of opponents, making them less effective against others. Researchers are exploring ways to mitigate this by diversifying training strategies and incorporating techniques such as opponent modeling and multi-agent cooperation.
Looking ahead, the future of self-play in RL will involve integrating it with other AI, such as LLMs and multi-modal models (like OpenAI’s o1). By combining self-play with LLMs, for example, AI could learn not just optimal strategies but also the reasoning and decision-making processes behind those strategies. This could be transformative in fields like negotiation, diplomacy, and even healthcare, where complex human-like decision-making is essential.
A Few Last Words: Back in 2016, Yann LeCun introduced his now-famous cake analogy for machine learning: “If intelligence is a cake, the bulk of it is unsupervised learning, the icing is supervised learning, and the cherry on top is reinforcement learning (RL).”
A year later, Pieter Abbeel, then a UC Berkeley professor and former OpenAI researcher, heated the dessert debate with his version: a cake with lots of cherries—meaning, lots of RL.
Fast forward seven years, and it seems like the cakes are getting loaded with even more cherries!