Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Niklas Lauffer¹, Ameesh Shah¹, Micah Carroll¹, Sanjit Seshia¹, Stuart Russell¹, Michael Dennis²

¹UC Berkeley ²Google Deepmind

NeurIPS 2025

Paper Code Demo Interactive Visualization

Adversarial optimization algorithms that explicitly search for flaws in agents' policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational—that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments.

Rationality-preserving Policy Optimization

In order to reap the benefits of adversarial optimization without incurring self-sabotaging behavior, we establish a new paradigm for adversarial optimization called Rationality-preserving Policy Optimization (RPO). We formalize RPO as an adversarial optimization problem that requires the policy to be optimal with respect to at least one policy that other agent(s) might play. This can be thought of as enforcing the agent to be rational: i.e., the agent must be utility-maximizing for some choice of teammates. RPO formalizes the rationality constraint as the following optimization problem:

The rationality constraint imposed by RPO is difficult to directly integrate into a single optimization objective. To solve RPO, we introduce a novel approach called rational policy gradient (RPG), which provides a gradient-based method for ensuring rational learning while optimizing an adversarial objective. RPG introduces a new set of agents called manipulators, one for each of the agents in the original optimization problem (which we call base agents). In RPG, the base agents only train to maximize their own reward in a copy of the game (called its manipulator environment) with their teammates replaced by their manipulator counterparts -- this ensures that the base agents are solely learning to be rational. Each manipulator uses opponent-shaping to manipulate the base agents' learning and guide them towards policies that optimize the adversarial objective (e.g., achieving low reward with one another in the original base environment). The manipulators are discarded after training and the trained base agents give the solution to the RPO-version of the adversarial objective -- whether that be related to robustness, diversity, or some other objective. RPG allows us to use adversarial optimization in cooperative settings to find adversarial examples, robustify behavior, and discover diverse policies:

Rational Adversarial Attacks

RPG can also be used to create rational adversarial attacks that avoid self-sabotage. We tested the ability of AP-RPG to discover weaknesses in policies in the Cramped Room Overcooked layout. First, we trained a policy using self-play (SP) (visualized in Figure 2), then we used both AP and AP-RPG to search for a adversarial attacks. Visualized in Figure 3, AP (red agent) finds an irrational policy that sabotages the game by blocking the plate dispenser, a policy that achieves zero reward, but doesn't reveal any meaningful weaknesses in the SP policy (blue agent). Visualized in Figure 4, AP-RPG (red agent) is able to find a rational policy that achieves a reward of zero against the SP policy that identifies a meaningful weaknesses. Instead of simply sabotaging the game like AP, AP-RPG discovers that the victim (blue agent) assumes that the agents will move around each other clockwise. The manipulator in AP-RPG incentivizes the adversary (red agent) to instead move in a counterclockwise fashion, a perfectly rational strategy, though it happens to be incompatible with the victim.


PAIRED	0.13	0.50	0.42
PAIRED-RPG	0.93	0.84	0.85
AT	0.0	0.0	0.0
AT-RPG	0.65	0.72	0.88
AD	0.00	0.00	0.00
AD-RPG	0.98	0.25	0.96
Self-play	0.98	0.16	0.96

Adversarial Attack Type

Victim

Training

PAIRED-A-RPG

AP-RPG

PAIRED

0.13

0.0

0.50

0.42

PAIRED-RPG

0.93

0.0

0.84

0.85

0.0

AT-RPG

0.65

0.0

0.72

0.88

0.00

0.0

0.00

AD-RPG

0.98

0.0

0.25

0.96

Self-play

0.98

0.0

0.16

0.96

Citation

@article{lauffer2025rpg, title={Robust and Diverse Multi-Agent Learning via Rational Policy Gradient}, author={Lauffer, Niklas and Shah, Ameesh and Carroll, Micah and Seshia, Sanjit and Russell, Stuart and Dennis, Michael}, journal={Advances in Neural Information Processing Systems}, year={2025} }

Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Interactive grid: click any square below!

Rationality-preserving Policy Optimization

Rational Adversarial Attacks

Evaluating robustness against rational adversarial attacks

Citation