Interpretability | Natasha Jaques

Interpreting whether multi-agent reinforcement learning (MARL) agents have successfully learned to coordinate with each other, versus finding some other way to exploit the reward function, is a longstanding problem. We develop a novel interpretability method for MARL based on concept bottlenecks, which enables detecting which agents are truly coordinating, which environments require coordination, and identifying lazy agents.