Abstract

This paper studies the problem of zero-shot generalization in reinforcement learning and introduces an algorithm that trains an ensamble of maximum reward seeking agents and an maximum entropy agent. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions with the help of maximum entropy agent and drive us to a novel part of the state space, where the ensemble may potentially agree again. This method combined with an invariance based approach achieves new state-of-the-art results on ProcGen.