Reinforcement Learning with Augmented Data

Michael Laskin*, Kimin Lee*, Adam Stooke, Lerrel Pinto, Pieter Abbeel, Aravind Srinivas

* Equal contribution

UC Berkeley, BAIR

[Github Code] [Paper]

Overview

We perform the first extensive study of image augmentation in the RL setting and show, surprisingly, that simple RL algorithms with augmented data achieve state-of-the-art results for data-efficiency on DeepMind control and test-time generalization on OpenAI ProcGen environments.

Abstract

Learning from visual observations is a fundamental yet challenging problem in reinforcement learning (RL). Although algorithmic advancements combined with convolutional neural networks have proved to be a recipe for success, current methods are still lacking on two fronts: (a) sample efficiency of learning and (b) generalization to new environments. To this end, we present RAD: Reinforcement Learning with Augmented Data, a simple plug-and-play module that can enhance any RL algorithm. We show that data augmentations such as random crop, color jitter, patch cutout, and random convolutions can enable simple RL algorithms to match and even outperform complex state-of-the-art methods across common benchmarks in terms of data-efficiency, generalization, and wall-clock speed. We find that data diversity alone can make agents focus on meaningful information from high-dimensional observations without any changes to the reinforcement learning method. On the DeepMind Control Suite, we show that RAD is state-of-the-art in terms of data-efficiency and performance across 15 environments. We further demonstrate that RAD can significantly improve the test-time generalization on several OpenAI ProcGen benchmarks. Finally, our customized data augmentation modules enable faster wall-clock speed compared to competing RL techniques.

Method

In our paper, we show how data augmentation improves performance and generalization abilities of standard RL algorithms, both on and off-policy. We combine data augmentation with (i) Soft Actor Critic (SAC) for solving tasks on DeepMind control and (ii) PPO for ProcGen environments. Our method does not change the underlying RL pipeline - it only augments the underlying data.

Results

(1) RAD is the state-of-the-art algorithm on the majority (5 out of 6) extensively benchmarked environments on both DMControl100k and DMControl500k benchmarks, matching or outperforming CURL, Dreamer, PlaNet, SLAC, SAC+AE, and Pixel SAC.

(2) RAD quickly matches state based performance on the majority (11 out of 15) of DeepMind control algorithms. It also performs comparably to or better than the prior state of the art algorithm for data efficiency - CURL.

-

(3) Random crop, stand-alone, has the highest impact on final performance relative to all other augmentations on DeepMind control. We ablate pair-wise permutations of 6 common augmentations and measure the performance of walker, walk at 500k environments steps. While all data augmentations help, random crop alone results in the highest performance.

drawing

(4) RAD achieves state-of-the-art test-time generalization on ProcGen environments like BigFish and StarPilot. Again, RAD with random crop achieves 55.8% relative gain to pixel-based PPO in the BigFish environment.

(5) RAD trained with 100 training levels outperforms the pixel-based PPO trained with 200 training levels on both BigFish and StarPilot environments. This shows that data augmentation can be more effective in learning generalizable representations compared to simply increasing the number of training environments.

drawing

(5) Random crop fails in environments that require structural generalization (e.g. adapting to new map layout) like Jumper and CoinRun. However, color augmentation like random convolution and color jitter still improve test-time performance on environments like CoinRun.

drawing

Why is random crop so effective?

To understand why random crop works so well on DeepMind control, we inspect spatial attention maps across the convolutional encoder for policies learned with all of the various augmentations, including the no augmentation baseline. We notice that random crop help the encoder localize the agent more accurately and reliably than the other augmentations on DeepMind control. In particular, while other augmentations also pay attention to distractors (e.g. stars in the background) or fail to capture the agent state, the policy learned with random crop crisply and reliably extracts the agent from the frame. This suggests that spatial observation jittering helps develop the base agent develop contingency awareness.

drawing

BibTex

@unpublished{laskin_lee2020rad,
  title={Reinforcement Learming with Augmented Data},
  author={Laskin, Michael and Lee, Kimin and Stooke, Adam and Pinto, Lerrel and Abbeel, Pieter and Srinivas, Aravind},
  note={arXiv:2004.14990}
}