



Posted by: Winnie Xu, Student Researcher, Kuang-Huei Lee, Software Engineer, Google Research, Brain Team

Current Deep Reinforcement Learning (RL) techniques can train professional artificial agents who are good at making decisions about a variety of individual tasks in a particular environment, such as Go or StarCraft. However, little progress has been made in extending these results to generalist agents that can perform many different tasks, as well as in different environments with potentially different embodiments.

Looking at recent advances in the areas of natural language processing, vision, and generative models (PaLM, Imagen, Flamingo, etc.), breakthroughs in creating generic models can be achieved by scaling up and training Transformer-based models. You can see that there are many. They are in large and semantically diverse datasets. It’s no wonder, of course, is it possible to use a similar strategy to build generalist agents for sequential decision making? Does such a model allow for rapid adaptation to new tasks, as well as PaLM and Flamingo?

As a first step in answering these questions, a recent treatise, Multi-Game Decision Transformers, explores how to build a generalist agent that plays many video games at the same time. Our model trains agents who can play 41 Atari games simultaneously with human-like performance and can quickly adapt to new games with fine-tuning. This approach significantly improves some existing alternatives for learning multi-game agents, such as time difference (TD) learning and behavioral cloning (BC).

Multi-Game Decision Transformers (MGDT) allows you to play multiple games with the level of ability you need, from training in different trajectories across all levels of expertise.

In reinforcement learning, which does not optimize returns, but seeks optimization, rewards refer to incentive signals related to the completion of tasks, and returns refer to cumulative rewards in the process of interaction between the agent and its surrounding environment. Point to. Traditional deep reinforcement learning agents (DQN, SimPLe, Dreamer, etc.) are trained to optimize decision making to achieve optimal returns. At every time step, the agent observes the environment (also considers past interactions) and determines the actions to take to achieve higher return magnitudes in future interactions.

This task uses Decision Transformers as a backbone approach for training RL agents. A decision transformer is a sequence model that predicts future actions, taking into account past interactions between agents and their surroundings, and the desired returns achieved in (most important) future interactions. Decision Transformers maps different experiences from expert to beginner level to the corresponding return magnitude during training, instead of learning the policies to achieve a high return magnitude as in traditional reinforcement learning. Training agents with different experiences (from beginner to advanced level) exposes the model to different variations of gameplay, allowing you to extract useful gameplay rules that can succeed in any situation. Therefore, during inference, the Decision Transformer can achieve any return value within the range seen during training, including optimal returns.

But how can you know if a return is optimal and stable and achievable in a particular environment? Earlier applications of decision transformers relied on a customized definition of the desired return for each individual task. This required manually defining a range of plausible and informative scalar values, which are signals that can be properly interpreted for each particular game. .. To address this issue, instead model the distribution of return magnitudes based on past interactions with the training environment. When inferring, it just adds an optimal bias that increases the probability of producing actions related to higher returns.

We have also changed the DecisionTransformer architecture to consider image patches rather than global image representations to capture more comprehensively the spatiotemporal patterns of agent-environment interactions. Patches allow the model to focus on local dynamics and model game-specific information in more detail.

Together, these factors provide the backbone of a multi-game decision-making transformer.

Each observation image is divided into a set of M patches of pixels indicated by O. Return R, action a, and reward r follow these image patches in each input casual sequence. Decision Transformer is trained to predict the next input (excluding image patches) to establish a causal relationship.

Train a multi-game Decision Transformer to play 41 games at once Train one Decision Transformer agent with a large (~ 10B) and wide set of gameplay experiences from 41 Atari games. In our experiments, this agent, called the Multi-Game Decision Transformer (MGDT), clearly doubled existing reinforcement learning and behavioral cloning methods by learning to play 41 games at the same time, humans. It is close to the level (100% in the following figure corresponds to the level of human gameplay). These results apply when comparing training methods in both settings where policies need to be learned from static datasets (offline) and settings where new data can be collected from interaction with the environment (online).

Each bar is the total score of 41 games, 100% showing human level performance. The blue bars are for models trained in 41 games at the same time, and the gray bars are for 41 specialist agents. Multi-Game Decision Transformer delivers human-level performance comparable to specialist agents, significantly superior to other multi-game agents.

This result shows that Decision Transformers is suitable for multitasking, multi-environment, and multi-implementation agents.

The parallel work “Generalist Agent” shows similar results, showing that a large trans-based sequence model can remember expert behavior very well in more environments. In addition, there are well-complementary discoveries in their work and in our work. They show that they can train in a wide range of environments beyond Atari games, but they show that it is possible and useful to train with a wide range of experience.

In addition to the above performance, experience has shown that MGDTs trained in a variety of experiences are superior to MDGTs trained in expert-level demonstrations or simply by cloning demonstration behavior. ..

Scale up the size of multi-game models to improve performance In many recent machine learning breakthroughs, scale is the main driving force. This is usually achieved by increasing the number of parameters in the transformer-based model. Our observations about multi-game decision transformers are similar. Performance improves as expected with larger model sizes. In particular, its performance does not seem to have reached its limit yet, and the performance gains are more pronounced as the model size increases compared to other learning systems.

The performance of the Multi-Game Decision Transformers (shown by the blue line) improves as expected with larger model sizes, but not with other models.

Pre-trained Multi-Game Decision Transformers are Fast Learners Another advantage of MGDT is that you can learn how to play new games from very few gameplay demonstrations (not all need to be expert level). In that sense, MGDT can be seen as a pre-trained model that can be quickly tweaked with small new gameplay data. It clearly demonstrates the consistent benefits of getting a higher score compared to other common pre-training methods.

Multi-Game Decision Transformers pre-training (DT pre-training, shown in light blue) shows a consistent advantage over other popular models in adapting to new tasks.

Where are the agents looking? In addition to quantitative assessment, visualizing agent behavior is insightful (and fun). By scrutinizing the attention head, we can see that the MGDT model consistently weights the field of view in the area of ​​the observed image that contains meaningful game entities. Visualize the model’s attention as it predicts the next action in different games, consistently with entities such as agent screen avatars, agent free movement space, non-agent objects, and key environmental features. Make sure it is supported. For example, in an interactive setting, you need to know how and when to focus on known objects (such as current obstacles) to create an accurate world model. You also need to anticipate and / or plan for unknown objects in the future (such as negative spaces). This diverse allocation of attention to the many key components of each environment ultimately improves performance.

Here you can see how much weight the model gives to each key asset in the game scene. A bright red indicates that the pixel’s patch is more emphasized.

Future of Large Generalist Agents This task is an important step in demonstrating the potential for training generic agents across many environments, embodiments, and behavioral styles. We have shown the benefits of increasing the scale of performance and the potential of further scaling. These findings appear to tell a story of generalization similar to other domains such as vision and language. We look forward to exploring the great potential of data scaling and learning from diverse experiences.

We look forward to future research towards the development of high performance agents for the configuration of multiple environments and multiple embodiments. Code and model checkpoints will soon be accessible here.

Acknowledgments Thanks to all the remaining authors of this treatise, including Igor Mordatch, Ofir Nachum Menjiao Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Eric Jang, and Henryk Michalewski.

