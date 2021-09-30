



Posted by Rishabh Agarwal, Research Associate, Google Research, Brain Team

Reinforcement learning (RL) is a set of decision-making paradigms for training intelligent agents to tackle complex tasks such as moving robots, playing video games, flying stratospheric balloons, and designing hardware chips. .. RL agents have shown promising results in a variety of activities, but even if these tasks are semantically equivalent, it is difficult to migrate the functionality of these agents to new tasks. For example, consider a jump task in which an agent learned from image observation needs to jump over obstacles. Deep RL agents trained on some of these tasks in various obstacle locations struggle to jump well with obstacles in previously invisible locations.

Jump task: Agents learning from pixels (white blocks) need to jump over obstacles (gray squares). The challenge is to use a small number of training tasks and generalize to the position of invisible obstacles and floor height in the test task. For certain tasks, the agent needs to accurately time the jump at a certain distance from the obstacle. Otherwise, you will eventually hit an obstacle.

Presented as a spotlight in ICLR 2021, “Contrast Behavioral Similarity Embedding for Generalization in Reinforcement Learning” incorporates the unique sequential structure of RL into the expression learning process to generalize invisible tasks. Strengthen. This is orthogonal to the main approach prior to this task, usually adapted from supervised learning, and therefore largely ignores this continuous aspect. Our approach takes advantage of the fact that if the agent is operating on tasks with similar underlying mechanisms, it will show at least a short sequence of similar operations between these tasks.

Previous work on generalization usually adapted from supervised learning and revolved around strengthening the learning process. These approaches rarely take advantage of sequential aspects such as action similarity across temporal observations.

Our approach trains agents to learn expressions that are close to each other when the optimal behavior of the agent in these and future states is similar. This concept of proximity, called behavioral similarity, is generalized to observations across a variety of tasks. A policy that is a theoretically motivated state similarity metric inspired by bisimulation to measure behavioral similarity between states across different tasks (eg, the location of distinct obstacles in a jump task). Introduce a similarity metric (PSM). For example, the following image shows that the agent’s future actions in two visually different states are the same, and these states are similar according to PSM.

Understand behavioral similarities. Agents (blue icons) need to earn rewards while staying away from danger. The initial states are visually different, but they are similar in that they behave optimally in the current state and in future states following the current state. The Policy Similarity Metric (PSM) assigns high similarity to such behaviorally similar states and low similarity to different states.

To enhance generalization, our approach learns state embeddings that correspond to neural network-based representations of task states. This brings together behaviorally similar states (such as the one above) and pushes away behaviorally different states. To that end, we introduce Control Metric Embedding (CME), which takes advantage of contrast learning to learn expressions based on state similarity metrics. To learn Policy Similarity Embedding (PSE), use Policy Similarity Metrics (PSM) to instantiate contrasting embeddings. PSE assigns similar representations to states that behave similarly in both these and future states, such as the two initial states shown in the image above.

As shown in the results below, PSE greatly enhances the generalization of the jump task from pixels mentioned above and is superior to the previous method.

Method Grid configuration “Wide” “Narrow” “Random” Regularization 17.2 (2.2) 10.2 (4.6) 9.3 (5.4) PSE 33.6 (10.0) 9.3 (5.3) 37.7 (10.4) Data expansion 50.7 (24.2) 33.7 (11.8) 71.3 (15.6) Data August + Bi-simulation 41.4 (17.6) 17.4 (6.7) 33.4 (15.6) Data August + PSE 87.0 (10.1) 52.4 (5.8) 83.4 (10.1) Jump task results: Tests solved differently Percentage of tasks (%) Method with or without data expansion. The Wide, Narrow, and Random grids are configured as shown in the figure below and contain 18 training tasks and 268 test tasks. Use the standard deviation in parentheses to report the average performance over 100 runs with various random initializations. Jump task grid configuration: Visualization of average PSE performance using data extensions across different configurations. For each grid configuration, the height changes along the y-axis (11 heights) and the position of the obstacles changes along the x-axis (26 locations). The red letter T indicates a training task. Beige tiles are tasks that PSE solves, and black tiles are open tasks that are combined with data expansion.

It also uses UMAP, a common visualization technique for high-dimensional data, to visualize the representation learned by the PSE and baseline methods by projecting it onto 2D points. As the visualization shows, PSE, unlike previous methods, clusters states with similar behavior together and separates different states. In addition, PSE divides the state into two sets. (1) All the states before the jump and (2) The state where the action does not affect the result (the state after the jump).

Visualize the learned expressions. (A) Optimal trajectory of the jump task (visualized as a colored block) at the location of various obstacles. Points with the same number label correspond to the same distance of the agent from the obstacle. This is a basic optimal immutable feature across various jump tasks. (Bd) Use UMAP to visualize hidden representations. Here, the color of the dots indicates the corresponding observation task. (B) The PSE captures the correct invariant features, as evidenced by the points where the same number labels are clustered together. That is, after the jump action (numbered block 2), all other actions (unnumbered blocks) are similar as shown by the overlapping curves. Unlike PSE, baselines containing (c) l2 loss embedding (instead of contrasting losses) and (d) reward-based bisimulation metrics have behaviorally similar states with similar number labels. I will not summarize. Insufficient generalization of (c, d) may be due to a similar optimal behavior ending in a distant embedding.

Conclusion Overall, this work demonstrates the benefits of leveraging the unique structure of RL to learn effective expressions. Specifically, this work facilitates generalization of RLs with two contributions: policy similarity metrics and embedding contrasting metrics. PSE combines these two ideas to enhance generalization. Exciting paths for future work include finding better ways to define behavioral similarities and leveraging this structure for expression learning.

Acknowledgments This is a collaboration with Pablo Samuel Castro, Marlos C. Machado and Marc G. Bellemare. We would also like to thank David Ha, Ankit Anand, Alex Irpan, Rico Jonschkowski, Richard Song, Ofir Nachum, Dale Schuurmans, Aleksandra Faust, and Divya Ghosh for their insightful comments on this work.

