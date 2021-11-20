



Posted by: Pete Florence, Research Scientist, Corey Lynch, Research Engineer, Robotics at Google

Despite significant advances in robot learning over the past few years, some robot agent policies can make it difficult to make definitive choices of actions when trying to mimic accurate or complex behavior. There is sex. Consider the task of a robot sliding a block on a table and trying to place it exactly in a slot. There are many possible ways to solve this task, each of which requires precise movement and correction. The robot needs to commit to only one of these options, but it also needs to be able to change the plan each time the block slides farther than expected. This may seem easy, but it may not be the case for modern learning-based robots, which often learn behaviors that expert observers describe as indecisive or inaccisive. Often.

An example of a cloning model of explicit baseline behavior where a robot is struggling with a task that requires the block to slide on a table and then insert it into the fixture exactly.

To encourage robots to become more decisive, researchers often utilize discretized action spaces. This forces the robot to select option A or option B without vibrating between the options. Discretization, for example, is a key element of modern Transporter Networks architecture and is unique to many notable achievements by gameplay agents such as AlphaGo, AlphaStar, and OpenAI’s Dota bot. However, discretization has its own limitations. For robots that operate in a spatially continuous real world, discretization has at least two drawbacks. Dimensional curse is triggered because (i) accuracy is limited and (ii) many discretizations are taken into account. Different dimensions can dramatically increase memory and computational requirements. In this regard, in 3D computer vision, recent advances have been underpinned by continuous representation rather than discretized representation.

Today we announced an open source implementation of Implicit Behavioral Cloning (Implicit BC), a new and simple approach to imitation learning, announced at CoRL2021 last week, with the goal of learning a definitive policy without the drawbacks of discretization. increase. Implicit BC achieves powerful results in both simulated benchmarking tasks and real-world robotic tasks that require accurate and decisive movements. This includes achieving state-of-the-art (SOTA) results on human professional tasks from D4RL, a recent benchmark for offline reinforcement learning teams. In six of these seven tasks, Implicit BC is superior to Conservative Q Learning, the previous best method of offline RL. Interestingly, Implicit BC achieves these results without the need for reward information. This means that you can use relatively simple supervised learning instead of the more complex reinforcement learning.

Implicit Behavior Cloning Our approach is a type of behavior cloning. This is the easiest way for robots to learn new skills from demonstrations. In behavioral cloning, agents use standard supervised learning to learn how to mimic professional behavior. Traditionally, behavioral cloning involves training an explicit neural network (shown on the left below) that incorporates observations and outputs expert behaviors.

The key idea behind Implicit BC is instead to train the neural network to capture both observations and actions, producing a single number that is low for expert actions and high for non-expert actions. (Bottom, right), an energy-based modeling problem for behavioral cloning. After training, the Implicit BC policy generates an action by finding the action input with the lowest score for a particular observation.

A depiction of the difference between an explicit (left) policy and an implicit (right) policy. In the implicit policy, “argmin” means the action that minimizes the value of the energy function when combined with a particular observation.

Use InfoNCE loss to train the Implicit BC model. It trains the network to output low energy for expert actions in the dataset and high energy for everything else (see below). Keep in mind that this idea of ​​using a model that incorporates both observation and behavior is common in reinforcement learning, but not in supervised policy learning.

Animation showing how the implicit model fits the discontinuity — in this case, training the implicit model to fit the step (heaviside) function. Left: 2D plot that fits the black (X) training points — colors represent energy values ​​(blue is low, brown is high). Center: 3D plot of the energy model during training. Right: Training loss curve.

Once the training is complete, the implicit model is particularly good at accurately modeling the discontinuities (above) that the previous explicit model struggles with (as in the first figure in this post). I understand. The result is a new policy that can decisively switch between different behaviors.

But why does the traditional explicit model struggle? Most modern neural networks use a continuous activation function. For example, Tensorflow, Jax, and PyTorch all come with only a continuous activation function. When fitting discontinuous data, the explicit network constructed by these activation functions cannot represent discontinuity, so a continuous curve must be drawn between the data points. An important aspect of the implicit model is the ability to express sharp discontinuities, even if the network itself consists only of continuous layers.

It also establishes the theoretical basis for this aspect, especially the concept of universal approximation. This proves a class of functions that an implicit neural network can represent and helps to justify and guide future research.

An example of fitting a discontinuous function in an implicit model (top) compared to an explicit model (bottom). The inset highlighted in red shows the implicit model showing discontinuities (a) and (b), while the explicit model shows continuous lines (c) between the discontinuities. Indicates that (d) needs to be drawn.

One of the challenges faced by the first attempt at this approach was the “high action dimension”. This means that the robot needs to decide how to adjust many motors at the same time. To scale to higher action dimensions, use either an autoregressive model or Langevin dynamics.

Highlights In our experiments, Implicit BC is an order of magnitude (10x) better in the task of sliding and inserting with 1mm accuracy compared to the explicit BC model of the baseline, in the real world. Turned out to be particularly good. In this task, the implicit model performs some continuous precise adjustments (below) before sliding the block into place. This task requires multiple elements of determinacy. Various solutions are possible, depending on the symmetry of the blocks and the arbitrary order of push operations. The robot must determine when the block was pushed “sufficiently” discontinuously before switching to the slide. In another direction. This is in contrast to the indecisiveness often found in continuously controlled robots.

An example of a task that slides a block on a table and inserts it exactly into a slot. These are the autonomous behavior of the implicit BC policy, using only the image (from the displayed camera) as input. A diverse set of different strategies to accomplish this task. These are autonomous behaviors from implicit BC policies that use only images as input.

In another difficult task, the robot needs to sort the blocks by color. It presents a number of possible solutions, with any order of sorting. For this task, the explicit model is usually not definitive, but the performance of the implicit model is significantly improved.

Comparison of implicit (left) and explicit (right) BC models in a rewarding continuous multi-item sorting task. (4x speed)

In our tests, the implicit BC model can show robust reaction behavior even if the model never sees the human hand and tries to interfere with the robot.

Robust behavior of the implicit BC model, despite interfering with the robot.

Overall, Implicit BC policies have been found to achieve powerful results compared to state-of-the-art offline reinforcement learning methods across several different task domains. These results, difficultly, include tasks with a small number of demonstrations (around 19), high-dimensional observations from image-based observations, and / or high-dimensional actions of up to 30 actions. With a robot.

Implicit BC policy learning results compared to baselines across several domains.

Conclusion Despite its limitations, cloning behavior by supervised learning is one of the easiest ways for robots to learn from examples of human behavior. As shown here, replacing explicit policies with implicit policies when cloning motions allows the robot to overcome the “decisive struggle” and mimic much more complex and accurate motions. It will be like. Although the focus of the results here was on robot learning, the ability of implicit functions to model sharp discontinuities and multimodal labels may have broad interest in other machine learning application domains as well. I have.

Acknowledgments Pete and Corey summarized a study conducted with other co-authors Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. We would also like to thank Vikas Sindwhani for giving us some advice on the direction of the project. Robot Software Infrastructure Steve Xu, Robert Baruch, Arnab Bose; ML Infrastructure Jake Varley, Alexa Greenberg; Kamyar Ghasemipour, Jon Barron, Eric Jang, Stephen Tu, Sumet Singh, Jean-Jacques Slotine, Anirudha Majumdar, Vincent Vanhoucke He provided a lot of feedback and discussions.

