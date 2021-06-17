



Yifeng Jiang, Research Intern, Jie Tan, Research Scientist, Google Robotics

Simulation enables rapid prototyping in a variety of engineering disciplines with minimal human effort. In robotics, physics simulation provides a safe and inexpensive virtual playground for robots to acquire physical skills through techniques such as Deep Reinforcement Learning (DRL). However, manual physics in simulations do not exactly match the real world, so fully trained control policies in simulations can fail when tested on real hardware. This is a challenge known as a sim-to-real gap or domain adaptation problem. The actual gaps in perception-based tasks (such as grasping) have been addressed using RL-CycleGAN and RetinaGAN, but there are still gaps caused by robotic system dynamics. This prompts us to ask, can we learn a more accurate physics simulator from the orbits of a few real robots? If so, you can use such an improved simulator to improve your robot controller using standard DRL training and succeed in the real world.

In the ICRA 2021 publication SimGAN: Identifying Hybrid Simulators for Domain Adaptation by Hostile Reinforcement Learning, physics simulators are trained by DRL with a special reward function that penalizes discrepancies between orbitals. We propose to treat it as a possible component. The movement of the robot over time generated by the simulation and a small number of orbits collected by the actual robot. Use Generative Adversarial Networks (GANs) to provide such rewards and create hybrid simulators that combine learnable neural networks with analytical physics equations to balance the expressiveness and physical accuracy of the model. Take. For robotic move tasks, our method is better than multiple powerful baselines, including domain randomization.

Learnable Hybrid Simulator A traditional physics simulator is a program that solves differential equations to simulate the movement and interaction of objects in a virtual world. This task requires building different physical models that represent different environments. When a robot walks on a mattress, the deformation of the mattress should be considered (for example, using the finite element method). However, the variety of scenarios that robots can encounter in the real world makes these environment-specific modeling techniques cumbersome (or impossible). Therefore, it is useful to take a machine learning-based approach instead. .. The simulator can learn entirely from the data, but if the training data does not contain enough types of situations, the trained simulator can violate the laws of physics if it needs to simulate the following situations: (That is, it deviates from real-world dynamics). It wasn’t trained. As a result, robots trained in such limited simulators are more likely to fail in the real world.

To overcome this complexity, we build a hybrid simulator that combines both a learnable neural network and physics equations. Specifically, simulator parameters that are often manually defined (contact parameters (such as friction and coefficient of restitution) and motor parameters (such as motor gain)) are mainly detailed in unmodeled contact and motor dynamics. Therefore, replace it with a learnable simulation parameter function. Cause of the gap between the simulation and the actual. Unlike traditional simulators, where these parameters are treated as constants, hybrid simulators are state-dependent. It may change depending on the state of the robot. For example, motors can weaken at high speeds. These normally unmodeled physics can be captured using state-dependent simulation parameter functions. In addition, contact and motor parameters are usually difficult to identify and can change due to wear, but our hybrid simulator can automatically learn them from the data. For example, you don’t have to manually specify the robot’s foot parameters for every surface the robot may touch, and the simulation learns these parameters from the training data.

Comparison of conventional simulator and our hybrid simulator.

The rest of the hybrid simulator consists of physics equations that ensure that the simulation follows basic laws of physics such as energy conservation, bringing it closer to the real world and reducing the gap between simulation and reality.

In the previous mattress example, a learnable hybrid simulator can mimic the contact force from the mattress. Since the learned contact parameters are state-dependent, the simulator can adjust the contact force based on the distance and speed of the robot’s foot against the mattress, mimicking the effects of deformable surface stiffness and damping. As a result, there is no need to analytically devise a model dedicated to deformable surfaces.

Using GAN for Simulator Learning Successful training of the above simulation parameter functions results in a hybrid simulator that can generate trajectories similar to those collected by a real robot. The key to this learning is to define a metric of similarity between trajectories. GAN was initially designed to produce composite images with a small number of real images that share the same distribution or “style”, but can be used to generate composite trajectories that are indistinguishable from real images. GAN has two main parts: a generator that learns to generate new instances and a discriminator that evaluates how similar the new instance is to training data. In this case, the learnable hybrid simulator acts as a GAN generator and the GAN disk criminator provides the similarity score.

The GAN discriminator provides a similarity metric that compares the movement of a simulated robot with that of a real robot.

Matching the parameters of a simulation model to data collected in the real world is a process called system identification (SysID), which is common in many engineering disciplines. For example, the stiffness parameters of a deformable surface can be determined by measuring the displacement of the surface under various pressures. This process is usually cumbersome manually, but using GAN makes it much more efficient. For example, SysID often requires hand-crafted metrics for discrepancies between simulated orbits. With GAN, such metrics are automatically trained by the discriminator. In addition, in order to calculate the mismatch metric, traditional SysID requires each simulated trajectory to be paired with the corresponding actual trajectory generated using the same control policy. This one-to-one pairing is not necessary because the GAN discriminator takes only one orbit as an input and calculates the likelihood that it will be collected in the real world.

Use Reinforcement Learning (RL) to train the simulator and put together all the policy improvements to formulate simulation learning as an RL problem. Neural networks learn state-dependent contact and motion parameters from a small number of real-world orbits. The neural network is optimized to minimize the error between the simulated orbit and the actual orbit. Keep in mind that it is important to minimize this error over a long period of time. Simulations that accurately predict the distant future lead to better control policies. RL is well suited for this because it not only optimizes single-step rewards, but also optimizes rewards accumulated over time.

After the hybrid simulator is trained and more accurate, use the RL again to adjust the robot’s control policy within the simulation (for example, walking on the surface as shown below).

Follow the arrow clockwise: (upper left) Record a small number of failed robot attempts in the target domain (for example, the actual proxy with the red legs modified to be much heavier than the source domain). (Upper right) Learn the hybrid simulator to match the trajectories collected in the target domain. (Bottom right) Improve the control policy of this learned simulator. (Bottom left) Test the sophisticated controller directly in the target domain.

Evaluation During 2020, access to the actual robot was restricted, so we created a second different simulation (target domain) as the actual proxy. The changes in dynamics between the source and target domains are large enough to approximate the actual gap between various simulations (for example, making one leg heavier or walking on a deformable surface rather than a hard floor. Masu). We evaluated whether a hybrid simulator without knowledge of these changes could learn to match the dynamics of the target domain, and whether the sophisticated policies of this learned simulator could be successfully deployed to the target domain. ..

The following qualitative results show far better performance for two robots with different morphologies and dynamics by simulation learning using less than 10 minutes of data collected in the target domain (floor deformable). Shows that you can generate sophisticated policies.

Performance comparison of initial and sophisticated policies in the target domain (deformable floor) of hoppers and quadruple robots.

The following quantitative results show that SimGAN outperforms multiple state-of-the-art baselines, such as domain randomization (DR) and direct fine-tuning at the target domain (FT).

Comparison of policy performance using different sim-to-real transfer methods in three different target domains for Quadruped robots: movement on deformable surfaces, weakened motors, and increased body weight.

Conclusion The gap between simulation and practice is one of the major bottlenecks that prevent robots from harnessing the power of reinforcement learning. We are tackling this challenge by learning a simulator that can more faithfully model real-world dynamics while using only a small amount of real-world data. The improved control policy in this simulator can be successfully deployed. To achieve this, we will extend the classical physics simulator with learnable components and train this hybrid simulator using hostile reinforcement learning. So far, we have tested its application to mobile tasks, but we would like to build on this general framework by applying it to other robot learning tasks such as navigation and manipulation.

