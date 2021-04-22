



Posted by: John D. Co-Reyes, Research Intern, Yingjie Miao, Senior Software Engineer, Google Research

The long-term and comprehensive goal of reinforcement learning (RL) research is to design a single general-purpose learning algorithm that can solve a variety of problems. However, this goal is difficult because the classification of RL algorithms is very large and the design of new RL algorithms requires extensive tuning and validation. A possible solution is to devise a meta-learning method that allows you to design a new RL algorithm that automatically generalizes to different tasks.

In recent years, AutoML has been very successful in automating the design of machine learning components such as neural network architectures and model update rules. One example is Neural Architecture Search (NAS). It is used to develop better neural network architectures for image classification and efficient architectures for running on phones and hardware accelerators. In addition to NAS, AutoML-Zero shows that it is even possible to learn the entire algorithm from scratch using basic math operations. One of the common themes of these approaches is that the entire neural network architecture or algorithm is graphically represented and another algorithm is used to optimize the graph for a particular purpose.

These early approaches were designed for supervised learning with a simpler overall algorithm. However, RL has additional algorithmic components that can be potential targets for design automation (eg, neural network architecture for agent networks, sampling strategies from regeneration buffers, and overall formulation of loss functions. ), Not always. Clarify the best model update procedure for integrating these components. Previous efforts to discover automated RL algorithms have focused primarily on model update rules. These approaches learn the optimizer or the RL update procedure itself and typically represent update rules in neural networks such as RNNs and CNNs that can be efficiently optimized in a gradient-based manner. However, these trained rules cannot be interpreted or generalized because the trained weights are opaque and domain-specific.

In our ICLR2021 approved paper, Evolveing ​​Reinforcement Learning Algorithms, we learned that we can learn new analytically interpretable and generalizable RL algorithms by using graph representations and applying AutoML community optimization techniques. is showing. In particular, we represent the loss function used to optimize the parameters based on the agent’s experience as a computational graph, and use Regularized Evolution to evolve the computational graph population in a series of simple training environments. This makes the RL algorithm even better, and the found algorithm is generalized to more complex environments, even in environments with visual observations such as Atari games.

RL Algorithm as a Computational Graph Inspired by the idea of ​​NAS searching the space of a graph that represents a neural network architecture, we meta-learn the RL algorithm by representing the loss function of the RL algorithm as a computational graph. In this case, we use a directed acyclic graph for the loss function and the nodes represent inputs, operators, parameters, and outputs. For example, in a DQN compute graph, the input node contains data from the replay buffer, the operator node contains neural network operators and basic mathematical operators, and the output node represents loss. This is minimized by the steepest descent method.

Such expressions have several advantages. This representation is expressive enough to define new, undiscovered algorithms as well as existing ones. That is also interpretable. This graph representation can be analyzed in the same way as the human-designed RL algorithm, making it easier to interpret than the approach using the black box function approximation throughout the RL update procedure. Once researchers understand why the algorithms they have learned are good, they can modify the internal components of the algorithm to improve it and move the useful components to other problems. Finally, this representation supports common algorithms that can solve a variety of problems.

An example of a DQN calculation graph that calculates the square of the Bellman error.

I implemented this representation using the PyGlove library. This conveniently transforms the graph into a search space that can be optimized with normalized evolution.

Evolving RL Algorithm Use an evolution-based approach to optimize the RL algorithm of interest. First, we use a randomized graph to initialize the training agent population. This population of agents is trained in parallel in a series of training environments. Agents first train in a hurdle environment. Simple environments such as CartPole aim to quickly get rid of poorly performing programs.

If the agent is unable to resolve the hurdle environment, training will be stopped early with a score of 0. Otherwise, training goes to more difficult environments (Lunar Lander, simple MiniGrid environment, etc.). The performance of the algorithm is evaluated and used to update the population. Here, the more promising algorithms are further modified. Use functionally equivalent checkers to reduce search space. This skips the newly proposed algorithms if they are functionally the same as the previously examined algorithms. This loop continues until the new mutation candidate algorithm is trained and evaluated. At the end of the training, choose the best algorithm and evaluate its performance in a series of invisible test environments.

The population size of the experiment was about 300 agents, and after 20,000 to 50,000 mutations, we observed the evolution of a good candidate loss function that required about 3 days of training. I was able to train on the CPU because the training environment was simple and I could control the computational and energy costs of the training. To further control the cost of training, we seeded the initial population with a human-designed RL algorithm such as DQN.

Overview of meta-learning methods. The newly proposed algorithm must first work properly in a hurdle environment before it can be trained in a series of more difficult environments. Algorithm performance is used to update the population in which a high-performing algorithm is changed to a newer algorithm. At the end of the training, the best performing algorithm is evaluated in the test environment.

Learned Algorithms Focus on two discovered algorithms that show good generalization performance. The first is DQNReg, which is based on DQN by adding a weighted penalty for the Q value to the normal squared Bellman error. The second learned loss function, DQNClipped, is more complex, but its dominant term is in a simple form: the maximum Q value and the square of the Bellman error (constant method). Both algorithms can be thought of as a way to regularize the Q value. While DQNReg adds soft constraints, DQNClipped can be interpreted as a kind of constrained optimization that minimizes the Q value if it gets too large. If overestimating the Q value is a potential problem, it indicates that this learned constraint begins early in the training. If this constraint is met, the loss instead minimizes the original squared Bellman error.

In-depth analysis shows that baselines like DQN generally overestimate the Q value, but the learned algorithms address this issue in a variety of ways. DQNReg underestimates the Q value, while DQNClipped behaves like doubling dqn in that it slowly approaches ground truth without overestimating.

It is worth pointing out that these two algorithms appear consistently when evolution is seeded with DQN. This method of learning from scratch rediscovers the TD algorithm. For completeness, we release a dataset of the top 1000 execution algorithms discovered during evolution. Curious readers can further investigate the properties of these learned loss functions.

Overrated values ​​are generally a value-based RL issue. Our method learns an algorithm that has found a way to normalize the Q value and reduce overestimation.

Learned Algorithm Generalization Performance In RL, generalization usually refers to a trained policy that generalizes across tasks. However, in this task, we are interested in the generalized performance of the algorithm. This means how well the algorithm works in a set of environments. In a set of classic control environments, the learned algorithms are consistent with the baseline of dense reward tasks (CartPole, Acrobat, LunarLander) and are superior to the more sparse reward tasks DQN of Mountain Car.

Baseline performance of learned algorithms and traditional control environments.

In a series of sparse rewards MiniGrid environments that test a variety of different tasks, DQN Reg is significantly above baseline in both training and test environments in terms of sample efficiency and final performance. In fact, the impact is even more pronounced in test environments where the presence of new obstacles such as size, composition and lava is different.

The relationship between training environment performance and training steps measured by the episode returns more than 10 training seeds. DQNReg can rival or exceed baseline in sample efficiency and final performance. DQNReg can significantly exceed the baseline in an invisible test environment.

Visualize the performance of the regular DDQN and the learned algorithm DQNReg in some MiniGrid environments. The starting position of these environments, the composition of the walls, and the composition of the objects are randomized on each reset, so the agent should generalize rather than just remember the environment. DDQN often struggles to learn meaningful behavior, but DQN Reg can efficiently learn optimal behavior.

You’ll see performance gains in both image-based Atari and non-image-based training. This suggests that metatraining in a set of inexpensive and diverse training environments using generalizable algorithmic representations may allow generalization of the underlying algorithm.

EnvDQNDDQNPPODQNReg Asteroid1364.5734.72097.52390.4 Bowling50.468.140.180.5 Boxing88.091.694.6100.0 RoadRunner 39544.0 44127.0 35466.065516.0 Performance of the learning algorithm DQNReg for the baseline of some Atari games. Performance is assessed in over 200 test episodes per million steps.

Conclusion In this post, I described learning a new interpretable RL algorithm by representing the loss function as a computational graph and evolving the agent population through this representation. Computational graph formulation allows researchers to build on human-designed algorithms and study algorithms learned using the same mathematical toolset as existing algorithms. You can analyze some of the learned algorithms and interpret them as a form of entropy regularization to prevent overestimation of the values. These learned algorithms can go above baseline and generalize to invisible environments. The best performing algorithms can be used for further analytical research.

We hope that future work will extend to a wider variety of RL settings, such as actor critic algorithms and offline RL. Furthermore, we hope that this work will lead to the development of machine-assisted algorithms. Computational meta-learning helps researchers find new directions for pursuing learned algorithms and incorporating them into their work.

Acknowledgments Thanks to co-authors Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee and Aleksandra Faust. Also, Luke Metz for providing useful early discussions and feedback on the treatise, Hanjun Dai for early discussions on related research ideas, Xingyou Song for supporting the infrastructure, Krzysztof Choromanski, Thanks also to Kevin Wu and Jongwook Choi for helping us choose the environment. Finally, thanks to Tom Small for designing the animation for this post.

What Are The Main Benefits Of Comparing Car Insurance Quotes Online

LOS ANGELES, CA / ACCESSWIRE / June 24, 2020, / Compare-autoinsurance.Org has launched a new blog post that presents the main benefits of comparing multiple car insurance quotes. For more info and free online quotes, please visit https://compare-autoinsurance.Org/the-advantages-of-comparing-prices-with-car-insurance-quotes-online/ The modern society has numerous technological advantages. One important advantage is the speed at which information is sent and received. With the help of the internet, the shopping habits of many persons have drastically changed. The car insurance industry hasn't remained untouched by these changes. On the internet, drivers can compare insurance prices and find out which sellers have the best offers. View photos The advantages of comparing online car insurance quotes are the following: Online quotes can be obtained from anywhere and at any time. Unlike physical insurance agencies, websites don't have a specific schedule and they are available at any time. Drivers that have busy working schedules, can compare quotes from anywhere and at any time, even at midnight. Multiple choices. Almost all insurance providers, no matter if they are well-known brands or just local insurers, have an online presence. Online quotes will allow policyholders the chance to discover multiple insurance companies and check their prices. Drivers are no longer required to get quotes from just a few known insurance companies. Also, local and regional insurers can provide lower insurance rates for the same services. Accurate insurance estimates. Online quotes can only be accurate if the customers provide accurate and real info about their car models and driving history. Lying about past driving incidents can make the price estimates to be lower, but when dealing with an insurance company lying to them is useless. Usually, insurance companies will do research about a potential customer before granting him coverage. Online quotes can be sorted easily. Although drivers are recommended to not choose a policy just based on its price, drivers can easily sort quotes by insurance price. Using brokerage websites will allow drivers to get quotes from multiple insurers, thus making the comparison faster and easier. For additional info, money-saving tips, and free car insurance quotes, visit https://compare-autoinsurance.Org/ Compare-autoinsurance.Org is an online provider of life, home, health, and auto insurance quotes. This website is unique because it does not simply stick to one kind of insurance provider, but brings the clients the best deals from many different online insurance carriers. In this way, clients have access to offers from multiple carriers all in one place: this website. On this site, customers have access to quotes for insurance plans from various agencies, such as local or nationwide agencies, brand names insurance companies, etc. "Online quotes can easily help drivers obtain better car insurance deals. All they have to do is to complete an online form with accurate and real info, then compare prices", said Russell Rabichev, Marketing Director of Internet Marketing Company. CONTACT: Company Name: Internet Marketing CompanyPerson for contact Name: Gurgu CPhone Number: (818) 359-3898Email: [email protected]: https://compare-autoinsurance.Org/ SOURCE: Compare-autoinsurance.Org View source version on accesswire.Com:https://www.Accesswire.Com/595055/What-Are-The-Main-Benefits-Of-Comparing-Car-Insurance-Quotes-Online View photos