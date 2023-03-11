



Posted by Danny Driess (Student Researcher) and Pete Florence (Research Scientist, Google Robotics)

Recent years have seen great progress across the machine learning realm, from models that can explain jokes and answer visual questions in different languages ​​to models that can generate images based on textual descriptions. Such innovation has been made possible by the increased availability of large datasets and new advances that enable training models on these data. Scaling robotics models has met with some success, but the lack of datasets available at scales comparable to large text corpora and image datasets has left us more advanced than other domains. .

Today, I would like to introduce PaLM-E. PaLM-E is a new generalist robot model that overcomes these problems by transferring knowledge from various visual and linguistic domains to the robotic system. We started with a powerful large-scale language model, PaLM, and “embodied” it by complementing it with sensor data from robotic agents (the “E” in PaLM-E). This is a key difference from previous efforts to introduce large-scale language models into robotics. Instead of relying solely on text input, PaLM-E trains a language model to directly ingest the raw stream of robot sensor data. The resulting model not only enables highly effective robot learning, but is also a state-of-the-art general-purpose visual language model while maintaining excellent language-only task capabilities.

Embodied Language Model and Visual Language Generalist

On the one hand, PaLM-E was primarily developed as a model for robotics, solving different tasks with multiple types of robots and multiple modalities (images, robot states, and neural scene representations). At the same time, PaLM-E is a general functioning visual and linguistic model. It can perform visual tasks such as describing images, detecting objects, and classifying scenes, and is also proficient in verbal tasks such as quoting poetry, solving mathematical equations, and generating code.

PaLM-E combines the latest large-scale language model PaLM with one of the most advanced vision models, ViT-22B. The largest instantiation of this approach, built on the basis of PaLM-540B and called PaLM-E-562B, is a new update to the visual language OK-VQA benchmark while retaining essentially no task-specific tweaks. Set cutting edge technology. Same general language performance as PaLM-540B.

How does Palm-E work?

Technically, PaLM-E works by injecting observations into a pre-trained language model. This is achieved by converting sensor data, such as images, into representations through a procedure comparable to how words in natural language are processed by language models.

Language models rely on mechanisms to represent text mathematically in ways that neural networks can process. This is achieved by first splitting the text into so-called tokens that encode (sub)words. Each word is associated with a high-dimensional numeric vector and is a token embedding. The language model can then apply mathematical operations (such as matrix multiplication) to the resulting set of vectors to predict the next most likely word token. By adding new predicted words back to the input, the language model can iteratively generate longer texts.

Inputs to PaLM-E are text and other modalities (images, robot states, scene embeddings, etc.) in arbitrary order, which we call “multimodal sentences”. For example, the input looks like this: and ? “, where and It’s two images. The output is text autoregressively generated by PaLM-E, which could be an answer to a question or a set of decisions in text form.

PaLM-E model architecture. We show how PaLM-E incorporates different modalities (states and images) and addresses tasks through multimodal language modeling.

The idea of ​​PaLM-E is to train an encoder that transforms various inputs into the same space as embeddings of natural language tokens. These continuous inputs are mapped to something similar to “words” (but not necessarily forming a discrete set). Both the word and image embeddings now have the same dimensionality, so they can be input to the language model.

Initialize PaLM-E for training with pretrained models for both the language (PaLM) and vision components (Vision Transformer, aka ViT). All parameters of the model can be updated during training.

Transferring knowledge from large-scale training to robots

PaLM-E provides a new paradigm for training generalist models. This is achieved by assembling robot tasks and visual-language tasks together through a common representation that takes images and text as input and outputs text. The main result is that PaLM-E achieves significant positive knowledge transfer from both visual and verbal domains, improving the efficacy of robot learning.

Actively transferring knowledge from common visual-linguistic tasks results in more effective robot learning. This is demonstrated in three different robot embodiments and domains.

The results show that PaLM-E can simultaneously address a large number of robotics, vision, and language tasks without degrading performance compared to training individual models on each task. Moreover, visual-linguistic data actually significantly improves the performance of robot tasks. This transfer allows PaLM-E to learn robotics tasks efficiently in terms of the number of examples required to solve the task.

result

Evaluate PaLM-E in three robotic environments. Two of them involve real robots and common visual language tasks such as visual question answering (VQA), image captioning, and general language tasks. If PaLM-E is tasked with making decisions about robots, combine PaLM-E with low-level language-to-action policies to transform text into low-level robot actions.

In the first example below, a person asks a mobile robot to bring a bag of chips. In order to successfully complete the task, PaLM-E creates a plan to find and open the drawer, and updates the plan as it performs the task, responding to changes in the world. In the second example, the robot is asked to grab a green block. Even if the blocks are not recognized by the robot, PaLM-E generates stepwise plans that generalize beyond the robot’s training data.

PaLM-E controls a mobile robot operating in a kitchen environment. Left: The task is to get a bag of chips. PaLM-E exhibits robustness against adversarial jamming, such as returning chip bags to drawers. Right: The final step in executing the plan to obtain the unverified block (green star). This capability is facilitated by transfer learning from visual and language models.

In the second environment below, the same PaLM-E model solves a very long-running and precise task, such as “sorting blocks into corners by color”, on a different type of robot. Look directly at the image and generate a series of short actions expressed in text. For example, “Push the blue cube to the bottom right corner”, “Push the blue triangle there as well”. — long-term tasks that were outside the scope of autonomous completion, even in our own modern models. We also demonstrate the ability to generalize new tasks that were not seen during training time (zero-shot generalization), such as pushing a red block into his cup of coffee.

A PaLM-E controlling tabletop robot successfully completes long field of view tasks.

A third robot environment is inspired by the field of task and motion planning (TAMP). TAMP studies a combinatorially difficult planning task (rearrangement of objects) that faces a robot with a very large number of possible action sequences. With a modest amount of training data from his expert TAMP planner, PaLM-E is not only able to solve these tasks, but also leverages visual and verbal knowledge transfer to solve them more effectively. Indicates that

PaLM-E creates plans for tasks and motion planning environments.

As a visual language generalist, PaLM-E is a competitive model even compared to the best visual language-only models such as Flamingo and PaLI. Notably, PaLM-E-562B achieved the highest numbers ever reported in the challenging OK-VQA dataset. This requires not only visual understanding, but also external knowledge of the world. Moreover, this result is achieved with a generalist model without fine-tuning only for specific tasks.

PaLM-E exhibits visual thought-chain reasoning-like features in which the model divides the response process into smaller steps. This is a feature that has so far only been demonstrated in the language-only domain. This model was trained with only a single image prompt, but also demonstrates the ability to perform inference on multiple images. The New York Knicks and Boston Celtics images, under the term CC-by-2.0, were posted by kowarski on his Flickr. Kobe Bryant images are in the public domain. Other images are taken by us.Conclusion

PaLM-E pushes the boundaries of how to train generically performing models to address vision, language, and robotics simultaneously, and can also transfer knowledge from vision and language to the realm of robotics. increase. In this paper, additional topics explored in more detail, such as how PaLM-E exploits neural scene representations, and the degree to which the larger the model scale of PaLM-E, the less catastrophic forgetting of its language features. I have.

PaLM-E not only provides a path to building more capable robots that benefit from other data sources, but also uses multimodal learning, including the ability to integrate tasks that were previously viewed separately. It can be a key enabler for a wide range of other applications.

Acknowledgments

This work was done in collaboration with multiple teams at Google, including the Robotics at Google and Brain teams, and TU Berlin. Co-authors: Igor Mordatch, Andy Zeng, Aakanksha Chowdhery, Klaus Greff, Mehdi SM Sajjadi, Daniel Duckworth, Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Fei Xia, Brian Ichter, Karol Hausman, Tianhe Yu, Quan Vuong, Yevgen Chebotar, Huang Wenlong , Pierre Sermane, Sergei Levine, Vincent Van Hoek, Mark Tussian. Danny is a doctoral student at the Technical University of Berlin, supervised by Mark his Toussaint. Also other colleagues who provided advice and assistance, including Xi Chen, Etienne Pot, Sebastian Goodman, Maria Attarian, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, and Yonghui. Thanks also to Wu, Erika Moreira, Victor Gomez, Tom Durig, Mario Lucic, Henning Meyer, Kendra Byrne.

Sources 1/ https://Google.com/ 2/ https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html The mention sources can contact us to remove/changing this article

What Are The Main Benefits Of Comparing Car Insurance Quotes Online

LOS ANGELES, CA / ACCESSWIRE / June 24, 2020, / Compare-autoinsurance.Org has launched a new blog post that presents the main benefits of comparing multiple car insurance quotes. For more info and free online quotes, please visit https://compare-autoinsurance.Org/the-advantages-of-comparing-prices-with-car-insurance-quotes-online/ The modern society has numerous technological advantages. One important advantage is the speed at which information is sent and received. With the help of the internet, the shopping habits of many persons have drastically changed. The car insurance industry hasn't remained untouched by these changes. On the internet, drivers can compare insurance prices and find out which sellers have the best offers. View photos The advantages of comparing online car insurance quotes are the following: Online quotes can be obtained from anywhere and at any time. Unlike physical insurance agencies, websites don't have a specific schedule and they are available at any time. Drivers that have busy working schedules, can compare quotes from anywhere and at any time, even at midnight. Multiple choices. Almost all insurance providers, no matter if they are well-known brands or just local insurers, have an online presence. Online quotes will allow policyholders the chance to discover multiple insurance companies and check their prices. Drivers are no longer required to get quotes from just a few known insurance companies. Also, local and regional insurers can provide lower insurance rates for the same services. Accurate insurance estimates. Online quotes can only be accurate if the customers provide accurate and real info about their car models and driving history. Lying about past driving incidents can make the price estimates to be lower, but when dealing with an insurance company lying to them is useless. Usually, insurance companies will do research about a potential customer before granting him coverage. Online quotes can be sorted easily. Although drivers are recommended to not choose a policy just based on its price, drivers can easily sort quotes by insurance price. Using brokerage websites will allow drivers to get quotes from multiple insurers, thus making the comparison faster and easier. For additional info, money-saving tips, and free car insurance quotes, visit https://compare-autoinsurance.Org/ Compare-autoinsurance.Org is an online provider of life, home, health, and auto insurance quotes. This website is unique because it does not simply stick to one kind of insurance provider, but brings the clients the best deals from many different online insurance carriers. In this way, clients have access to offers from multiple carriers all in one place: this website. On this site, customers have access to quotes for insurance plans from various agencies, such as local or nationwide agencies, brand names insurance companies, etc. "Online quotes can easily help drivers obtain better car insurance deals. All they have to do is to complete an online form with accurate and real info, then compare prices", said Russell Rabichev, Marketing Director of Internet Marketing Company. CONTACT: Company Name: Internet Marketing CompanyPerson for contact Name: Gurgu CPhone Number: (818) 359-3898Email: [email protected]: https://compare-autoinsurance.Org/ SOURCE: Compare-autoinsurance.Org View source version on accesswire.Com:https://www.Accesswire.Com/595055/What-Are-The-Main-Benefits-Of-Comparing-Car-Insurance-Quotes-Online View photos