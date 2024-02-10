



Lately, there has been a lot of talk about AI agents that receive commands on smartphones and physically perform tasks such as required taps and swipes. This story about building an AI agent is very reminiscent of the “new Google Assistant” he announced in 2019 along with the Pixel 4.

At I/O 2019, Google demonstrated this next-generation assistant for the first time. The premise was that the audio processing on the device would make “tap to interact with your phone almost feel slow.”

Google demonstrated simple commands to open and control apps, but a more complex idea was “how the device's built-in Assistant could coordinate tasks across apps.” . In this example, I came up with the idea of ​​receiving a text, replying with voice, searching for an attached photo, and sending it. The “action'' and “multitasking'' capabilities were completed by Gmail's natural language “compose'' feature.

This next-generation assistant lets you instantly control your phone with your voice, multitask between apps, and complete complex actions, all with near-zero latency. .

The new Assistant was released on Pixel 4 later that year, and is available on all subsequent Google devices.

“Take a selfie.” Then say, “Share this with Ryan.” In the chat thread, say “Reply. I'm on my way.” “Search for yoga classes on YouTube.” Then say, “Share this with your mom.” “Show me an email from Michelle in Gmail.” With her Google Photos app open, say “Show me a photo of New York.” Then say, “People in Central Park.” With the recipe site open in Chrome, say “Search for chocolate brownies with nuts.'' With your travel app open, say “Hotels in Paris.”

This is the basic idea behind AI agents. During Alphabet's earnings call last month, Sundar Pichai was asked about the impact generative AI will have on assistants. He said this will allow Google Assistant to “act more like an agent over time” and “follow you beyond answers.”

According to this week's The Information, OpenAI is working on a ChatGPT agent that:

“These types of requests cause agents to perform actions like clicks, cursor movements, and text input that humans do when interacting with various apps, according to people familiar with the effort.”

Then there's Rabbit, which has a Large Action Model (LAM) trained to interact with existing mobile and desktop interfaces to complete set tasks.

The version of Google Assistant introduced in 2019 feels very pre-programmed, requiring users to stick to specific phrases rather than speaking naturally and identifying actions automatically. there was. At the time, Google said the Assistant “works seamlessly with many apps” and “continues to improve app integration.” As far as we know, that didn't actually happen, but some of the features Google showed off no longer work due to changes to the app. True agents are able to adapt to set conditions rather than relying on them.

Last year Google Research published a study on “Enabling conversational interactions with mobile UIs using large-scale language models,” so it’s easy to see how LLM can improve this.

Google Research has demonstrated an approach that allows you to “quickly understand the purpose of your mobile UI.”

Interestingly, we observed that when LLM creates a summary, it uses prior knowledge to infer information that is not visible in the UI. In the example below, LLM has inferred that the tube station belongs to the London Underground system, but the input UI does not include this information.

You can also answer questions about the content displayed in the UI and control it by receiving natural language instructions.

The Gemini AI agent for Android devices is a natural evolution of Google's initial efforts as an all-encompassing assistant that offers new ways to use your phone. However, features such as the ability to transcribe message responses and say “send” are still available with voice input in the Gboard Assistant.

Indeed, it seems like the last effort was a case of Google getting an idea too early and not having the necessary technology. At this point, Google would be wise to prioritize this effort so it can lead rather than catch up.

