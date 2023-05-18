



Until now, it has been difficult for AI to control smartphone interfaces. But Google researchers seem to have found a way.

To improve speech-based interactions with mobile user interfaces, researchers at Google Research have investigated the use of large scale language models (LLMs). Current mobile intelligent assistants are limited in conversational interaction because they cannot answer questions about specific information on the screen.

Researchers have developed a set of techniques for applying LLM to mobile user interfaces, including algorithms that convert user interfaces to text. These technologies enable developers to rapidly prototype and test new voice-based interactions. LLM is well suited for context-prompted learning, feeding the model a few examples from the target task.

A large-scale language model as a smartphone interface

Four key tasks were studied in extensive experiments. According to the researchers, the results show that LLM is competitive on these tasks, requiring only two samples per task for him.

1. Generate on-screen questions: For example, when a mobile user interface (UI) is presented, the language model can generate relevant questions about UI elements that require user input. The study found that the language model was able to generate questions with near-perfect grammar (4.98/5), and that grammar was 92.8 percent related to the input fields presented on the screen.

Image: Google

2. Screen overview: LLM can effectively summarize the main features of the mobile UI. They produce more accurate summaries than the previously introduced Screen2Words model, and can also infer information not directly displayed in the UI.

Image: Google

3. Answer on-screen questions: When presented with the mobile UI and open-ended questions that require information about the UI, LLMs can provide correct answers. This study shows that LLMs can answer questions like “What’s the headline?” LLM performed significantly better than his DistilBERT QA model at baseline.

Image: Google

4. Mapping Instructions to UI Actions: Given a mobile UI and natural language instructions to control it, the model can predict the ID of the object on which a particular action should be performed. For example, when given the instruction to “open Gmail,” the model was able to correctly identify her Gmail icon on the home screen.

Image: Google

Researchers at Google conclude that prototyping new voice-based interactions on mobile UIs can be simplified using LLM. This opens up new possibilities for designers, developers and researchers before investing in developing new databases and models.

