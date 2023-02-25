



Posted by Yang Li (Research Scientist), Gang Li (Software Engineer, Google Research)

Computer understanding of the user interface (UI) is an important step towards achieving intelligent UI behavior. Previously, we explored various UI modeling tasks such as widget captioning, screen summarization, and command grounding for different interaction scenarios such as automation and accessibility. It also showed how machine learning helped a user experience practitioner improve the quality of her UI by diagnosing tappability confusion and providing insights to improve UI design. These studies, along with those developed by others in the field, show how deep neural networks can transform end-user experience and interaction design practices.

With all of these successes in addressing individual UI tasks, the natural question is whether it is possible to have a basic understanding of the UI that is useful for specific UI tasks. As a first attempt at answering this question, I developed a multitasking model that addresses various UI tasks simultaneously. Some progress has been made, but some challenges remain. Older UI models rely heavily on the UI view hierarchy (i.e. the structure or metadata of a mobile UI screen, like the document object model of a web page), and the model provides detailed information about the UI objects on the screen (e.g. their types) can now be retrieved directly. , text content and position). This metadata gave the previous model an advantage over its vision-only counterpart. However, view hierarchies are not always accessible and are often corrupted due to missing object descriptions or misplaced structural information. As a result, despite the short-term gains from using view hierarchies, it can ultimately hinder model performance and applicability. Additionally, previous models had to process disparate information across datasets and his UI tasks, often resulting in complex model architectures that were difficult to scale or generalize across tasks. rice field.

Approved for publication at ICLR 2023, “Spotlight: Mobile UI Understanding Using Focused Vision Language Models” aims to achieve completely general UI understanding from raw pixels through visual We present the only approach. Introduce a unified approach to representing diverse UI tasks. That information can be universally expressed through her two core modalities: vision and language. The vision modality captures what a person sees from her UI screen, and the language modality can be his sequence of natural language or any tokens relevant to the task. Spotlight has shown to greatly improve the accuracy of a variety of his UI tasks, including widget captioning, screen summarization, command grounding, and tappability prediction.

spotlight model

The Spotlight model’s input contains a tuple of three items: a screenshot, an on-screen region of interest, and a text description of the task. The output is a text description or response about the area of ​​interest. This simple input and output representation of the model enables an expressive and scalable model architecture that captures a wide variety of UI tasks. This model design allows for a variety of learning strategies and setups, from task-specific fine-tuning to multi-task learning to few-shot learning. As shown in the diagram above, the Spotlight model leverages existing architectural building blocks such as ViT and T5. These blocks are pre-trained on well-resourced common vision language domains and can build on these successes. General domain model.

Since UI tasks often involve a specific object or area on the screen and the model needs to be able to focus on the object or area of ​​interest, we introduced a focus area extractor in the vision language model. to enable the model to: Concentrate on an area in light of the screen context.

In particular, we design a Region Summarizer that obtains the potential representation of screen regions based on ViT encoding using attention queries generated from region bounding boxes (see the paper for details). . Specifically, each coordinate (scalar value, i.e. left, top, right, or bottom) of the bounding box, shown as a yellow box in the screenshot, is first passed through a multi-layer perceptron (MLP) to a dense Embedded as a collection of vectors. , and fed to the Transformer model along with the embedding of the coordinate types. Dense vectors and their corresponding coordinate-type embeddings are color-coded to indicate their relationship to each coordinate value. The coordinate query corresponds to the screen encoding output by ViT via cross attention, and the final attention output of Transformer is used as the region representation for downstream decoding by T5.

The target area on the screen is summarized by querying the screen encoding from ViT via the attention mechanism using its bounding box.result

We pre-train a Spotlight model using two unlabeled datasets (an internal dataset based on the C4 corpus and an internal mobile dataset) containing 2.5 million mobile UI screens and 80 million web pages. We then fine-tune the pre-trained model separately for each of the four downstream tasks (caption, summarization, grounding, and tappability). For the widget caption and screen summary tasks, we report CIDer scores. It measures how similar a model’s textual description is to a set of references produced by a human evaluator. For command grounding, we report the accuracy, which measures the percentage of times the model was able to locate the target object in response to a user command. For tappability prediction, we report the F1 score, which measures the model’s ability to distinguish between tappable and non-tapable objects.

This experiment compares Spotlight to several benchmark models. Widget captions use the view hierarchy and an image for each UI object to generate a textual description of the object. Similarly, Screen2Words generates screen summaries using view hierarchy and screenshots, as well as auxiliary functions (such as app descriptions). Similarly, VUT combines screenshots and view hierarchies to perform multiple tasks. Finally, the original tappability model leverages object metadata from the view hierarchy and screenshots to predict the tappability of objects. Tapperception is a follow-up model of Tappability and uses her visual-only Tappability prediction approach. We examine two Spotlight model variants for ViT building block sizes, including B/16 and L/16. Spotlight significantly outperformed the state of the art in four UI modeling tasks.

Model Caption Summary Grounding Tappability Baseline Widget Caption 97 – – – Screen2Words – 61.3 – – VUT 99.3 65.6 82.1 – Taper – – – 85.5 Tappability – – – 87.9 Spotlight B/16 136.6 103.5 95.7 86.9

We then pursue a more difficult setting where the model learns multiple tasks simultaneously, since multitasking models can greatly reduce the footprint of the model. Experiments showed that the model still performed competitively, as shown in the table below.

Model Caption Summary Grounding Tapability VUT Multitasking 99.3 65.1 80.8 – Spotlight B/16 140 102.7 90.8 89.4 Spotlight L/16 141.3 99.2 94.2 89.5

The Region Summarizer allows us to understand how Spotlight focuses on target and related regions on the screen by adding weights of attention (where the model’s attention is on the screenshot) for both widget captions and screen summarization tasks. ) are analyzed. In the image below, for the widget’s caption task, the model predicts the left checkbox “select Chelsea team” and highlights it with a red bounding box. From the attention heatmap on the right (showing the distribution of attention weights), we can see that the model learned to generate attention not only for the checkbox target area, but also for the text “Chelsea” on the far left. . Caption. In the screen summary example, the model predicts “a page that displays a learning app tutorial” from the screenshot on the left. In this example, the target area is the entire screen, and the model focuses on the important parts Learn to point the . Summary screen.

For the Widget Caption task, the Attention Heatmap shows the checkbox, i.e. the model corresponding to the target object, and the text label to the left of it when generating captions for the object. The red bounding box in the illustration is for illustration purposes only. For the screen summarization task, where the target region encompasses the entire screen, the attention heatmap shows models paying attention to different locations on the screen that contribute to the generation of the summary.Conclusion

We demonstrate that Spotlight outperforms previous methods that use both screenshots and view hierarchies as inputs, establishing state-of-the-art results on several representative UI tasks. These tasks range from accessibility and automation to interaction design and evaluation. A vision-only approach to understanding mobile UIs alleviates the need to use view hierarchies, allows the architecture to be easily scaled, and supports large vision-language models pre-trained for common domains. Benefit from success. Compared to recent large vision language model efforts such as Flamingo and PaLI, Spotlight is relatively small, and our experiments show that larger models tend to perform better. Spotlight can be easily applied to more UI tasks and has the potential to advance the frontiers of many interaction and user experience tasks.

understand

Mandar Joshi and Tao Li helped with processing the web pre-training dataset, and Chin-Yi Cheng and Forrest Huang provided feedback for proofreading the paper. Thanks to Tom Small for his help in creating the animation of his figure in this post.

