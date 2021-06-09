



Posted by: Diego Martin Arroyo (Software Engineer), Federico Tombari (Google Research Research Scientist)

Information in a written document is conveyed not only by the meaning of the words it contains, but also by the layout of the entire document. Layouts typically indicate the order in which the reader parses the document for better understanding (eg columns and paragraphs), provide useful summaries (eg with titles), and aesthetic purposes (eg ads). (When displaying) is used. ..

These design rules are easy to follow, but difficult to define explicitly without quickly including exceptions or encountering ambiguous cases. This makes document design difficult to automate. This is because a system with a set of hard-coded production rules is oversimplified, either unable to create the original layout (loss of layout diversity in composite data) or too complex. Numerous rules and their accompanying exceptions. As an attempt to solve this problem, a machine learning (ML) method for synthesizing document layouts has been proposed. However, most ML-based solutions for automated document design either cannot accommodate many layout components or rely on additional training information, such as the relationships between the various components of the document.

Variational Transformer Networks for Layout Generation, presented at CVPR 2021, creates a document layout generation system that supports any number of elements and does not require additional information to capture the relationships between design elements. I will. Variations that allow you to model document layout design rules as a distribution, rather than using a set of predefined heuristics. Use the self-attention layer as a building block for an autoencoder (VAE) to create a variety of layouts. I will raise it. The resulting Variational Transformer Network (VTN) model can extract meaningful relationships between layout elements (paragraphs, tables, images, etc.) and is a realistic composite document (for example, better placement and margins). ) Is generated. It demonstrates the effectiveness of this combination in a variety of areas, including scientific treatises, UI layouts, and even furniture placement.

VAE for layout generation The ultimate goal of this system is to infer design rules for a particular type of layout from a collection of samples. If you consider these design rules to be the underlying distribution of your data, you can use a probabilistic model to discover them. We suggest using VAE (widely used for tasks such as image generation and anomaly detection) to do this. This is an autoencoder architecture that consists of two different subparts, an encoder and a decoder. The encoder learns to compress the input to fewer dimensions and holds only the information needed to reconstruct the input, while the decoder learns to undo this operation. A compressed representation (also known as a bottleneck) can be forced to behave like a known distribution (such as a uniform Gaussian distribution). Feeding samples from this prior distribution to the decoder segment of the network produces output similar to training data.

Another advantage of VAE formulation is that it does not depend on the type of operation used to implement the encoder and decoder segments. Therefore, it uses a self-attention layer (typically found in Transformer architectures) to automatically capture the impact of each layout element on other elements.

Transformers use self-attention layers to model long, continuous relationships. This often applies across language domains for a series of natural language understanding tasks such as translation and summarization, as well as object detection and document layout understanding tasks. Self-attention operations correlate all the elements in a sequence and determine how they affect each other. This property is ideal for modeling relationships between different elements in a layout without the need for explicit annotations.

Several approaches to layout generation to synthesize new samples from these relationships [e.g., 1] In other domains [e.g., 2, 3] It relies on greedy search algorithms such as beam search, nuclear sampling, and top k sampling. The diversity of the generated samples is not guaranteed, as these strategies are often based on search rules that tend to prioritize the most probable results at every step. However, by combining self-attention with VAE’s probabilistic method, the model can directly learn the distribution from which new elements can be extracted.

Variational Bottleneck Modeling VAE bottlenecks are typically modeled as vectors representing inputs. It is difficult to apply the standard VAE formula because the self-attention layer is a sequence-to-sequence architecture, that is, a sequence of n input elements is mapped to n output elements. Inspired by the BERT, it adds an auxiliary token at the beginning of the sequence and treats it as the self-encoder bottleneck vector z. During training, the vector associated with this token is the only information passed to the decoder, so the encoder needs to learn how to compress the entire document information within this vector. The decoder then learns to infer the number of elements in the document and the position of each element in the input sequence from this vector alone. This strategy allows you to normalize bottlenecks using standard techniques such as KL divergence.

In order to synthesize a document with varying numbers of decoding elements, the network needs to model a sequence of arbitrary length, which is not an easy task. With care, the encoder can automatically adapt to any number of elements, but the decoder segment does not know the number of elements in advance. This problem is overcome by decoding the sequence in an autoregressive way — at every step the decoder takes the previously decoded element (bottleneck vector z) until a special stop element is generated. Generates an element that is concatenated to (start as input).

Visualization of the proposed architecture

Documents that convert layouts to input data often consist of several design elements, such as paragraphs, tables, images, titles, and footnotes. To make this information easier to understand in a neural network, each element is defined by four variables (x, y, width, height), and the position (x, y) and size of the element on the page (x, y) and size ( Width, height).

Results We evaluate VTN performance according to two criteria: layout quality and layout diversity. Evaluate the quality of the generated layout by training the model on published document datasets such as PubLayNet, a collection of layout-annotated scientific papers, and quantifying the amount of overlap and alignment between elements. To do. Use the Wasserstein distance to the distribution of element classes (for example, paragraphs, images, etc.) and bounding boxes to measure how well the composite layout resembles a training distribution. To understand the variety of layouts, use the DocSim metric to find the most similar real-world sample in each generated document. The greater the number of unique matches to the actual data, the more diverse the results.

Compare the VTN approach with previous works such as LayoutVAE and Gupta. The former is a VAE-based formulation with an LSTM backbone, but Gupta et al. Use a self-attention mechanism like ours in combination with a standard search strategy (beam search). The results below show that LayoutVAE struggles to comply with design rules such as tight alignment, as it does with PubLayNet. Thanks to the self-attention operation, Gupta et al. Can model these constraints more effectively, but the use of beam search affects the variety of results.

IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Match ↑ LayoutVAE 0.171 0.321 0.472 –0.045 241 Gupta et al. 0.039 0.006 0.361 0.018 0.012 546 VTN 0.031 0.017 0.347 0.022 0.012 697 Actual data 0.048 0.007 0.353 —- Down arrow ↓) indicates that the lower the score, the better, and the up arrow (↑) indicates that the higher the score, the better.

We will also explore the possibilities of approaches to learning design rules for other domains, such as Android UI (RICO), natural scenes (COCO), and indoor scenes (SUN RGB-D). Our method effectively learns the design rules of these datasets and produces synthetic layouts with the same quality and versatility as today’s state-of-the-art technology.

IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Match ↑ LayoutVAE 0.193 0.400 0.416 –0.045 496 Gupta et al. 0.086 0.145 0.366 0.004 0.023 604 VTN 0.115 0.165 0.373 0.007 0.018 680 Real Data 0.084 0.175 0.410 — Down Arrow (↓) ) Indicates that the lower the score, the better, and the up arrow (↑) indicates that the higher the score, the better. IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Match ↑ LayoutVAE 0.325 2.819 0.246 –0.062 700 Gupta et al. 0.194 1.709 0.334 0.001 0.016 601 VTN 0.197 2.384 0.330 0.0005 0.013 776 Real data 0.192 1.724 0.347CO. (↓) indicates that the lower the score, the better, and the up arrow (↑) indicates that the higher the score, the better.

Below is an example of a layout created by our method compared to the existing method. The design rules (positions, margins, placements) learned by the network are similar to the rules of the original data and show a high degree of variability.

LayoutVAE Gupta et al. VTN Qualitative results of our method at PubLayNet compared to existing state-of-the-art methods.

Conclusion This task demonstrates the feasibility of using self-attention as part of the VAE formulation. It validates this approach to layout generation and delivers cutting-edge performance for different datasets and different tasks. Our research paper also explores alternative architectures for self-attention and VAE integration, investigates non-autoregressive decoding strategies and various types of prior distributions, and analyzes their strengths and weaknesses. Layouts generated by our method help create synthetic training data for downstream tasks such as document analysis and automation of graphic design tasks. This task because many secondary issues, such as how to suggest styles for the styles of elements in the layout (text fonts, images to choose, etc.) and how to choose them, have not yet been fully resolved. Hopes to be the basis for continued research in this area to reduce the amount of training data required for the model to generalize.

Acknowledgments Thanks to co-authors Janis Postels, Alessio Tonioni and Luca Prasso for helping design some of our experiments. We would also like to thank Tom Small for helping us create the animation for this post.

