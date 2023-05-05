



Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists at Google Research

The basic model of Vision Language is built on the premise of a single pre-training and subsequent adaptation to multiple downstream tasks. Two main disjoint training scenarios are common: CLIP-style contrastive learning and next token prediction. Contrastive learning trains a model to predict whether image-text pairs are correctly matched, effectively constructing visual and textual representations of corresponding image-text inputs. Next Token Prediction, on the other hand, predicts the most likely next text token in a sequence and learns to generate text. , according to the required tasks. Contrastive learning enables image text and text-image retrieval tasks such as finding the image that best matches a given description, while next-token learning enables text generation tasks such as image captioning and visual question answering (VQA). Make it possible. Both approaches have shown strong results, but when models are pretrained in contrast, they typically perform poorly on text generation tasks and vice versa. Moreover, adaptation to other tasks is often done in complex or inefficient ways. For example, to extend visual language models to video, some models need to make inferences for each video frame separately. This limits the size of the video that can be processed to a few frames and does not take full advantage of the motion information available across frames.

Motivated by this, we present a “simple architecture for collaborative learning of multimodal tasks” called MaMMUT. MaMMUT can be jointly trained for these competing objectives, providing a foundation for many visual-linguistic tasks either directly or via simple adaptation. MaMMUT is a compact, 2B-parameter, multimodal model that is symmetric, text-generating, and trained with localization-aware objectives. It consists of a single image encoder and text decoder, allowing you to reuse both components directly. Moreover, our simple adaptation to the videotext task requires only one use of the image encoder and can process more frames than our previous work. In line with modern language models (PaLM, GLaM, GPT3, etc.), our architecture uses a decoder-only text model, which can be thought of as a simple extension of the language model. Despite its modest size, our models outperform or outperform the state of the art in image-to-text, text-to-image search, video question answering (VideoQA), video captioning, open vocabulary detection, and VQA. achieve a certain performance.

The MaMMUT model enables a wide range of tasks such as image text/text image search (top left and top right), VQA (left middle), open vocabulary detection (right middle), and VideoQA (bottom).Decoder-only model architecture

One of the surprising findings is that a single language decoder is sufficient for all these tasks. This eliminates both the complex construction and training procedure described above. For example, our model (shown on the left side of the figure below) consists of a single visual encoder and a single text decoder connected via mutual attention to achieve contrastive loss and text generation. Train on both types of loss simultaneously. By comparison, previous studies either failed to handle image and text retrieval tasks, or applied only some loss to only some parts of the model. To enable multimodal tasks and take full advantage of decoder-only models, both contrastive loss and caption-like loss for text generation need to be jointly trained.

The MaMMUT architecture (left) is a simple structure consisting of a single vision encoder and a single text decoder. Comparing with other popular visual language models (e.g. PaLI (middle), ALBEF, CoCa (right)) jointly and efficiently perform multiple visual language tasks with both contrast loss and text production loss training and tasks.Two-pass learning of the decoder

A decoder-only model for language learning shows a clear performance advantage for small model sizes (approximately half the parameters). The main challenge for applying them to multimodal settings is to integrate contrastive learning (using unconditional sequence-level representations) with captioning (optimizing the likelihood of tokens conditioned on previous tokens). is to We propose a two-pass approach to jointly learn these two competing types of textual representations within a decoder. The first pass utilizes mutual attention and causal masking to learn caption generation tasks. The text feature corresponds to the image feature and can predict tokens in order. In the second pass, mutual attention and causal masking are disabled to learn a contrasting task. Text features are unaware of image features, but can bi-directionally respond to all text tokens at once to produce the final text-based representation. Completing this two-pass approach within the same decoder addresses both types of previously difficult-to-coordinate tasks. Although simple, we show that this model architecture can provide a foundation for multiple multimodal tasks.

Two-pass learning of the MaMMUT decoder only allows both symmetrical and generative learning paths on the same model.

Another advantage of our architecture is that it was trained for these disparate tasks, so it can be seamlessly applied to multiple applications such as image-to-text, text-to-image search, VQA, and captioning.

Moreover, MaMMUT easily adapts to video language tasks. The previous approach used the vision encoder to process each frame individually, requiring multiple applications. This is slow and typically limits the number of frames the model can handle to 6-8. MaMMUT uses sparse video tubes for lightweight adaptation directly via spatiotemporal information from videos. Additionally, the model can be adapted for Open-Vocabulary Detection by simply training it to detect bounding boxes via an object detection head.

Adapting the MaMMUT architecture to the video task (left) is straightforward and fully reuses the model. This is done by generating a video “tube” feature representation that resembles an image patch. This is projected into low-dimensional tokens and passed through the vision encoder. It is used only once, unlike the previous approach (right) that required running multiple individual images through the vision encoder.result

Our model achieves excellent zero-shot results without adaptation in image-text and text-image searches, outperforming all previous state-of-the-art models. VQA results are comparable to state-of-the-art results achieved by much larger models. The PaLI model (17B parameters) and the Flamingo model (80B) have the best performance on the VQA2.0 dataset, while MaMMUT (2B) has the same accuracy as the 15B PaLI.

MaMMUT beats the state of the art (SOTA) for Zero-Shot Image-Text (I2T) and Text-Image (T2I) searches on both MS-COCO (top) and Flickr (bottom) benchmarks. The performance of the VQA2.0 dataset is competitive, but not superior to large models such as Flamingo-80B and PalI-17B. Performance is evaluated in a more challenging open-ended text generation setting.

MaMMUT outperforms the state-of-the-art of VideoQA as shown below on the MSRVTT-QA and MSVD-QA datasets. Note that it outperforms much larger models such as the Flamingo. Flamingo is specifically designed for image and video pre-training and is pre-trained on both image text and video text data.

MaMMUT outperforms SOTA models on the VideoQA task (MSRVTT-QA dataset, top; MSVD-QA dataset, bottom), using 80B parameters and pre-trained on both images, such as 5B GIT2 and Better than much larger models such as the Flamingo. verbal and visual language tasks.

Our results surpass the state of the art for fine-tuning open vocabulary detection, as also shown below.

main ingredient

We demonstrate that joint training for both contrastive and text-generating objectives is not an easy task, and in ablation, we found that these tasks are better served by different design choices. We find that fewer mutual attention connections are better for the search task, whereas more attention is favored for the VQA task. However, this indicates that our model’s design choices may not be optimal for individual tasks, but our model is more effective than more complex or large models.

Ablation studies showing that fewer mutual attention connections (1-2) are better suited for retrieval tasks (top), whereas more connections favor text generation tasks such as VQA (bottom).Conclusion

We presented MaMMUT, a simple and compact vision encoder language decoder model. The model jointly trains many conflicting objectives to coordinate contrasting and text-generating tasks. Our model also serves as the foundation for many more visual language tasks, including state-of-the-art or competitive performance in image-to-text, text-to-image search, videoQA, video captioning, open vocabulary detection, and VQA. Achieve. I hope it can be used further for more multimodal applications.

Acknowledgments

The described work was co-authored by Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, and Anelia Angelova. Thanks to Mojtaba Seyedhosseini, Vijay Vasudevan, Priya Goyal, Jiahui Yu, Zirui Wang, Yonghui Wu, Runze Li, Jie Mei, Radu Soricut, Qingqing Huang, Andy Ly, Nan Du, Yuxin Wu, Tom Duerig, Paul Natsev, and Zoubin Ghahramani. increase. their help and support.

