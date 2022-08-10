



Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists, Google Research, Brain Team

Video is a ubiquitous source of media content that touches many aspects of people’s daily lives. Real-world video applications such as video captioning, video content analysis, and video question answering (VideoQA) increasingly rely on models that can associate video content with text or natural language. However, VideoQA is particularly difficult because it requires understanding both semantic information, such as objects in the scene, and temporal information, such as the movement and interaction of things, both of which must be obtained in the context of natural language. is. A question with a specific intent. In addition, videos have many frames, so processing them all to learn spatio-temporal information can be computationally expensive. Nevertheless, understanding all this information allows the model to answer complex questions. For example, in the video below, the question about his second ingredient poured into a bowl requires identifying the object (ingredient), the action (pouring), and the temporal order (second). .

An example input question for the VideoQA task “What is the second ingredient that is poured into the bowl?” This requires a deeper understanding of both visual and text input. The video is an example of the 50 Salad dataset used under a Creative Commons license.

To address this, “Video Question Answering with Iterative Videotext Joint Tokenization” introduces a novel approach to videotext learning called iterative joint tokenization. This allows efficient fusion of spatial, temporal, and linguistic information in VideoQA. This approach is multi-stream, with separate backbone models processing different scales of video to capture different features (e.g., high spatial resolution, long duration, etc.). generate a representation. The model then applies a joint tokenization module to learn efficient representations by fusing the video stream with the text. This model is very efficient, using only 67 gigaflops (GFLOP). This is at least 50% less than the previous approach and offers better performance than the alternative state-of-the-art models.

The main goal of the Video-Text Iterative Co-tokenization model is to generate features from both video and text (i.e. user questions) and jointly interact with the corresponding inputs. A second goal is to do this in an efficient way. This is very important for videos as they contain tens to hundreds of frames as input.

The model learns to tokenize joint video linguistic input into a smaller set of tokens that jointly and efficiently represent both modalities. When tokenizing, we use both modalities to produce a joint compact representation. This is fed to the transformer layer to generate the next level representation. The challenge here, as is typical in cross-modal learning, is that video frames often do not directly correspond to relevant text. We address this by adding two learnable linear layers that combine the dimensions of the visual and text features before tokenization. In this way, we will be able to tailor the way video tokens are learned for both video and text.

Furthermore, a single tokenization step does not allow further interaction between the two modalities. To do so, we use this new feature representation to interact with the video input features and generate another set of tokenized features. This feeds into the next transformer layer. This iterative process allows the creation of new features or tokens representing continued refinement of joint representations from both modalities. The final step feeds the features into a decoder that produces a text output.

As is customary in VideoQA, we pre-train the model before fine-tuning it on individual VideoQA datasets. In this work, instead of pre-training on the large VideoQA dataset, we use the HowTo100M dataset to use videos automatically annotated with text based on speech recognition. Even with this weak pre-training data, the model can learn video text features.

Visualization of an iterative joint tokenization approach for videotext. Multistream video inputs that are versions of the same video input (for example, a high resolution, low frame rate video and a low resolution, high frame rate video) are efficiently fused with the text input to produce text. Base answer by decoder. Instead of processing the input directly, the videotext iterative joint tokenization model learns by reducing the number of available tokens from the fused video linguistic input. This process is iterative, and the selection is refined because the tokenization of the current feature can influence the selection of tokens in the next iteration.

We apply our efficient video question answering video linguistic iterative joint tokenization algorithm to three major VideoQA benchmarks, MSRVTT-QA, MSVD-QA, and IVQA, and show that our approach outperforms other state-of-the-art models. Prove to achieve. While it’s a good size. Moreover, iterative joint tokenization learning saves a lot of computational complexity in the videotext learning task. This method only uses 67 gigaflops (GFLOPS). This is six times less than the 360 ​​GFLOPS required when using the popular 3D-ResNet video model in combination with text, and more than double the efficiency of the X3D model. It produces highly accurate results and outperforms state-of-the-art methods.

Comparison of iterative joint tokenization approaches with previous methods such as MERLOT and VQA-T as well as baselines using single ResNet-3D or X3D-XL.

Multi-stream video input For VideoQA, or many other tasks involving video input, multi-stream input has been found to be important for more accurately answering questions about both spatial and temporal relationships. Our approach utilizes three video streams with different resolutions and frame rates. High resolution, low frame rate video (8x224x224); and one in between (16x112x112). The iterative joint tokenization approach yields a very efficient model, despite the apparently large amount of information to process in the three streams. At the same time, these additional streams allow us to extract the most pertinent information. For example, as shown in the figure below, a question related to a specific activity in time produces higher activations with a lower resolution but higher frame rate video input, whereas a question related to a general activity Questions can be answered from high resolution input. very few frames. Another advantage of this algorithm is that the tokenization changes depending on the question asked.

Visualization of attention maps learned layer by layer during joint tokenization of video text. Attention maps vary depending on the questions asked for the same video. For example, if the question is related to a general activity (such as surfing in the image above), attention maps for high resolution, low framerate inputs seem to be more active and take into account more global information. I can see. On the other hand, if the question is more specific, such as asking what happens after an event, the feature map tends to be more localized and active with high frame rate video inputs. Additionally, we find that lower resolution, higher frame rate video inputs provide more information related to activity within the video.

Conclusions We present a novel approach to video language learning, focusing on collaborative learning across videotext modalities. Tackle the important and challenging task of video question answering. Our approach, albeit more efficient, is highly efficient and accurate, outperforming current state-of-the-art models. Our approach yields modest model sizes and can be further improved with larger models and data. I hope you can.

Acknowledgments This work was done by AJ Pierviovanni, Kairo Morton, Weicheng Kuo, Michael Ryoo, and Anelia Angelova. We thank our collaborators in this study, Soravit Changpinyo for valuable comments and suggestions, and Claire Cui for suggestions and support. Thanks also to Tom Small for the visualization.

