



Posted by Antoine Yang (Student Researcher), and Arsha Nagrani (Research Scientist, Google Research, Awareness Team)

Videos are becoming an increasingly important part of our daily lives across fields such as entertainment, education and communication. However, understanding the content of a video can be a daunting task, as videos often contain multiple events occurring on different timescales. For example, a video of a musher tethering a dog to a dog sled contains a long event (the dog pulls the sled) and a short event (the dog is tethered to the sled). One of his ways to spur research in video understanding is through the use of high-density video captioning tasks. This consists of temporarily localizing and describing all events in a few minutes of video. This is different from single image captions or standard video captions where he describes a short video in one sentence.

High-density video captioning systems include making videos accessible to blind and hearing-impaired users, automatically generating video chapters, and improving search for video moments in large databases. , has a wide range of uses. However, current high-density video captioning approaches have some limitations. For example, they often contain highly specialized, task-specific components that are difficult to integrate into a powerful underlying model. Moreover, it is often trained using only manually annotated datasets, which is very difficult to obtain and not a scalable solution.

In this post, we present “Vid2Seq: Large-Scale Pre-Training of Visual Language Models for High-Density Video Captioning,” coming at CVPR 2023. A textual description within the same output sequence. To pre-train this integration model, we leveraged unlabeled narrated videos by reconstructing sentence boundaries of transcribed audio as pseudo-event boundaries and using transcribed audio sentences as pseudo-event captions. increase. The resulting Vid2Seq model, pre-trained on millions of narrated videos, advances the state of the art for a variety of high-density video captioning benchmarks, including YouCook2, ViTT, and ActivityNet Captions. Vid2Seq is also generalized to high-density video captioning settings for a few shots, video paragraph captioning tasks, and standard video captioning tasks. Finally, I also released the code for Vid2Seq here.

Vid2Seq is a visual language model that predicts dense event captions with temporal basis in videos by generating single token sequences.A visual language model for dense video captions

Multimodal transformer architectures have advanced the state of the art for a wide range of video tasks such as action recognition. But adapting such an architecture to the complex task of collaboratively localizing and captioning an event in a few minutes of video is not trivial.

As a general overview of how this can be achieved, we extend the visual language model with special temporal tokens (e.g., text tokens) that represent discretized timestamps in the video, similar to Pix2Seq in the spatial domain. increase. Given a visual input, the resulting Vid2Seq model can take as input and generate a sequence of text and temporal tokens. First, it allows the Vid2Seq model to understand the temporal information of the transcribed speech input cast as a single sequence of tokens. Second, this allows Vid2Seq to jointly predict dense event captions and temporally place them in the video while generating a single sequence of tokens.

The Vid2Seq architecture includes visual and text encoders that encode video frames and transcribed audio input, respectively. The resulting encoding is forwarded to the text decoder. A text decoder autoregressively predicts the output sequence of high-density event captions along with their temporal localization within the video. The architecture is initialized with a strong visual backbone and a strong language model.

An overview of the Vid2Seq model: We formulate high-density event captioning as a sequence-to-sequence problem and use special temporal tokens to identify the semantic information of the text and the temporal localization information underlying each text sentence in the video. Allows seamless understanding and generation of sequences of tokens containing both Large scale pre-training on untrimmed narrated videos

Manually collecting annotations for dense video captions is particularly costly due to the high density of the task. Therefore, he pre-trains the Vid2Seq model using unlabeled narrated videos that are easily available at scale. In particular, we use the YT-Temporal-1B dataset, which contains 18 million narrated videos covering a wide range of domains.

Using the transcribed audio sentence and its corresponding timestamp as a director, we cast it as a single sequence of tokens. Vid2Seq for generation purpose to teach the decoder to predict the transcribed audio sequence when given only visual input and masked tokens when given noisy transcribed audio sequence and visual input Pretrain for denoising purposes that facilitate multimodal learning by asking the model to predict In particular, noise is added to the speech sequence by randomly masking spans of tokens.

Vid2Seq is pre-trained on unlabeled narrated videos using a generation objective (top) and a denoising objective (bottom).Downstream High Density Video Captioning Benchmark Results

The resulting pre-trained Vid2Seq model can be fine-tuned using a simple maximum likelihood goal with supervised enforcement in a downstream task (i.e. subtracting the next token from the previous ground truth token). predict). After fine-tuning, Vid2Seq significantly improves the state of the art for three standard downstream high-density video captioning benchmarks (ActivityNet Captions, YouCook2, and ViTT) and two video clip captioning benchmarks (MSR-VTT, MSVD). In our paper, we provide additional ablation studies, qualitative results, and results for a few-shot setting and a video paragraph captioning task.

Comparison of CIDEr metric (higher is better) with state-of-the-art methods for dense video captions (left) and video clip captions (right).Conclusion

We introduce Vid2Seq, a new visual language model for dense video captioning that simply predicts all event boundaries and captions as a single token sequence. Vid2Seq can effectively pre-train unlabeled narrated videos at scale, achieving state-of-the-art results on a variety of downstream high-density video captioning benchmarks. Learn more from the paper and get the code here.

Acknowledgments

The research was conducted by Antoine Yang, Arsha Nagrani, Paul Honsak So, Antoine Miech, Jordi Pontusset, Ivan Laptev, Joseph Civic, and Cordelia Schmidt.

