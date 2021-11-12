



Posted by: Dave Epstein, Student Researcher, Chen Sun, Staff Research Scientist, Google Research

Machine learning (ML) agents are increasingly being deployed in the real world to make decisions and support people’s daily lives. Making rational predictions about the future on different timescales is of paramount importance to such agents, as they can anticipate changes in the surrounding world, including the actions of other agents, and plan how to act next. It is one of the functions. Importantly, for successful future predictions, capture meaningful transitions in the environment (eg, dough turns into bread) and how transitions over time to make decisions. You need to both deploy to or adapt.

Previous work in predicting the future from visual observation is the format of its output (eg, pixels representing an image) or a manually defined set of human activities (eg, someone keeps walking, sits, or sits). Jump). These are either too detailed and difficult to predict, or lack important information about the richness of the real world. For example, even if you predict a “human jump”, you cannot understand the reason for the jump or the target of the jump. Also, with very few exceptions, previous models were designed to make predictions with a fixed offset in the future. This is a limited assumption, as it is almost unknown when a meaningful future situation will occur.

For example, in a video about making ice cream (shown below), a meaningful transition from “cream” to “ice cream” occurs over 35 seconds, so the model that predicts such a transition is 35 seconds ahead. Must see. However, this time interval will vary greatly depending on your activity and video. Meaningful transitions will occur at any distance in the future. It is difficult to learn to make such predictions at flexible intervals, as the desired ground truth can be relatively ambiguous. For example, the correct prediction could be ice cream that has just been stirred in the machine, or a scoop of ice cream in a bowl. Moreover, it is not feasible to collect such annotations on a large scale (that is, every frame of millions of videos). However, many existing instructional videos come with audio-to-character conversion, often providing a concise and general explanation throughout the video. This data source allows you to focus your model’s attention on key parts of your video, eliminating the need for manual labeling and allowing you to define a flexible, data-driven future.

“Learn Temporal Dynamics from Narrated Video Cycles” published in ICCV 2021 proposes a self-monitoring approach using a recent large unlabeled dataset of diverse human behavior. .. The resulting model works with a high level of abstraction, can arbitrarily predict the distant future, and chooses a future to predict based on context. Called Multimodal Cycle Consistency (MMCC), it leverages narrated educational videos to learn powerful predictive models of the future. We will show you how to apply the MMCC to a variety of difficult tasks without fine-tuning it, and qualitatively examine its predictions. In the example below, MMCC predicts futures (d) from the current frame (a) rather than potential irrelevant futures (b) or (c).

This work uses visual and linguistic cues to predict high-level changes in video (such as howTo100M video) (such as cream becoming ice cream).

Viewing a Video as a Graph The basis of this method is to display a narrated video as a graph. The video is displayed as a collection of nodes. Nodes are either neural network-encoded video frames (sampled at 1 frame per second) or segments of narrated text (extracted by an automated speech recognition system). During training, MMCC creates graphs from nodes, uses cross-modal edges to connect video frames and text segments that reference the same state, and uses time edges to present (eg, strawberry flavor cream) and future (eg, strawberry flavor cream). Example :, soft serve). The temporal edge works equally well in both modality. You can start with a video frame, some text, or both, and either modality can connect to future (or past) states. MMCC does this by learning the potential representations shared by frames and text and making predictions in this representation space.

Apply the concept of cycle integrity to learn cross-modal and time-edge functions without multimodal cycle integrity monitoring. Here, cycle consistency refers to building a cycle graph that builds a series of edges where the model builds a series of edges from the initial node to another node and back again. Given a starting node (for example, a sample video frame), the model is expected to: Find its cross-modal counterparts (that is, the text that describes the frame) and combine them as the current state. To do this, at the beginning of training, the model assumes that frames and texts with the same timestamp correspond, but later relax this assumption. The model then predicts the future state and selects the node that most closely resembles this prediction. Finally, the model attempts to reverse the above steps by predicting the current state in the opposite direction from the future node and connecting the future node to the starting node.

The discrepancy between the prediction of the current model from the future and the actual present is a loss of cycle integrity. Intuitively, for the purposes of this training, the predicted future should contain sufficient information about the past. This makes predictions that correspond to meaningful changes to the same entity (for example, tomatoes become marinara sauce, or bowls of flour and eggs become dough). .. In addition, the inclusion of cross-modal edges ensures that future projections are meaningful in both modality.

Use soft attention techniques to learn temporal and cross-modal edge functions end-to-end. This technique first outputs the likelihood that each node will be the target node for the edge, and then the weighted average of all possible candidates. Importantly, this circular graph constraint makes few assumptions about the types of temporal edges that the model needs to learn, as long as it forms a consistent cycle. This enables the emergence of long-term temporal dynamics that are essential for future prediction, without the need for manual labeling of meaningful changes.

Example of training objectives: A cycle graph between chicken with soy sauce and chicken with chili oil is expected to be created because it is two adjacent steps in chicken preparation (HowTo100M video).

Cycle detection in real-world video MMCC uses only long video sequences and randomly sampled start conditions (frame or text excerpts) and asks the model to find a time cycle, explicitly Trained without ground truth. After training, MMCC can identify meaningful cycles to capture complex changes in video.

Given a frame as input (left), MMCC selects the relevant text from the video narration and uses both modality to predict future frames (center). Then find the text related to this future and use it to predict the past (right). Using knowledge of how objects and scenes change over time, MMCC “closes the cycle” and ends where it started (HowTo100M video). You can also find the relevant transitions by starting the model with narrated text instead of frames (HowTo100M video).

The zero-shot application MMCC defines a “transition score” for each pair of frames (A, B) in the video, as predicted by the model, in order to identify meaningful transitions over time throughout the video. The closer B is, the higher the assigned score will be for A’s future model predictions. It then ranks all pairs according to this score and displays the pair with the highest score of the current and future frames detected in previously undisplayed videos (example below).

The highest scoring pair from 8 random videos. This shows the variety of models across a wide range of tasks (HowTo100M video).

Using this same approach, an unordered collection of video frames without tweaking by finding an order that maximizes the overall confidence score between all adjacent frames in the sorted sequence. You can sort it temporarily.

Left: Shuffled frames from three videos. Right: MMCC unshuffles the frame. The true order is displayed below each frame. Even if the MMCC does not predict ground truth, the prediction often seems valid, so we can offer a different order (HowTo100M video).

Predicting the Future Here we use top-k recall metrics to measure the ability of the model to obtain the correct future, and in some cases evaluate the ability of the model to predict action minutes in advance (the higher the better). good). In CrossTask, a dataset of instructional videos labeled to explain key steps, MMCC is superior to previous self-monitoring state-of-the-art models in estimating potential future actions.

Recall the model Top-1 Top-5 Top-10 Cross-modal 2.9 14.2 24.3 Repr. ant. 3.0 13.3 26.0 MemDPC 2.9 15.8 27.4 TAP 4.5 17.1 27.9 MMCC 5.4 19.9 33.8

Conclusion We have introduced a self-monitoring method for learning temporal dynamics by circulating educational videos with narration. The model’s architecture is simple, but it discovers meaningful long-term transitions in vision and language and can be applied to difficult downstream tasks such as predicting distant behavior and ordering collections of images without further training. .. An interesting future direction is to transfer the model to the agent so that the agent can carry out long-term planning.

Acknowledgments The core team includes Dave Epstein, Jiajun Wu, Cordelia Schmid and Chen Sun. Thanks for the feedback from Alexei Efros, Mia Chiquier and Shiry Ginosar, and for the inspiration for Allan Jabri’s figure design. Dave would like to thank Ddac Surs and Carl Vondrick for their insightful and early discussion of circulating time in the video.

