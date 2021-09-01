



Posted by: Forrester Cole, Software Engineer, Tali Dekel, Research Scientist

Image and video editing operations often rely on accurate matte (the image that defines the separation between foreground and background). Modern computer vision technology allows you to create high quality matte on natural images and videos, enabling real-world applications such as generating synthetic depth of field, editing and compositing images, and removing backgrounds from images. However, it lacks one basic element. Various scene effects that the subject may produce, such as shadows, reflections, and smoke, are usually overlooked.

“Omnimatte: Associating Objects and their Effects in Video,” presented at CVPR 2021, describes a new approach to mat generation that leverages layered neural rendering to separate video into layers called omni mats. Effects related to them in the scene. A typical state-of-the-art segmentation model extracts a subject in a scene, such as a person or dog mask, but the method proposed here isolates additional details related to the subject, such as casting shadows. You can extract it. Ground.

State-of-the-art segmentation networks (such as MaskRCNN) take input video (left) and generate plausible masks for humans and animals (center), but miss the relevant effect. Our method produces a matte that contains shadows as well as the subject (right; individual channels of humans and dogs visualized as blue and green).

Also, unlike segmentation masks, Omnimat can capture partially transparent soft effects such as reflections, splashes and tire smoke. Like traditional mats, omni mats are RGBA images that can be manipulated using widely available image or video editing tools, for example, traditional mats to insert text into video under the flue. Can be used wherever is used.

Layered Decomposition of Video Splits the input video into a set of layers to generate an omnimat. Add one for each moving subject and one for a stationary background object. In the example below, there is one layer for the person, one layer for the dog, and one layer for the background. When merged using traditional alpha blending, these layers play the input video.

In addition to playing the video, the decomposition should capture the correct effect on each layer. For example, if the shadow of a person is displayed on the dog layer, the merged layer will reproduce the input video, but inserting additional elements between the person and the dog will result in an obvious error. The challenge is to find a decomposition where each subject’s layer captures only the effect of that subject and produces a true omnimat.

Our solution is to apply a previously developed layered neural rendering approach to train a convolutional neural network (CNN) and map a subject’s segmentation mask and background noise image to an omnimat. Due to its structure, CNNs tend to learn the correlation between image effects, and the stronger the correlation between effects, the easier it is for CNNs to learn. For example, in the video above, the spatial relationship between a person and his shadow, and a dog and his shadow, remains the same when walking from right to left. The relationship between a person and a dog’s shadow, or between a dog and a person’s shadow, changes further (and therefore the correlation is weaker). The CNN first learns a strong correlation, leading to the correct decomposition.

The details of the Omnimat system are shown below. In preprocessing, the user selects a subject and specifies each layer. The segmentation mask for each subject is extracted using a ready-made segmentation network such as MaskRCNN, and camera transformations to the background are detected using standard camera stabilization tools. Random noise images are defined in background reference frames and sampled using camera transformations to produce a frame-by-frame noise image. Noise images provide an image feature that tracks the background randomly but consistently over time, providing a natural input for CNNs to learn to reconstruct the background color.

The rendering CNN takes a segmentation mask and a frame-by-frame noise image as input and produces an RGB color image and an alpha map that captures the transparency of each layer. These outputs are merged using traditional alpha blending to produce an output frame. CNN reconstructs the input frame by finding effects that are not captured by the mask (shadows, reflections, smoke, etc.) and associating them with a particular foreground layer so that the target alpha roughly contains the segmentation mask. You will be trained from scratch to do. Sparsity loss is also applied to the foreground alpha so that the foreground layer captures only the foreground elements and not the static background.

The new rendering network will be trained on a video-by-video basis. Since the network is only needed to reconstruct a single input video, in addition to separating the effects of each subject, it is possible to capture fine structure and fast movements, as shown below. .. In the walking example, the omnimat contains shadows projected onto the slats on the park bench. In the tennis example, even faint shadows and tennis balls are captured. In the soccer example, the player’s shadow and the ball are broken down into the appropriate layers (a slight error occurs if the player’s foot is blocked by the ball).

This basic model is already working well, but you can improve the results by increasing the CNN input with additional buffers such as optical flow and texture coordinates.

Once the application omnimat is generated, how can it be used? As shown above, you can remove an object by simply removing the object’s layer from the composition. You can also duplicate an object by repeating layers of the object in the composition. In the example below, the video is “unwrapped” into a panorama, and the horse is duplicated several times to produce a strobe photo effect. Notice that the shadows that the horse casts on the ground and obstacles are captured correctly.

A more subtle but powerful application is to retime the subject’s time. Time manipulation is widely used in film, but it usually requires separate shots and a controlled shooting environment for each subject. The decomposition into omnimat allows you to retiming your daily video using only post-processing by simply changing the playback speed of each layer individually. Omnimat is a standard RGBA image, so this retiming edit can be done using traditional video editing software.

The video below is broken down into three layers, one for each child. The child’s first asynchronous jump is adjusted by simply adjusting the layer’s playback speed, producing realistic retiming of splashes and reflections in the water.

In the original video (left), each child jumps at a different time. After editing (right), everyone will jump.

It is important to consider that new techniques for manipulating images can be abused to generate disinformation and misleading information and must be developed and applied responsibly. is. Our technology was developed according to AI principles and only allows the relocation of content that already exists in the video, but as shown in these examples, even a simple relocation can significantly change the effect of the video. I can do it. Researchers need to be aware of these risks.

Future Work There are some exciting directions to improve the quality of omnimat. At a practical level, the system currently only supports backgrounds that can be modeled as panoramas with fixed camera positions. If the camera position moves, the panoramic model will not be able to accurately capture the entire background and some background elements may clutter the foreground layer (as shown in the figure above). You’ll need a 3D background model to handle perfectly common camera movements, such as walking in a room or walking down a street. Reconstructing 3D scenes in the presence of moving objects and effects remains a difficult research subject, but recent advances are expected.

At the theoretical level, the ability of CNNs to learn correlations is powerful, but it’s still a bit strange and doesn’t always lead to the expected layer decomposition. Our system allows you to edit manually if the automatic results are incomplete, but a better solution is to fully understand the features and limitations of CNNs for learning image correlation. Such an understanding can lead to improvements in many other video editing applications other than denoise, repair, and layer decomposition.

Acknowledgments Erika Lu of Oxford University, in collaboration with Google researchers Forrester Cole, Tali Dekel, Michael Rubinstein, William T. Freeman and David Salesin, and Oxford University researchers Weidi Xie and Andrew, twice at Google Developed the Omnimat system during my internship. Jisaman.

Thanks to the author’s friends and family for agreeing to appear in the sample video. The “horse jump”, “lucia” and “tennis” videos are from the DAVIS 2016 dataset. Soccer videos are used with the permission of Online Soccer Skills. The car drift video is licensed from Shutterstock.

