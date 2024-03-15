



Google researchers have been working overtime these days, rolling out new models and ideas. The latest is a method that takes a still image and transforms it into a controllable avatar that emanates from the back of an AI agent playing the game.

VLOGGER isn't currently available for trial, but the demo suggests you can create an avatar and control it using your voice, and it looks surprisingly realistic.

You can do something similar to some extent using tools like Pika Labs' Lipsync, Hey Gen's video translation service, and Synthesia, but this seems like a simpler, lower-bandwidth option.

What is a video blogger?

Currently, VLOGGER is just a research project with some fun demo videos, but once it goes into production, it could become a new way to communicate in Teams and Slack.

It is an AI model that can create animated avatars from still images and maintain the photorealistic appearance of the person in the photo in every frame of the final video.

The model then also takes in the audio file of the person speaking and processes body and lip movements to reflect the natural movement of the person as they speak.

This includes making head movements, facial expressions, gaze, and eye blinks, as well as hand gestures and upper body movements, without reference beyond images or sounds.

How does VLOGGER work?

(Image source: Google)

This model is built on the diffusion architecture that powers text-to-image conversion, video, and even 3D models like MidJourney and Runway, but adds additional control mechanisms.

The Vlogger goes through multiple steps to get the generated avatar. It first takes audio and images as input, runs them through a 3D motion generation process, then runs a “time spread” model to determine timing and movement, and finally upscales them into the final output.

Essentially, we build a neural network that uses a still image as the first frame and audio as a guide to predict the movement of the face, body, pose, gaze, and expression over time.

Training the model required a large multimedia dataset called MENTOR. There are 800,000 videos of different people talking and labeling different parts of their faces and bodies.

What are the limitations of VLOGGER?

This is a research preview rather than an actual product, and although it can generate realistic movements, the video may not always match a person's actual movements. It is still essentially a pervasive model and may be prone to strange behavior.

According to the research team, they especially struggle with large movements and diverse environments. It can also only handle relatively short videos.

What are some use cases for VLOGGER?

(Image credit: Future)

According to Google researchers, one of the main use cases is video translation. For example, shoot an existing video in a specific language and edit the lips and face to match the new translated audio.

Other potential use cases include creating animated avatars for virtual assistants, chatbots, or virtual characters that look and move realistically in gaming environments.

There are already tools like Synthesia that do something similar, allowing users to go into a company's office and create a virtual avatar of themselves to give a presentation, but this new model makes the process much more Seems like it would be easier.

One potential use is to provide low-bandwidth video communications. A future version of this model could enable voice-to-video chat by animating still-image avatars.

This can prove especially useful in VR environments on headsets like Meta Quest and Apple Vision Pro, which operate independently of the platform's own avatar model.

