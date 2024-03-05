



A collage of several “interactive environments” generated by Genie from enlarged/still images or text prompts. By now, anyone following generative AI is familiar with tools that can generate passive, consumable content in the form of text, images, video, and audio. Google DeepMind's recently announced Genie model (short for “GENerative Interactive Environment”) does something completely different, turning images into “interactive, playable environments that you can easily create, step into, and explore.”

DeepMind's Genie announcement page has a sample GIF of a simple platform-style game generated from a static starting image (a child's sketch, a real-world photo, etc.) or a text prompt passed through ImageGen2. Many are displayed. Although these nice-looking GIFs disguise some of the current major limitations discussed in the full research paper, AI researchers believe that Genie's generalizable “fundamental world modeling” will We're still excited about how we can power machine learning.

under the hood

At first glance, Genie's output looks similar to what you would get from a basic 2D game engine, but this model actually draws sprites and creates playable games in the same way that human game developers do. We don't code platformers. Instead, the system treats the starting image as a frame of video and generates its best guess about what the entire next frame will look like given a particular input.

To establish its model, Genie started with 200,000 hours of public internet game video and narrowed this down from “hundreds of 2D games” to 30,000 hours of standard video. Individual frames from these videos were tokenized into a 200 million parameter model that machine learning algorithms could easily operate on.

Advertisement Enlarge / Images like this, generated via text prompts to the image generator, serve as the starting point for Genie's worldbuilding.

Enlarge / A sample of the interactive movements enabled by Genie from the starting image above (click “Enlarge” if the GIF does not animate). From here, the system generates a “latent action model” to predict what interactive “actions” (i.e., button presses, etc.) Feasible and consistently producible. The system limits potential inputs to a “potential action space” of eight possible inputs (e.g., the four directions and the diagonal of the d-pad) in order to “make it human playable” ( All trained videos are human playable).

Once the latent action model is established, Genie takes any number of frames and latent actions and generates an educated guess about what the next frame will look like given the latent inputs. Generates a “dynamics model” that can be used. This final model ends up being his 10.7 billion parameters trained on 942 billion tokens, but Genie's results suggest that even larger models will produce better results. Masu.

Previous work using generative AI to generate similar interactive models has relied on the use of “ground truth action labels” or textual descriptions of training data to help guide machine learning algorithms. Genie differs from that work with its ability to “train without actions or text annotations,” inferring potential actions behind a video using only hours of tokenized video frames.

“The ability to generalize to such an extent is [out-of-distribution] “The inputs highlight the robustness of our approach and the value of training on large-scale data, which was not possible when using real actions as inputs,” the Genie team said in a research paper. I am.

