2022 is a great year for generative AI, with models such as DALL-E 2, Stable Diffusion, Imagen and Parti released. And he looks to be on that path for 2023, as Google introduced its latest text-to-image model, Muse, earlier this month.

Like other text-to-image models, Muse is a deep neural network that takes text prompts as input and generates images that fit the description. But what sets Muse apart from its predecessors is its efficiency and precision. By building on the experience of previous research in this area and adding new methods, Google researchers have successfully created generative models that require less computational resources, and have been used by other Improved some issues that generative models suffer from.

Google’s Muse uses token-based image generation

Muse builds on previous work in deep learning, including large-scale language models (LLMs), quantized generative networks, and masked generative image transformers.

Dilip Krishnan, a research scientist at Google, said his interest in using tokens to integrate image and text generation was a big motivator. Muse builds on the ideas of our group’s previous paper, MaskGit, and his modeling of masking from large language models.

Muse leverages pre-trained language model conditioning used in previous work and the idea of ​​cascade models borrowed from Imagen. One interesting difference between Muse and other similar models is that they generate individual tokens rather than pixel-level representations. This makes the model output more stable.

Like other text-to-image generators, Muse is trained on a large corpus of image-caption pairs. Her pre-trained LLM processes the captions and generates embeddings, which are multi-dimensional numerical representations of the text descriptions. At the same time, a cascade of two image encoder/decoders transforms different resolutions of the input image into matrices of quantized tokens.

During training, the model trains a base transformer and a super-resolution transformer to match text embeddings to image tokens and use them to recreate images. The model tunes its parameters by randomly masking image tokens and predicting them.

Image source: Google.

Once trained, the model can generate image tokens from text embeddings of new prompts and use the image tokens to create new high-resolution images.

According to Krishnan, one of Muse’s innovations is parallel decoding in the token space. This is fundamentally different from both diffusion and autoregressive models. The diffusion model uses progressive denoising. Autoregressive models use serial decoding. Parallel decoding in Muse is highly efficient without compromising visual quality.

According to Krishnan, the process of deciphering a muse is similar to the process of painting, where an artist starts by sketching out key areas, then gradually fills in color and refines the result by tweaking details. increase.

Excellent results for Google Muse

Google has not yet made Muse public because the model can be used for misinformation, harassment, and various types of social and cultural bias.

However, according to the results published by the research team, Muse matches or beats other state-of-the-art models on CLIP and FID scores, two metrics that measure the quality and accuracy of images produced by generative models. Exceeded.

Muse is faster than Stable Diffusion and Imagen because it uses discrete tokens and a parallel sampling method that reduces the number of sampling iterations required to produce high quality images.

Interestingly, Muse improves on other models in problem areas such as cardinality (prompts containing a certain number of objects), composability (prompts describing a scene containing multiple objects that are related to each other), and text rendering. I’m here. However, the model still fails for prompts that require long text or a large number of objects to render.

One of Muse’s key advantages is the ability to perform editing tasks without the need for fine-tuning. These features include inpainting (replaces part of an existing image with generated graphics), fill (adds detail around an existing image), and maskless editing (replaces the background or specific areas within an image). object changes, etc.).

All generative models require reconciliation and editing of prompts. Krishnan said the efficiency of Muse allows users to make this adjustment quickly and aid in the creative process. Token-based masking allows you to unify the methods used by text and images. It may be used for other modalities.

Muse is an example of how impressive advances in AI can be made with the right combination of technology and architecture. The team at Google believes Muse still has room for improvement.

Krishnan sees generative modeling as an emerging research topic. He is interested in directions such as how to customize editing based on Muse models and further speed up the generation process. They also build on existing ideas in the literature.

