



Posted by: Jonathan Ho, Research Scientist, Chitawan Saharia, Software Engineer, Google Research, Brain Team

Natural image compositing is a wide class of machine learning (ML) tasks with a variety of applications and presents many design challenges. One example is super-resolution of images. In this case, the model is trained to convert low resolution images to detailed high resolution images (such as RAISR). Super-resolution has a variety of uses, from restoring portraits of old families to improving medical imaging systems. Another such image compositing task is class conditional image generation, where the model is trained to generate sample images from the input class label. The resulting sample images can be used to improve the performance of downstream models such as image classification and segmentation.

In general, these image compositing tasks are performed by deep generative models such as GAN, VAE, and autoregressive models. However, each of these generative models has drawbacks when trained to synthesize high-quality samples on difficult, high-resolution datasets. For example, GANs often suffer from unstable training and mode collapse, and autoregressive models usually suffer from slow synthesis rates.

Alternatively, the diffusion model, first proposed in 2015, has recently revived due to training stability and promising sample quality results for image and audio generation. Therefore, it offers a potentially favorable trade-off compared to other types of deep generative models. The diffusion model works by corrupting the training data by gradually adding Gaussian noise, slowly erasing the data details until they become pure noise, and training the neural network to reverse this corruption process. To do. This reverse corruption process synthesizes data from pure noise by gradually denoising the data until a clean sample is produced. This synthesis procedure can be interpreted as an optimization algorithm that produces samples according to the gradient of data density.

Today, I will introduce two connected approaches that push the limits of image composition quality in diffusion models. It is a model of super-resolution (SR3) by iterative refinement and a class conditional synthesis model called Cascade Diffusion Model (CDM). We show that by scaling up the diffusion model and using carefully selected data augmentation techniques, we can perform better than existing approaches. Specifically, SR3 achieves powerful image super-resolution results that exceed GAN in human evaluation. CDM produces fidelity ImageNet samples that significantly exceed BigGAN-deep and VQ-VAE2 in both FID and classification accuracy scores.

SR3: Image Super-Resolution SR3 is a super-resolution diffusion model that takes a low-resolution image as input and builds the corresponding high-resolution image from pure noise. The model is trained in an image corruption process where noise is gradually added to the high resolution image until only pure noise remains. You will then learn to reverse this process by starting with pure noise and gradually removing the noise to reach the target distribution through the guidance of the input low resolution image.

SR3 achieves powerful benchmark results in super-resolution tasks for facial and natural images when scaling from 4x to 8x the input low resolution image with extensive training. These super-resolution models can be further cascaded to increase the effective super-resolution scale factor. For example, stack 64×64 → 256×256 and 256×256 → 1024×1024 surface super-resolution models to perform a 64×64 → 1024×1024 super-resolution task. ..

Use human evaluation studies to compare SR3 with existing methods. Conducted two alternative forced-choice experiments in which subjects were asked to choose between a reference high-resolution image and model output when asked, “Which image do you think the image came from?” To do. Measure the performance of the model through the confusion rate (% of the time evaluators choose the model output over the reference image; with the full algorithm, the confusion rate is 50%). The results of this survey are shown in the figure below.

Above: With 16×16 → 128×128 facial tasks, we achieved a confusion rate of almost 50%, surpassing the state-of-the-art facial super-resolution techniques PULSE and FSRGAN. Bottom: Even the much more difficult task of 64×64 → 256×256 natural images achieved a confusion rate of 40%, well above the regression baseline.

CDM: Class Conditional ImageNet Generation Now that we have demonstrated the effectiveness of SR3 in performing super-resolution of natural images, we will go one step further and use these SR3 models for class conditional image generation. CDM is a class conditional diffusion model trained with ImageNet data to generate high resolution natural images. Since ImageNet is a difficult, high-entropy dataset, we built CDM as a cascade of multiple diffusion models. This cascading approach chains multiple generative models at multiple spatial resolutions. One diffusion model produces data at low resolution, followed by a series of SR3 super-resolution diffusion models that gradually increase the resolution of the generated image to the highest resolution. Cascades are well known to improve the quality and training speed of high resolution data, as shown in previous studies (such as autoregressive models and VQ-VAE-2) and parallel work of diffusion models. CDM further emphasizes the effectiveness of cascading in diffusion models and its usefulness in downstream tasks such as sample quality and image classification, as shown in the quantitative results below. ..

Example of a cascade pipeline containing a series of diffusion models: The first model produces a low resolution image and the rest upsamples to the final high resolution image. The pipeline here is for class-conditional ImageNet generation, starting with a class-conditional diffusion model with 32×32 resolution, followed by 2x and 4x class-conditional super-resolution using SR3. 256×256 Cascade Class Selected an image generated from a conditional ImageNet model.

In addition to including the SR3 model in the cascade pipeline, we will also introduce a new data extension technique called Conditioning Extension that further improves the sample quality results of CDM. The CDM super-resolution model is trained on the original image from the dataset, but during generation it is necessary to perform super-resolution on the image generated by the low resolution base model. This is the original image. This causes train test inconsistencies in the super-resolution model. Conditioning extension refers to applying data extension to the low resolution input image of each super-resolution model in the cascade pipeline. These extensions (which in this case include Gaussian noise and Gaussian blur) prevent each super-resolution model from overfitting low-resolution adjustment inputs and ultimately improve the quality of CDM’s high-resolution samples. Let me do it.

Overall, CDM produces high fidelity samples that are superior to BigGAN-deep and VQ-VAE-2 in terms of both FID and classification accuracy scores for class-conditional ImageNet generation. CDM, unlike other models such as ADM and VQ-VAE-2, is a pure generative model that does not use classifiers to improve sample quality. See below for sample quality quantification results.

Class conditional ImageNetFID score at 256×256 resolution for methods that do not use additional classifiers to improve sample quality. BigGAN-deep is reported with the best truncation value. (Lower is better.) ImageNet classification accuracy is scored at a resolution of 256×256 and measures the accuracy of the trained classifier validation set with the generated data. The data generated by CDM is significantly better than existing methods and bridges the classification accuracy gap between the actual and generated data. (The higher the better.)

Conclusion Using SR3 and CDM, we have pushed the performance of diffusion models to the forefront of super-resolution and class-conditional ImageNet generation benchmarks. We are pleased to be able to further test the limits of the diffusion model for various generational modeling problems. For more information on our work, see Image Super-Resolution with Iterative Precision and Cascade Diffusion Models for High Fidelity Image Generation.

Acknowledgments: Thanks to co-authors William Chan, Mohammad Norouzi, Tim Salimans and David Fleet. We would also like to thank Ben Poole, Jascha Sohl-Dickstein, Doug Eck, and other Google Research for their discussion and support for their research. Brain team. Thanks to Tom Small for helping with the animation.

