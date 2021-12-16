



Posted by: Timothy Nguyen1, Research Engineer, Jaehoon Lee, Senior Research Scientist, Google Research

For machine learning (ML) algorithms to be effective, it is (often) necessary to extract useful features from large amounts of training data. However, this process can be difficult due to the costs associated with training such large datasets, both in terms of computational requirements and real time. Distillation ideas play an important role in these situations by reducing the resources required for the model to be effective. The most widely known form of distillation is model distillation (also known as knowledge distillation), which distills large and complex teacher model predictions into smaller models.

An alternative option for this model space approach is dataset distillation. [1, 2], Large datasets are extracted into smaller composite datasets. Training your model with such extracted datasets can reduce memory and computation requirements. For example, instead of using all the 50,000 images and labels in the CIFAR-10 dataset, use a distillation dataset consisting of only 10 synthetic data points (1 image per class). Invisible test set.

Above: A natural (ie, unchanged) CIFAR-10 image. Bottom: Dataset extracted by CIFAR-10 classification task (one image per class). Using only these 10 composite images as training data, the model can achieve a test set accuracy of approximately 51%.

Two new dataset distillations, kernel derivation point (KIP) and label, in “Dataset Metal Learning from Kernel Ridge Regression” published in ICLR2021 and “Dataset Distillation with Infinitely Wide Convolutional Network” published in NeuroIPS2021. Introducing the algorithm. Solve (LS) optimizes the dataset using a loss function that results from kernel regression, a classical machine learning algorithm that fits a linear model to features defined through the kernel. Numerous benchmark image classification datasets, applying KIP and LS algorithms to obtain highly efficient distillation datasets for image classification and reducing the dataset to 1, 10, or 50 data points per class. Get cutting-edge results with. .. We are also excited to release distilled datasets to benefit the wider research community.

Methodology One of the key theoretical insights of deep neural networks (DNNs) in recent years is that widening the range of DNNs results in more regular behavior and is easier to understand. At infinity, the gradient descent-trained DNN converges on a familiar, simple class model that results from kernel regression on the Neural Tangent Kernel (NTK). neural network. Thanks to the Neural Tangents library, neural kernels with different DNN architectures can be calculated in a scalable way.

To tackle the distillation of the dataset, we used the above infinite width limiting theory of neural networks. Dataset distillation can be formulated as a two-step optimization process. An “inner loop” that trains the model with the trained data and an “external loop” that optimizes the trained data for performance with natural (that is, unchanged) data. The infinite width limit replaces the internal loop that trains a finite width neural network with a simple kernel regression. With the addition of the regularization term, kernel regression becomes a kernel ridge regression (KRR) problem. This is a very valuable result because the kernel ridge regressionr (that is, the predictor from the algorithm) has an explicit formula for the training data (unlike the neural network predictor). In other words, you can easily optimize the KRR loss function as follows: Outer loop.

The original data label can be represented by a one-hot vector. That is, a true label is given a value of 1, and all other labels are given a value of 0. Therefore, the cat image is assigned the label “cat”. The “dog” and “horse” labels will be 0, but the value will be 1. The label used contains subsequent average centering steps. This step subtracts the reciprocal of the number of classes from each component (0.1 for 10 ways). Classification) Allows the expected value of each label component in the entire dataset to be normalized to zero.

Natural image labels are displayed in this standard format, but the trained distillation dataset labels are free to optimize for performance. When you get the kernel ridge regression from the inner loop, the KRR loss function of the outer loop calculates the mean square error between the original label of the natural image and the label predicted by the kernel ridge regression. KIP optimizes support data (images and possibly labels) by minimizing the KRR loss function in a gradient-based manner. The Label Solve algorithm directly solves the set of support labels that minimizes the KRR loss function and produces a unique high density label vector for each (natural) support image.

An example of a label obtained by label resolution. Left and center: Sample images of labels that may be listed below. Raw one-hot labels are displayed in blue, and the final high-density labels generated by LS are displayed in orange. Right: Covariance matrix between the original label and the learned label. Here, 500 labels were extracted from the CIFAR-10 dataset. Using these labels for kernel ridge regression achieves a test accuracy of 69.7%.

For ease of distributed computation, we will focus on an architecture consisting of a convolutional neural network with a pooling layer. Specifically, we will focus on the so-called “ConvNet” architecture and its variants. This is because it has been featured in other dataset distillation studies. I used a slightly modified version of ConvNet. It has a simple architecture given by 3 convolution blocks, ReLu, and 2×2 average pooling, and a final linear read layer with additional 3×3 convolutions and a ReLu layer (for exact details). See GitHub). ).

ConvNet architecture used by DC / DSA. Additional 3×3 Conv and ReLu have been added.

I used the NeuralTangents library to calculate the neural kernel needed for my work.

The first step in this task was to apply KRR and focus on a fully connected network. The kernel elements of this network are cheap to calculate. However, the hurdle faced by the neural kernel of models with convolutional layers and pooling is that the calculation of each kernel element between two images is proportional to the square of the number of input pixels (pixel-to-pixel correlation by the kernel). To capture the relationship). Therefore, in the second stage of this work, the calculation of kernel elements and their gradients had to be distributed across many devices.

Distributed computation for large-scale meta-learning.

Calls a distributed computing client-server model in which servers distribute independent workloads to a large pool of client workers. An important part of this is splitting the backpropagation steps in a computationally efficient way (discussed in detail in this paper).

It uses Courier (a part of DeepMind’s Launchpad), an open source tool that can distribute computations across GPUs running in parallel, and JAX, which enables computationally efficient gradients with a new usage of the jax.vjp function. Realized by using. This distributed framework allows hundreds of GPUs to be utilized for each distillation of the dataset in both the KIP and LS algorithms. Considering the calculations required for such experiments, we are releasing distilled datasets to benefit the wider research community.

Example In the first distillation image set above, CIFAR-10 was distilled to one image per class using KIP with the label fixed. Next, the figure below compares the training test accuracy of a natural MNIST image, a KIP distilled image with a fixed label, and a KIP distilled image with an optimized label. Learning labels emphasizes that while extracting datasets has a mysterious advantage, it is effective. In fact, the resulting set of images provides the best test performance (for infinite width networks), even though it is difficult to interpret.

Distillation of MNIST datasets with trainable and non-trainable labels. Above: Natural MNIST data. Center: Kernel induction point distillation data with a fixed label. Bottom: Distillation data of kernel induction points using learned labels.

Results Our distilled datasets achieved state-of-the-art performance in benchmark image classification datasets, using convolutional architectures, dataset condensation (DC), and divisible sham extensions (DSA). Improves performance beyond previous state-of-the-art models using. In particular, for the CIFAR-10 classification task, a model trained with a dataset consisting of only 10 distillation data entries (1 image / class, 0.02% of the total dataset) will have a test set accuracy of 64%. Achieve. Here, label training and additional image pre-processing steps significantly improve performance beyond the 50% test accuracy shown in the first figure (see paper for more information). .. With 500 images (50 images / class, 1% of the total dataset), the model reaches 80% test set accuracy. Although these numbers are for the neural kernel (using the KRR infinite width limit), you can also use these extracted datasets to train finite-width neural networks. In particular, for the 10 data points of CIFAR-10, the finite width ConvNet neural network achieved 50% test accuracy on 10 images and 68% test accuracy on 500 images. increase. This is the latest result. We provide a simple Colab notebook that demonstrates this transfer to a finite width neural network.

Dataset distillation using kernel induction points (KIPs) with a convolutional architecture is superior to previous state-of-the-art models (DC / DSA) in all benchmark settings for image classification tasks. Label solves (LS, middle column) can perform better than previous state-of-the-art models by simply extracting the information in the label (eg CIFAR-10 10, 50 data points per class). Often.

In some cases, the trained dataset is more effective than the natural dataset, which is 100 times larger in size.

Conclusion We believe that our commitment to distilling datasets opens up many interesting future directions. For example, our algorithms KIP and LS have demonstrated the effectiveness of using trained labels. In addition, we expect to reduce the computational load and scale up to larger datasets by using an efficient kernel approximation method. We hope that this work will allow researchers to explore other applications of dataset distillation, such as searching for neural architectures, continuous learning, and even potential applications for privacy.

Those interested in datasets trained with KIP and LS for further analysis are encouraged to check out our dissertation. [ICLR 2021, NeurIPS 2021] Open source code and datasets available on Github.

Acknowledgments This project was collaborated with Zhourong Chen, Roman Novak and Lechao Xiao. Special thanks to Samuel S. Schoenholz for proposing and supporting the development of the overall strategy for the distributed KIP learning methodology.

1 Now with DeepMind. ↩

