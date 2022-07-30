



Posted by Ehsan Amid (Research Scientist) and Rohan Anil (Principal Engineer, Google Research, Brain Team)

Although model design and training data are key components of a successful deep neural network (DNN), the specific optimization methods used to update model parameters (weights) are not often discussed. Training a DNN involves minimizing a loss function that measures the discrepancy between the ground truth labels and the model’s predictions. Training is performed by backpropagation, which adjusts the model weights through a gradient descent step. Gradient descent then updates the weights using the gradient (that is, the derivative) of the loss with respect to the weights.

The simplest weight update corresponds to stochastic gradient descent. This moves the weights in the negative direction with respect to the gradient at every step (appropriate step size, aka learning rate). More advanced optimization methods change the direction of the negative gradients and then update the weights using information from past steps or local properties of the loss function around the current weights, such as curvature information. To do. For example, the momentum optimizer encourages moving along the average direction of past updates, while the AdaGrad optimizer scales each coordinate based on past gradients. These optimizers are commonly known as first-order methods because they usually only use information from the first derivative (gradient) to change the update direction. More importantly, the weight parameter components are processed independently of each other.

More advanced optimizations such as Shampoo and K-FAC have been shown to capture correlations between parameter gradients, improve convergence, reduce the number of iterations, and improve solution quality. These methods obtain information about local changes in the derivative of the loss, i.e. changes in gradient. Using this additional information, higher-order optimizers can find more efficient update directions for training models by taking into account correlations between different groups of parameters. On the downside, computation of higher-order update directions is computationally more expensive than first-order updates. This operation uses more memory to store the statistics and involves matrix inversion, which hinders the applicability of higher-order optimizers in practice.

“LocoProp: Enhancing BackProp with Local Loss Optimization” introduces a new framework for training DNN models. Our new framework, LocoProp, thinks of neural networks as a modular composition of layers. In general, each layer of a neural network applies a linear transformation to its input followed by a nonlinear activation function. In the new structure, each layer is assigned its own weight regularizer, output target, and loss function. The loss function for each layer is designed to match the layer’s activation function. Using this formulation, training minimizes the local loss for a given mini-batch of samples iteratively and in parallel across layers. Our method uses a first-order optimizer (such as RMSProp) to perform multiple local updates for each example batch. However, it shows that the combined local update looks like a higher order update. Empirically, LocoProp outperforms first-order methods on deep autoencoder benchmarks, and performs on par with higher-order optimizers such as Shampoo and K-FAC, without the high memory and computational requirements. indicates that

A MethodNeural network is generally viewed as a composite function that transforms a model’s input into an output representation layer by layer. LocoProp adopts this view while decomposing the network into layers. In particular, instead of updating the layer weights to minimize the output loss function, LocoProp applies a predefined local loss function unique to each layer. A loss function is chosen to match the activation function for a given layer. For example, for layers with tanh activation, tanh loss is chosen. The loss per layer measures the discrepancy between the layer’s output (for a particular mini-batch in the example) and the notion of that layer’s target output. Additionally, the regularization term ensures that the updated weights do not stray too far from their current values. The combined per-layer loss function (with local target) and regularizer is used as the new objective function for each layer.

Similar to backpropagation, LocoProp applies a forward pass to compute the activations. In the backward pass, LocoProp sets a “target” for each neuron in each layer. Finally, LocoProp can split model training into independent problems across multiple layers, applying multiple local updates to each layer’s weights in parallel.

Probably the simplest loss function you can think of for layers is the squared loss. Although squared loss is a valid choice of loss function, LocoProp takes into account the possible nonlinearity of the layer activation functions and applies a layer-by-layer loss tailored to each layer’s activation function. This causes the model to emphasize areas of the input that are more important to model predictions and de-emphasize areas that have less impact on the output. Below are examples of adjusted losses for the tanh and ReLU activation functions.

Loss functions induced by (left) tanh and (right) ReLU activation functions. Each loss is more sensitive to areas that affect the output prediction. For example, the ReLU loss is zero as long as both the prediction (â) and the target (a) are negative. This is because the ReLU function applied to any negative number is equal to zero.

After forming an objective on each layer, LocoProp iteratively applies gradient descent steps to that objective to update the layer weights. Updates typically use a primary optimizer (such as RMSProp). However, it can be seen that the overall behavior of combined updates is very similar to higher-order updates (see below). LocoProp therefore provides training performance close to that achieved by high-order optimizers, without the heavy memory and computational demands of higher-order methods such as matrix inversion. We show that LocoProp is a flexible framework that enables the recovery of well-known algorithms and allows the construction of new algorithms via various choices of loss, target, and regularization. LocoProp’s layer-by-layer view of the neural network also allows parallel weight updates across layers.

Experiments Our paper describes experiments on a commonly used baseline deep autoencoder model to evaluate the performance of optimization algorithms. We performed extensive tuning on several of his commonly used first-order optimizers, including SGD, SGD with momentum, AdaGrad, RMSProp, Adam, and the higher-order Shampoo optimizer and the K-FAC optimizer, and the results to LocoProp. Our findings show that the LocoProp method performs significantly better than his first-order optimizer and is comparable to higher-order optimizers, while being significantly faster when run on a single GPU. is shown.

Training loss and epoch number (left) and wall-clock time (i.e., real-time elapsed during training) (right) for RMSProp, Shampoo, K-FAC, and LocoProp for deep autoencoder models.

Summary and Future Directions In order to optimize deep neural networks more efficiently, we introduced a new framework called LocoProp. LocoProp decomposes neural networks into separate layers with their own regularizers, output targets, and loss functions, and applies local updates in parallel to minimize local goals. While using first-order updates for local optimization problems, combined updates closely follow the direction of higher-order updates, both theoretically and empirically.

LocoProp provides flexibility to choose the regularization, target, and loss function for each layer. Therefore, new update rules can be developed based on these selections. The code for LocoProp is available online on GitHub. We are currently working on scaling up the LocoProp-induced ideas to a larger scale model. stay tuned!

Acknowledgments We would like to thank co-author Manfred K. Warmuth for his important contributions and inspiring vision. Sameer Agarwal for discussions looking at this piece in terms of composite functions, Vineet Gupta for discussing and developing Shampoo, Zachary Nado from K-FAC, and the animations used in this blog post. I would like to thank Tom Small for the development of and finally his Yonghui. Wu and Zoubin Ghahramani provided the nurturing research environment for the Google brain team.

