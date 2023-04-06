



Posted by Zi Wang and Kevin Swersky, Research Scientists, Google Research, Brain Team

Bayesian Optimization (BayesOpt) is a widely used and powerful tool for global optimization tasks such as hyperparameter tuning, protein engineering, synthetic chemistry, robotic learning, and even baking cookies. BayesOpt is a good strategy for these problems because it involves all optimizations of black-box functions that are expensive to evaluate. The basic mapping of black box functions from inputs (configurations of what we want to optimize) to outputs (measures of performance) is unknown. However, you can try to understand the inner workings by evaluating the function against various combinations of inputs. Each evaluation can be computationally expensive, so we want to find the best input with as few evaluations as possible. BayesOpt works by iteratively building a surrogate model of a black-box function and strategically evaluating the function at the most probable or informative input positions given the information observed so far.

Gaussian processes are popular surrogate models for BayesOpt because they are easy to use, can be updated with new data, and provide confidence levels for each prediction. Gaussian process models build probability distributions over possible functions. This distribution is specified by a mean function (what these possible functions look like on average) and a kernel function (how much these functions can vary between inputs). The performance of BayesOpt depends on whether the confidence intervals predicted by surrogate models include black box functions. Traditionally, experts use domain knowledge to quantitatively define mean values ​​and kernel parameters (such as the range and smoothness of a black box function) to express expectations about what the black box function will look like. was However, in many real-world applications such as hyperparameter tuning, understanding the tuning goal landscape is very difficult. Even experts with relevant experience can find it difficult to narrow down the appropriate model parameters.

Pretrained Gaussian Processes for Bayesian Optimization explores the challenges of hyperparameter optimization of deep neural networks using BayesOpt. We propose Hyper BayesOpt (HyperBO), a highly customizable interface with algorithms that eliminate the need to quantify the model parameters of Gaussian processes with BayesOpt. For new optimization problems, experts can easily select previous tasks related to the current task they are trying to solve. HyperBO pre-trains a Gaussian process model with data from these selected tasks and automatically defines the model parameters before running BayesOpt. HyperBO enjoys theoretical guarantees on the consistency of the pre-trained model with the ground truth and the quality of the solution for black-box optimization. Both new tuning benchmarks for state-of-the-art deep learning models and traditional multitasking black-box optimization benchmarks (HPO-B) share strong results for HyperBO. We also show that HyperBO is robust to relevant task selection and has low requirements on the amount of pre-training data and tasks.

The traditional BayesOpt interface requires the expert to carefully choose the mean and kernel parameters of the Gaussian process model. HyperBO replaces this manual specification with a selection of related tasks, making Bayesian optimization much easier to use. The selected tasks are used for pre-training, optimizing the Gaussian process so that it can progressively generate functions similar to those corresponding to the selected tasks. The similarity appears in individual function values ​​and variations in function values ​​across inputs.loss function for pretraining

Pretrain a Gaussian process model by minimizing the Kullback-Leibler divergence (a commonly used divergence) between a ground truth model and a pretrained model. Since the ground truth model is unknown, we cannot directly compute this loss function. To solve this, we introduce two data-driven approximations. (1) Empirical Kullback-Leibler Divergence (EKL). This is the difference between the empirical estimation of the ground truth model and the pretrained model. (2) Negative log-likelihood (NLL). It is the sum of the negative log-likelihoods of pretrained models for all training functions. The computational cost of EKL or NLL increases linearly with the number of training functions. Additionally, stochastic gradient-based techniques like Adam’s can be used to optimize the loss function, further reducing computational cost. In a well-managed environment, optimizing EKL and NLL will give the same result, but the optimization landscape can be very different. For example, in the simplest case where a function has only one possible input, its Gaussian process model will be a Gaussian distribution with mean (m) and variance (s). So the loss function has only two parameters, m and s, and he can visualize EKL and NLL as follows.

Simulate EKL (left) and NLL (right) loss situations for a simple model with parameters m and s. Colors represent heatmaps of EKL or NLL values, with red corresponding to high values ​​and blue indicating low values. These two loss situations are very different, but both aim to match a pretrained model with a ground truth model.Pre-training improves Bayesian optimization

The BayesOpt algorithm iteratively determines where to evaluate the black box function. Decision criteria are based on the confidence level provided by the Gaussian process. This confidence level is updated at each iteration by conditioning on previous data points obtained by BayesOpt. In either of these two cases, BayesOpt cannot make decisions comparable to those made by experts, so the updated trust level should be intuitively appropriate.

HyperBO replaces the manually specified model in traditional BayesOpt with a pretrained Gaussian process. With sufficient training functions under mild conditions, we can mathematically verify the excellent theoretical properties of HyperBO. (1) Alignment: A pre-trained Gaussian process guarantees closeness to the ground truth model when both are conditioned on observed data points. (2) Optimality: HyperBO guarantees to find near-optimal solutions to black-box optimization problems for any function distributed according to an unknown ground-truth Gaussian process.

Visualize the Gaussian process (95% and 99% confidence intervals in purple shaded areas) conditional on observations (black dots) from an unknown test function (orange line). Compared to traditional BayesOpt without pretraining, HyperBO’s predicted confidence level captures the unknown test function much better. This is an important prerequisite for Bayesian optimization.

Empirically, we chose to use a highly expressive mean function modeled by a neural network to define the structure of the pre-trained Gaussian process, and we used the neural network to map the high-dimensional space Applies a well-defined kernel function to the input encoded in .

To evaluate HyperBO on a challenging and realistic black-box optimization problem, we created a PD1 benchmark containing a dataset for multitasking hyperparameter optimization of deep neural networks. PD1 was developed by training tens of thousands of configurations of state-of-the-art deep learning models on common image and text datasets and protein sequence datasets. PD1 contains about 50,000 hyperparameter evaluations from 24 different tasks (such as tuning Wide ResNet on CIFAR100) and requires about 12,000 machine days to compute.

For only a few hours of pre-training on a single CPU, HyperBO significantly outperforms BayesOpt using carefully hand-tuned models on invisible and difficult tasks such as tuning ResNet50 on ImageNet. shows that it is excellent. Even with only ~100 data points per training function, HyperBO can still be competitive against the baseline.

Adjust validation error rates for ResNet50 for ImageNet and Wide ResNet (WRN) for Street View House Numbers (SVHN) dataset and CIFAR100. By pre-training only up to 20 tasks and up to 100 data points per task, HyperBO performs classical BayesOpt (using carefully hand-tuned Gaussian processes) on previously unseen tasks. can be greatly exceeded.Conclusion and future work

HyperBO is a framework that pre-trains a Gaussian process and then uses the pre-trained model to perform Bayesian optimization. HyperBO eliminates the need to manually specify exact quantitation parameters for Gaussian processes. Instead, you only need to identify relevant tasks and corresponding data for pre-training. This makes BayesOpt more accessible and effective. An important future direction is to allow HyperBO to generalize heterogeneous search spaces. To that end, we develop new algorithms by pre-training hierarchical probabilistic models.

Acknowledgments

The following members of the Google Research Brain Team conducted this research: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zachary Nado, Justin Gilmer, Jasper Snoek, and Zoubin Ghahramani. We would like to thank Zelda Mariet and Matthias Feurer for their assistance and consultation on transfer learning baselines. We also thank Rif A. Saurous for constructive feedback, and Rodolphe Jenatton and David Belanger for feedback on previous versions of the manuscript. In addition, thanks to Sharat Chikkerur, Ben Adlam, Balaji Lakshminarayanan, Fei Sha, Eytan Bakshy for their comments, and Setareh Ariafar and Alexander Terenin for conversations about animation. Finally, thanks to Tom Small for designing the animations for this post.

