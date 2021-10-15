



Posted by: Zachary Nado, Research Engineer, Dustin Tran, Research Scientist, Google Research, Brain Team

Machine learning (ML) is increasingly being used in real-world applications, so you need to understand the uncertainty and robustness of your model to ensure real-world performance. For example, how does the model behave if it is deployed to different data than the trained data? How does the model signal when it’s about to make a mistake?

To understand the behavior of the ML model, its performance is often measured against the baseline of the task in question. At each baseline, researchers must try to reproduce the results using only the description of the corresponding treatise, resulting in serious replication challenges. If well documented and maintained, it may be useful to have access to the code. However, this is not enough because the baseline needs to be rigorously validated.For example, in a retrospective analysis of a collection of works [1, 2, 3]Authors often find that simple, well-tuned baselines are better than more sophisticated methods. To truly understand how models work with each other and to allow researchers to measure whether new ideas actually make meaningful progress, the models of interest are a common baseline. Should be compared with.

Uncertainty Baseline: Benchmarks of Uncertainty and Robustness in Deep Learning provides an uncertainty baseline that is a collection of high-quality implementations of standard and state-of-the-art deep learning techniques for a variety of tasks. to introduce. Improves the reproducibility of studies on uncertainty and robustness. The collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained pipeline with easily reusable and extensible components and minimal external dependencies on the framework in which it is written. The included pipeline is implemented in TensorFlow, PyTorch, and Jax. In addition, the hyperparameters of each baseline have been significantly adjusted over a large number of iterations to provide even stronger results.

Uncertainty Baselines At the time of this writing, the Uncertainty Baseline provides a total of 83 baselines and consists of 19 methods, including standards and more recent strategies across 9 datasets. Examples of methods include BatchEnsemble, Deep Ensembles, Rank 1 Bayesian Neural Networks, Monte Carlo Dropouts, and Spectral Normalization Neural Gaussian Processes. It acts as a successor to merging some of the most popular benchmarks in the community (Can you trust the model uncertainty ?, BDL benchmark, Edward 2 baseline).

Five subsets of the nine available datasets for which baselines are provided. Data sets span tabular, text, and image modality.

Uncertainty baselines set each baseline under the selection of a base model, training dataset, and set of metrics. Then adjust its hyperparameters to maximize the performance of such metrics. The baselines available vary between the following three axes:

The basic model (architecture) includes Wide ResNet 28-10, ResNet-50, BERT, and a simple fully connected network. Training datasets include standard machine learning datasets (CIFAR, ImageNet, and UCI) as well as more realistic issues (Clinc Intent Detection, Kaggle’s Diabetic Retinopathy Detection, and Wikipedia Toxicity). .. Evaluations include predictive indicators (such as accuracy), uncertainty indicators (such as selective prediction and calibration errors), computational indicators (inference latency), and the performance of in- and out-of-distribution datasets.

Modularity and Reusability We have deliberately optimized the baseline to be as modular and minimal as possible so that researchers can use and build it. As shown in the workflow diagram below, Uncertainty Baselines does not introduce a new class abstraction, but instead reuses existing classes in the ecosystem (such as TensorFlow’s tf.data.Dataset). Each baseline training / evaluation pipeline is included in a standalone Python file for that experiment and can be run on a CPU, GPU, or Google Cloud TPU. This independence between baselines allows you to develop baselines in either TensorFlow, PyTorch, or JAX.

A workflow diagram of how the various components of the uncertainty baseline are organized. All datasets are subclasses of the BaseDataset class and provide a simple API for use in baselines written in any of the supported frameworks. You can then analyze the output from any baseline with the Robustness Metrics library.

One of the areas being discussed among research engineers is how to manage hyperparameters and other configuration values. These values ​​can be in the dozens. Instead of using one of the many frameworks built for this, instead of risking the user having to learn yet another library, using the Python flag, or Abseil, according to Python rules. I decided to use the defined flags. This is a technique familiar to most researchers and can be easily extended and connected to other pipelines.

Reproducibility In addition to being able to run each baseline using documented commands and get the same report results, we also aim to release hyperparameter adjustments and final model checkpoints for further reproducibility. increase. Currently, there are only these completely open source sources for diabetic retinopathy baselines, but as we do, we will continue to upload even more results. In addition, there are examples of baselines that can accurately reproduce even hardware determinism.

Each baseline in the Practical Impact Repository has undergone extensive hyperparameter adjustments, and we hope that researchers can easily reuse this task without the need for costly retraining and readjustment. increase. In addition, we want to avoid small differences in pipeline implementations that affect baseline comparisons.

Uncertainty baselines are already used in many research projects. If you’re a researcher using other methods or datasets you want to contribute to, open a GitHub issue and start the discussion.

Acknowledgments Thanks to the many co-developers who provided guidance and / or helped review this post: Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel, Jeremiah Liu, Zelda Mariet, Jeremy Nixon, Shreyas Padhy, Jie Ren, Tim GJ Rudner, Yeming Wen, Florian Wenzel, Kevin Murphy, D. Sculley, Balaj Lakshminarayanan, Jasper Snoek, Yarin Gal

