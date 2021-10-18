



Posted by Alex D’Amour and Katherine Heller, Research Scientist, Google Research

Machine learning (ML) models are more widely used and influential today than ever before. However, when used in a real domain, it often shows unexpected behavior. For example, computer vision models can be surprisingly sensitive to unrelated functions, while natural language processing models can unexpectedly rely on demographic correlations that are not directly shown in the text. there is. Some reasons for these failures are well known. For example, train an ML model with poorly curated data, or train a model to solve predictive problems that structurally do not match the application domain. However, even when these known issues are addressed, the behavior of the model is inconsistent in deployment and can vary between training runs.

The Underspecification Presents Challenges for Credibility in Modern Machine Learning, published in the Journal of Machine Learning Research, shows that the major failure modes that are especially common in modern ML systems are under-specified. The idea behind the lack of specifications is that the ML model is validated with retained data, but this validation is not sufficient to guarantee well-defined behavior when the model is used in the new configuration. It means that there are many cases. It shows that the lack of specifications appears in various practical ML systems and proposes some strategies for mitigation.

The UnderspecificationML system is successful because it incorporates model validation on the submitted data to ensure high performance. However, for fixed datasets and model architectures, there are many different ways in which a trained model can achieve high validation performance. However, in the standard way, models that encode individual solutions are often treated as equivalent because the predicted performance they hold is about the same.

Importantly, the differences between these models become apparent when measured by criteria that exceed standard predictive performance, such as fairness and robustness to irrelevant input perturbations. For example, some models that work equally well in standard validation may have greater performance disparities between social groups than other models, or may rely heavily on irrelevant information. These differences can translate into actual differences in behavior when the model is used in real-world scenarios.

Lack of specifications refers to the gap between the requirements that practitioners often keep in mind when building ML models and the requirements that are actually applied by the ML pipeline (that is, model design and implementation). An important result of under-specification is the assurance that even if the pipeline can, in principle, return a model that meets all these requirements, the model actually meets the requirements beyond the accurate prediction of the data held. there is no. In fact, the returned model may contain properties that depend on arbitrary or opaque selections made in the implementation of the ML pipeline, such as those that result from random initialization seeds, data ordering, hardware, etc. .. Even without including explicit flaws, it is possible to return a model that behaves unexpectedly in the actual configuration.

Identifying under-specifications in real-world applications In this task, we investigated the specific impact of under-specifications on the types of ML models used in real-world applications. Our empirical strategy was to build a set of models using almost the same ML pipeline. We applied only small changes to this that did not have a practical impact on standard validation performance. Here we focused on the random seeds used to initialize the training and determine the order of the data. If important properties of the model can be significantly affected by these changes, it indicates that the pipeline does not fully specify this actual behavior. It was found that in all domains where this experiment was performed, these small changes cause considerable variation in the axes that are important in actual use.

As an example of lack of computer vision specifications, consider the relationship between lack of specifications and the robustness of computer vision. The central challenge of computer vision is that deep models often suffer from vulnerabilities under distribution shifts that humans do not find difficult. For example, an image classification model that works well with the ImageNet benchmark is known to be inadequate for benchmarks such as ImageNet-C, which applies common image corruption such as pixelation and motion blur to the standard ImageNet test set. I have.

Our experiments have shown that the sensitivity of the model to these failures is underestimated by the standard pipeline. Following the strategy above, we generated 50 ResNet-50 image classification models using the same pipeline and the same data. The only difference between these models was the random seeds used in training. When evaluated with the standard ImageNet validation set, these models achieved virtually comparable performance. However, when evaluating the model with different test sets of the ImageNet-C benchmark (that is, corrupted data), the performance of some tests was orders of magnitude different than standard validation. This pattern persisted in large models pre-trained on much larger datasets (for example, the BiT-L model pre-trained on the 300 million image JFT-300M dataset). In these models, changing the random seeds during the training tweak stage produced similar change patterns.

Left: Parallel axis plot showing accuracy variability between identical ResNet-50 models randomly initialized with heavily corrupted ImageNet-C data. The lines represent the performance of each model in the ensemble in a classification task with undamaged test data and corrupted data (pixelation, contrast, motion blur, brightness). The value given is the deviation of the accuracy from the ensemble average and is scaled by the standard deviation of the accuracy of the “clean” ImageNet test set. The solid black line emphasizes the performance of the arbitrarily selected model, indicating that the performance of one test may not adequately represent the performance of another. Right: An example image of a standard ImageNet test set, including a corrupted version of the ImageNet-C benchmark.

We also showed that the lack of specifications could have a practical impact on special-purpose computer vision models built for medical imaging, where deep learning models have high expectations. We considered two research pipelines intended as precursors to medical applications. One is an ophthalmic pipeline for building a model for detecting diabetic macular edema that can be referred to as diabetic retinopathy from retinal fundus images, and the other is a model for recognizing common skin diseases from skin photographs. A dermatology pipeline to build. In our experiment, we considered a pipeline that was validated only with randomly held data.

The models created by these pipelines were then stress tested in the actual critical dimensions. The ophthalmology pipeline tested how models trained with various random seeds performed when applied to images taken from new camera types that were not encountered during training. For the dermatology pipeline, the stress tests were similar, but for patients with different estimated skin types (ie, non-dermatologists assessing skin tone and response to sunlight). In both cases, standard validation proved insufficient to fully specify the performance of the model trained on these axes. In ophthalmic applications, the random seeds used in training cause wider performance fluctuations than expected from standard validation with the new camera type, and in dermatology applications, random seeds have similar performance in the skin type subgroup. Caused fluctuations in. Although the overall performance of the model was stable between seeds.

These results reiterate that standard holdout testing alone is not sufficient to guarantee acceptable model behavior in medical applications, and is an extended test protocol for ML systems intended for medical applications. Emphasizes the need. In the medical literature, such validation is called “external validation” and has historically been part of reporting guidelines such as STARD and TRIPOD. These are highlighted in updates such as STARD-AI and TRIPOD-AI. Finally, as part of the regulated medical device development process (see, for example, US and EU regulations), safety and mandatory compliance with risk management, human factors engineering, clinical validation, and accreditation standards There are other forms of consideration related to performance. A body review aimed at ensuring acceptable medical application performance.

Relative variation of the medical imaging model in a stress test using the same rules as in the figure above. Upper left: Fluctuations in AUC between diabetic retinopathy classification models trained using different random seeds when evaluated on images of different camera types. In this experiment, camera type 5 was not detected during training. Bottom left: Differences in accuracy between skin condition classification models trained using different random seeds when evaluated with different estimated skin types (dermatologist-trained layman estimates and labels from retrospective photographs An error may occur). Right: An example of the image of the original test set (left) and the stress test set (right).

Insufficient specifications in other applications

The above case is a small subset of the models we investigated for lack of specifications. Other cases we have investigated are:

Natural Language Processing: In various NLP tasks, we have shown that lack of specifications affects how the model derives from BERT-processed statements. For example, depending on the random seed, the pipeline can generate a model that is more or less dependent on the correlation that includes gender (for example, between gender and occupation) when making predictions. Acute Kidney Injury (AKI) Prediction: We have shown that under-specification affects the dependence of electronic health record-based AKI prediction models on operational and physiological signals. Polygene Risk Score (PRS): Insufficient specifications have shown that a model that predicts clinical outcomes based on patient genomic data (PRS) affects the ability to generalize across different patient populations.

In each case, these important properties were not clearly defined in the standard training pipeline, indicating that they are sensitive to seemingly harmless choices.

Conclusion Dealing with lack of specifications is a difficult problem. Full specifications and testing of model requirements that exceed standard predicted performance are required. To do this well, you need to be fully involved with the context in which the model is used, understand how training data is collected, and often incorporate domain expertise in the event of a lack of available data. Is required. These aspects of ML system design are often not emphasized in today’s ML research. The main goal of this work is to show how the underinvestment in this area manifests itself, and to facilitate the development of more complete specifications and processes for testing the ML pipeline.

Some important first steps in this area are to specify a stress test protocol for the applied ML pipeline, which aims to verify actual use. Once these criteria are organized into measurable metrics, various algorithmic strategies such as data augmentation, pre-training, and incorporation of causal structures can help improve them. However, keep in mind that the ideal stress test and improvement process usually requires repetition. Both the requirements of ML systems and the world in which they are used are constantly changing.

Acknowledgments Thanks to co-authors Dr. Nead Tomasev (DeepMind), Professor Finale Doshi-Velez (Harvard SEAS), UK Biobank, and partners EyePACS, Aravind Eye Hospital, and Sankara Nethralaya.

