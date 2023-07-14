



Imagine a team of scientists has developed a machine learning model that can predict whether a patient has cancer from a lung scan. They hope to share this model with hospitals around the world so clinicians can use it for diagnosis.

But there is a problem. To teach the model how to predict cancer, they showed it millions of real-life lung scans. This is a process called training. These sensitive data are currently encoded into the internal workings of the model and can be extracted by malicious agents. Scientists can prevent this by adding noise, or more general randomness, to the model that makes it difficult for adversaries to guess the original data. However, perturbations reduce the accuracy of the model, so the less noise you can add, the better.

Researchers at MIT have developed techniques that ensure the protection of sensitive data while minimizing the noise users can add.

The researchers created a new privacy metric called “Probably Approximately Correct (PAC) Privacy” and based on this metric, built a framework that can automatically determine the minimum amount of noise that should be added. . Additionally, the framework does not require knowledge of the inner workings of the model or its training process, making it easier to use with different types of models and applications.

In some cases, researchers have shown that PAC privacy requires far less noise than other approaches to protect sensitive data from adversaries. This can help engineers create machine learning models that obviously hide the training data while maintaining accuracy in real-world settings.

PAC privacy exploits the uncertainty or entropy of sensitive data in meaningful ways, often adding an order of magnitude less noise. This framework allows us to understand the nature of arbitrary data processing and automatically privatize it without human intervention. We’re still in the early stages and working out simple examples, but we’re excited about the potential of this technology, said Edwin Sibley-Webster, professor of electrical engineering and new paper on PAC privacy. said co-author Srini Devadas.

Devadas co-authored the paper with lead author Hanshen Xiao, a graduate student in electrical engineering and computer science. This research will be presented at the International Cryptographic Conference (Crypto 2023).

Definition of privacy

A fundamental question in data privacy is how much sensitive data can an attacker recover from a noise-added machine learning model.

One common definition of privacy, differential privacy, states that privacy is achieved when an adversary observing a released model cannot infer whether any personal data is used in the training process. I’m here. However, in order to make the data usage indistinguishable to an attacker, it often requires a large amount of noise to obscure the data usage. This noise reduces the accuracy of the model.

PAC Privacy takes a slightly different look at this issue. This characterizes how difficult it is for an attacker to reconstruct a piece of randomly sampled or generated sensitive data after noise has been added, rather than focusing solely on the issue of identity. .

For example, if the sensitive data is images of human faces, differential privacy focuses on whether an attacker can determine whether the dataset contains someone’s face. PAC privacy, on the other hand, can test whether an attacker can extract an approximate silhouette that can be recognized as a specific individual’s face.

After establishing a definition of PAC privacy, the researchers automatically asked users how much noise to add to the model to prevent attackers from confidently reconstructing data close to sensitive data. I created an algorithm that tells the The algorithm guarantees privacy even if the adversary has infinite computing power, Xiao said.

To find the optimal amount of noise, the PAC privacy algorithm relies on the uncertainty, or entropy, of the original data from the adversary’s perspective.

This automated technique takes samples randomly from a data distribution or large data pool and runs your machine learning training algorithm on the subsampled data to generate an output learning model. Run this many times with different subsamplings and compare the variances of all outputs. This variance determines how much noise should be added. The smaller the variance, the less noise you need.

Algorithm advantage

Unlike other privacy approaches, the PAC privacy algorithm does not require knowledge of the inner workings of the model or the training process.

When implementing PAC privacy, users can initially specify their desired level of trust. For example, perhaps a user might want assurances that he has no greater than 1 percent confidence that an attacker has successfully reconstructed sensitive data to within 5 percent of her actual value. To achieve these goals, the PAC privacy algorithm automatically tells the user the optimal amount of noise that should be added to the output model before sharing it publicly.

The noise is optimal in the sense that adding less than the specified value may result in all bets losing. But the effects of adding noise to neural network parameters are complex, and we make no promises about the reduced utility a model might experience with added noise, Xiao says.

This illustrates one limitation of PAC privacy. With this technique, the user does not know how much accuracy the model loses when noise is added. PAC privacy also involves repeatedly training a machine learning model with a large number of data subsamplings, which can be computationally expensive.

One approach to improving PAC privacy is to change the user’s machine learning training process to make it more stable. That is, the output model produced does not change much when the input data is subsampled from the data pool. This stability reduces the difference between subsampled outputs, which not only reduces the number of runs of the PAC privacy algorithm to determine the optimal amount of noise, but also adds less noise.

An additional benefit of more stable models is that they often have less generalization error. That means we can make more accurate predictions on never-before-seen data, creating a win-win situation between machine learning and privacy, Devadas adds.

Over the next few years, we hope to explore this relationship between stability and privacy, and between privacy and generalization error, in a little more detail. We’re knocking on the door here, but it’s not yet clear where that door leads, he says.

Obfuscating the use of personal data within the model is paramount to protecting privacy. But that can come at the cost of data and model usability, said Jeremy Goodsitt, senior machine learning engineer at Capital One. He was not involved in this research. PAC offers an experience-based black-box solution that can reduce additional noise compared to current practices while maintaining comparable privacy guarantees. Moreover, its empirical approach extends its applicability to more data-intensive applications.

This research was partially funded by DSTA Singapore, Cisco Systems, Capital One, and MathWorks Fellowship.

