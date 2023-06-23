



The internet is in a storm of AI-generated nonsense, and no one knows how to stop it.

This is an amazing possibility demonstrated by two papers investigating AI models trained on AI-generated data. This potentially avoidable fate is nothing new for AI researchers. However, these two new findings foreground some concrete results that detail the consequences of feedback loops that train models based on their own outputs. While this study was unable to replicate the scale of the largest AI models such as ChatGPT, the results were still not pretty. And they could be reasonably extrapolated to larger models.

Over time, those errors pile up. Then at some point the data basically becomes dominated by errors instead of the original data.Ilya Shumaylov, University of Cambridge

Jennifer Prendki, CEO and founder of DataPrepOps company Alectio, said, “The concept of data generation and reusing it to retrain, tune, or perfect a machine learning model is a very dangerous game. ,” he said.

AI rushes towards collapse

Both of these papers are preprints and approach the problem from slightly different angles. “The Curse of Recursion: Training on Generated Data Makes Models Forget” discusses the potential implications for large-scale language models (LLMs) such as ChatGPT and Google Bard, as well as Gaussian Mixture Models (GMMs) and Variational Autoencoders (VAEs) to validate. The second paper, “Towards Understanding the Interaction of Generative Artificial Intelligence and the Internet,” examines the impact on diffusion models such as those used in image generation tools such as Stable Diffusion and Dall-E. .

Although the models discussed are different, the papers reach similar results. Both found that training a model based on data generated by the model can lead to failures known as model collapse.

This is because the first model has its own errors when fitting the data. And his second model, which trains on data produced by the first model with an error inside, basically learns the set error, and adds its own error on top of it, and the Cambridge University computer Dr. Science Ilya Shumailov said. Recursion Paper Candidate and Co-Author. Over time, those errors pile up. Then at some point the data basically becomes dominated by errors instead of the original data.

The quality of results produced by LLM degrades with each generation of training on AI-generated data.The curse of recursion: training on generated data causes models to forget

And the errors pile up quickly. Shumailov and his co-authors used his OPT-125M, an open-source LLM introduced by his Meta researchers in 2022, to refine the model using the wikitext2 dataset. . Early generations gave decent results, but within 10 generations the responses became meaningless. Responses from the Ninth Generation repeated the phrase “a jackrabbit with a tail” and alternated a variety of colors, none of which had anything to do with the initial prompt about the construction of Somerset Tower in England. was.

Diffusion models are similarly affected. “As soon as you get a reasonable amount of artificial data, it seems to degenerate,” says Rick Surker, co-author of Towards Understanding and deputy director of the Computer Science Fundamentals Laboratory at the University of Edinburgh. says. The paper found that a simple diffusion model trained on a specific category of images, such as photos of birds and flowers, produced unusable results within two generations of him.

Sarkar warns that this result is a worst-case scenario. The dataset was limited and the results of each generation were directly fed back to the model. Still, the paper’s results show that model collapse can occur when there is too much AI-generated data in the model’s training dataset.

AI training data represents a new frontier in cybersecurity

For those closely studying the interactions between AI models and the data used to train them, this is no shock. Prendki is an expert in the field of Machine Learning Computation (MLOps), but also has a PhD in particle physics, looking at problems through a more radical lens.

It’s basically the concept of entropy, right? Data has entropy. More entropy means more information, right? says Plenke. However, just because the dataset is twice as large does not guarantee double the entropy. It’s like putting sugar in a teacup and then adding more water. I haven’t increased the amount of sugar.

This is a next-generation cybersecurity issue that most people don’t mention. Jennifer Plenchi, CEO of Alectio.com

From this perspective, model collapse appears to be an obvious problem with an obvious solution. Just turn off the faucet and heap a spoonful of sugar. But that is easier said than done. Pedro Revilliego, co-author of Towards Understanding, says that while there are ways to get rid of AI-generated data, they quickly become obsolete as new AI models are released every day.as if [cyber]Security, says Revilliego. You have to keep chasing things that are moving fast.

Mr. Prendki agreed with Mr. Leviriego and took the discussion one step further. She says organizations and researchers training AI models should view training data as a potential enemy and scrutinize it to avoid model degradation. This is a next-generation cybersecurity problem that most people don’t mention, Prendki says.

There is one possible solution that can completely solve this problem. It’s a watermark. Images generated by OpenAIs DALL-E contain a specific color pattern as a watermark by default (although users have the option to remove it). LLMs can also contain watermarks in the form of algorithmically detectable patterns that are not apparent to humans. Watermarks provide an easy way to detect and filter out data generated by AI.

However, the use of effective watermarks requires some agreement on how to implement them, as well as enforcement measures to prevent malicious parties from distributing AI-generated data without watermarks. China has introduced (among other regulations) legislation to force watermarks on AI content, an unlikely template for Western democracies.

Images created with OpenAI DALL-E have a watermark in the bottom right corner, but users can choose to remove it. Open AI

A faint glimmer of hope remains. The models presented in both papers are small compared to the largest models currently in use, such as stable diffusion and GPT-4, so large-scale models may prove to be more robust. I have. New ways of curating data may also improve the quality of future datasets. But in the absence of such a solution, AI models could face a first-mover advantage, as early models will have more access to datasets untainted by AI-generated data, Shumailov said. says.

Once the ability to generate synthetic data with some internal errors is established and such models are used on a large scale, the data generated by these models will inevitably be used online. Will, Shumailov says.You want to start a company that provides large language models as a service to someone [today]. Then online he collects a year’s worth of data and when he tries to build a model, it collapses inside the model.

