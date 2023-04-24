



The AI ​​boom is based on data, data came from the internet, and the internet was born from us.

Driving the news: Washington Post analysis of one public data set widely used to train AI shows how broadly today’s AI industry samples 30 years of web publishing gold to guide neural networks was shown.

Why it matters: Have you ever written a blog? Created a web page? Have you joined a Reddit thread? Your words could be helping educate AI chatbots everywhere. I have.

The big picture: While this massive repurposing of words has triggered a significant legal debate over whether it should be treated as fair use or theft, it’s still building today’s online world. It’s stimulating the personal calculations of millions of contributors who have done it.

I thought we shared a heart to heart, and of course we did.

But unwittingly, I was also creating an incomplete but rich database of human expressions. Its database enables ChatGPT and its competitors’ amazingly clever sentence-completion gymnastics.

Visual AI tools like Dall-E, Midjourney and Stable Diffusion became popular before voice chatbots like ChatGPT came along, so visual creators, photographers, illustrators and fine artists were among the first to make this happen. .

But far more people have typed a few words on the Internet than have never recorded a song or drawn a picture.

The Washington Post project allows you to enter any internet domain name and see how much it contributes to one AI training database. (This is different from what OpenAI used for ChatGPT and other projects. OpenAI does not disclose the training data source.) “The data set contains over 500,000 personal blogs, Represents 3.8% of total “tokens”. ,” or separate language chunks in the data, the post team found. (Posting on their own social media platforms such as Facebook, Instagram and Twitter does not indicate that those companies have access to the data within their company.)

NOTE: These training databases are huge, but mostly unrepresentative. Some cultures, groups and subjects are oversampled. Many others are unfairly neglected. And all the prejudices, limitations, and harmful aspects of internet culture show up in AI training data.

My Thought Bubble: A personal blog I’ve been writing consistently for 15 years, like most other articles I’ve contributed over 10 years to a web magazine I helped create, shows up in the Post data set. There seems to be

If you have any kind of online history, the self-searching opportunities that Post’s research offers are very tempting, like Googleing your name. (For visuals, there’s a similar search tool called “Have I Been Training?”). What?” you ask yourself. and “Why wasn’t I consulted?” and “What would you do if you knew this was coming?”

Get smart: AI’s hunger for training data is shedding new light on the entire 30-year history of the popular internet.

Today’s AI breakthroughs would not have been possible without tapping into the digital stockpiles and landfills of information, ideas, and emotions that the Internet has inspired people to generate. But we created them all for each other, not for the AI.

Seen in this light, the existence of these vast data “corpora” was a very important unintended consequence of the rise of the web itself.

In 1995, when one generation fell in love with “www” and browsers, or a decade later, when another celebrated the advent of blogging and “wisdom of the crowds,” this result disappeared. In the 2010s, the turmoil of the machine learning revolution began to unnerve some visionaries. But it took a very long time before the entire web realized that he was about to fall prey to AI training.

Today, this unintended consequence is at the forefront of our online experience, and everything we do with or against AI today will shape the future in unpredictable ways. reminds me of

For example, unleashing massive simulacrums on public networks risks discouraging people from sharing and creating their own original work. This could leave future AI models stuck forever with human output frozen around the year 2000. Nothing new to learn in 2020.

