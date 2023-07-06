



Google has updated its privacy policy to confirm that it collects public data from the Internet to train AI models and services such as its chatbot Bar and its search engine, which provides instant response generation to queries. .

Under current research and development, the fine print reads: “Google uses the information to improve our services and to develop new products, features and technologies that benefit our users and the public. For example, building Google’s AI models and products and capabilities such as Google Translate, Bard, and cloud AI capabilities.”

Google uses publicly available information to help train our AI models and build our products and features.

Interestingly, Reg staff outside the US were unable to see the text quoted in the link above. However, in the PDF version of this Google policy, it states, “To help train Google’s AI models and build products and features such as Google Translate, Bard, and Cloud AI features, online or from other public sources. We may collect publicly available information.”

This change defines the scope of Google’s AI training. The previous policy only referred to “language models” and referred to Google Translate. However, the term has been changed to cover “AI models”, including Bard and other systems built as applications on cloud platforms.

A Google spokesperson told The Register that the update didn’t fundamentally change how AI models are trained.

“Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate. With this latest update, , it just makes it clear that new services like Bard will also be included.We incorporate privacy principles and safeguards into our privacy policy.We develop AI technology in line with our AI principles.” A spokeswoman said in a statement.

Over the years, developers have scraped the internet, photo albums, books, social networks, source code, music, articles, and more to gather training data for AI systems. However, the process is controversial given that the material is usually protected by copyrights, terms of service and licenses, all of which have led to lawsuits.

Some people believe that not only can their content be used to build machine learning systems that replicate their work, thus endangering their lives, but they are also regurgitating this training data. I’m frustrated that this makes the output of the model too close to copyright or license infringement. Not changed.

AI developers may argue that their efforts fall under fair use, and that what the model outputs is a new form of work, not really a copy of the original training data. It’s a hotly debated issue.

For example, Stability AI is being sued by Getty Images for collecting and misusing millions of images from stock image websites to train text-to-image conversion tools. On the other hand, OpenAI and its owner Microsoft also improperly scraped “300 billion words on the Internet, ‘books, articles, websites and posts containing personal information obtained without consent'” and plagiarized the source code. It has faced multiple lawsuits. Public repository for creating the AI ​​Pair programming tool GitHub Copilot.

Google officials declined to say whether the advertising and search giant scrapes copyrighted or licensed public data or social media posts to train its systems.

Now that people know better about how AI models are trained, some Internet companies have started charging developers for access to their data. For example, Stack Overflow, Reddit, and Twitter introduced fees or new rules for accessing content through their APIs this year. Other sites such as Shutterstock and Getty have chosen to license their images to AI model builders and have partnered with the likes of Meta and Nvidia.

