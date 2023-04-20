



New concerns have been raised about the training materials used for some of the biggest and most powerful artificial intelligence models after several investigations reveal fascist, pirated and malicious sources from which data was collected. it was done.

One such dataset is the Colossal Clean Crawled Corpus (C4). This is collected by Google from his over 15 million websites and used to train both search engine LaMDA AI and his Metas GPT competitor LLaMA.

The dataset is publicly available, but its size makes it difficult to explore. This is presumably a clean version of Common Crawl, a broader dataset with noisy content, offensive language, and racist slurs removed from the material.

But a Washington Post study revealed that C4’s cleanliness is just the surface. It utilizes websites such as Guardian and Wikipedia, which make up his 0.05% of the entire dataset, and large databases such as Google Patents and scientific journal hub PLOS, but also includes sites with poor reputations.

The white nationalist site VDARE is in the database and is one of the 1,000 largest sites, along with far-right news site Breitbart. The Russian government-backed propaganda site RT is one of the largest providers of training data for the C4 corpus, with over 100 of them.

Common Crawl, the nonprofit that collected the scraped data, said it honors requests to be excluded from searches, but few sites explicitly agreed to be included. However, some push the boundaries of fair use. Formerly known as Bookzz, b-ok.org was a vast repository of pirated ebooks until he was seized by the FBI in 2022. C4 database.

Such massive collections of data are critical for AI creation, as massive datasets are required to improve the large language models (LLMs) that underpin tools like ChatGPT.

Harvesting the hundreds of gigabytes of text needed to train such models from explicitly licensed sources is a daunting task, and many AI researchers prefer to ask for forgiveness rather than permission. choose and claim that their work is protected by a fair use defense to copyright.

Some even choose to forgo the cleaning Google applies to the dataset in order to have access to more data for the system to learn from. London-based Stability AI released his new LLM StableLM on Wednesday. It was trained on Pile, an 850 GB dataset containing the entire uncleansed Common Crawl database and 2 million pirated ebooks from the BitTorrent site Bibliotik, with 100 GB of data scraped. Information from the coding site GitHub and more esoteric sources such as all internal emails sent by the now-defunct energy company Enron and all proceedings of the European Parliament.

The Pile is published by an anonymous group of data enthusiasts called Eye. Its copyright removal policy links to a video of a choir of clothed women pretending to masturbate on an imaginary penis while singing.

The version Stability uses is now private, but the company says it’s three times bigger. Details about additional content in that dataset have not yet been released, but StableLM says it performs surprisingly well on conversation and coding tasks.

Stability said they are open sourcing the model to promote transparency and foster trust. Researchers can look inside to validate performance, work on interpretability techniques, identify potential risks, and help develop safeguards.

Public and private sector organizations can adapt (fine-tune) these open source models for their own applications without sharing sensitive data or relinquishing control over AI capabilities.

I asked Google for comment.

