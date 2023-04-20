



A large language model can generate strings of text based on word patterns learned from web pages, books, and other bodies of text in the training data. Besides ChatGPT, these programs form the backbone of search chatbots such as Microsoft Bing Chat and Google Bard, a growing number of applications that produce professional and creative copy in seconds. Its AI-powered illustration and video-generating counterparts draw patterns from image datasets, such as photos collected from Pinterest and Flickr.

Data sets used in AI development are often constructed through informal means, such as dispatching software to scrape content from websites. In the United States, this is generally considered legal, but remains controversial due to copyright issues and website terms of use that discourage the practice.

Some websites such as Reddit and Stack Overflow are more engaging. Provide downloadable data dumps or real-time data portals to allow software access to content called APIs. In the case of Stack Overflow, Chandrasekar said, LLM developers use a combination of dumps, APIs, and scraping to get their hands on the data. All of this is now free.

But Chandrasekar says the LLM developers have violated Stack Overflows terms of service. As outlined in the TOS, a user owns the content she posts to Stack Overflow, but Creative Commons requires that anyone who uses the data later mention the origin of that data. Applicable to license. Chandrasekhar said that when an AI company sells a model to a customer, it fails to identify all of the community members whose questions and answers were used to train the model, violating the Creative Commons license. increase.

Neither Stack Overflow nor Reddit publish pricing information. We are working on this issue and will be sharing more with our partners in the coming weeks, according to Reddit spokesperson Tim Rathschmidt. Stack Overflow will study Reddits’ strategy and consult with potential customers of the company, some of whom have already been in touch about data access, he said.

A potential roadmap for pricing could come from Elon Musk, who raised the price for access to Twitter data this month. Access to 50 million tweets starts at $42,000 per month. Nearly three times the volume of tweets previously offered for free. In a tweet this week, Musk accused Microsoft, a leading AI developer and his close partner at OpenAI, of illegally using Twitter data to train algorithms. Without elaborating, he added, lawsuit time.

Both Stack Overflow and Reddit will continue to license their data for free to select individuals and businesses. Chandrasekar said Stack Overflow is seeking compensation only from companies developing LLMs for large-scale commercial purposes. When people start charging for products built on community-built sites like ours, it’s not fair use, he says.

Reddit CEO Steve Huffman told The New York Times this week that he doesn’t want to give away the world’s biggest companies. Crawling Reddit, creating value, and not returning that value to users is what we have a problem with, he said.

