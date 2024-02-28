



Social media platform Reddit signed a licensing agreement with Google on Thursday, giving the search giant access to Reddit users' posts to train its artificial intelligence (AI) engine. As part of the deal, Google will pay the social news aggregator $60 million annually for access to user-generated content from the platform.

This agreement could not have come at a better time for both companies. Reddit is hoping for cash and investor love ahead of its planned initial public offering (IPO). And Google is trying to save face from its AI failures.

Although Reddit generates revenue, the company is not profitable. The company's IPO documents filed with U.S. stock market regulators show 2023 revenue of $804 million. Most of it comes from advertisers. However, the platform caused him a net loss of $90.8 million.

Google's annual salary to Reddit provides cash to the platform and monetizes the company. Additionally, his data partnership with one of the biggest names in the AI ​​business could boost his Reddit status ahead of its IPO, and in an age of AI where chatbots are tearing apart monolithic social platforms, investors will be able to see value in the platform.

The licensing deal gives the Mountain View, Calif.-based company a mine of data to help it navigate the AI ​​disaster it's currently facing.

What's wrong with Google?

Google's sporadic attempts to break OpenAI's dominance in AI have seriously hurt the search giant. The company's virgin AI chatbot Bard, launched as a rival to OpenAIs ChatGPT, had flaws. The first demo video contained factual errors. Subsequent iterations were also not academically well-endowed.

More recently, the Gemini chatbot company overcompensated for its lack of diversity by displaying irrelevant images in response to queries. The company's AI-based image generator displayed a photo of a black woman in response to the question, “Who is the founding father of the United States?” In another example, Asians were depicted as German soldiers during the Nazi era. Such an unintellectual reaction caused quite a stir.

These blunders prompted Prabhakar Raghavan, the company's chief executive officer who oversees search operations, to apologize and say the product had missed the mark.

While these issues are related to large-scale language models (LLMs) and the weights attached to tokens, another challenge facing Google is that raw data LLMs are data-intensive algorithms. , the quality of the information flowing into it is very important.

To be able to input accurate text, a generative AI (GenAI) model must first read a large amount of text. For a long time, tech companies have been free riding by scraping the web for text or using open source crawling tools to sneak into websites and retrieve data from them. I did.

The tactic has been called into question as users and publishers push back against AI companies indiscriminately collecting data from the web. A proposed class action lawsuit in July 2023 accuses Google of misusing vast amounts of web users' personal information to train its AI models.

Separately in December, news publisher The New York Times sued OpenAI and Microsoft for copyright infringement. The lawsuit alleges that the AI ​​company used millions of news articles to train its AI model (ChatGPT).

Complaints like these from individuals and businesses are prompting lawmakers to step up and create policies for the ethical use of information available on the Web.

U.S. lawmakers have passed a bill, the AI ​​Fundamentals Model Transparency Act, that would require the Federal Trade Commission (FTC) and National Institute of Standards and Technology (NIST) to set rules for reporting data transparency for AI models. submitted. This requires builders of basic AI models to disclose the source of their training data.

If such a law were passed, AI companies would have to be compensated for using their data to train their models. As a result, the cost of building AI models increases. To pre-empt such laws, big tech companies are entering into licensing agreements with news publishers and other content sources. OpenAI's deal with news agency Associated Press is a case in point.

Other news organizations, including Gannett (the largest newspaper company in the U.S.) and News Corp (owner of the Wall Street Journal), are also in talks with OpenAI, according to media reports. Publications that sign deals with AI companies will receive fees based on how often their content is used.

How different is this deal?

Google's deal with Reddit is against this context. However, unlike other platforms, Reddit functions as a social news website, with content curated and promoted on social. The platform is made up of hundreds of sub-communities, known as subreddits, where members post content that is upvoted or downvoted by other members.

In the context of this agreement, Google will have access to the Reddits Data API to provide unique content in real-time to huge search content from a large and dynamic platform. This helps enterprise AI models access behavioral and trend information data. Separately, Google will continue to use crawlers to access information on the web.

However, there is one problem with Reddit. Concerns over content moderation and accessibility arose in July 2023 when Reddit decided to implement a new policy that would charge some third-party apps to access data on the platform. Several groups protested the changes proposed by Reddit. More than 8,000 subreddits went dark. At the time, these subreddit groups said the change threatened to eliminate historically important ways to customize the platform.

To avoid such conflicts this time around, Reddit is giving an unspecified number of top users, including moderators and users with high Karma scores, the opportunity to buy shares in the IPO, The Verge reports.

Reddit plans to do that through a tier-based allocation system. Tier 1 individuals will be specific users and moderators identified as users who have contributed meaningfully to Reddit community programs. The second tier consists of users with a Karma score of 2,000 or higher, a score that shows how much a user has contributed to the Reddit community, and users who have performed at least 5,000 moderator actions.

This is an unusual move since this privilege is usually given to professional investors who want to buy shares at a theoretically lower price before they are listed on an exchange. Reddit currently has approximately 267.5 million weekly active users, more than 100,000 active communities, and 1 billion total posts, according to SEC filings.

Have other platforms used user data to train AI models?

Unlike Reddit, few platforms have yet to announce whether their users' public information will be used to train AI models. X (formerly Twitter) announced in September that it would use user posts to train AI models for purposes outlined in its policy. The policy does not specify which AI models it refers to.

Meta said user data from applications such as Facebook, Instagram, and Threads will be used for AI training of the AI ​​chatbot. TikTok and Snapchat have both launched AI chatbots, but neither has mentioned accepting user posts to train their AI models.

Using user data to train algorithms is nothing new in the tech world. Most platforms' recommender engines use your personal usage data to suggest videos, articles, and movies. But using that information to train an AI model is new and requires caution, given that these chatbots tend to spew out personal information when responding to prompts.

A case in point is when Samsung banned the use of AI chatbots in its offices after employees used them and it was discovered that the bots spewed out trade secrets.

