



Robots.txt allows website owners to choose whether or not to allow Google and other tech giants to scrape their online content. Most sites allow this to Google because it gives them a huge share of their valuable traffic.

And so the AI ​​war began. It turns out that all this content is stored in the underlying datasets for training powerful AI models such as OpenAI, Google, and Meta. Because these models often answer user questions directly, less traffic is distributed and epic deals on the web can begin to unravel.

Part of Google's response was to launch a new tool that can block websites from using its content to train AI models. It's called Google Extensions. It was released in September and is slowly gaining popularity.

As of late March, Google enhanced snippets were used by about 10% of the top 1,000 websites, according to data shared by Originality.ai.

Use of code snippets that block tech companies from using online content to train AI models. originality.ai

According to the New York Times' review of the robots.txt file, the Google extension blocker is enabled. The publication, which is in a heated AI copyright battle with OpenAI, is also blocking the startup from accessing its content.

The company is at war with other companies that use online data to train AI models or compile this type of data for other companies to use in similar ways.

“The use of any device, tool, or process designed to data mine or scrape content using automated means is prohibited without prior written permission,” the NYT says in robots.txt It says on the page.

Prohibited uses include “any software, machine learning, artificial intelligence (AI), and/or large-scale language model (LLM) development,” the publisher added. A NYT spokesperson declined to comment.

Google has blocked fewer than OpenAI.

As for Google-Extended, other websites have also turned it on, including CNN, BBC, Yelp, and Business Insider, the publisher of this article.

However, Google-Extended has much lower usage than OpenAI's GPTBot, at about 32% of the top 1,000 websites. CCBot, powered by Common Crawl, is also now more enabled.

BI asked Originality.ai CEO Jonathan Gillham why Google-Extended is less used than other AI training data blockers.

He said if Google makes its generative AI search engine widely available to the public, there is a risk that sites that block access to the company's training data will not be reflected in AI-generated search results.

“If the query is, 'What's the best deep-dish pizza in Chicago?'” “They no longer have knowledge about it and can't include it in their responses,” Gillum explained.

Google is testing an early version of genAI search through Search Generative Experience (SGE). It's unclear whether the company will launch this in earnest in the future, or how different it will be from the traditional Google search engine.

These decisions will go a long way in determining the future of the web in this new AI world.

Axel Springer, Business Insider's parent company, has a global deal that allows OpenAI to train models based on its media brands' reporting.

