Amazon investigating embarrassment over scraping abuse allegations

Amazon's cloud division has launched an investigation into Perplexity AI, asking whether the AI ​​search startup is violating Amazon Web Services rules by scraping websites that have tried to prevent such scraping, WIRED reports.

An AWS spokesperson, who spoke to WIRED on the condition of anonymity, confirmed that the company is investigating Perplexity. WIRED has previously found that the startup, backed by the Jeff Bezos family fund and Nvidia and recently valued at $3 billion, appears to be relying on scraped website content that is blocked by the Robots Exclusion Protocol, a common web standard. While the Robots Exclusion Protocol is not legally binding, terms of use generally are.

The Robots Exclusion Protocol is a decades-old web standard that lets you place a plain text file (such as on a domain to specify pages that automated bots and crawlers shouldn't visit. Companies that use scrapers can choose to ignore the protocol, but most companies have traditionally respected it. An Amazon spokesperson told WIRED that AWS customers must follow the robots.txt standard when crawling websites.

AWS' terms of service prohibit customers from using its services for any illegal activity, and customers are responsible for complying with its terms and all applicable laws, a spokesperson said in a statement.

The investigation into Perplexitys' activities follows a June 11 Forbes report that the startup had allegedly stolen at least one article. WIRED's investigation confirmed the activity and found further evidence of scraping abuse and theft by systems linked to Perplexitys' AI-powered search chatbot. Engineers at WIRED's parent company, Cond Nast, block Perplexitys' crawlers on all of its websites using robots.txt files. However, WIRED found that the company had accessed its servers using an undisclosed IP address,, to access Cond Nast properties and scrape Cond Nast websites at least hundreds of times in the past three months.

Machines associated with Perplexity appear to be conducting extensive crawls of news websites that ban bots from accessing their content, and spokespeople for The Guardian, Forbes and The New York Times said they have also found the IP address on their servers multiple times.

WIRED traced the IP address to a virtual machine, known as an Elastic Compute Cloud (EC2) instance, hosted on AWS. The investigation was launched after WIRED asked whether using AWS infrastructure to scrape prohibited websites violated the company's terms of use.

Last week, Perplexity CEO Aravind Srinivas responded to WIRED's inquiry by first saying that the questions posed to the company reflected a fundamental, deep-seated misunderstanding of Perplexity and how the Internet works. Srinivas then told Fast Company that the covert IP addresses WIRED saw scraping the Cond Nast website and a test site we created were operated by a third-party company that provides web crawling and indexing services. He declined to name the company, citing non-disclosure agreements. When asked if he would tell the third-party company to stop crawling WIRED, Srinivas said, “It's complicated.”




