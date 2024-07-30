Anthropic has stepped up AI data scraping in a big way

Hitting websites millions of times in 24 hours

Tom Allen
clock • 2 min read
Anthropic has stepped up AI data scraping in a big way

Anthropic has come under fire for its "egregious" and "aggressive" data scraping practices, as it seeks to hoover up data to train its Claude.ai chatbot.

Data scraping is the process of pulling data from multiple websites and collating it to a single source. It's been around for years, in the form of news aggregators, price comparison sites and search engines, but has grown exponentially in the AI era.

Website owners, though, have big problems with AI training. The systems rarely ask permission, can use (and replicate) premium paid content, and are now flooding websites with requests.

Matt Barrie, CEO of Freelancer.com, told the Financial Times that Anthropic is "the most aggressive scraper by far," hitting his website more than 3.5 million times in four hours.

ClaudeBot's activities were waking the site's system admins up with high traffic alerts, so Barrie said Freelancer.com's only recourse was to block the crawler bot in robots.txt.

Even that's not a guaranteed way to block scraping. 404 Media says AI developers are bypassing blocks by launching new crawlers with different names, which then also need to be added to robots.txt.

"The ecosystem of agents is changing quickly, so it's basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively," said the anonymous owner of Dark Visitors, a site that tracks the ever-changing landscape of web crawlers.

Freelancer.com is only the latest website to complain about the excesses of AI web scraping. Kyle Wiens, CEO of iFixit, said last week that ClaudeBot had hit its servers a million times in 24 hours.

Coding documentation deployment service Read the Docs also complained about crawler bots. In a recent blog post, co-founder Eric Holscher said "AI products and services" are "recklessly crawling many sites across the web." One crawler in particular downloaded 73TB of data in May. That activity, which was blamed on a bug and didn't benefit Read the Docs in any way, cost the site more than $5,000 in bandwidth charges.

Although Anthropic is in the spotlight right now - Barrie says it consumes "probably about five times the volume of the number two [crawler]" - it's only one part of the problem. Elon Musk, for example, recently made headlines for training his chatbot Grok on all Twitter/X user data. The feature is on by default, which may violate the GDPR. You can turn the setting off here.

Because of crawlers' aggressive data harvesting, some websites have opted for a whitelist approach; Reddit, for example, only allows Google's crawler. CDN Cloudflare is working on its own solution, announcing a tool to block web scrapers earlier this month.

Related Topics

You may also like
IT Essentials: Investors are running out of patience with GenAI

Artificial Intelligence

IT Essentials: Investors are running out of patience with GenAI

Where are the use cases?

clock 29 July 2024 • 4 min read
Is AI the answer the legal sector needs? - Ctrl Alt Lead podcast

Artificial Intelligence

Is AI the answer the legal sector needs? - Ctrl Alt Lead podcast

Going beyond iteration to innovation

clock 29 July 2024 • 1 min read
Executives hype AI benefits while employees struggle with increased workloads, survey reveals

Artificial Intelligence

Executives hype AI benefits while employees struggle with increased workloads, survey reveals

Nearly 71% of full-time workers said they are experiencing burnout, while two-thirds are struggling to cope with growing demands

clock 29 July 2024 • 3 min read
Tom Allen
Author spotlight

Tom Allen

View profile
More from Tom Allen

Is AI the answer the legal sector needs? - Ctrl Alt Lead podcast

Asian Tech Roundup: Billionaire banged up

Sign up to our newsletter

The best news, stories, features and photos from the day in one perfectly formed email.

Get the newsletter

More on Internet

Anthropic has stepped up AI data scraping in a big way
Internet

Anthropic has stepped up AI data scraping in a big way

Hitting websites millions of times in 24 hours

Tom Allen
Tom Allen
clock 30 July 2024 • 2 min read
Microsoft accused of illegally sourcing Google Chrome's data
Internet

Microsoft accused of illegally sourcing Google Chrome's data

Edge is playing up, again

Muskan Arora
clock 31 January 2024 • 1 min read
Russia further ramps up internet censorship
Internet

Russia further ramps up internet censorship

Federal agency plans to build database of location data

Penny Horwood
Penny Horwood
clock 19 January 2024 • 2 min read