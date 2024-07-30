Data scraping is the process of pulling data from multiple websites and collating it to a single source. It's been around for years, in the form of news aggregators, price comparison sites and search engines, but has grown exponentially in the AI era.

Website owners, though, have big problems with AI training. The systems rarely ask permission, can use (and replicate) premium paid content, and are now flooding websites with requests.

Matt Barrie, CEO of Freelancer.com, told the Financial Times that Anthropic is "the most aggressive scraper by far," hitting his website more than 3.5 million times in four hours.

ClaudeBot's activities were waking the site's system admins up with high traffic alerts, so Barrie said Freelancer.com's only recourse was to block the crawler bot in robots.txt.

Even that's not a guaranteed way to block scraping. 404 Media says AI developers are bypassing blocks by launching new crawlers with different names, which then also need to be added to robots.txt.

"The ecosystem of agents is changing quickly, so it's basically impossible for website owners to manually keep up. For example, Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) just added new ones last month and last week, respectively," said the anonymous owner of Dark Visitors, a site that tracks the ever-changing landscape of web crawlers.

Freelancer.com is only the latest website to complain about the excesses of AI web scraping. Kyle Wiens, CEO of iFixit, said last week that ClaudeBot had hit its servers a million times in 24 hours.

Hey @AnthropicAI: I get you're hungry for data. Claude is really smart! But do you really need to hit our servers a million times in 24 hours?



You're not only taking our content without paying, you're tying up our devops resources. Not cool. — Kyle Wiens (@kwiens) July 24, 2024

Coding documentation deployment service Read the Docs also complained about crawler bots. In a recent blog post, co-founder Eric Holscher said "AI products and services" are "recklessly crawling many sites across the web." One crawler in particular downloaded 73TB of data in May. That activity, which was blamed on a bug and didn't benefit Read the Docs in any way, cost the site more than $5,000 in bandwidth charges.

Although Anthropic is in the spotlight right now - Barrie says it consumes "probably about five times the volume of the number two [crawler]" - it's only one part of the problem. Elon Musk, for example, recently made headlines for training his chatbot Grok on all Twitter/X user data. The feature is on by default, which may violate the GDPR. You can turn the setting off here.

Because of crawlers' aggressive data harvesting, some websites have opted for a whitelist approach; Reddit, for example, only allows Google's crawler. CDN Cloudflare is working on its own solution, announcing a tool to block web scrapers earlier this month.