Tech
Cloudflare debuts one-click nuke of web-scraping AI
Cloudflare on Wednesday offered web hosting customers a way to block AI bots from scraping website content and using the data without permission to train machine learning models.
It did so based on customer loathing of AI bots and, “to help preserve a safe internet for content creators,” it said in a statement.
“We hear clearly that customers don’t want AI bots visiting their websites, and especially those that do so dishonestly. To help, we’ve added a brand new one-click to block all AI bots.”
There’s already a somewhat effective method to block bots that’s widely available to website owners, the robots.txt file. When placed in a website’s root directory, automated web crawlers are expected to notice and comply with directives in the file that tell them to stay out.
Given the widespread belief that generative AI is based on theft, and the many lawsuits attempting to hold AI companies accountable, firms trafficking in laundered content have graciously allowed web publishers to opt-out of the pilfering.
Last August, OpenAI published guidance about how to block its GPTbot crawler using a robots.txt directive, presumably aware of concern about having content scraped and used for AI training without consent. Google took similar steps the following month. Also in September last year Cloudflare began offering a way to block rule-respecting AI bots, and 85 percent of customers – it’s claimed – enabled this block.
Now the network services biz aims to provide a more robust barrier to bot entry. The internet is “now flooded with these AI bots,” it said, which visit about 39 percent of the top one million web properties served by Cloudflare.
The problem is that robots.txt, like the Do Not Track header implemented in browsers fifteen years ago to declare a preference for privacy, can be ignored, generally without consequences.
And recent reports suggest AI bots do just that. Amazon last week said it was looking into evidence that bots working on behalf of AI search outfit Perplexity, an AWS client, had crawled websites, including news sites, and reproduced their content without suitable credit or permission.
Amazon cloud customers are supposed to obey robots.txt, and Perplexity was accused of not doing that. Aravind Srinivas, CEO of the AI upstart, denied his biz was underhandedly ignoring the file, though conceded third-party bots used by Perplexity were the ones observed scraping pages against the wishes of webmasters.
Spoofed
“Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent,” Cloudflare said. “We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot, even when operators lie about their user agent.”
Cloudflare said its machine-learning scoring system rated the disguised Perplexity bot below 30 consistently over a period from June 14 through June 27, indicating that it’s “likely automated.”
This bot detection approach relies on digital fingerprinting, a technique commonly used to track people online and deny privacy. Crawlers, like individual internet users, often stand out from the crowd based on technical details that can be read through network interactions.
These bot tend to use the same tools and frameworks for automating website visits. And with a network that sees an average of 57 million requests per second, Cloudflare has ample data to determine which of these fingerprints can be trusted.
So this is what it’s come to: machine learning models defending against bots foraging to feed AI models, available even for free tier customers. All customers have to do is click the Block AI Scrapers and Crawlers toggle button in the Security -> Bots menu for a given website.
“We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection,” Cloudflare said. “We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on.” ®