Google open sources robots.txt web parser

Google reveals details of its Robots Exclusion Protocol

Google has released its robots.txt web parser, the Robots Exclusion Protocol that instructs Google's web crawlers to ignore website content.

The robots.txt Robots Exclusion Protocol enables website operators to instruct the parts of their websites the company's web crawlers should examine, and which to ignore. This helps prevent some parts of company's websites from being indexed, although it does not necessarily keep them private.

AI & Machine Learning Live is returning to London on 3rd July 2019. Hear from the Met Office's Charles Ewen, AutoTrader lead data scientist Dr David Hoyle and the BBC's Noriko Matsuoka, among many others. Attendance is free to qualifying IT leaders and senior IT pros, but places are limited, so reserve yours now.

For 25 years or so, the Protocol has become a de facto standard in managing the web-page-sniffing conducted by web crawlers, but has never become an official standard.

That means that no official guidelines on how to use it have been issued, leading to to problems whereby different websites have interpreted the robots.txt format in different ways. Hence, this can lead to web crawlers interpreting embedded robots.txt files in different ways, which can affect web searches on different search engines when looking for the most relevant content.

For creators of web crawlers, the lack of an official standard means they are left with uncertainty around how their crawlers should handle things like large robots.txt files.

Google hopes that by open sourcing the robots.txt parser, developers will be able to examine the C++ library the Googlebot uses for parsing and matching rules in robots.txt files. Essentially, this should pave the way for a better understanding of how crawlers interact with robots.txt files and thus pave the way for some form of standard practices.

Google has published some bits of a draft proposal to haw it reckons REP should be used, which it will submit to the Internet Engineering Task Force (IETF); the folks that handle standards on the internet.

"We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers. Together with the original author of the protocol, webmasters, and other search engines, we've documented how the REP is used on the modern web, and submitted it to the IETF," claimed the company in a Google Webmaster Central blog.

It continued: "The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP.

"These fine grained controls give the publisher the power to decide what they'd like to be crawled on their site and potentially shown to interested users. It doesn't change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web."

There's no guarantee that open sourcing the Protocol will lead to it becoming an official standard, but it's a step in the right direction. For the average web user, it should mean better and more accurate content is served up when searching, not just on Google but in Bing, DuckDuckGo, Qwant, Ecosia, Startpage, Mojeek or any number of alternative search engines to Google.