How Googlebot and Search Engine Crawlers Have Evolved Over Time
Evolution of Crawling Technologies
Google’s Gary Illyes highlighted key changes in search engine crawlers, emphasizing that while the fundamental crawling process remains similar, underlying technologies have improved. Notably, the shift from HTTP/1.1 to HTTP/2 and HTTP/3 protocols allows more efficient crawling by enabling multiple requests over a single connection. Although Googlebot does not yet support HTTP/3, it plans to adopt it in the future for better efficiency.
Robots.txt and Crawler Behavior
Crawler behavior policies differ among companies, but most well-behaved crawlers respect the Robots Exclusion Protocol (robots.txt) and adjust their activity based on site load signals. Discussions at the IETF reveal publishers’ concerns about some crawlers acting disruptively, highlighting the need for standardized crawler conduct.
Handling Adversarial Crawlers and Spam
Besides legitimate crawlers, there are adversarial ones like malware and privacy scanners. These require different handling strategies since they may try to operate stealthily to detect or avoid malicious content. This adversarial dynamic influences crawler policy development.
Impact of Crawling on Internet Resources
- Google has been working to minimize its crawler footprint on the internet.
- New AI-driven products are increasing fetching activities, offsetting some efficiency gains.
- Illyes argues that indexing and processing fetched data consume more resources than the crawling itself.
Overall, while crawling methods have become more efficient, the biggest resource demands come from how the fetched data is processed and served.
Source: Search Engine Roundtable by barry@rustybrick.com (Barry Schwartz). Read original.