Web Crawl Refusals : Insights From Common Crawl

Ansar, M.

This study investigates server-side blocks encountered by Common Crawl, a major web crawling project. Unlike previous studies that rely on HTTP status codes or Cloudflare errors to identify server-side blocks, this research utilizes semantic analysis of page contents to cover a broader range of refusals. By constructing and utilizing 147 fine-grained regular expressions crafted to identify various refusal pages precisely, we found that approximately 1.68% of websites in a Common Crawl snapshot exhibit some form of explicit refusal. Significant contributors to these refusals include large hosting providers and website builders. Our analysis categorizes the diverse forms of refusal messages, ranging from outright blocks to challenges and rate-limiting responses across multiple HTTP status codes. The study also examines the temporal dynamics of refusals, offering insights into the persistence of these blocks and the effectiveness of Common Crawl's retry strategies. Our findings highlight the diversity of server-side blocks and suggest using tailored approaches to navigate and mitigate them.

Web Crawl Refusals : Insights From Common Crawl

Ansar, M. (2024)