University of Twente Student Theses
Web Crawl Refusals : Insights From Common Crawl
Ansar, M. (2024) Web Crawl Refusals : Insights From Common Crawl.
This is the latest version of this item.
PDF
2MB |
Abstract: | This study investigates server-side blocks encountered by Common Crawl, a major web crawling project. Unlike previous studies that rely on HTTP status codes or Cloudflare errors to identify server-side blocks, this research utilizes semantic analysis of page contents to cover a broader range of refusals. By constructing and utilizing 147 fine-grained regular expressions crafted to identify various refusal pages precisely, we found that approximately 1.68% of websites in a Common Crawl snapshot exhibit some form of explicit refusal. Significant contributors to these refusals include large hosting providers and website builders. Our analysis categorizes the diverse forms of refusal messages, ranging from outright blocks to challenges and rate-limiting responses across multiple HTTP status codes. The study also examines the temporal dynamics of refusals, offering insights into the persistence of these blocks and the effectiveness of Common Crawl's retry strategies. Our findings highlight the diversity of server-side blocks and suggest using tailored approaches to navigate and mitigate them. |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science MSc (60300) |
Link to this item: | https://purl.utwente.nl/essays/98890 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page