Optimization of the BUbiNG web craler

Buijsrogge, Anne A (2014) Optimization of the BUbiNG web craler.

[img]
Preview
PDF
21MB
Abstract:This research is about the performance and optimization of the BUbiNG web crawler, an open-source web crawler developed by Paolo Boldi et al. This web crawler aims to have high throughput in terms of crawled pages per unit time. The data structures of the web crawler that we consider are the sieve, the workbench and the workbench virtualizer. The goal is to have as many hosts as possible in the workbench, which is the crucial datum in order to have high throughput of the web crawler. In order to improve the number of hosts in the workbench, the workbench virtualizer can be used for URLs that are already extracted from the sieve. In case the workbench virtualizer is not used, we derived analytical results when assuming that the host sizes are homogeneous. When the host sizes are heterogeneous we find that several hosts dominate the workbench, resulting in a low throughput of the web crawler. To overcome this problem, the workbench virtualizer is used. We found that no natural decision policies for the workbench virtualizer improve the throughput of the web crawler. Instead, using the workbench virtualizer for the hosts that dominate the workbench results in a decision policy that overcomes these hosts from dominating the workbench.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:31 mathematics
Programme:Applied Mathematics MSc (60348)
Link to this item:http://purl.utwente.nl/essays/66134
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page