University of Twente Student Theses
Optimization of the BUbiNG web craler
Buijsrogge, Anne A (2014) Optimization of the BUbiNG web craler.
PDF
21MB |
Abstract: | This research is about the performance and optimization of the BUbiNG web crawler, an open-source web crawler developed by Paolo Boldi et al. This web crawler aims to have high throughput in terms of crawled pages per unit time. The data structures of the web crawler that we consider are the sieve, the workbench and the workbench virtualizer. The goal is to have as many hosts as possible in the workbench, which is the crucial datum in order to have high throughput of the web crawler. In order to improve the number of hosts in the workbench, the workbench virtualizer can be used for URLs that are already extracted from the sieve. In case the workbench virtualizer is not used, we derived analytical results when assuming that the host sizes are homogeneous. When the host sizes are heterogeneous we find that several hosts dominate the workbench, resulting in a low throughput of the web crawler. To overcome this problem, the workbench virtualizer is used. We found that no natural decision policies for the workbench virtualizer improve the throughput of the web crawler. Instead, using the workbench virtualizer for the hosts that dominate the workbench results in a decision policy that overcomes these hosts from dominating the workbench. |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 31 mathematics |
Programme: | Applied Mathematics MSc (60348) |
Link to this item: | https://purl.utwente.nl/essays/66134 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page