University of Twente Student Theses

As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

The use of rare key indexing for distributed web search

Tinselboer, Koen (2007) The use of rare key indexing for distributed web search.

PDF
766kB

Abstract:	In the last few years we have seen a rise in the of peer-to-peer applications in areas like file sharing [1][2][3]. However distributed information retrieval applications have not taken off yet. In such an application every peer (website) helps to maintain a global index of all information in the global document collection. When a website is updated the P2P search engine index can also be directly updated by the peer. Each peer only needs to contribute a limited amount of disk space and network bandwidth. Groups of websites can even form their own search engine which specializes in a their area of expertise. In this way the search process can be driven more by the Internet community. In the first part of this thesis the previous work done in the field of distributed information retrieval is discussed. Most of the recently developed systems (like ALVIS [4], Minerva [5] and pSearch [6]) use a conceptually global but physically distributed index. This index is distributed using a distributed hash table (DHT [7]) based approach. The scalability of such systems is very good and the reported retrieval performance also approaches that of a centralized information retrieval system. However each project uses a different collection and a different set of queries to test their application. Therefore we cannot compare them directly with each other. In the second part I discuss the implementation and evaluation of a distributed information retrieval system based on rare key indexing. Such an index stores sets of terms that appear near each other in a limited number of documents. This approach was first presented as part of the ALVIS project. In this thesis we tested the suitability of the approach for indexing and searching a realistic collection of websites using a subset of the WT10g collection [8]. To measure performance we look at the top-10 overlap between a single term index and a multi term index. In the best case the average overlap ratio was found to be only 7.5%. In the ALVIS project the average overlap ratio was between 83% and 97% [9]. I outline several causes that attribute to this huge difference. Based on the outcome of the experiments I have to conclude that the rare key indexing method scales well. However its retrieval performance on a realistic collection of websites is very poor. Therefore the rare key indexing method cannot be considered a good choice for a distributed web search application.
Item Type:	Essay (Master)
Faculty:	EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:	54 computer science
Programme:	Computer Science MSc (60300)
Link to this item:	https://purl.utwente.nl/essays/770
Export this item as:	BibTeX EndNote HTML Citation Reference Manager

Show download statistics for this publication

Repository Staff Only: item control page