A Large-Scale Real World Data Integration Use Case Analysis for DuBio

Author(s): Hesthaven, D. (2024)

Abstract:
This research focuses on the limitations and execution performance of semantic duplicate cleaning on real-world data of the probabilistic database DuBio, which was developed at the University of Twente. This was done by using the WDC Product Data Corpus, a large collection of data, to find the overhead of running similar queries on two versions of the same database, one probabilistic and one not. The goal is to find how increasing the size of the database or the size of clusters within the database, affect the relative difference in overhead between the two versions of the database, alongside finding any additional aspects that may influence the overhead.

Document(s):

Hesthaven_BA_EEMCS.pdf