University of Twente Student Theses


Aggregation in probabilistic databases : implemented for DuBio

Le, T.N.H. (2022) Aggregation in probabilistic databases : implemented for DuBio.

[img] PDF
Abstract:Probabilistic databases lack of the support of aggregate functions, since there is difficulty in representing query result. Specifically, probabilistic databases have exponential number of possible worlds with potentially different answers for each worlds, it is impractical to represent all of them to the user. To tackle the problem, a few probabilistic databases present aggregate queries as expected value or top-k result. In this thesis, the ultimate goal is implementing aggregate function in probabilistic databases, particularly for DuBio, a probabilistic database developed by the University of Twente. This research includes a methodology to build such aggregate functions in DuBio, experiments to study the feasible number of records these functions can achieve, what is their scalability and how uncertainties of the database affects them. We devise 2 approaches to represent the results of aggregate queries. Firstly, representing all possible answers to the user. Secondly, representing the query result approximately as a histogram or top k. We have implemented 3 aggregate functions in total, i.e. Combination and All possible worlds to return all possible aggregate answers, TopK to return aggregate answers over top K probable worlds. To evaluate these functions, we conduct 5 experiments including changing the database size by increasing number of records, number of random variables, number of alternatives and sentence’s complexity. Our conclusion is as follows: Combination and All possible worlds basically have expo- nential growth rate. However All possible worlds is favored over Combination because it grows exponentially w.r.t. number of variables while Combination grows exponentially w.r.t. number of records. The maximum number of record Combination and All possible worlds can query safely under 30 seconds on a commodity hardware is 15 and 26, respectively. Despite exponential growth, in practice, user can safely use All possible worlds aggregate queries when the intermediate result before aggregation has max 26 tuples. TopK algorithm has log-linear growth rate, it reaches safely about 350 records on a commodity hardware. Therefore this function is practical for large-sized databases, and scalable on server hardware. The contribution of this research is providing aggregate functions which can return all possible answers, and answers on Top K probable worlds for DuBio database.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page