Querying Probabilistic XML

Kessel, Ruud van (2008) Querying Probabilistic XML.

Abstract:In the scientific field and in working with data integration, uncertain data is a very common subject. In [KKA05] a compact representation is proposed for storing uncertain data in XML. A naive way of querying this data is by calculating all possible worlds and execute the query on each of the worlds. Calculating these possible worlds is however very inefficient because of the exponential growth of worlds. In this thesis we will investigate how the compact representation can be queried in an efficient way. We will compare two methods for querying the compact representation: Recursive path analysis and the Compare paths method. Recursive path analysis. Using a script, for each step of the query a piece of XQuery code is generated, which returns each possible answer for that step. The output of step one is the input of step two and so on. The increase in performance is obtained by calculating the possible answers for the query, instead of calculating all possible worlds for the document. Compare paths method. The query is converted by adding the needed possibility and probability steps to the query. When executing queries that include a predicate, an extra check has to be performed to examine if the returned results indeed can occur together with one of the elements referred to in the predicate. We do this by comparing the paths of node identifiers belonging to the probability and possibility ancestors of the candidate elements with the ones of the predicate elements. Two elements occur in the same world only if the number of probability ancestors that occur in both paths of the two elements is equal to the number of probability ancestors that occur in both paths. We test both methods by executing several queries on test documents of different sizes and containing different levels of uncertainty. This leads to the following conclusions: – Even for large documents (up to an address book containing 1000 people) the compare paths method works well. However when requesting documents with a lot of descendants in the result, the performance decreases quickly. This is a point of interest for future work. – The performance of the recursive path analysis is more dependent of uncertainty. Therefore it works better on smaller documents and documents with a smaller level of uncertainty. – In the recursive path analysis, no feature of checking the correctness of child nodes is implemented. For this reason it performs better than the compare paths method when elements with a lot of children are returned. However, when using predicates there are several cases in which the result can contain incorrect child nodes because simply every node is returned.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:http://purl.utwente.nl/essays/58330
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page