University of Twente Student Theses

Login

Improving data quality in a probabilistic database by means of an autoencoder

Mauritz, R.R. (2020) Improving data quality in a probabilistic database by means of an autoencoder.

[img] PDF
1MB
Abstract:In the data integration process, the resulting data often contains uncertainties. A way to deal with this is to store the data in a probabilistic database (PDB). The data we work with is categorical nominal and the uncertainty is on attribute level. We note that the data in the PDB indirectly contains evidence from the underlying ground truth distribution. Based on this notion, we propose to develop a model that captures this evidence and incorporates this in the data with the aim of improving the data quality. We do this by first modelling the PDB and 'data quality improvement'. Then we develop a probabilistic model that can improve data quality given that the underlying ground truth distribution is known. We use this knowledge to develop a model based on a denoising autoencoder model that can improve data quality in case the underlying ground truth distribution is not known. We test both models on synthetic data sets and see that we indeed achieve data quality improvement.
Item Type:Essay (Bachelor)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:31 mathematics, 54 computer science
Programme:Applied Mathematics BSc (56965)
Link to this item:https://purl.utwente.nl/essays/80505
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page