University of Twente Student Theses
Improving data quality in a probabilistic database by means of an autoencoder
Mauritz, R.R. (2020) Improving data quality in a probabilistic database by means of an autoencoder.
PDF
1MB |
Abstract: | In the data integration process, the resulting data often contains uncertainties. A way to deal with this is to store the data in a probabilistic database (PDB). The data we work with is categorical nominal and the uncertainty is on attribute level. We note that the data in the PDB indirectly contains evidence from the underlying ground truth distribution. Based on this notion, we propose to develop a model that captures this evidence and incorporates this in the data with the aim of improving the data quality. We do this by first modelling the PDB and 'data quality improvement'. Then we develop a probabilistic model that can improve data quality given that the underlying ground truth distribution is known. We use this knowledge to develop a model based on a denoising autoencoder model that can improve data quality in case the underlying ground truth distribution is not known. We test both models on synthetic data sets and see that we indeed achieve data quality improvement. |
Item Type: | Essay (Bachelor) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 31 mathematics, 54 computer science |
Programme: | Applied Mathematics BSc (56965) |
Link to this item: | https://purl.utwente.nl/essays/80505 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page