Improving data quality in a probabilistic database by means of an autoencoder

Mauritz, R.R. (2020)

In the data integration process, the resulting data often contains uncertainties. A way to deal with this is to store the data in a probabilistic database (PDB). The data we work with is categorical nominal and the uncertainty is on attribute level. We note that the data in the PDB indirectly contains evidence from the underlying ground truth distribution. Based on this notion, we propose to develop a model that captures this evidence and incorporates this in the data with the aim of improving the data quality. We do this by first modelling the PDB and 'data quality improvement'. Then we develop a probabilistic model that can improve data quality given that the underlying ground truth distribution is known. We use this knowledge to develop a model based on a denoising autoencoder model that can improve data quality in case the underlying ground truth distribution is not known. We test both models on synthetic data sets and see that we indeed achieve data quality improvement.
Mauritz_BA_EEMCS.pdf