University of Twente Student Theses


Automatic instance-based matching of database schemas of web-harvested product data

Drechsel, Alexander (2019) Automatic instance-based matching of database schemas of web-harvested product data.

This is the latest version of this item.

[img] PDF
Abstract:Every day more information becomes available on the internet and companies can significantly benefit from integration of information from these sources that is useful to them into their own systems. However there is no set standard for information on the internet meaning that integrating of this useful information is time consuming and costly. In this thesis we present a semi-automated method for matching web-harvested product database schemas on the basis of data characteristics and commonality. We provide a pre-processing system which takes web-harvested product information and turns it into machine learner ready feature sets as well as a machine learner which is capable of using these feature sets to match groups or columns of data from different sources together on the basis of similarity and thus representing the same property. Multiple methods were developed and tested for sampling, machine learner algorithm and training set selection. We used the best results for each of these. For sampling we concluded that a resampler which generates samples of 100 data values that also has restrictions on how many samples it can generate based on the overall size of the dataset to prevent overtraining performs best. With regard to machine learner algorithms, both nearest neighbour and RbfSVC performed well at the classification task. The system described in the thesis is capable of good matching accuracy scores (in excess of 50% for textual cases and 67% for numerical cases despite a large amount of possible classes) given not many new properties are introduced beyond the training set. The thesis describes a number of clear development ways in which the system could be expanded and further improved.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page