University of Twente Student Theses
Clustering with Outliers in a Biased Animal Movement Database
Go, M.L. (2019) Clustering with Outliers in a Biased Animal Movement Database.
PDF
1MB |
Abstract: | By putting accelerometers on horses, data was collected that can be classified into different behavioural patterns. Examples are "walking", "resting", "running-rider", and "running-natural". In order to study the behaviour of horses and investigate the structure of the data, the data has to be grouped according to these different behaviours. This research focused on grouping the horse data by means of clustering. Clustering is an unsupervised procedure to group data based on similarity. Unsupervised means that the clustering algorithm does not have classified (or labeled) data to train with. Finding the correct clustering algorithm is a challenge because the database is large, and the data is high-dimensional (21 dimensions). Furthermore, the data itself is biased. Most of the samples represent the horse trotting with a rider, and there are little samples available of other behaviour such as the horse shaking. Due to the small amount of samples for this behaviour, the algorithm could identify these samples as outliers (deviations from normal patterns) and remove them. In this research two algorithms were identified for clustering large and high-dimensional data: DBSCAN and OPTICS. The performance of the algorithms was evaluated using the V-measure. The algorithms were also assessed for biases towards clustering larger or smaller clusters as outliers, or clustering samples wrongly (False Negatives). After performing the tests, it was found that with the chosen parameter values, DBSCAN performed better. Although OPTICS had a far smaller percentage of False Negatives (21 percent per class on average compared to the 61 percent of DBSCAN), this could be explained by the high percentages of outliers that OPTICS had. DBSCAN was, in other words, better at identifying outliers. Furthermore, it had a higher V-measure (1 is the most desirable) with 0.512, whereas OPTICS had a V-measure value of 0.304. Further improvement of the performances can be achieved through extended parameter optimization. |
Item Type: | Essay (Bachelor) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 43 environmental science, 54 computer science |
Programme: | Business & IT BSc (56066) |
Link to this item: | https://purl.utwente.nl/essays/79158 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page