University of Twente Student Theses
Cross-domain textual geocoding: the influence of domain-specific training data
Kolkman, M.C. (2015) Cross-domain textual geocoding: the influence of domain-specific training data.
PDF
949kB |
Abstract: | Modern technology is more and more able to understand natural language. To do so, unstructured texts need to be analysed and structured. One such structuring method is geocoding, which is aimed at recognizing and disambiguating references to geographical locations in text. A word or phrase that refers to a location is called a toponym. Approaches to tackle the geocoding task mainly use natural language processing techniques and machine learning. The difficulty of the geocoding task is dependent of multiple aspects, one of which is the data domain. The domain of a text describes the type of the text, like its goal, degree of formality, and target audience. When texts come from two (or more) different domains, like a Twitter post and a news item, they are said to be cross-domain. This thesis presents a geocoding system, called XD-Geocoder, aimed at robust cross-domain performance by using text-based and lookup list based features only. Features are built up using word shape, part-of-speech tags, dictionaries and gazetteers. The features are used to train SVM and CRF classifiers. Both classifiers are trained and evaluated on three corpora from three domains: Twitter posts, news items and historical documents. These evaluations show Twitter data to be the best for training out of the tested data sets, because both classifiers show the best overall performance when trained on tweets. Although the XD-Geocoder is outperformed by state-of-the-art geocoders on single-domain evaluations (trained and evaluated on data from the same domain), it outperforms the baseline systems on cross-domain evaluations. |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Interaction Technology MSc (60030) |
Link to this item: | https://purl.utwente.nl/essays/67542 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page