Cross-domain textual geocoding: the influence of domain-specific training data

Kolkman, M.C. (2015) Cross-domain textual geocoding: the influence of domain-specific training data.

[img]
Preview
PDF
949kB
Abstract:Modern technology is more and more able to understand natural language. To do so, unstructured texts need to be analysed and structured. One such structuring method is geocoding, which is aimed at recognizing and disambiguating references to geographical locations in text. A word or phrase that refers to a location is called a toponym. Approaches to tackle the geocoding task mainly use natural language processing techniques and machine learning. The difficulty of the geocoding task is dependent of multiple aspects, one of which is the data domain. The domain of a text describes the type of the text, like its goal, degree of formality, and target audience. When texts come from two (or more) different domains, like a Twitter post and a news item, they are said to be cross-domain. This thesis presents a geocoding system, called XD-Geocoder, aimed at robust cross-domain performance by using text-based and lookup list based features only. Features are built up using word shape, part-of-speech tags, dictionaries and gazetteers. The features are used to train SVM and CRF classi�ers. Both classifiers are trained and evaluated on three corpora from three domains: Twitter posts, news items and historical documents. These evaluations show Twitter data to be the best for training out of the tested data sets, because both classi�ers show the best overall performance when trained on tweets. Although the XD-Geocoder is outperformed by state-of-the-art geocoders on single-domain evaluations (trained and evaluated on data from the same domain), it outperforms the baseline systems on cross-domain evaluations.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Human Media Interaction MSc (60030)
Link to this item:http://purl.utwente.nl/essays/67542
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page