University of Twente Student Theses

Login

Finding you on the Internet: Entity resolution on Twitter accounts and real world people

Been, Henry (2013) Finding you on the Internet: Entity resolution on Twitter accounts and real world people.

[img]
Preview
PDF
1MB
Abstract:Over the last years online social network sites [SNS] have become very popular. There are many scenarios in which it might prove valuable to know which accounts on a SNS belong to a person. For example, the dutch social investigative authority is interested in extracting characteristics of a person from Twitter to aid in their risk analysis for fraud detection. In this thesis a novel approach to �nding a person's Twitter account using only known real world information is developed and tested. The developed approach operates in three steps. First a set of heuristic queries using known information is executed to �nd possibly matching accounts. Secondly, all these accounts are crawled and information about the account, and thus its owner, is extracted. Currently, name, url's, description, language of the tweets and geo tags are extracted. Thirdly, all possible matches are examined and the correct account is determined. This approach di�ers from earlier research in that it does not work with extracted and cleaned datasets, but directly with the Internet. The prototype has to cope with all the "noise" on the Internet like slang, typo's, incomplete pro�les, etc. Another important part the approach was repetition of the three steps. It was expected that repeating the discovering candidates, enriching them and eliminating false positives will increase the chance that over time the correct account "surfaces." During development of the prototype ethical concerns surrounding both the experiments and the application in practice were considered and judged morally justi�able. Validation of the prototype in an experiment showed that the �rst step is executed very well. In an experiment With 12 subjects with a Twitter account, an inclusion of 92% was achieved. This means that for 92% of the subjects the correct Twitter account was found and thus included as a possible match. A number of variations of this experiment were ran, which showed that inclusion of both �rst and last name is necessary to achieve this high inclusion. Leaving out physical addresses, e-mail addresses and telephone numbers does not in uence inclusion. Contrary to those of the �rst step, the results of the third step were less accurate. The currently extracted features cannot be used to predict if a possible match is actually the correct Twitter account or not. However, there is much ongoing research into feature extraction from tweets and Twitter accounts in general. It is therefore expected that enhancing feature extraction using new techniques will make it a matter of time before it is also possible to identify correct matches in the candidate set.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:http://purl.utwente.nl/essays/63395
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page