Finding you on the Internet : Entity resolution on Twitter accounts and real world people
Been, Henry (2013)
Over the last years online social network sites [SNS] have become very popular.
There are many scenarios in which it might prove valuable to know which accounts
on a SNS belong to a person. For example, the dutch social investigative authority
is interested in extracting characteristics of a person from Twitter to aid in their risk
analysis for fraud detection.
In this thesis a novel approach to finding a person's Twitter account using only
known real world information is developed and tested. The developed approach
operates in three steps. First a set of heuristic queries using known information is
executed to find possibly matching accounts. Secondly, all these accounts are crawled
and information about the account, and thus its owner, is extracted. Currently,
name, url's, description, language of the tweets and geo tags are extracted. Thirdly,
all possible matches are examined and the correct account is determined.
This approach differs from earlier research in that it does not work with extracted
and cleaned datasets, but directly with the Internet. The prototype has to cope with
all the "noise" on the Internet like slang, typo's, incomplete profiles, etc. Another
important part the approach was repetition of the three steps. It was expected that
repeating the discovering candidates, enriching them and eliminating false positives
will increase the chance that over time the correct account "surfaces."
During development of the prototype ethical concerns surrounding both the experiments
and the application in practice were considered and judged morally justifiable.
Validation of the prototype in an experiment showed that the first step is executed
very well. In an experiment With 12 subjects with a Twitter account, an inclusion
of 92% was achieved. This means that for 92% of the subjects the correct Twitter
account was found and thus included as a possible match. A number of variations of
this experiment were ran, which showed that inclusion of both first and last name
is necessary to achieve this high inclusion. Leaving out physical addresses, e-mail
addresses and telephone numbers does not in
uence inclusion.
Contrary to those of the first step, the results of the third step were less accurate.
The currently extracted features cannot be used to predict if a possible match is
actually the correct Twitter account or not. However, there is much ongoing research
into feature extraction from tweets and Twitter accounts in general. It is therefore
expected that enhancing feature extraction using new techniques will make it a
matter of time before it is also possible to identify correct matches in the candidate
set.
Master_thesis_Henry_Been.pdf