Using statistical methods to create a bilingual dictionary

Hiemstra, D.

A probabilistic bilingual dictionary assigns to each possible translation a probability measure to indicate how likely the translation is. This master's thesis covers a method to compile a probabilistic bilingual dictionary, (or bilingual lexicon), from a parallel corpus (i.e. large documents that are each others translation). Two research questions are answered in this thesis. In which way can statistical methods applied to bilingual corpora be used to create the bilingual dictionary? And, what can be said about the performance of the created bilingual dictionary in a multilingual document retrieval system? To build the dictionary, we used a statistical algorithm called the EM-algorithm. The EM-algorithm was first used to analyse parallel corpora at IBM in 1990. In this thesis we took a new approach as we developed an EM-algorithm that compiles a bi-directional dictionary. We believe that there are two good reasons to conduct a bi-directional approach instead of a uni-directional approach. First, a bi- directional dictionary will need less space than two uni- directional dictionaries. Secondly, we believe that a bi- directional approach will lead to better estimates of the translation probabilities than the uni-directional approach. We have not yet theoretical proof that our symmetric EM-algorithm is indeed correct. However we do have preliminary results that indicate better performance of our EM-algorithm compared to the algorithm developed at IBM.

Using statistical methods to create a bilingual dictionary

Hiemstra, D. (1996)