Extraction of synonyms and semantically related words from chat logs

Extraction of synonyms and semantically related words from chat logs

Fredrik Norlindh

Uppsala University Department of Linguistics and Philology Master's Programme in Language Technology Master's Thesis in Language Technology November 20, 2012 Supervisors: Mats Dahll?f, Uppsala University Sonja Petrovic? Lundberg, Artificial Solutions

Abstract

This study explores synonym extraction from domain specific chat log collections by means of a tool-kit called JavaSDM. JavaSDM uses random indexing and measures distributional similarities. The focus of this study is to evaluate the effect of different preprocessing operations of training data and different extraction criteria. Four chat log collections containing approximately 1,000,000 tokens were compared: one English and one Swedish from a retail company and one English and one Swedish from a travel company. One gold standard was based on synonym dictionaries and one was a manually extended version of that gold standard. The extended gold standard included antonyms, misspellings and near-related hyponyms/hypernyms/siblings and was about 20 % bigger. On average around two of the extracted synonym candidates per test word were falsely classified as incorrect because they were not included in the dictionarybased gold standard.

Precision, recall and f-score were computed. Test words were either nouns, verbs or adjectives. The f-scores were three to five times higher when using the extended gold standard.

The best f-scores were achieved when training data had been lemmatized. POS-tagging improved precision but decreased recall and decreased the number of extractions of misspellings as synonyms. A cosine similarity score threshold 0.5 could be used to increase precision and f-score without substantially decreasing recall.

Contents

Acknowledgments

5

1 Introduction

6

2 Background

7

2.1 Synonymy and other lexical sense relations . . . . . . . . . . . . 7

2.2 Usage of extracted synonyms . . . . . . . . . . . . . . . . . . . . 8

2.3 Random Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Synonym Extraction Methods . . . . . . . . . . . . . . . . . . . 10

2.5 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Data and Method

13

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Preprocessings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.1 Extraction of User Input from Chat Logs . . . . . . . . . 15

3.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2.3 Part-Of-Speech Tagging . . . . . . . . . . . . . . . . . . . 15

3.2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.5 Stop Word Lists . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Synonym Extraction by Random Indexing . . . . . . . . . . . . 16

3.3.1 Random Indexing Tool-Kit . . . . . . . . . . . . . . . . . 16

3.3.2 Extraction Criteria . . . . . . . . . . . . . . . . . . . . . 18

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.1 Test Words . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.2 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.3 Adapted Gold Standard . . . . . . . . . . . . . . . . . . 21

4 Results

24

4.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Part-of-speech tagging . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5 The different gold standards . . . . . . . . . . . . . . . . . . . . 26

4.6 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.7 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.8 F-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.9 Comparison Overview . . . . . . . . . . . . . . . . . . . . . . . 30

5 Conclusions

34

3

5.1 Overview of the results . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.1 Future Study suggestions . . . . . . . . . . . . . . . . . . 39

References

40

Bibliography

40

4

Acknowledgments

I would like to thank Mats Dahll?f for continuous support and feedback throughout this project.

I'm greatly appreciative to Artificial Solutions for the opportunity to do this project with a leading language technology company and the use of their data. Especially I want to thank my supervisor Sonja Petrovic Lundberg.

I would also like to thank Per Starb?ck for the time he took to read my thesis and the advice he gave me regarding formatting estestics and J?rg Tiedemann for helpful comments and feedback.

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download