Extraction of synonyms and semantically related words from chat logs
Extraction of synonyms and semantically related words from chat logs
Fredrik Norlindh
Uppsala University Department of Linguistics and Philology Master's Programme in Language Technology Master's Thesis in Language Technology November 20, 2012 Supervisors: Mats Dahll?f, Uppsala University Sonja Petrovic? Lundberg, Artificial Solutions
Abstract
This study explores synonym extraction from domain specific chat log collections by means of a tool-kit called JavaSDM. JavaSDM uses random indexing and measures distributional similarities. The focus of this study is to evaluate the effect of different preprocessing operations of training data and different extraction criteria. Four chat log collections containing approximately 1,000,000 tokens were compared: one English and one Swedish from a retail company and one English and one Swedish from a travel company. One gold standard was based on synonym dictionaries and one was a manually extended version of that gold standard. The extended gold standard included antonyms, misspellings and near-related hyponyms/hypernyms/siblings and was about 20 % bigger. On average around two of the extracted synonym candidates per test word were falsely classified as incorrect because they were not included in the dictionarybased gold standard.
Precision, recall and f-score were computed. Test words were either nouns, verbs or adjectives. The f-scores were three to five times higher when using the extended gold standard.
The best f-scores were achieved when training data had been lemmatized. POS-tagging improved precision but decreased recall and decreased the number of extractions of misspellings as synonyms. A cosine similarity score threshold 0.5 could be used to increase precision and f-score without substantially decreasing recall.
Contents
Acknowledgments
5
1 Introduction
6
2 Background
7
2.1 Synonymy and other lexical sense relations . . . . . . . . . . . . 7
2.2 Usage of extracted synonyms . . . . . . . . . . . . . . . . . . . . 8
2.3 Random Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Synonym Extraction Methods . . . . . . . . . . . . . . . . . . . 10
2.5 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Data and Method
13
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Preprocessings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.1 Extraction of User Input from Chat Logs . . . . . . . . . 15
3.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Part-Of-Speech Tagging . . . . . . . . . . . . . . . . . . . 15
3.2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.5 Stop Word Lists . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Synonym Extraction by Random Indexing . . . . . . . . . . . . 16
3.3.1 Random Indexing Tool-Kit . . . . . . . . . . . . . . . . . 16
3.3.2 Extraction Criteria . . . . . . . . . . . . . . . . . . . . . 18
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.1 Test Words . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.3 Adapted Gold Standard . . . . . . . . . . . . . . . . . . 21
4 Results
24
4.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Part-of-speech tagging . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 The different gold standards . . . . . . . . . . . . . . . . . . . . 26
4.6 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.8 F-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.9 Comparison Overview . . . . . . . . . . . . . . . . . . . . . . . 30
5 Conclusions
34
3
5.1 Overview of the results . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.1 Future Study suggestions . . . . . . . . . . . . . . . . . . 39
References
40
Bibliography
40
4
Acknowledgments
I would like to thank Mats Dahll?f for continuous support and feedback throughout this project.
I'm greatly appreciative to Artificial Solutions for the opportunity to do this project with a leading language technology company and the use of their data. Especially I want to thank my supervisor Sonja Petrovic Lundberg.
I would also like to thank Per Starb?ck for the time he took to read my thesis and the advice he gave me regarding formatting estestics and J?rg Tiedemann for helpful comments and feedback.
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- finding synonyms using automatic word alignment and measures of
- list of synonyms antonyms
- techniques for solving synonym and antonym questions
- synonyms for positive feelings
- the oxford thesaurus an a z dictionary of english learners
- extraction of synonyms and semantically related words from chat logs
- extracting synonyms from dictionary definitions semantic scholar
Related searches
- list of synonyms words meanings
- list of synonyms and antonyms
- list of synonyms and meanings
- lists of synonyms words list
- list of synonyms and antonyms pdf
- dictionary of synonyms and antonyms
- words with synonyms and antonyms
- definition of synonyms and examples
- synonyms and antonyms words list
- words and their synonyms and antonyms
- part of synonyms and antonyms
- list of synonyms and antonym