Master Thesis Using Machine Learning Methods for ...

Master Thesis

Using Machine Learning Methods for Evaluating the Quality of Technical Documents

Author: Michael LUCKERT Moritz SCHAEFER-KEHNERT

Supervisor: Prof. Dr. Welf L?WE Examiner: Prof. Dr. Andreas KERREN Reader: Dr. Ola PETERSSON Semester: HT2015 Subject: Computer Science Course: 15HT - 5DV50E/4DV50E

Abstract

In the context of an increasingly networked world, the availability of high quality translations is critical for success in the context of the growing international competition. Large international companies as well as medium sized companies are required to provide well translated, high quality technical documentation for their customers not only to be successful in the market but also to meet legal regulations and to avoid lawsuits. Therefore, this thesis focuses on the evaluation of translation quality, specifically concerning technical documentation, and answers two central questions:

? How can the translation quality of technical documents be evaluated, given the original document is available?

? How can the translation quality of technical documents be evaluated, given the original document is not available?

These questions are answered using state-of-the-art machine learning algorithms and translation evaluation metrics in the context of a knowledge discovery process. The evaluations are done on a sentence level and recombined on a document level by binarily classifying sentences as automated translation and professional translation. The research is based on a database containing 22, 327 sentences and 32 translation evaluation attributes, which are used for optimizations of five different machine learning approaches. An optimization process consisting of 795, 000 evaluations shows a prediction accuracy of up to 72.24% for the binary classification. Based on the developed sentence-based classification systems, documents are classified using recombination of the affiliated sentences and a framework for rating document quality is introduced. Therefore, the taken approach successfully creates a classification and evaluation system.

Contents

List of Figures

IV

List of Tables

V

1 Introduction

1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Purpose and Research Question . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Approach and Methodology . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Scope and Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.6 Target group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theoretical Background

6

2.1 Knowledge Discovery in Databases . . . . . . . . . . . . . . . . . . . . 6

2.2 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Instance-Based Learning (kNN) . . . . . . . . . . . . . . . . . . 17

2.3.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . 17

2.3.6 Evaluation of Machine Learning . . . . . . . . . . . . . . . . . 19

2.4 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 Rule-Based Machine Translation . . . . . . . . . . . . . . . . . . 22

2.4.2 Example-Based Machine Translation . . . . . . . . . . . . . . . 25

2.5 Evaluation of Machine Translation . . . . . . . . . . . . . . . . . . . . . 27

2.5.1 Round-Trip Translation . . . . . . . . . . . . . . . . . . . . . . 27

2.5.2 Word Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.3 Translation Error Rate . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.4 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.5 NIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

I

2.5.6 METEOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Technical Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Method

35

3.1 Identification of the Data Mining Goal . . . . . . . . . . . . . . . . . . . 37

3.2 Translation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.3 Choice of Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 Specification of a Data Mining Approach . . . . . . . . . . . . . . . . . 46

3.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Further Use of the Results . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.6.1 Document-Based Analysis . . . . . . . . . . . . . . . . . . . . . 49

3.6.2 Proposal of an Evaluation Framework . . . . . . . . . . . . . . . 50

4 Results / Empirical data

52

4.1 Empirical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.1 Research Question 1 . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.2 Research Question 2 . . . . . . . . . . . . . . . . . . . . . . . . 62

4.2.3 Quality Ranking of Technical Documentation . . . . . . . . . . . 65

4.2.4 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5 Discussion

67

5.1 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.1 Research Question 1 Sentence-Based . . . . . . . . . . . . . . . 67

5.1.2 Research Question 1 Document-Based . . . . . . . . . . . . . . . 70

5.1.3 Research Question 2 Sentence-Based . . . . . . . . . . . . . . . 73

5.1.4 Research Question 2 Document-Based . . . . . . . . . . . . . . . 74

5.1.5 Comparison of the two Research Questions . . . . . . . . . . . . 75

5.1.6 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 Method reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 Conclusion

84

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.2 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography

87

A Optimization Ranges

91

B Additional Results

92

II

C Detailed Working Steps

93

III

................
................

In order to avoid copyright disputes, this page is only a partial summary.

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

To fulfill the demand for quickly locating and searching documents.

Related download

Related searches