Chinese Text Analysis using - SJSU



Hidden Markov Model for Text Analysis

Student : Tun Tao Tsai

Advisor : Dr. Mark Stamp

Committee member : Dr. Jeff Smith

Committee Member : Dr. Chris Pollett

Department of Computer Science,

San Jose State University

Email: joetsai@

Hidden Markov Model for Text Analysis 1

Abstract 3

1. Introduction 4

2. The Basic probability theory 6

2. The Cave and Neuwirth experiment 7

3. Hidden Markov model 13

3.1 The Markov property 14

3.2 The Hidden Markov model definition 15

3.3 Finding the probability of the observed sequence 15

3.4 The forward-backward algorithm 16

3.4.1 The forward recursion 17

3.4.2 The backward recursion 18

3.5 Choosing the best state sequence 19

3.6 Parameter re-estimation 19

4. Chinese information processing 21

5. Phonology 22

5.1 Chinese phonemic transcription 22

5.2 English phoneme transcription 24

6. Experiment design and the software 25

6.1 Number of iterations 25

6.2 Numbers of states 26

6.3 Chinese corpus 26

6.4 The software 27

7. Experiment results 28

7.1 English alphabet experiment results 28

7.2 English phoneme results 28

7.2.1 English phoneme experiment using 2 States 29

7.2.2 English phoneme experiment with more than two states 29

7.3 Chinese characters experiment result 30

7.4 Zhuyin experiment results 31

7.5 Entropy 33

8. Summary and conclusions 35

9. Future work 35

Reference 36

Appendix 1 : Experiment results 38

Appendix 2: Entropy experiment results 48

Appendix 3 : Brown University corpus. 51

Appendix 4. Chinese character encoding 52

Appendix 5: Zhuyin – Pinyin conversion table 54

Appendix 6: CMU pronouncing dictionary phoneme chart 55

Appendix 7 : Trial experiment result to determine the number of iterations to use. 56

Abstract

In the field of Natural Language processing, the Hidden Markov Model (hereafter as HMM) method is proven to be useful in the application area of finding patterns from sequence of data. In this study, we apply HMM technique to the Brown corpus [1], the Brown corpus in its phonemic representation, a Chinese corpus, and the Chinese corpus in its Zhuyin representation in an attempt to find the statistical models that can explain certain language features.

We first give a brief overview to the original experiment conducted by Cave and Neuwirth [14], the Hidden Markov Model, the Chinese language processing issues and English phonetic properties that are related to our experiments, the design of our experiments, and the finding from our experiments. Finally, we discuss some of the challenges of future research.

Keywords: Chinese, Hidden Markov Model

1. Introduction

The theory of Hidden Markov Model (HMM) was first introduced in the late 1960’s by Baum and his colleagues. Quickly spotting the potential in this method, Jack Ferguson (The Institute for Defense Analyses) brought it into the area of speech recognition. During the 1970s, Baker at Carnegie-Mellon University (CMU) and Jelinek at International Business Machines (IBM) applied the HMM technique to speech processing. Several new studies of the language models were quickly spawned after the introduction of the HMM technique, such as Part-Of-Speech Tagging, and the word segmentation problem that is common in Asian languages [12]. Even today, HMM is still widely used in many new research areas such as DNA sequence alignment, network intrusion detection, and vision recognition.

An HMM is a statistical model of a Markov process with hidden states. One use of an HMM is to determine the relationship of the hidden states to the observations, which depends on the associated probabilities. The graphic representation of the HMM is illustrated as figure 1. The states of the Markov chain are hidden. The outputs from the Markov chain are observable.

[pic]

Figure1: Graphic illustration of Hidden Markov model

This is a common technique used in language processing. One classic example of such is the experiment done by R.L. Cave and L.P. Neuwirth, published in 1980. In this experiment, they applied an HMM to a large body of English text (known as a corpus) where the individual letters were taken as the set of observations. With two states, they found that the letters naturally separated into vowels and consonants. In some sense, this is not surprising. But since no a priori assumption was made regarding the two hidden states, the experiment clearly demonstrates that the vowel-consonant separation of letters in English is significant in a very strong statistical sense. Cave and Neuwirth [14] also conducted similar experiments with more than two hidden states and they were able to sensibly interpret the results for up to 12 hidden states.

We have conducted an HMM analysis of the Chinese text somewhat analogous to the HMM study of English text discussed in the previous paragraph. Chinese is, of course, very different than English. Unlike English, Chinese language has a large character base and the characters are written in sequence without space or other separators. The only form of separation in Chinese is the use of punctuation marks in locations such as sentence boundaries. The language structure affects the experiment outcome of our projects. More detail will be discussed in the chapter of Chinese information processing.

In addition, we have extended this study to the area of phonology. We applied HMM techniques to both corpora under their respective phonemic transcription. The result is interesting since phonemes are more primitive than graphemes. In addition, the experiment may reveal the different uses of nasal and oral sounds between languages.

We applied an HMM with two to six states, using the individual characters/phoneme as the observations. Such an HMM application is computationally intensive and requires vast amounts of data in order to have any chance of success. We also attempt to give meaningful linguistic interpretation to the results we obtain.

2. The Basic probability theory

The mathematics behind the Hidden Markov Model method is pretty straightforward and easy to understand. In this section, we will describe several basic probability theories that are required to understand the Hidden Markov Model technique. The information is obtainable from many sources[11].

1. Probability axioms: Given a finite sample space S and an event A in S. We define [pic]is the probability of A, then

a. [pic]for each event A in S.

b. [pic].

c. [pic] if [pic] and [pic]are mutually exclusive events in S.

2. Joint probability : If A and B are random variables, then the joint probability function is [pic]

3. Conditional probability : the probability of A conditioned on[pic] is defined as [pic]

4. Product rule: from the definition of conditional probability, the product rule is[pic]

5. Chain rule : the chain rule is an extension of the product rule which we can write down in more generic form as: [pic]

6. Bayes’ rule: Bayes’ rule is an alternative method to calculate the conditional probability if the joint probability [pic]is unknown. From the conditional probability, we know

[pic] as well as[pic].

Bayes rule is [pic]

7. Entropy : Entropy is the measurement randomness of a system. If S is a finite set of values based on a probability distribution, the Entropy [pic] where S takes the value si , 1 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download