Speech Recognition Software and Vidispine

Speech Recognition Software and Vidispine

Tobias Nilsson

April 2, 2013 Master's Thesis in Computing Science, 30 credits

Supervisor at CS-UmU: Frank Drewes Examiner: Fredrik Georgsson

Ume?a University

Department of Computing Science SE-901 87 UME?A SWEDEN

Abstract

To evaluate libraries for continuous speech recognition, a test based on TED-talk videos was created. The different speech recognition libraries PocketSphinx, Dragon NaturallySpeaking and Microsoft Speech API were part of the evaluation. From the words that the libraries recognized, Word Error Rate (WER) was calculated and the results show that Microsoft SAPI performed worst with a WER of 60.8%, PocketSphinx at second place with 59.9% and Dragon NaturallySpeaking as the best with 42.6%. These results were all achieved with a Real Time Factor (RTF) of less than 1.0.

PocketSphinx was chosen as the best candidate for the intended system on the basis that it is open-source, free and would be a better match to the system.

By modifying the language model and dictionary to closer resemble typical TED-talk contents, it was also possible to improve the WER for PocketSphinx to a value of 39.5%, however with the cost of RTF which passed the 1.0 limit, making it less useful for live video.

ii

Contents

1 Introduction

1

1.1 Report Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Models and Algorithms for Speech Recognition

5

2.1 What is Speech Recognition? . . . . . . . . . . . . . . . . . . . 5

2.1.1 Speech Recognition systems . . . . . . . . . . . . . . . . 5

2.1.2 Requirements for speech recognition . . . . . . . . . . . 6

2.1.3 Difficulties in speech recognition . . . . . . . . . . . . . 7

2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Statistical Models . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Acoustic models . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 Language models . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Formal Overview . . . . . . . . . . . . . . . . . . . . . . 13

2.4.2 Forward Search . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.3 Viterbi search . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.4 A* decoder / Stack decoder . . . . . . . . . . . . . . . . 23

3 Speech Recognition Systems and Libraries

29

3.1 CMU Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Julius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 HTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Windows Speech Recognition . . . . . . . . . . . . . . . . . . . 32

3.6 Dragon NaturallySpeaking . . . . . . . . . . . . . . . . . . . . . 32

3.7 SpeechMagic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.8 Comparison of libraries . . . . . . . . . . . . . . . . . . . . . . 33

iii

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download