Title of the Paper (18pt Times New Roman, Bold)



An Application for Visualizaton

of Parameterized Speech Signals

LUKAS DZBANEK, ANDY KUIPER

Institute of Radioelectronics

Brno University of Technology

Purkynova 118, 61200 Brno

Czech Republic

Abstract: - The problem of speech recognition is being investigated by a large amount of specialists for a couple of years now. At present there exist two most used methods for automatic speech recognition. The first one is based on “Hidden Markov Models” and is preferably used for systems with a huge vocabulary. The second method is the so-called “Dynamic Time Warping” which is suitable for smaller systems (small amount of words to be recognized).

In this paper a with MatLab developed software package is being presented. This application is able to visualize the eventualities of a possible variant based on digital image processing.

Key-Words: - Application, MatLab, Speech Processing, Image Processing, MFCC, LPC, FFT, Cepstrum

1 Introduction

Human beings recognize the content of speech based on the knowledge of individual phonemes as well as the on grammar (context-based). If spoken words are phonetically similar the human brain concentrates besides on the phonetic structure also on the differences between both words and is able to differentiate those even in a noisy environment (e.g. a telephone call in public transports). However, the knowledge of the differences is not sufficient for speech recognition. It can only be seen as a supplement to the classic methods.

The introduced software package (Fig. 1) allows to observe the parameterization of speech and therefore helps to make the selection of differences more comfortable.

Fig.1: Appearance of the program’s interface

(top: six windows with transformed signals, bottom: configurability of the segmentation, parameterization, transformation and representation).

2 Description of the System

The system was developed using the mathematical programming environment MatLab 6,5. The application consists of a GUI (Graphical User Interface) and calls the implemented functions which segment the loaded speech signal as wanted and can also parameterize, transform, represent and store it (Fig. 2).

[pic]

Fig.2: Block diagram of the developed environment

2.1 Segmentation

The input signal is segmented using classical procedures [1]. The length and the overlapping of segments can be defined as well as the use of preemphasis or a Hamming weighting function (Fig. 3).

[pic]

Fig. 3: Configurability of the segementation

Parameterization

At present the following methods to parameterize the signal are [2], [3], [4], and [5], namely:

Cepstral coefficients after Mel scale (Mfcc)

LPC

Logarithmic

Root Cepstral coefficients

ParCor coefficients

FFT.

Just as described in chapter 1 (“Description of the System”) it is very simple to implement additional functions for parameterization. For this option there are also a large amount of possibilities which can be seen seen in Fig. 4.

Calculation and Representation

As soon as the parameters for segmentation and parameterization are correctly set the speech signal can be represented as a picture (Fig. 5 b). This can be modified using the dialogue shown in Fig. 6 or using different methods of binarization.

Therefore 5 different binarization methods were implemented. Each of them can be influenced using “Direction Method” or “Threshold”. In order to receive a better visualisation the user can choose between 20 freely selectable color maps.

[pic]

a)

[pic]

b)

Fig. 5: a) Representation of a speech signal (Czech number 2 – „dvě“)

b) Representation of the speech signal as picture. The used parameters can be seen in the top border of picture.

[pic]

Fig. 6: Modification and Representation

It is possible to display different words (up to 6 words in a work window - Fig. 1) using the same parameterization or the same word with different parameterizations. Hereby it is possible to update the representation, the parameterization or both (as shown in Fig. 5b). This can be done using the checkbox R-cpm for “refresh-computing” or R-grph for “refresh-graph”. When the picture is generated it can be saved as BMP-image or as compressed JPG-image. Here it is also possible to extend the application with further profiles to save data in different image formats. At present only the two most used are implemented.

3 Practical Example

The developed application was used to test different variants of speech recognition. In Fig.7 are 6 pictures, the German words “Hand” and “Haut” with 2 repetitions. They were transformed using the following parameters: length of segment: 160 (equals to 20 ms using a sampling frequency of 8000 Hz), overlapping of the windows: 10 ms, Mfcc-coefficients: 12, amount of Mel-banks: 13, FFT-window length: 256, graphical transformation: sobel-operator, direction method: both and threshold for the binarization: 0,001.

[pic]

[pic]

[pic]

[pic]

[pic]

[pic]

Fig. 7: Representation of 6 words (the words "Hand“ and "Haut“ with 2 repetitions each).

The first 3 pictures represent the word “Hand”, while the latter 3 represent the word “Haut”. Although those words are phonetically similar big differences are noticeable. Using additional methods will show even better and more concrete results as to be seen in Fig. 8.

[pic][pic][pic]

Fig. 8: Example of extracted word characteristics– chosen examples of Fig. 7 (from top to bottom 2 variants of the words “Hand” a), b) and “Haut” c), d). The symbol “P” stands for parameter).

The extracted characteristics [6] of Fig.8 were formed using erosion- and dilation-functions (further description of these methods are not subject of this paper).

4 Conclusion and Future Work

On the basis of the results of chapter 3 it can be seen that the usage of this software-package shows good results in a short time with a minute period of vocational adjustment. By means of Fig. 8 (recognition of dominant characteristics) a new possibility appears whereas it is possible to visualize words as pictures. Therefore it is possible that speech recognition will not only take place in 1D-domain but also in 2D-domain making image processing applicable. In future works algorithms of digital image processing will be used to measure the reliability of the system.

References:

[1] Vary P., Heute U., Hess W. (1998) Digitale Sprachsignal-Verarbeitung B.G. Teubner Stuttgart, Karlsruhe, Germany. 591 p

[2] O´Shaughnessy D. (1987) Speech Communication. Human and Machine. Addison-Wesley Publishing company, USA, 568 p.

[3] Psutka J. (1995) Komunikace s počítačem mluvenou řečí. Academica Praha, Czech Republic, 287 p.

[4] Becchetti C., Riccoti P.L. (2002) Speech Recognition. Theory and C++ Implementation. John Wiley & Sons Ltd, Chichester, England, 407 p.

[5] Sarilaya R., Hansen J.H.L. (2001) Analysis of the Root-Cepstrum for Acoustic Modeling and Fast Decoding in Speech Recognition. In “EUROSPEECH-2001”. Aalborg, Denmark. URL: (Mai 2004)

[6] Mohamad R. Rafimanzelat, Babak N. Araabi, Emad Sharifi (2004) New features from Fourier Spectrum for Induction Machine Broken Bar Detection using Statistical Pattern Recognition, 3rd WSEAS Int.Conf. on SIGNAL PROCESSING, ROBOTICS AND AUTOMATION (ISPRA 2004), Salzburg, Austria

[7] Prasad M.G., Omkar S. N., Mani V., Honne H. Gowda (2004) Comparative Study of Neural Network Approach and Genetic Programming approach to Land Cover Mapping, 5th WSEAS Int.Conf. on NEURAL NETWORKS AND APPLICATIONS (NNA 2004), Udine, Italy

ACKNOLEDGEMENT

This paper was supported by the Grant Agency of the Czech Republic–Project No. 102/03/H109.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download