Speech Recognition and Understanding



Carnegie Mellon University

Language Technology Institute

An Acoustic Model of the Water Noise

in the Dolphin Project

11-751 Speech Recognition and Understanding

Fall 2004

Instructor: Dr. Tanja Schultz

Kai-min Kevin Chang

Alex Kang

Introduction

Towards Communication with Dolphins is an ambitious long-term scientific study of Tursiops Truncatus’ (Atlantic Bottlenose Dolphin) speech at Carnegie Mellon University. Working with the Wild Dolphin Project, a non-profit, all-volunteer organization, the two groups set out a goal to utilize speech recognition technology to better classify dolphin recordings. Since the establishment of the Wild Dolphin Project in 1989, giga bytes of dolphin sound have been recorded but left untranscribed. Manual transcription would consume hours of tedious routine work. Thus, in our term project for 11-751 Speech Recognition and Understanding, we propose to train acoustic models for water noise with a goal of automatic classification of dolphin sound and water noise.

Literature survey

Water noise modeling can be classified into two schools. On one hand, a noise detection model uses standard HMM to classify a series of sound units into noise and non-noise. On the other hand, a noise filter model utilizes signal processing theories to separate noise and non-noise into different channels. Given that water noise is background noise that is often prevalent throughout the entire recordings, a noise filter model is of ultimate interest. However, a noise filter model is an open question in signal processing and is beyond the scope of a term project. For the purpose of automatic classification of dolphin sound and water noise, a noise detection model suffices.

A typical recording in the Dolphin Project contains a range of dolphin sound and noises. There are three types of dolphin sound, namely broadband clicks used in echolocation, broadband burst pulses, and whistles. The noises are more difficult to describe. Depending on different recording equipment / environment, noises include human speech, various types of machine noises that are due to microphones, propeller of the boat, and various types of water noise that are due to splashes, habitants in the water, and perhaps different types and depths of waterbed.

Previous work in the Dolphin project involves identification of different dolphins using Janus speech recognition toolkit. Janus is a HMM-based speech recognizer that is developed in the Interactive Systems Lab at CMU. In a HMM speech recognizer, acoustic models are trained to capture the phonetic feature of the interested classification. Hence, the dolphin-ID project trained an acoustic model for each recognized dolphin based on its signature whistles, while aggregating one acoustic model for all other noises. Initial success in distinguishing dolphins has been achieved. The present work will be an extension to the dolphin-ID project, while focusing on differentiation of water noise from dolphin sound and all other noises. That is, we will aggregate one acoustic model for all dolphin sound, one model of the water noise, and one acoustic model for all noises.

Approach

Models

Setup

Extending the Dolphin-ID project, we established current project in the Interactive Systems Laboratories. The linux box meenie.is.cs.cmu.edu has been setup with audio support, which enables us to view the spectrum of the utterance via Emulabel program and to hear the utterance via headset / speaker. Janus speech recognition toolkit, a HMM-based speech recognizer will be used to train the acoustic models.

Acoustic models

The phoneme topology for non-silence phonemes (ie. dolphin sound) will be three states with transitions to themselves and to the next state. The silence, water noise, and garbage phones will only use one state. All the transitions have equal probability (0.5). We will start with a fully continuous densities system with 39 Gaussian mixture models. The setup is adopted from the dolphin-ID project.

Language models and Dictionary

Initially, the dictionary contains only 4 entries: dolphin, water, pause, garbage and silence. A uni-gram language model will be used. Thus, each vocabulary has uniform probability that is equal to one another.

Database

There are two sets of dolphin recordings: CMU database and Bahamas recording.

CMU database - CMU database contains utterances that are recorded in the Wild Dolphin Project. Utterances are selected from those of clear dolphin sound. The recordings contain either dolphin or water noises, and rarely silence or garbage. There are 166 utterances in the CMU database. Each utterance is roughly 15 seconds in duration and sampled at 10kHz. 100 utterances will be randomly selected to the training set, while the remaining 66 utterances be the unseen testing set. The recordings have been labeled by Dr. Alan Black. The CMU database was the source of data in the Dolphin-ID project.

Bahamas database – Bahamas database contains utterances that are recorded during Dr. Bob Frederking’s trip to Bahamas in 2004. There is five days worth of untranscribed recordings. Unfortunately, a constant buzziness throughout the first four days of recordings rendered those recordings bad source of data. Thus, we decide to select the utterances from the last day, when the recording was done by manually directing the microphone toward dolphin. There are 56 utterances in the Bahamas database. Each utterance is roughly 15 seconds in duration and sampled at 10kHz. Because we only have 56 recordings available, we elect not to distinguish training and testing set. During the labeling process, we found the recordings to be qualitatively different from the CMU database. Whereas CMU database contains utterances that contain distinctive dolphin and water sound, Bahamas database contains lots of silence. This difference is expected to effect the recognition accuracy.

Evaluation plan

SClite

The conventional metric for speech recognition is a recognition accuracy calculated based on word accuracy as follows.

WACC = (Len – (Sub + Ins + Del) / Len ) * 100%

where Sub, Ins, Del and Len are the numbers of substitutions, insertions, deletions and words in the manual transcription, respectively.

Although the conventional WACC measure is accepted as a standard benchmark test for speech recognition system, the measure is not applicable to current task for the following three reasons. First, we have a very small vocabulary size, consisting of only silence, dolphin, water, and garbage. Second, a word in the task can be arbitrarily long; there is nothing to prevent a recording to be nothing but silence or water. Last and most importantly, WACC does not consider time information. As long as the speech recognizer decodes the utterances as a sequence of labels the same as that of human transcriber, WACC gives high score regardless of actual time of each label. For example, both human and speech recognizer may label an utterance as a sequence of [water, dolphin, water], but one marked dolphin starting from time t to t+k, while the other marked dolphin starting from time t+k to t+2k. Despite the high WACC rate, the resulting labels are not useful the transcriber.

In sight of the last problem, we decide to make use of Sclite, a standard ARPA speech recognition benchmark application, to evaluate the testing result. Sclite comes with a ‘-T’ option to do the time-mediated alignment. Time-Mediated alignment is a variation of DP alignment where word-to-word distances are based on the time of occurrence for individual words. The time alignments are computed by replacing the standard word-to-word distance weights of 0, 3, 3, and 4 with measures based on beginning and ending word times.

Eval.pl

In addition to the time-mediate alignment score provided by SClite, we decide to include a measure of the recognition accuracy on the frame basis since we are more interested in the correctness of the label position. Conventionally, we derive the accuracy by comparing the hypothesis labels with human labels, where each frame is either a match or not match. The script eval.pl has been developed for this purpose. In addition to the accuracy score, the script also outputs a confusion matrix that details common substitutions.

Development Cycle

The general development cycle is depicted as follows. In iteration 0, we will train the acoustic models from the training set of CMU database. The resulting acoustic model will be used to decode both CMU database and Bahamas database. Scores from Sclite and Eval.pl will be calculated. A human transcriber will then take the decoder output for the Bahamas database as initial labels for the Bahamas recordings and correct the labels in the Emulabel program. The corrected labels will then be taken as new training set in iteration 1, in which the acoustic model from iteration 0 are retrained with new data. The resulting acoustic model will be used to decode both CMU and Bahamas database again. The iteration continues until evaluation shows no more improvement, or we run out of data. In the current task, one iteration was performed.

[pic]

Figure 1 Development cycle

Result

| |Data |Eval.pl |Sclite -T |

|Dolphin-ID |cmu.0 |41.05% |39.80% |

|Waternoise |cmu.0 |57.64% |61.80% |

| |Bahama.0 |16.67% |45.70% |

| |Bahama.1 |63.43% |12.00% |

| |cmu.1 |42.45% |22.00% |

Table 1 Evaluation

|Cmu.0 |Ref |

| |SIL |Dolphin |Water |Garbage |Total |

|Hyp |SIL |3.27% |4.94% |6.71% |0.10% |15.01% |

| |Dolphin |0.74% |21.95% |1.85% |0.07% |24.61% |

| |Water |10.16% |10.51% |32.29% |0.62% |53.59% |

| |Garbage |0.43% |4.63% |1.60% |0.13% |6.79% |

| |Total |14.59% |42.03% |42.45% |0.93% |  |

|  |  |  |  |  |  |  |

|Bahamas.0 |Ref |

| |SIL |Dolphin |Water |Garbage |Total |

|Hyp |SIL |3.01% |0.50% |2.16% |2.24% |7.90% |

| |Dolphin |1.52% |0.93% |3.11% |1.37% |6.93% |

| |Water |52.83% |1.70% |12.28% |16.55% |83.35% |

| |Garbage |0.63% |0.31% |0.42% |0.45% |1.82% |

| |Total |57.98% |3.44% |17.97% |20.61% |  |

| | | | | | | |

|Cmu.1 |Ref |

| |SIL |Dolphin |Water |Garbage |Total |

|Hyp |SIL |0.57% |0.56% |0.65% |0.01% |1.80% |

| |Dolphin |5.33% |20.13% |14.92% |0.28% |40.65% |

| |Water |6.73% |15.13% |21.60% |0.48% |43.93% |

| |Garbage |1.96% |6.21% |5.29% |0.16% |13.62% |

| |Total |14.59% |42.03% |42.45% |0.93% |  |

| | | | | | | |

|Bahamas.1 |Ref |

| |SIL |Dolphin |Water |Garbage |Total |

|Hyp |SIL |37.33% |0.53% |0.98% |0.45% |39.30% |

| |Dolphin |3.73% |1.19% |1.94% |2.30% |9.17% |

| |Water |12.25% |0.96% |13.60% |6.55% |33.36% |

| |Garbage |4.67% |0.76% |1.45% |11.31% |18.18% |

| |Total |57.98% |3.44% |17.97% |20.61% |  |

Table 2 Confusion matrix

Analysis

Iteration 0

1. Improvement over Dolphin-ID. As seen in the first two rows in Table 1, there is an improvement in accuracy from 41.05% to 57.64% of eval.pl score over the dolphin-ID project. Similar pattern is observed in the Sclite evaluation; an improvement of 39.8% to 61.8% is observed. This pattern of result is expected since dolphin-ID was set out to another goal – to differentiate dolphins. Consequently, the project intended to ignore water noise and label all of such as garbage. Performance of dolphin-ID thus suffers whenever there is a water label. Given that the two tasks are of different goals, it is expected that Waternoise generates better result. The comparison is prepared for baseline comparison only.

2. Low accuracy rate in new recording. As seen in the third row of Table 1, the initial recognition accuracy rate of Bahamas database was very low (16.67% accuracy according to eval.pl and 45.7% according to Sclite). The result is hypothesized to be a result of qualitatively different recordings in the two databases. The hypothesis has been verified. As seen in table 2, CMU database contains 14.59% silence, 42.03% dolphin, 42.45% water, and 0.93% garbage, whereas Bahama database contains 57.98% silence, 3.44% dolphin, 17.97% water, and 20.61% garbage. Two databases are indeed qualitatively different.

Iteration 1 (more labels from Bahamas database)

1. Huge improvement in Bahamas database after labeling. As seen in table 1, there is a 16.67% to 63.43% improvement in eval.pl. The improvement is again hypothesized to be due to the prevalence of silence in Bahamas recording. One phenomenon was observed during the labeling process: the speech recognizer often confuses silence with water in Bahamas database. This is due to the fact that silence is rare in the CMU database, but prevalent in the Bahamas database. The hypothesis is verified by the confusion matrix. As seen in Bahamas.0 of table 2, the speech recognizer confuses silence with water 52.83% of the time. The confusion reduces to 12.25% in the next iteration, when the recognizer has been trained toward the new database. The recognition accuracy rate is expected to gain from the reduction of silence-water confusion. Another explanation of the huge improvement is that we are testing on the same training set. Unfortunately, due to data scarcity, we are not likely to differentiate training and testing set. Lastly, the improvement is not seen in Sclite score. The reason to this phenomenon is unknown. We suspect that Sclite is not a good evaluation criterion for the current task.

2. Slight decrease in cmu performance. As seen in table 1, there is a decrease in CMU database performance from 57.64% to 42.45% in eval.pl, and 61.8% to 22.0% in Sclite. This pattern of result is expected since we are fitting the model to accommodate new data. The performance reduction is a result of lesser accuracy in identification of dolphin and water (from 21.95% to 1.19%, and 32.29% to 21.6%, respectively). This is due to the fact that Bahamas database contains very little dolphin and water recordings (3.44% and 17.97%, respectively).

Discussion

Present work

The present work seeks to an automatic classifier of dolphin and water that will eventually speed up manual labeling process. One example from the automatic classifier is included in Figure 2. The utterance contains an initial silence, follows by a segment of water splashes. No dolphin sound is present. As seen in the figure, the classifier correctly labels silence and most of the water segment. The human transcriber only needs to correct the one mis-label of dolphin sound. The initial result seems promising.

[pic]

Figure 2 Sample label from the speech recognizer.

Future work

Many improvements are sighted for future work. For example, one direction will be to tune the model parameters, experimenting with different acoustic model topology, or extending the language model. Although it is unclear whether it is beneficial to extend the language model from a uni-gram model to bi-gram model, better probability estimation for the vocabularies in the database should prove beneficial. In fact, better probability estimation should solve the problem of qualitatively different recording, as seen in present work. Another direction of improvement is in the line of getting more data. More utterances should be labeled to enable more training iteration.

At the beginning of the semester, we were aiming for multiple water noise acoustic models; ones that would model water at different depth, different waterbed, or different habitant. However, this was discouraged by a) late arrival of logs b) not enough data. Currently, logs of Bahamas recordings have arrived. Differentiation of the water noise models may be possible given that we have sufficient data. If the data scarcity problem does exist, one remedy is to simulate different water conditions in a water tank. That is, we can set up controlled recordings in a water tank simulating different types of water noise. Water tank recording has an additional benefit that the noise is uni-channel. That is, we can make sure that we are modeling one channel of water noise at a time.

Lessons learnt

1. Janus speech recognizer – In the current task, we have learned the basics of using the Janus speech recognizer. Through cleaning up the “waternoise” directory Tanja copied for us, we learned what are the essential files needed to start the recognizer. We even push the step further by re-organizing the directory. One nice thing we are proud of is that each step in the project is clearly documented in README and Makefile. In addition, we created some useful environment setup (for Emulabel, and playing audio directly in Vim), as well as common utilities like ctm2lab.pl, ctm2frm.pl, and lab2ctm.pl. All the utilities are created with extensibility and re-usability in mind. We hope they prove useful for the follow-up researchers.

2. Hand on experience in training acoustic models. The project is our first hand on experience in training acoustic models. It was indeed very exciting seeing our first model outputs water when testing on a segment of utterance that consists of water. The excitement will never be forgotten. Like every programmer writes “Hello World” in his/her first program, the Water noise acoustic model is our way of saying “Hello World” to the filed of speech recognition.

3. Labeling process. A part of project was spent on hand labeling some of the utterances in Bahamas database. Although we had some initial labels generated by the speech recognizer to aid the process, the manual transcription process was still the worst part of the project. Labeling requires a high concentration both visually and phonetically. We took 5 hours to label merely 60 utterances (Emulabel crashes a lot). Thus, the lesson is well learnt - respect those who transcribe for you.

Timeline and goals reached

|Introduction |Proposed |

|Problem Statement (1) |Actual |

|Literature Survey (2) |Milestone |

|Approach | |

|Input |Sep 24 |

|Recording from Bahama (1) |Sep 24 |

|Transcription (2) |First meeting + 1.1 |

|Recording from water tank (2) | |

|Transcription (1) |Sep 25 |

|Algorithm |Sep 25 |

|Types of noises (1) |Proposal + 1.2 |

|Algorithms for acoustic models (3) | |

|Implementation (3) |Sep 27 |

|Evaluating function (1) |Sep 27 |

|Output |2.1.1 + 2.1.2 + 2.2.1 |

|Hypothesis from acoustic models (1) | |

|Pre-test (1) |Sep 29 |

|Tuning (3) |Sep 29 |

|Model parameters (2) |Proposal due |

|Combining noise models (2) | |

|Post-test (1) |Oct 2 |

|Analysis / Comparison (2) |Oct 2 |

|Conclusion (1) |2.2.2 + 2.2.3 + 2.2.4 |

| | |

| |Oct 4 |

| | |

| |Feedback on proposal |

| | |

| |Oct 9 |

| | |

| |2.1.3 + 2.1.4 + spill over from 2.2.3 and 2.2.4 |

| | |

| |Oct 16 |

| |Nov 17 |

| |2.3.1 + 2.4 |

| | |

| |Oct 20 |

| | |

| |Progress Report |

| | |

| |Oct 23 |

| |Nov 24 |

| |2.5.1 |

| | |

| |Oct 30 |

| |Dec 1 |

| |2.5.2 |

| | |

| |Nov 6 |

| |Dec 15 |

| |2.6 + 2.7 |

| | |

| |Nov 13 |

| |Dec 15 |

| |2.8 |

| | |

| |Nov 20 |

| |Dec 8 |

| |Draft Report |

| | |

| |Nov 24 - 28 |

| |Dec 15 |

| |Thanksgiving |

| | |

| |Nov 27 |

| | |

| | |

| | |

| |Dec 4 |

| | |

| | |

| | |

| |Dec 8 |

| |Dec 17 |

| |Presentation |

| | |

| |Dec 11 |

| | |

| | |

| | |

| |Dec 15 |

| |Dec 18 |

| |Final Report |

| | |

| |Meet time:30 min after class on Monday, Wednesday |

| |2 hrs on Weekend |

The above chart depicts the milestones and timeline as proposed in the Project Proposal. The milestones are colored blue if we have accomplished the step, red if we planned the milestone but have not accomplished at the time of writing, strikethrough if it’s a milestone that we drop from the proposal, and black for whatever lies ahead. As seen in the chart, the progress of the model has been largely delayed, partly due to Tanja’s absence in October and November. Fortunately, we were able to seek help from YuePan, Alan, Yunghui and Stan. Progress picked up very quickly when Tanja returns. Fortunately, we were conservative in our project proposal for the last few weeks of the semester.

Conclusions

In summary, early success in recognizing water noise has been achieved with a combined accuracy of 47.85%. The automatic classification of water, dolphin, silence, and garbage is expected to aid human transcription process. Future work in the Dolphin project includes experimentation with different acoustic model topology, extending the language model, and labeling more utterances. Furthermore, multiple water noise acoustic models might be made possible through simulation of different water conditions in water tank.

Reference

[1] - Structure of Signature Whistles

[2] C. Hori, T. Hori & Sadaoki Furui. Evaluation Method for Automatic Speech Summarization,

[3] T. Kristjansson, B. Frey L. Deng & A. Acero. Towards Non-Stationary Model-based Noise Adaptation for Large Vocabulary Speech Recognition.

[4] Didier A. Depireux. Detection and Classification in Non-Stationary Backgrounds Using HMM Extensions.

[5] Speech Recognition in Noise. The University of Birmingham School of Electronic & Electrical Engineering.

[6] V. Stouten, H. Van hamme, K. Demuynck & P. Wambacq. Robust Speech Recognition using Model-Based Feature Enhancement.

[7] Y. Stylianou. Applying the Harmonic plus Noise Model in Concatenative Speech Synthesis.

[8] Y. Stylianou. Concatenative speech synthesis using a Harmonic plus Noise Model.

[9] Y. Stylianou. Applying the Harmonic plus Noise Model in Concatenative Speech Synthesis

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download