Paper Title (use style: paper title)



A development of a speech data transcription tool for building a spoken corpus

Hyangrae Noh

Hanbat National University

Daejon, South Korea

nhr1712@

Yunsoo Kim

IIR TECH Inc.

Daejon, South Korea

yunsookim@iirtech.co.kr

Yeonguk You

Hanbat National University

Daejon, South Korea

ryk012@

Yongjin KwaK

IIR TECH Inc.

Daejon, South Korea

silhuett@iirtech.co.kr

Jaeeun Park

IIR TECH Inc.

Daejon, South Korea

jaeeun@iirtech.co.kr

Yoonjoong Kim

Hanbat National University

Daejon, South Korea

yjkim@hanbat.ac.kr

Abstract— In this study, we developed a speech data transcription tool that integrates speech segmentation, speaker classification, speech transcription, and editing processes for the purpose of shortening transcription time of audio data. The system converts the speech data into standardized transcription data that is used as an input to a spoken corpus construction system. The speech segmentation and speaker classification process was developed using deep learning technologies and the transcription process uses the Google API. It was confirmed that the experiment performed to compare with the existing ELAN and notepad tool saves half of the processing time

Keywords— transcription tool, deep learning, speech corpus construction, speech segmentation, speaker classification

Introduction

In recent years, the emergence of products and services using artificial intelligence technology has rapidly become visible. It is essential to secure high-quality data for the improvement of artificial intelligence performance and advancement. Artificial intelligence-related research institutes and companies in Korea are appealing for good data because of a common lack of such data [1].

Useful tools have emerged to build good quality data. Bratt [2] is a tool for constructing and visualizing object names and language annotation data, and AI-C [3] is a data tool for generating relationship information of concept-thing. ELAN [4] is a voice-based multimedia tool and Envil [5] is a tool for annotating video. These tools have the same purpose of creating a language resource, but they are divided into detailed application techniques and specialized areas. Moreover, since the Korean language is not supported and functions are implemented considering the characteristics of the English language, there are restrictions on the construction of a spoken corpus in the Korean language, which has a different language structure. It also requires a highly technical and customized development capability and language resource system, as well as design and deployment capabilities.

In the case of spoken corpora construction using ELANs and text editors to handle audio, the use of annotation (or tag) symbols that are not strictly defined results in poor data availability and increased work time. In the case of speech corpus construction using audio control and an annotation editor [6], audio segmentation and transcription are performed by hand, so automation is needed to shorten the time required.

In this study, we developed a speech data transcription tool that integrates speech segmentation, speaker classification, speech transcription, and the editing process. This system converts speech data into standardized transcription data and is used as input to the spoken corpus construction system.

Configuration of the system

The proposed system has the structure shown in Figure 1, and the main screen is shown in Figure 2. When a speech file is loaded, a waveform is displayed on an audio chart. There is an automatic and manual segmentation function for detecting a speech segment. The speech signals of the detected speech segment are classified by the speaker by automatic or manual speaker classification (ACLS, MCLS) and transcribed into a Korean string by automatic or manual transcription (ATRANS, MTRANS). The waveform of a voice file is visualized in an audio chart, and a detected speech segment is arranged in a timeline specified by a speaker in a square shape on a speaker chart.

The dialog list generation (DLGEN) creates a list of audio waveform, speaker, dialog, and segment information for all speech segments and outputs them to a grid view on the screen. When you click on the automatic transcription button on the main screen, the automatic transcription (ATRANS) and the dialog list generation (DLGEN) are processed in order.

Audio chart management consists of displaying signal waveforms of voice files, zooming in and out of waveforms, setting playback positions with a cursor, displaying the current time of the cursor, start and end display of a speech segment, and automatic and manual scrolling function.

[pic]

1. The overall configuration of the system, A/MSPCLS : Automatic/manual speaker classification. A/MTRANS : Automatic/manual transcription

[pic]

2. The user interface of the main screen

Speaker chart management places each speech segment in the timeline of the speaker. It also has the ability to add and delete speaker timelines, set and modify speaker names, and move, merge, separate, and delete speech segments.

The audio controller is used when verifying the contents of a speaker or a dialog, and has a function of playing back the entire voice file, and repeatedly playing and stopping a speech segment. The playback position is displayed in synchronization with the audio chart and the speaker chart.

In the dialog list window, errors can be verified visually or audibly. If an error is found, a manual editor is used to merge and delete speech segments, modify speaker names, and dialogue.

If you click the Next button, the contents of the dialog list will be converted into standardized transcription data and can be used as input to the spoken copra construction system.

Speech segmentation, speaker classification, and speech transcription

Speech segmentation, speaker classification, and transcription have automatic and manual processing capabilities. For manual processing, a set of user interfaces is prepared to enable audio listening, waveform watching, and keyboard operation simultaneously, and the engines are developed using a deep learning method for automatic processing.

Automatic speech segmentation is performed by a Voice Activity Detection Engine (VADE). This VADE is trained and verified by using 20 hours of data from the CallHome voice database[7] in an architecture composed of Bidirectional Long-Short Memory[8] and Deep Neural Network(DNN) in a Tensorflow and Python environment. Automatic speaker classification is performed by a Speaker Clustering Engine (SCE), which classifies speech segments into speaker groups. This engine is developed by learning the architecture consisting of a Convolution Neural Network(CNN)[9] and Recurrent Neural Network (RNN).[10] When the speakers are classified, a segment set is input to generate a d-vector set and they are classified into groups of speakers by K-means algorithm. Automatic transcription (ATRANS) uses the Google cloud speech API to automatically transcribe the speech signals of speech segments into Korean characters.

Experiment

In order to evaluate the performance of this system, we performed experiments using the existing ELAN and notepad tool(ET), the manual processing tool(MT), and automatic processing tool(AT).

The experimental data is an audio file of an interview by a Korean female teacher. The input file is stored in 16KHz 16-bit mono channel Windows PCM format. The number of samples is 5,382,240, the size is 10,513KB, and the recorded data is 5 minutes, 36 seconds.

Processing time by work tool and work process

|TOOL/PROCESS |ELAN + NOTEPAD |MANUAL TOOL |AUTOMATIC TOOL|

| |(ET) |(MT) |(AT) |

|DATA REGISTRATION |A |A |A |

|SPEECH SEGMENTATION |11:27 |16:51 |08:36 |

|SPEAKER CLASSIFICATION | | |02:22 |

|SPEECH TRANSCRIPTION |35:04 |10:33 |10:18 |

|VOICE ANNOTATION | |11:27 |11:27 |

|INSPECTION AND CORRECTION |52:02 |17:12 |17:12 |

|DATA CONVERSION |N/A |A |A |

|TOTAL |1:38:33 |56:03 |49:55 |

The values of each entry are the mean of the three measured values for 3 tools and 5 processes, and the value 'a' means applicable and 'n / a' means not-applicable. There is no significant difference in the comparison between the processing time 49:55 of the automatic tool (AT) and the time 56:03 of the manual processing tool(MT). This is because automatic processing requires full listening of audio data at least once for verification. The AT processing time (49:55) is 49.34% less than the existing ET processing time (1:38:33). This means that about 50% of the work time is saved in consideration of the fact that the working time may be changed depending on the skill level of the user who uses the tool and the characteristics of the data.

CONCLUSION

ELAN is a great tool, but due to its versatility, it is limited to building data for a specific purpose. In other words, recalling that the first purpose of a good corpus is to meet the purpose of construction[11], the corpus construction tool should provide the annotation system and the usage environment necessary to achieve the purpose of constructing the corpus.

In this study, we developed the system that provides the automatic segmentation function for word, sentence, and speaker for voice data, and also provides a function to manually correct voice transcription, voice annotation and verification procedures. As a result, not only the speech segmentation stage but also all the subsequent tasks in the entire process of spoken corpora construction has shortened the time required. For example, in the voice annotation process, there is almost no need for lap time and idle time.

In the experiment to measure the system performance, we confirmed that the proposed system reduced the processing time by 50% than the experiment using the tool ELAN and notepad.

We plan to devise a new architecture for speech segmentation and speaker classification to further shorten processing time and increase accuracy. An integrated architecture will be designed to merge the statistical features of each voice segment as well as the Mel Frequency Cepstral Coefficient (MFCC) features [12] for each frame used in the this system

References

1] S. Lee, “Secure high-quality learning data for intelligence,” ICT R&D Mid/long-term Technology Roadmap 2022, Artifical Intellignce, Institute for Information & Communication Technology Promotion, 2016, pp.414-414.

2] S. Rashid, G. Carenini, R. Ng. “A Visual Interface for Analyzing Text Conversations.” BIRTE 2012. [Lecture Notes in Business Information Processing, vol 154, pp.93-108, 2013]

3] AI-Cortex Knowlage Base, .

4] H. Sloetjes, P. Wittenburg,, "Annotation by category – ELAN and ISO DCR." Proceedings of the 6th International Conference on Language Resources and Evaluation, pp. 816-820, 2008.

5] The Video Annotatoin Research Tool, .

6] H. Kang, “Korea Learner’s Corpus Construction”, National Institute of Korean language, 2017.

7] A. Canavan, D. Graff, and G. Zipperlen, “CALLHOME American English Speech”, Linguistic Data Consortium, 1997.

8] S. Hochreiter and J. Schmidhuer, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp.1735-1780, 1997.

9] Convolutional Neural Network, /Convolutional_neural_network. (accessed in June 29, 2018)

10] J. Schmidhuber. 2015. “Recurrent Neural Networks.” http:// people.idsia.ch /~juergen/rnn.htm (accessed in June 29, 2018)

11] J. Sinclair, “Developing Linguistic Corpora:a Guide to Good Practice.” Tuscan Word Centre, /chapter1.htm#section2. (accessed in June 29, 2018)

12] MFCC(Mel-Frequency Cepstral Coffefficents) Algorithm, https:// en.wiki/Mel-frequency_cepstrum, (accessed June 29, 2018)

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download