EE 3414 – Multimedia Communication Systems



EE 3414 – Multimedia Communication Systems Final Project Report

Team Members: Lawrence Pan, Jamme Tan, Syed Hassan

Concepts of Advanced Voice Recognition

Abstract:

Voice recognition has become an innovative technology that allows a person to use speech as an input device for an application. Early stages of voice recognition technology required a user to train the system accordingly with their voice. One of the first commercial uses for speech recognition appeared in the form of word processor dictation, which allowed people used to speak fluently into a microphone and have their words transferred in to text on the screen. As the technology in this field progressed the flexibility and marketability behind it grew tremendously.

Below are two graphs showing the error rate of a certain task against the difficulty in a certain task using voice recognition and the computers available against the computers that are required to perform these tasks in voice recognition. As time went on, the both error rates and computers required to perform tasks decreased significantly.

|[pic] |[pic] |

Many fields that have already branched into the automated response answering system made possible through ever growing speech recognition technology include: automobile companies, credit unions, banks, insurance companies, hospitals, pharmacies, medical labs, health care offices, corporate offices, shipping services, airlines, hotels, travel agencies, etc.. In no time, any substantial business will have an automated system handling/directing incoming phone traffic on their lines.

Project Plan:

• Conduct research on history and technology behind speech recognition

o Uses in the corporate world?

▪ Automated real time telephone systems.

▪ Different companies which employ speech recognition.

• Understand the process of speech recognition?

o How the speech is received, broken down, and analyzed?

o Current driving technologies?

▪ Search and Compare algorithms

▪ Phonetic Models

▪ Analysis of spectral representation

• Current leaders in speech recognition

o Research products of leading companies behind speech recognition.

o Compare variety of products available from each vender along with their performances.

• Testing and evaluation of products.

o Download demo software from each company

o Test product’s usability, response time, reliability, interface, and result.

o Determine most reliable speech recognition software.

• Create simple automated telephony system

o Customize winning software.

o Input will be pre-recorded sound file or real time.

o Testing and revision of program.

• Final PowerPoint presentation

Accomplished Tasks:

• History of voice recognition conducted

o Radio Rex developed in 1922, was a celluloid dog that lived in a electromagnetic house and responded to his name.

o In 1940’s, the United States Department of Defense wanted to develop an automatic language translator. Was a failure because design of language recognizer did not translate words correctly and needed to be more complex

▪ As a result, the government funded Carnegie Mellon University, MIT and other institutions to design better voice recognition systems.

o Bell Laboratories, in 1952, designed the first successfully recognizing system where the digits 0 to 9 were recognized.

o MIT, in 1959, developed a system where vowels were recognized with 93% accuracy.

o Carnegie Mellon University later in the 1970’s invented HARPY system where complete sentences were recognized instead of words with long discrete pauses.

▪ Approximately 50 computers were needed for the HARPY system to perform adequately.

o Speech recognition later became commonly applicable to everyone regardless of different native accents, genders and age groups.

• Why Voice Recognition became popular

o There were many conveniences for individuals using voice recognition systems:

▪ Handicapped or disabled people can use there voice when dialing numbers or entering computer keys becomes difficult.

▪ Many who are traveling or driving on the road may find it easier just speaking into an automated voice menu rather than type in number keys.

• Technology behind voice recognition systems

o There are 5 major steps used by these systems, below is a diagram:

[pic]

• In the Capture and Digitalization stage, the system takes the voice signal from the telephone network and samples the signal at the standard rate at 8000 samples per second.

• Voice samples are then converted to the spectral domain.

• In this domain, certain parts of the signal are taken and segmented.

o This reduces computation and further increases the time it takes to process the signal and find a match to the word in real-time.

[pic]

o These representations are used by the voice recognizer system to match against data in acoustic and language models to determine what words are being spoken by the user.

o Acoustic Models

▪ Phonemes are the smallest phonetic unit in a given language

• They create distinctions between words such as the b in boy and the t in toy in the English language.

▪ Allophones are the different pronunciations of each phoneme

• E.g. t in tab, t in stab, and tt in stutter.

▪ Lexicons are databases of all words know to the system from a given language.

• They contain different pronunciations for each word

• E.g. the word “the” can be pronounced “duh” or “dee”.

▪ Trellis’ are data structures made up of all possible combinations of allophones

▪ Training of acoustic models happens when a user is asked to read a piece of already known text to the system. For multi-user systems, utterances spoken by many users are compiled into a database

▪ Weights are put into frequent allophones.

o Language Models are used

▪ Advantage is taken that all languages have set grammar structures

• The difference between two words that sound alike can be distinguished by using the previous word (using context)

• E.g. “ours” and “hours” can be determined if previous word is “two”

• Common Application of Voice Recognition systems

o Automated call centers

▪ Used in all industries including airlines, banking (“pay-by-phone”, account access), delivery services such as FedEx and UPS, and all other companies within many different industries which provide customer service.

o Integrated into personal computers:

▪ Speech to text which is mostly used by handicapped, disabled or lazy people.

o Recent development in automobiles

▪ One example is the 2004 Infiniti Q45 which uses Visteon Voice Technology™

• This system controls the climate in the car, navigation system and the compact disk player

• Standards of voice recognition

o VoiceXML (Voice Extensible Markup Language) which have AT&T, IBM, Motorola, Lucent Technologies and many more as their clients.

▪ Used in voice portals where users call up and access information

▪ Target is shifting towards web development where these voice systems can be used on the internet.

o SALT (Speech Application Language Tags)

▪ Partners include Microsoft, Intel, Cisco and SpeechWorks

o These standard are focusing their attention on bringing voice recognition to the next level, the world wide web.

o These two are the main competing standards present.

o VoiceXML is the more popular standard as of right now and they have a few hundred clients listed as using their technology.

o Slow increase in voice recognition systems are due to these two standards where the difference between them are unclear and companies are not too well informed on which standard to associate themselves with.

• Current Leaders in voice recognition

o Dragon systems who developed the text to speech dictation product Naturally Speaking

o SpeechWorks was designed to connect users of different companies to their voice portals

▪ Companies which include AOL, FedEx, E*Trade and many more

o BeVocal provides voice portals for Bell South

o TellMe provides for AT&T, Merrill Lynch

o Philips Speech Recognition provides services to the automobile industry, mobile device companies and consumer electronic industries

o Other leaders of voice recognition include IBM ViaVoice and MS Agent by Microsoft

A handful of companies are working together to take voice recognition to the next level. Dragon Systems is one of the leading companies responsible for naturally speaking to computers. They have developed user side programs for Automated Speech Recognition (ASR). IBM ViaVoice is another leader which deals with MS Agent, an application created with the help of Microsoft. MS Agent helps users speak to computers to perform tasks. SpeechWorks, BeVocal, and Phillips deal with voice portals which are used in the customer service applications. Both VoiceXML and SALT are working to improve voice recognition and improve it. A clash between the two organizations is inevitable as both try to out perform its competitor.

Magical Merlin’s Grade Retrieval System

Microsoft Agent has provided a utility to help make a personal computer voice activated. Microsoft Agent has to ability to recognize a trained voice and can perform desired tasks when spoken to. With the help of Visual Basic we were able to create a voice activated grade retrieval system for our Multimedia Systems class. We used Magical Merlin and programmed him to be our host for the program. Merlin fetches the grades of the desired student and says the specific grade for exams and projects, while being voice activated. A user simply says a student’s name and specific element and Merlin retrieves the grade for that element. A database of students was created for Merlin. Each student had his or her grades inserted under the name. When a specific name is heard, Merlin goes to that case statement which represents the desired student and loads the information of the student as the current student on file. This makes the information of that student accessible while also posing a large security issue. Certain improvements that can be made to Merlin would include secure access to personal information (grades) viable either through a voice portal or online run java script.

Future Challenges

• Speech recognition is in many ways far better than past technologies but is still far from perfect. This is due to diverse vocabularies and grammar of people all over the world. Plus, recognition systems have to work in noise-free environments which are hard to find with busy cities everywhere.

• The confusion over the two standards, VoiceXML and SALT, will further halt the widespread use of voice recognition systems.

• Voice portals are not real time yet and their database are not updated frequently enough therefore limiting its uses. One example is the purchase, selling and trading of stocks.

• Advertising is another key issue where revenue CAN’T come in due to the intolerance of users to wait and listen to ads for long periods of time compared to their glance at banners on the web.

• There are security issues where users do not want to say their social security or bank account number out loud for everyone near to listen to.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download