123seminarsonly.com



● Introduction

● Initial problem

● How to Compare Recordings

● Dependence of system’s accuracy

● Algorithm instruction

● Source Code

● Software Requirements

● Hardware Requirements

● References

Introduction

The project “Attendance through Voice Recognition” is a tool that can help an organization or academic institute to have attendance of their employee or students and also the faculty members.It also record the time and date at which the member is present.

This project allows a organization or academic institute to overcome the problem of proxy to a great extend.

Many organization is facing the problem of proxy. Employee may mark their attendance by some other guy and the organization may not detect it because there is no such process of verification and it is difficult to recognize the face or voice of every person.

The same situation is their in academic institute also.

The faculty member may mark their attendance though they are late or absent from the institute by some other colleagues which is a common scenario in a government institute.

The faculty members can also get help from this software by detecting proxy of students also.

The Project actually describes the process behind implementing a voice recognition algorithm in MATLAB. The algorithm utilizes the Discrete Fourier Transform in order to compare the frequency spectra of two voices. Chebyshev’s Inequality is then used to determine (with reasonable certainty) whether two voices came from the same person or not.

If the two voices matches each other than a present is marked in the attandence register i.e in a database and with that present, the date and time of attendance is also stored.

Initial Problem

Speech is a natural mode of communication for people. We learn all the relevant skills during early childhood, without instruction, and we continue to rely on speech communication throughout our lives. It comes so naturally to us that we don't realize how complex a phenomenon speech is. The human vocal tract and articulators are biological organs with nonlinear properties, whose operation is not just under conscious control but also affected by factors ranging from gender to upbringing to emotional state. As a result, vocalizations can vary widely in terms of their accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed; moreover, during transmission, our irregular speech patterns can be further distorted by background noise and echoes, as well as electrical characteristics (if telephones or other electronic equipment are used). All these sources of variability make speech recognition, even more than speech generation, a very complex problem.

A human can easily recognize a familiar voice however, getting a computer to distinguish a particular voice among others is a more difficult task. Immediately, several problems arise when trying to write a voice recognition algorithm. The majority of these difficulties are due to the fact that it is almost impossible to say a word exactly the same way on two different occasions. Some factors that continuously change in human speech are how fast the word is spoken, emphasizing different parts of the word, etc… Furthermore, suppose that a word could in fact be said the same way on different occasions, then we would still be left with another major dilemma. Namely, in order to analyze two sound files in time domain, the recordings would have to be aligned just right so that both recordings would begin at precisely the same moment.

How to Compare Recordings

Frequency Domain

Given the difficulties mentioned in the above paragraph, one thing becomes very evident. That is, any attempt to analyze sounds in time domain will be extremely impractical. Instead, this led us to analyze the frequency spectra of a voice which remains predominately unchanged as speech is slightly varied. We then effectively utilized the Discrete Fourier Transform to convert all recording into frequency domain before any comparisons were made. Working in frequency domain eliminates the necessity to exactly align audio tracks in order to make a comparison.

Finding a Norm

Due to nature of human speech, all data pertaining to frequencies above 600Hz can safely be discarded. It then followed that our files could actually be regarded as vectors in 600-dimensional Euclidean space. After normalizing our vectors, we then took the norm of the difference of two frequency spectra as way of comparing the two. Unfortunately, exactly which norm to use is not immediately clear. After carefully comparing and contrasting the use of the L1, L2, and L∞ norms, we concluded that the L2 norm most accurately measured how close the frequency spectra were in to different vectors. At this point, all that remained was to decide exactly how small the norm of the difference of two frequency spectra had to be in order to determine that both recordings originated from the same person.

Chebyshev's Inequality

Recall that Chebyshev's Inequality states that in particular, at least 3/4 of all measurements from the same population fall within 2 standard deviations of the mean. Hence, in response to the problem posed at the end of the previous paragraph, we then formed the following solution: By requiring that the norm of the difference fall within 2 standard deviations of the normal average voice, we were then ensured that at least 3/4 of the time, the algorithm would recognize a voice correctly.

Dependence of system’s accuracy:

The accuracy of voice recognition depends on many factors. A system's accuracy depends on the conditions under which it is evaluated: under sufficiently narrow conditions almost any system can attain human-like accuracy, but it's much harder to achieve good accuracy under general conditions. The conditions of evaluation - and hence the accuracy of any system - can vary along the following dimensions:

● Vocabulary size and confusability.

As a general rule, it is easy to discriminate among a small set of words, but error rates naturally increase as the vocabulary size grows. For example, the 10 digits 'zero' to 'nine' can be recognized essentially perfectly , but vocabulary sizes of 200, 5000, or 100000 may have error rates of 3%, 7%, or 45%. On the other hand, even a small vocabulary can be hard to recognize if it contains confusable words. For example, the 26 letters of the English alphabet (treated as 26 'words') are very difficult to discriminate because they contain so many confusable words (most notoriously, the E-set: 'B, C, D, E, G, P, T, V, Z'); an 8% error rate is considered good for this vocabulary

● Speaker dependence vs. independence.

By definition, a speaker dependent system is intended for use by a single speaker, but a speaker independent system is intended for use by any speaker. Speaker independence is difficult to achieve because a system's parameters become tuned to the speaker(s) that it was trained on, and these parameters tend to be highly speaker-specific.

● Isolated, discontinuous, or continuous speech.

Isolated speech means single words; discontinuous speech means full sentences in which words are artificially separated by silence; and continuous speech means naturally spoken sentences. Isolated and discontinuous speech recognition is relatively easy because word boundaries are detectable and the words tend to be cleanly pronounced.

● Task and language constraints.

Even with a fixed vocabulary, performance will vary with the nature of constraints on the word sequences that are allowed during recognition. Some constraints may be task-dependent (for example, an airlinequerying application may dismiss the hypothesis 'The apple is red'); other constraints may be semantic (rejecting 'The apple is angry'), or syntactic (rejecting 'Red is apple the'). Constraints are often represented by a grammar, which ideally filters out unreasonable sentences so that the speech recognizer evaluates only plausible sentences. Grammars are usually rated by their perplexity, a number that indicates the grammar's average branching factor (i.e., the number of words that can follow any given word). The difficulty of a task is more reliably measured by its perplexity than by its vocabulary size.

● Read vs. spontaneous speech.

Systems can be evaluated on speech that is either read from prepared scripts, or speech that is uttered spontaneously. Spontaneous speech is vastly more difficult, because it tends to be peppered with disfluencies like 'uh' and 'um', false starts, incomplete sentences, stuttering, coughing, and laughter; and moreover, the vocabulary is essentially unlimited, so the system must be able to deal intelligently with unknown words (e.g., detecting and flagging their presence, and adding them to the vocabulary, which may require some interaction with the user).

● Adverse conditions.

A system's performance can also be degraded by a range of adverse conditions. These include environmental noise (e.g., noise in a car or a factory); acoustical distortions (e.g, echoes, room acoustics); different microphones (e.g., close-speaking, omnidirectional, or telephone); limited frequency bandwidth (in telephone transmission); and altered speaking manner (shouting, whining, speaking quickly, etc.).

Algorithm Instructions

The project contain a folder named 'Matlab Files' contains 10 audio recordings of the person whose voice is to be recognized.

Every person should record his 10 voice saying his name.

Also, that folder contains two m-files. These two files are project.m and voicerec.m. 

Project.m is the voice recognition program that accomplishes the goals of the class project. The script file project.m can be ran in the command window in Matlab. Please ensure that the directory in Matlab is set to the directory that contains project.m and the audio recordings.

Once project.m is ran in Matlab, it will then ask you to "Enter the name that must be recognized".Then type in the name that has to be recognized but the name that is typed must have its recorded voice in the audio folder.

After that, the program will let you know that you have 2 seconds to record yourself saying “The name”. After recording, Matlab will play this recording and you will then have the option to record again or proceed with your recording. After proceeding, Matlab will generate a plot showing how the normalized frequency spectra in your voice (Top window) compares to the average normal vector of Typed name Voice(Bottom window)

Matlab will then display in the command window 'YOUR NOT “the particular person”!!!!' if you do not fall within 2 standard deviations of the normal average voice. If you do happen to fall within 2 standard deviations, then the command window will display 'HELLO “typed name” your attendance is marked!!!'.

The second m-file in that folder is voicerec.m. This program is also ran by entering voicerec in the command window. This program allows you to record your voice 10 times and saves these recordings in the directory

Screenshot

[pic]

[pic]

[pic]

Software Requirements

● Windows Operating System

● MATLAB

● Microsoft access

Hardware Requirements

The minimum hardware required for running parallel plotting of data are:

● Mouse

● Keyboard

● Monitor

● Intel Pentium Machines

● 1GHz / 16 MB RAM

● Main Memory Storage 4 G.B (Depending on estimated size of the database used)

● UPS - 500 V.A.

REFERENCES:

1. MATLAB protuct help

2.

3. Readings in Speech recognition by Alex weibel

4.

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download