FUNDAMENTAL FREQUENCY AND ENERGY CONTOURS …



Chapter 7

FEATURE CALCULATION

7.1 Basic Prosodic Attributes

In the present section, calculations and procedures employed to obtain basic features contours are explained. These essential attributes (i.e. pitch and energy) will be the starting point in the aim to obtain more complex features, which contain valuable information for our purposes. The software used in section 7.1 is part of the Verbmobil long-term project of the Federal Ministry of Education, Science, Research and Technology.

In order to achieve feasible estimations, and avoid the difficulties caused by the non-stationary nature of speech, it’s assumed that the properties of the signal change relatively slow with time. This allows examination of a short-time window of speech to extract relevant parameters that are presumed to be fixed within the duration of the window. Most techniques yield parameters averaged over the course of the time window. Thus, if dynamic parameters are to be modelled, the signal must be divided into successive windows or analysis frames so that the parameters can be calculated often enough to follow relevant changes. Consequently, in order to obtain F0 and energy contours, smaller fragments of speech, called frames, are considered.

For each frame, the F0 and energy values are computed. There will be one single value per frame and for its calculation a longer analysis window is employed. Inside the analysis window, all the speech signal values are considered and analysis windows are this way always overlapped. Frame durations of 10 ms and 20 ms are commonly used in speech processing, while window lengths for F0 and energy calculations are usually established between 25 ms and 40 ms. The analysis performed in the present work considers frame durations of 10 ms and analysis window lengths of 40 ms.

Since voiced/unvoiced decision is the base of the F0 computation, it’s the first algorithm in being described within this section. The decision is frame-based, and only over voiced frames, F0 will be estimated.

7.1.1 Voiced/unvoiced decision.

Voiced speech involves the vibration of the vocal folds in response to airflow from the lungs. This vibration is periodic and it could be examined independently of the properties of the vocal tract. Its periodicity refers to the fundamental frequency of such vibration or the resulting periodicity in the speech signal, also called “pitch”.

Figure 7.1. Waveform of the glottal source.

In unvoiced speech the sound source is not a regular vibration but rather vibrations caused by turbulent airflow due to a constriction in the vocal tract. The sound created as a result of the constriction is described as a noise source. It contains no dominating periodic component and has a relatively flat spectrum meaning that every frequency component is represented equally (in fact for some sounds the noise spectrum may slope down at around 6dB/octave). Attending to the time waveform of a noise source, only a random pattern of movement is observed around the zero axis. In this context without any periodicity, pitch estimation makes no sense.

[pic]

Figure 7.2. Different sources in speech production.

Therefore, for F0 estimation is essential to define which frames are considered voiced and which unvoiced. In contrast with F0 and energy calculations, non-overlapping windows will be employed for the voiced/unvoiced decision. The algorithm uses only values of the signal contained within a frame duration.

Voiced frames differentiate themselves from unvoiced frames by means of high amplitude values, a relative low zero-crossing rate and big energy values. Zero-crossing rate is understood as the number of zero-crossings per time unit, defined from now as the frame length, then 10ms. Several procedures to decide between voiced/unvoiced frames are introduced in [Hes83]. The algorithm used here applies thresholds, which are presented in [Hes83], and it’s described in [Kie97]. As a result of this work, following thresholds are proved to be suitably appropriated for the voiced/unvoiced decision:

Zero-crossing rate in Hz: [pic] (7.1)

Normalised energy of the signal: [pic] (7.2)

Normalised absolute maximum: [pic] (7.3)

Where

fs Sampling frequency in Hz (here 16000)

N Frame length in samples (here 160)

sn n-sample value of the signal

n_cross Amount of zero-crossings during a frame

Range Difference between maximum and minimum value in the signal

MaxRange Maximum feasible range, dependent on the quantification

(here 16 Bit ( MaxRange=65536)

Normalisation in (7.2) and (7.3) comes from the fact that the speaker may verbalise at different energy levels at different time.

The decision rule is achieved through the comparison of thresholds theoretically based ([pic]) with a vector whose components result from equations (7.1) to (7.3):

If

n_cross < (n_cross and

EneNorm > (EneNorm and (7.4)

MaxNorm > (MaxNorm

then

( Voiced

else

(Unvoiced

Where

[pic] = Vector obtained from (7.1) – (7.3)

Definition of appropriated thresholds was optimised in order to reach the best algorithm performance for the various speech samples available according to some theoretical background. Thresholds were selected through experiments made during the Verbmobil project development [Hag95]. After some simple experiments, based on trial and error methods, some experiments were also conducted using Neural Networks as classifier for the voiced/unvoiced decision on the frame plain. It was observed that this procedure provided thresholds, whose values yield better results. Detailed information and additional data about voiced unvoiced decision methods can be found in [Rab78], [Hes83] and [Kie97].

Since speech signal conditions are similar in this Diploma Thesis, these thresholds remain for calculations computed during it. Before these values were assumed, it was verified that they were able to compute efficiently voiced and unvoiced frames. Praat program is employed to compare regions selected as voiced. Both programs coincided consistently in which regions were classified as voiced. However, the Verbmobil program seemed to yield more accurate boundaries in voiced regions, while Praat created, in certain cases, too long regions, which included some undesirable unvoiced sounds in the section.

2. Fundamental Frequency Contour.

7.1.2.1 Previous remark.

This section deals with the fundamental frequency (F0 or pitch) of a periodic signal, which is the inverse of its period, T (see figure 7.1). The period is defined as the smallest positive member of the set of time shifts that leave the signal invariant and makes only sense to a perfectly periodic signal. Speech signal results from a combination of a source of sound modulated by a transfer (filter) function (see figure 7.3) determined by the shape of the supra-laryngeal vocal tract, according to the source-filter theory described in section 3.3.2. This theory, stemmed from the experiments of Johannes Müller (1848), tested a functional theory of phonation by blowing air through larynges excised from human cadavers.

Obviously, a signal cannot be switched on or off or modulated without losing its perfect periodicity, and this combination causes speech signal to be only quasi-periodic, either due to small period-to-period variations in the vocal cord vibration or in the vocal tract shape. Therefore, the art of fundamental frequency estimation is to deal with the information in a consistent and useful way.

7.1.2.2 Difficulties in estimating pitch contour.

F0 is considered one of the most important features for the characterisation of emotions and is the acoustic correlate of the perceptive pitch. Its perception by human ear is non-linear, reliant on the frequency. In addition, human voice is not a pure sinusoid, but a complex combination of diverse frequencies.

Estimating the pitch of voiced speech sounds has always been a complex problem. Thought it appears to be a rather simple task on the surface, there are many subtleties that need to be kept in mind. F0 is usually defined, for voiced speech, as the rate of vibration of the vocal folds. Periodic vibration at the glottis may produce speech that is less perfectly periodic, due to the changes in the shape of the vocal tract that filters the glottal source waveform, making hard to estimate fundamental periodicity from the speech waveform.

Therefore, F0 estimation involves a huge number of considerations; it can be influenced by many factors such as phone intrinsic parameters or coarticulation. Furthermore, the excitation signal itself is not truly periodic, but it shows small variations in period duration (jitter) and in periodic amplitude (shimmer). These aperiodicities, in the form of relatively smooth changes in amplitude, rate or glottal waveform shape (for example the duty cycle of open and closed phases), or intervals where the vibration seems to reflect several superimposed periodicities (diplophony), or where glottal pulses occur without obvious regularity of time interval or amplitude (glottalizations, vocal creak or fry), don’t contribute to the speech intelligibility, but to the naturalness of human speech. Therefore, the mapping between physical acoustics and perceived prosody is neither linear nor one-to one; as we said, variations in F0 are the most direct cause of pitch perception, but amplitude and duration also affect pitch and make its estimation more intricate.

While there are many successfully implemented pitch estimation algorithms (s. [Che01.Hes83]), none of them work without making certain assumptions about the sound being analysed and everyone has to face many difficulties and to admit certain failure. Next paragraphs exhibit a brief historical overview of different methods tried. It can be seen, how from the first method ever employed, they meet with diverse limitations.

The first method tried was simply to low-pass the speech signal in order to remove all harmonics and then measure the fundamental frequency by any convenient means. This method faces two difficulties. First, it had to be an adaptive filter, because pitch can easily cover a 2 to 1 range and it always had to pass the fundamental and reject the second harmonic. The filter frequency was set by tracking the pitch and predicting the forthcoming pitch value; hence any error in one frame of speech could cause the filter to select the wrong cut-off frequency in the next frame and so lose track of the pitch altogether. The second difficulty arose from the fact that in many cases pitch had to be estimated from speech, where the fundamental frequency was omitted. For instance, in telephone speech frequency response drops off rapidly below 300 Hz; hence for many male voices the fundamental frequency is absent or so weak as to be lost in the system noise.

In the absence of the fundamental, it is common to search for periodicities in a signal by examining its autocorrelation function. In a periodic function, the autocorrelation will show a maximum at a lag equal to the period of the function. One first problem is that speech is not exactly periodic, because of changes in pitch and in formant frequencies. Therefore, the maximum may be lower and broader than expected, causing problems in setting the decision threshold. Another problem arises from the possibility that the first formant frequency is equal to or below the pitch frequency. If its amplitude is particularly high, this situation can yield a peak in the autocorrelation function that is comparable to the peak belonging to the fundamental. As a result, a pitch tracking process is used. Anyway, this process can usually ride out a single error, but not a string of errors.

Pitch can be determined either from periodicity in the time domain or from regularly spaced harmonics in the frequency domain. Consequently, pitch estimation techniques can be classified into two main groups:

• period-synchronous procedures: These methods try to follow the periodic characteristics of the signal, e.g. positive zero-crossings, and estimate the signal period from this information.

• short-term analysis procedures (window based). The short-term variety of estimators operates on a block (short-time frame) of speech samples and, for each one of these frames, one pitch value is estimated. The series of estimated values yield the fundamental frequency contour of the signal. There are different short-time analysis procedures e.g. cross- or autocorrelation or algorithms that operate in the frequency domain. Spectral procedures transform frames spectrally to enhance the periodicity information in the signal. Periodicity appears as peaks in the spectrum at the fundamental and its harmonics.

Period-synchronous procedures have the advantage of being generally faster and present an adequate performance in most applications. Short-term methods are considered more accurate and robust, due to the higher precision of calculating one changing attribute in a shorter time interval. In addition, they are less affected by noise and do not require a complex post-processing. Consequently, a short-term analysis procedure is used in this thesis for F0 calculation.

3. Description of the algorithm.

The program used for F0 and energy contour calculations is a part of the prosodic module employed in the second phase prototype of the Verbmobil project. The procedure was developed in previous works at the Chair for Pattern Recognition of the Friederich-Alexander-University Erlangen-Nurnberg and is widely detailed in manifold works (s. [Kom89, Not91, Pen93, Har94, Kie97]). Consequently, here just a brief description is introduced.

Fundamental frequency estimation through a window-based procedure

This procedure performs a short-term analysis, which works in the spectral domain and provides sequential F0 computing. As it was already clarified, since F0 only makes sense to voiced frames, voiced/unvoiced decision must be the first step when F0 estimation problem is faced. The way this decision is made was detailed in previous section 7.1.1.

For the prosodic analysis of the human voice, F0 is usually expected to be in the interval between 35 Hz and 550 Hz. According to the Shannon theorem [Sha49], an analog signal must be sampled with at least the double of the highest frequency of the signal, to be able to be recovered without any losses. In order to respect this theorem, voiced regions are low-pass-filtered with cut-off frequency of 1100 Hz. Through this limitation of the F0 maximum to 550 Hz, noise and mistakes will less affect the algorithm. Then, the low-pass-filtered signal is digitised using a low sampling frequency (downsampling) in order to reduce the number of signal values that must be computed. Consequently, the F0 estimation process will be accelerated. For the resulting frames, the short-time spectrum is calculated through the Fast Fourier Transform (FFT, s. [Nie83]).

The procedure is based on the assumption that the absolute maximum of the short-time spectrum corresponds to one harmonic vibration of the F0. The main difficulty of the algorithm is to find a proper definition to build the decision rule, which chooses the maximum of the spectrum inside a voiced frame. This decision is created here indirectly through an implemented Dynamic Programming (DP) procedure. For every estimated F0 value (one per voiced frame), absolute decision values (dividers) are allowed. Dividers of all the frames in one voiced region yields hence a matrix, which is used by DP to compose a specific low-cost function, employed to find the F0 optimal path. This cost-function takes into account the distance to adjacent candidates and the distance to a known target value. This value is calculated, for reasons of robustness, for the voiced frames with the maximum of the energy signal using a multi-channel procedure.

Different possible candidates are calculated for every target value using correlation methods (periodic AMBF-Procedures, s. [Ros74]) and frequency domain procedures (Seffer-Procedures, s. [Sen78]) and the median of these values results in the target value of the voiced interval. The arithmetic mean of all the target values of the speech signal is the reference point R, which is applied for the divider determination within every voiced frame. For each frame t, the spectrum from start-frame S to end-frame E of the voiced interval is considered, and the frequency Ft with maximum energy in this spectrum is calculated. With help of the divisors Kt=(Ft/R), the matrix J, containing diverse F0 candidates, is defined:

[pic] with [pic] (7.5)

In preliminary tests arise, that the correct F0 value is mostly enclosed when considering five candidates (n=2). Now, with the help of a recursive cost-function and by means of DP, the best path through the matrix J can be founded, which finally yields the F0 contour of the voiced region.

In addition, the procedure has some other advantages. On one hand, F0 values are not estimated in isolation for every frame. Instead, the cost-function establishes a relation with nearest neighbours, so that their spectral characteristics are also taken into account. On the other hand, proceeding this way short irregular periods produce no perturbation on the results. One additional benefit is that the expense calculations for every single frame, where the estimated valued is calculated, is limited. For further description of the cost-function see [Pen93] and [Kie97].

Post-processing of the F0 Contour

Independently of the F0 calculation method employed, post-processing is undoubtedly favourable, since direct application of the F0 values for further prosodic features calculations would be definitely inadequate. The sense in post-processing the F0 values lies behind different reasons:

• Automatic algorithms for the F0 extraction generate errors.

• Values of F0 are not calculated for every single frame of the signal.

• Fluctuations between adjacent F0 values are distressing under certain conditions.

• Calculations from the F0 contour are dependent on the voice reference (e.g. maximum).

Several possibilities for post-processing the fundamental frequency contour can be found in [Hes83]. In the framework of this work, post-processing is accomplished in different steps, as follows:

• Smoothing of the F0 curve through a median filter.

• Zero-setting of all the F0 values between 35 Hz and 60 Hz (before interpolation)

• Interpolation of the unvoiced interval.

• Semitone transformation and mean value subtraction.

Small failures of the algorithm can yield some undesirable noise. F0 curve smoothing through a median-filter is employed in order to leave out some of these small failures resulting from the algorithm. Smoothing increases the signal-to-noise ratio and allows the signal characteristics (peak position, height, width, area, etc.) to be measured more accurately

[pic]

Figure 7.4. Smoothing. The right peak is the result of smoothing the left noisy peak.

The zero-setting of all the values between 35 and 60 Hz before the interpolation is mainly adequate when recordings are carried out by means of WOZ dialogues. Usually, start and end point of the uttered expression are classified as voiced due to the system response and the contribution of the human voice also present in these parts. Values of F0 contained in such intervals habitually fit in the range between 35 and 60 Hz. By the use of zero-setting the system response of the whole utterance is considered.

Though F0 values are not computed over unvoiced frames, a continuous F0 contour would be desirable for further feature calculation. Therefore, interpolation over the unvoiced frames is absolutely required. For interpolation over intervals, whose F0 cannot be calculated, numerous alternatives are found. In the present Diploma Thesis, as proposed in [Kie97], linear interpolation and extrapolation is applied exclusively at the beginning and at the end of the phrase.

In addition, in order to reproduce the human ear response, semitone transformation is performed over the resulting interpolated F0 contour using the following function:

[pic] (7.6)

By choosing c=12/ln(2), semitones relate to 1 Hz as reference value; for normalisation of the F0 value, the mean value of each F0 value is subtracted from the overall F0 contour.

7.1.3 Energy Contour.

Coupling of the loudness perception with the acoustic measurement is as complex as the coupling of the tone pitch perception and the computable F0. The sensation of loudness is both dependent on the frequency of the sound and on the duration, and the other way round, pitch perception depends on the loudness (s. [Zwi67]). As a result, accurate complex reliance is not directly taken into account for the following algorithm, energy and F0 calculation are stored in a vector and, consequently, an implicit standardisation takes place.

Basic calculation procedures used for computation of energy as the acoustic correlated of perceptive loudness are based on relations between physical acoustic pressure magnitudes ps, measured in Pascal (1Pa= 1N/m2), and the acoustic intensity Is, whose unit is W/m2. It can be stated that Is is proportional to ps2. With help of the acoustic intensity reference value, I0=1pW/m2, and the acoustic pressure reference value, p0=20μPa, which illustrate the human auditory threshold at mid-range frequencies, the absolute acoustic pressure in decibels (dB) is given by:

[pic] (7.7)

The acoustic magnitude loudness quantifies the sound intensity rate between two perceived tones, hence a sound of 1 kHz with a loudness of 40 phones (acoustic pressure level of 40dB) is applied as reference. In addition, loudness varies proportionally to the third root of the intensity.

Automatic computation of energy contour can be achieved through different methods. In this Diploma Thesis a general method is employed using the following formula:

[pic] (7.8)

T[.] represents, in this manner, a convenient transformation of signal values sn and wn corresponds to an adequate window function to obtain precise segments of the signal. Values out of the used window are usually set to 0, in order to facilitate finite procedures. There are many possibilities for the choice of the transformation and the windowing function. In the loudness calculation process, a Hamming window (figure 7.5) has been used wnH with the form:

[pic] (7.9)

[pic]

Figure 7.5. The Hamming Window.

There, N represents the window size in samples. Rectangular window is proved to give maximum sharpness but large side-lobes (ripples) while hamming window blurs in frequency but produces much less leakage.

For the loudness calculation, the reference value I0 is needed, which can no longer be extracted from digitised signals. For a 16-bit quantization and a maximum acoustic pressure level of 60 dB, which represents a standard value during normal conversation, I0 is computed with equation 7.7 as follows:

[pic][pic] (7.10)

Using Hamming windows wnH of 40 ms of duration, thus with 40/16000 = 640 samples (N=640), the intensity value Is of the frame i can de estimated through the following expression:

[pic] (7.11)

The effective loudness value Lhi of the frame i can therefore be estimated through its relation to the intensity as follows:

[pic] (7.12)

During this Diploma Thesis, both loudness and energy will describe this magnitude and they are utilised as synonyms. For further details on different examples for energy calculation procedures or windowing functions, refer to the proper section in [Kie97].

1. Prosodic Features

Previous research on feature extraction for emotion recognition has focused on prosodic features, based on different linguistic units as utterance vector [Bat00], word vector [Hub98] or intervals [Ami01]. In the present work we attempt to recognize emotions from the speech signal given a short command (approximately 2 to 4 seconds), without getting any profit from context or linguistic information. In the long term, the goal of the investigation initiated during this thesis is to have a speaker and language independent emotion classifier. Such challenging purpose, leads us to deal only with global acoustic features, computed for a whole utterance or command, which seem to have the favour of many recent studies (s. [Del96, Pet00]).

The term prosody, previously introduced in section 3.1, comprises a number of attributes that can be classified into basic or compound characteristics.

Main basic prosodic attributes are loudness, pitch and duration related attributes such as duration, speaking rate and pause. Compound attributes derive from them and are intonation, accentuation, prosodic phrases, rhythm and hesitation.

With the aim to map emotions on the activation axis (see Chapter 2), we make a classification depending on prosodic characteristics, since most researches point them as the most related to feelings that differ in the activation dimension. With this aim we extracted features that model logarithmic F0, energy and durational aspects. Here we will mainly deal with acoustic prosodic features that are computed for the whole utterance.

During this work different kind of prosodic features have been used, mainly divided into two groups:

P1- Features related to prosodic basic attributes (i.e. energy and pitch) and pitch derivative. Most features have roots in statistics from values over all the frames in a sentence and in linear regression coefficients of the contour. These parameters derive from studies by [Bat00] and [Del96].

P2- Features related to prosodic compound attributes, which are more relative and provide information closer to the intonation and changes in P1 features. These parameters are based on the features proposed in [Tis93].

Calculations of both sets of features were written in C programming language and the description of their extraction method is given below.

7.2.1 P1

In this section, features of the first set are presented. Each feature is referenced with a number that corresponds to its index within the output vector from the C program which computes this set of features (ppal.c).

7.2.1.1 Energy based features.

These features derive from the estimated energy contour. For every frame i an energy value Ei exists. For further information about how this curve is obtained, see section 7.1.3.

P1.0 - ENER_MAX: Short-term energy maximum.

Maximum value of the energy curve in the whole utterance. The value is achieved by inspection of the energy values of all the frames within one utterance and selecting the maximum numeric value among them.

P1.1 - ENER_MAX_POS: Position of short-term-energy maximum.

Relative time position of the maximum energy value into the utterance. The maximum energy value is P1.0 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames:

[pic] (7.13)

Where

iEmax= frame position of the maximum energy value on the time axis.

N= number of frames in the whole utterance.

P1.2 - ENER_MIN: Short-term-energy minimum.

Minimum value of the energy curve into the whole utterance. The value is achieved by inspection of the energy values of all the frames within one utterance and selecting the minimum numeric value among them.

P1.3 - ENER_MIN_POS: Position of short-term-energy minimum.

Relative time position of the minimum energy value into the utterance. The minimum energy value is P1.2 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames:

[pic] (7.14)

Where

iEmin = frame position of the minimum energy value on the time axis.

N = number of frames in the whole utterance.

P1.4 - ENER_REG_COEF: Regression coefficient for short-term-energy.

Slope coefficient of the regression line for the energy curve values in the utterance.

[pic] (7.15)

With

[pic][pic] (7.16)

[pic] (7.17)

Where

i = frame position on the time axis.

Ei = Estimated energy in the ith frame according to the algorithm described in section 7.1.3.

N = Number of frames in the whole utterance.

P1.5 - ENER SQR_ERR: Mean square error for regression coefficient for short-term-energy.

Mean square error value between the regression line and the real energy curve.

[pic] (7.18)

With

[pic] (7.19)

[pic] (7.20)

Where

i = frame position on the time axis.

Ei = Estimated energy in the ith frame according to the algorithm described in section 7.1.3.

N = Number of frames in the whole utterance.

P1.6 - ENER_MEAN: Mean of short-term-energy.

Mean energy value calculated over the whole utterance. Energy values of all the frames in a sentence are summed and then divided by the total number of frames.

[pic] (7.21)

P1.7 - ENER_ VAR: Variance of short-term-energy.

Variance of the energy values over the whole utterance.

[pic] (7.22)

Where

µE ² = Energy mean (P1.6).

7.2.1.2 Fundamental frequency based features.

These features are extracted from the estimated F0-curve, which is obtained using the logarithmic and interpolated F0-curve. F0i represents the F0-value of the frame ith. For further description about how this curve is obtained, see section 7.1.2.

Since the existence of fundamental frequency only makes sense inside voiced frames, all the outcomes related to F0 are confined to voiced regions, where ‘voice region’ is understood as a speech interval containing more than three successive voiced frames. For further information about the voiced/unvoiced decision see section 7.1.1.

P1.8 - F0_MAX: F0 maximum.

Maximum value of the F0 curve in the voiced parts of the utterance. The value is achieved by inspection of the pitch values of all the frames labelled as voiced in the utterance and selecting the maximum numeric value among them.

P1.9 - F0_MAX_POS: Position of F0 maximum on time axis.

Relative time position of the maximum F0 value into the utterance. The maximum pitch value is P1.8 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames:

[pic] (7.23)

Where

IF0max= frame position of the maximum F0 value on the time axis.

N= number of frames in the whole utterance.

P1.10 - F0_MIN: F0 minimum.

Minimum value of the F0 curve in the voiced parts of the utterance. The value is achieved by inspection of the pitch values of all the frames labelled as voiced in the utterance and selecting the minimum numeric value among them.

P1.11 - F0_MIN_POS: Position of F0 minimum on time axis.

Relative time position of the maximum F0 value into the utterance. The minimum pitch value is P1.10 and its temporal position in the sentence is divided by the utterance overall length. Calculations are made in frames:

[pic] (7.24)

Where

IF0max= frame position of the minimum pitch value on the time axis.

P1.12 - F0_REG_COEF: Regression coefficient for F0.

Slope coefficient of the regression line for the F0 curve values in the utterance.

[pic] (7.25)

With

[pic][pic] (7.26)

[pic] (7.27)

Where

i = frame position on the time axis.

F0i = Estimated pitch in the ith frame according to the algorithm described in 7.1.2.

N = Number of frames in the whole utterance.

P1.13 - F0_SQR_ERR: Mean square error for regression coefficient.

Mean square error value between the regression line and the real energy curve.

[pic] (7.28)

With

[pic] (7.29)

[pic] (7.30)

Where

i = frame position on the time axis.

F0i = Estimated pitch in the ith frame according to the algorithm described in section 7.1.2.

N = Number of frames in the whole utterance.

P1.14 - F0_MEAN: F0 mean.

Mean F0 value calculated over the voiced regions of the utterance. Pitch values of all the voiced frames in a sentence are summed and then divided by the total number of voiced frames.

[pic] (7.31)

P1.15 - F0_VAR: F0 variance.

Variance of the energy values over the voiced regions in the utterance.

[pic] (7.32)

Where

µF0 ² = Pitch mean (P1.14).

P1.36 - Jitter.

Periodic jitter is defined as the relative mean absolute third-order difference of the point process. This feature is exceptionally calculated using Praat and then included in the feature vector. The algorithm is computed through the following formula:

[pic] (7.33)

Where

Ti = interval ith.

N = number of intervals.

For its computation, two arguments are required:

- Shortest period: Shortest possible interval that will be considered. For intervals Ti shorter than this, the (i-1)th, ist, and (i+1)th terms in the formula are taken as zero. This argument is set to a very small value, 0.1 ms.

- Longest period: Longest possible interval that will be considered. For intervals Ti longer than this, the (i-1)th, ith, and (i+1)th terms in the formula are taken as zero. Establishing the minimum frequency of periodicity as 50 Hz, the value for this parameter is 20 ms; intervals longer than that will be considered unvoiced.

7.2.1.3 Voiced/unvoiced regions based features.

These features have roots in the voiced/unvoiced information, which is obtained through an algorithm that assigns 1 to voiced frames and 0 to unvoiced. For further description about the decision algorithm, see 7.1.1.

P1.16 - F0_FIRST_VCD_FRAME.

F0 value for the first voiced frame in the utterance.

P1.17 - F0_LAST_VCD_FRAME.

F0 value for the last voiced frame in the utterance.

P1.18 - NUM_VOICED_REGIONS.

Amount of regions containing more than three successive voiced frames. Regions containing three or less voiced frames are not taken into consideration, despite their frames are counted as voiced.

P1.19 - NUM_UNVCD_REGIONS.

Number of regions with more than three successive unvoiced frames. Same considerations as P1.18 are used to define regions.

P1.20 - NUM_VOICED_FRAMES.

Amount of voiced frames in the utterance. Isolated voiced frames as well as frames belonging to a voiced region are counted.

P1.21 - NUM_UNVCD_FRAMES.

Number of unvoiced frames in the utterance. Isolated unvoiced frames as well as frames belonging to a voiced region are counted.

P1.22 - LGTH_LNGST_V_REG.

Length of the longest voiced region. The number of frames for each voiced region is counted and the highest amount is taken as feature P1.22.

P1.23 - LGTH_LNGST_UV_REG.

Length of longest unvoiced region. The number of frames for each unvoiced region is counted and the highest amount is taken as feature P1.23.

P1.24 - RATIO_V_UN_FRMS.

Ratio of number of voiced frames and number of unvoiced frames.

[pic] (7.33)

P1.25 - RATIO_V_UN_REG.

Ratio of number of voiced regions and number of unvoiced regions.

[pic] (7.34)

P1.26 - RATIO_V_ALL_FRMS.

Ratio of number of voiced frames and number of all frames.

[pic] (7.35)

P1.27 - RATIO_UV_ALL_FRMS.

Ratio of number of unvoiced frames and number of all frames.

[pic] (7.36)

7.2.1.4 Pitch contour derivative based features.

The derivative of F0 is computed and similar operations are performed. The calculations follow identical procedures as the F0 case and therefore they are just introduced.

P1.28 - F0_DER_MAX.

F0 derivative maximum.

P1.29 - F0_DER_MAX_POS.

Relative position of F0 derivative maximum.

P1.30 - F0_DER_MIN.

F0 derivative minimum.

P1.31 - F0_DER_MIN_POS.

Relative position of F0 derivative minimum.

P1.32 - F0_DER_REG_COEF.

Regression coefficient for F0 derivative.

P1.33 - F0_DER_SQR_ERR.

Mean square error for regression coefficient for F0 derivative.

P1.34 -F0_DER_MEAN.

F0 derivative mean.

P1.35 - F0_DER_VAR.

F0 derivative variance.

7.2.1 P2.

This section introduces the features included in the second set. The program used to calculate them is called complex_calcs.c (see chapter 10).

In order to obtain information associated with changes in the signal, following features result from relations among signal parameters, instead of being direct measurement magnitudes. In this section, N corresponds to the number of voiced regions in the utterance.

P2.0: Mean of the pitch means in every voiced region.

[pic] (7.37)

Where

[pic] = Mean of the pitch values in the voiced region n.

P2.1: Variance of the pitch means in every region.

[pic] (7.38)

Where

[pic] = Mean of the pitch values in the voiced region n.

F0AbsMean = P2.0.

P2.2: Mean of the maximum pitch values in every region.

[pic] (7.39)

Where

[pic] = Maximum of the pitch values within the voiced region n.

P2.3: Variance of the maximum pitch values in every region.

[pic] (7.40)

F0

1 4 3 2 t

Figure 7.6. F0 contour and points selected for calculations of P2.4 and P2.5.

P2.4: Pitch increasing per voiced region.

This feature take four points into account inside each voiced part of the utterance (see figure 7.6):

1. Beginning of the voiced region.

2. 2. End of the voiced region.

3. 3. Maximum pitch value.

4. Second maximum pitch value.

The sum of all pitch differences between two successive increasing points, divided by their respective time difference is computed. The final value for this feature results from the arithmetic mean of this calculation over all voiced parts contained in the utterance.

[pic] (7.41)

Where

i , j = represent one of the four points considered, where i appears before j

ti ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download