3 - Sheffield



[pic]

Periodic Progress Report

Research Training Network

Hearing, Organisation and Recognition of Speech in Europe

HOARSE

Contract N°: HPRN-CT-2002-00276

Commencement date of contract: 1/9/2002

Duration of contract (months): 48

Period covered by the report: 1/9/2003 to 31/8/2004

Coordinator:

Professor Phil Green

Department of Computer Science, University of Sheffield

Regent Court, Portobello St.,

Sheffield S1 4DP, UK

Phone: +44 114 222 1828: Fax: +44 114 222 1810 : e-mail p.green@dcs.shef.ac.uk

HOARSE Partners

1. The University of Sheffield [USFD] coordinator

2. Ruhr-Universitat Bochum [RUB]

3. DaimlerChrysler AG [DCAG]

4. Helsinki University of Technology [HUT]

5. Institut Dalle Molle d’Intelligence Artificielle Perceptive [IDIAP]

6. Liverpool University [UNILIV]

7. University of Patras [PATRAS]

Part A. Research Results

A.1 Scientific Highlights

At least one young researcher is now in post at each HOARSE lab. Here are the highlights of their work so far.

At Sheffield, doctoral researcher JANA EGGINK continues her work on automatic instrument recognition in polyphonic music under Task 1.5., auditory scene analysis in music. A system for the recognition of the solo instrument in accompanied sonatas and concertos has been developed. Compared with the previous system, which used a missing feature approach for instrument recognition in music with only a low number of concurrent tones, a change of focus from the background towards the foreground has taken place. Instead of identifying regions dominated by interfering sound sources, only the harmonic series belonging to the dominant instrument is identified and used for recognition. Test material is taken from commercially available classical music CDs, without placing any restrictions on the background accompaniment. The recognition accuracies achieved are comparable to those of systems developed to deal with monophonic music only. In an additional step, knowledge about the solo instrument is used to extract the F0s of the main melodic line played by this instrument. Combining different knowledge sources in a probabilistic framework led to a significant improvement in F0s estimation when compared to a baseline system using only bottom-up processing.

At Bochum, doctoral researcher JUHA MERIMAA has concentrated on HOARSE Tasks 2.1 (Researching the precedence effect), 2.2 (Reliability of auditory cues in multi-source scenarios), and 2.3 (Perceptual models of room reverberation with application to speech recognition). A novel auditory modeling mechanism predicting localization under precedence effect, multi-source, and reverberant conditions has been proposed. The model is currently investigated further by gathering new experimental data on the precedence effect. The perception of room reverberation has also been investigated in a study of spatial impression. The first part of this work included developing the experimental methods, as well as finding and training suitable test subjects for the listening experiments. The ongoing work concentrates on the effect of conflicting binaural cues on perception and on grouping of the cues to those related to sound sources and acoustical environments. Furthermore, a novel method for multi-channel loudspeaker reproduction of room reverberation has been developed in collaboration with HUT, leading to several joint papers.

At DCAG, doctoral researcher JULIEN BOURGEOIS is working on Task 4.2, informing speech recognition. This year, he concentrated on the comparison between linear blind source separation (BSS) methods and minimum-variance (or beamforming) techniques for the separation of the driver and co-driver speech in cars. We observed experimentally that BSS performs poorly when microphones are placed on the roof, as close as possible to the mouth of each speaker. We examined this theoretically and showed that when the input signal-to-interference ration (SIR) at the microphone is above a certain threshold, BSS is not able to bring any crosstalk-reduction, whereas beamforming still performs SIR improvement. Another limitation of BSS methods is their slower convergence on so-called non-causal mixtures, which arises for example if the two speakers are on the same half-plane defined by the microphone position. As a consequence, it is not advantageous to incorporate spatial prior information (available in cars) as hard constrains on the separation filters. This finding is confirmed in other experimental settings. Therefore, classical beamforming methods are preferable whenever reasonable speaker activity detection can be achieved. In Task 5.1, Speech recognition evaluation in multi-speaker conditions, DCAG made additional recordings using the commercial S-Klasse mirror beamformer and close-talk microphones. Further multi-speaker recognition evaluation: on these recordings, BSS methods performed lower word error rate reduction than beamforming.

At HUT Helsinki, working on HOARSE Tasks 3.1, Glottal excitation estimation and 3.2, Voice production studies doctoral researcher EVA BJORKNER has studied physiological differences between chest and head register in the female singing voice were studied by inverse filtering the oral airflow recorded for a sequence of /pae/ syllables sung at constant pitch and decreasing vocal loudness in each register by seven female musical theatre singers. Ten equidistantly spaced subglottal pressure (Ps) values were selected and the relationships between Ps and several parameters were examined. The normalised amplitude quotient (NAQ) was used for measuring glottal adduction. Development and evaluation of inverse filtering has been studied using physiological modelling of voice production as well as high-speed digital imaging of the vocal folds fluctuation. Thus this experiment combines several Ps -values with NAQ to measure glottal adduction.

At IDIAP, researcher VIKTORIA MAIER studied contextual and temporal information in speech and its use in ASR: The relevant HOARSE Task is 4.2, Informing Speech Recognition.

• The classic experiment of Liberman et al (1952) was re-designed and perceptual test run on a group of 37 listeners. Results were broadly consistent with Liberman et al (1952). The implications for HMM-based speech recognition systems were discussed.

• The importance of the number of emitting states in a model and the relationship of phoneme duration has been analyzed.

Viktoria Maier is about to leave the HOARSE network and continue her doctoral work at Sheffield, under different funding.

Also at IDIAP, doctoral researcher PETR SVOJANOVSKY is extending the TRAP-TANDEM model proposed by IDIAP. The main effort is towards universal classifiers frequency-localized patterns, extending (Hermansky and Jain, Eurospeech 2003). Recently an interesting and apparently effective method of training a classifier on a particular frequency band and applying it also at all other frequencies has emerged Svojanovsky’s work. The HOARSE task involved here is 4.3, Advanced ASR algorithms.

Svojanovsky was also involved in ASR experiments with nonsense syllables. This database in principle allows for evaluation of automatic recognizer, independently of any language-level constraints. HOARSE Task 5.1, Speech recognition evaluation in multi-speaker conditions.

Doctoral researcher GUILLAUME LATHOUD is working under task 5.2 (signal and speech detection in sound mixtures) on overlaps between speakers. Previously proposed microphone array-based speaker segmentation methods were extended into a generic short-term segmentation/tracking framework [Lathoud et al. 04] that successfully copes with unkown number of speakers and unkown speakers' locations. An audiovisual database called AV16.3 is now accessible online [Lathoud et al. 04] including a variety of multi-speaker cases, 3D location annotation and some speech/silence segmentation annotation. Recent work focused on sector-based multiple sources detection and localization [Lathoud et al. 04].

At Liverpool, the work of doctoral researcher ELVIRA PEREZ has concentrated on Task 1.3, active/passive speech perception. After a year on a Fulbright fellowship she has now returned to Liverpool. Two sets of experiments that test whether listeners actively predict the temporal or spectral nature of masking sounds were conducted. The experiments evaluated speech intelligibility in two contexts:

• regularly spaced and randomly spaced noise bursts (to test temporal prediction)

• using a predictable and unpredictable frequency modulated sinewave that could be integrated into the speech percept or heard as a separate sound.

Both experiments confirm that our ability to segregate signals from maskers does not exploit (or rely on) regularity of the masker. A paper on this work is in preparation

.

Also at Liverpool, post-doctoral researcher PATTI ADANK has worked on Task 1.4: Envelope information and binaural processing. Adank concentrated on the use of voice characteristics to help segregation of simultaneous speakers. Previous work has shown that listeners are able to segregate spatially disparate signals much better when spoken by different speakers (Darwin and Hukin, 2000). We hypothesized that a two-stage process may first segregate the signals on F0 and than bind components together using cues such as speaker location of voice characteristics (cf. Darwin et al., 2003). Important voice characteristics are local amplitude modulation (flutter) or random F0 variation (jitter). We tested whether jitter can be used as a primary or secondary segregation cue because previous modelling work (Ellis 1993) has shown that jitter can be extracted by computational models and used for grouping. In a first experiment synthetic vowel pairs were synthesized with a range of jitter and F0 values. We show that while F0 differences lead to improved recognition, manipulation of the F0 jitter does not and therefore conclude that jitter is not a primary grouping cue. In a second set of experiments listeners were presented with sentences that were synthesized with pitch and jitter differences to test whether jitter might aid stream formation. Again our results show that the introduction of jitter does not aid in the segregation of sentences. This leaves the intriguing question how speaker specific information aids stream formation. A technical report on this work is available.

Other work at Liverpool has addressed Task 2.2 Reliability of auditory cues in multi-cue scenarios. A key question for systems that have to integrate multiple cues is how to combine and weight the different cues that are available. At Liverpool a range of experiments that examine combination rules for low-level auditory and visual motion signals were examined. Three models for cue integration were formalised: independent decisions, probability summation (i.e. independent local decisions) and linear summation (i.e. direct integration of the signals before decisions are made). Results show that human observers use probability summation for signals that are not ecologically plausible and linear summation for signals that are ecologically plausible. The work was presented at ICA2004, Kyoto. A paper on this topic has been accepted for publication.

Liverpool are collaborating with Bochum and Sheffield in Task 4.1: Informing speech recognition. Liverpool carried out initial studies aiming to use linear prediction of the energy in 32 channels of an auditory filterbank to predict noise spectra based on past data. The results, based on the AURORA noises, show that short term prediction should lead to much better noise estimates than measures such as the long term average. The gains are larger for non-stationary noises than for stationary noises by virtue of the long term average being an already good predictor. The current aim is to record a database of typical environmental noises to evaluate the system with a reasonable sample of sounds. With help from Bochum Liverpool built a set of in-ear microphones that can be used with a DAT recorder to record binaural environmental sounds and are now collaborating with Sheffield to make the recordings.

The team at Patras is engaged on several HOARSE tasks. The post-doctoral researcher involved is JOHN WORLEY (previously at Bochum).

Task 2.3: Perceptual models of room reverberation with application to speech recognition: Work has been performed based on the use of smoothed room response measurements. The tests have illustrated some novel aspects of response measurements when employed for real-time room acoustics compensation and also the robustness of the method based on smoothed room response. This work is forming the starting point for further tests, which are described in Task 2.4.

Task 2.4: Speech enhancement for reverberant environments:: John Worley has designed an experiment that tests the spatial quality and sound efficacy of a complex smoothed room response filter. The initial stage of the experiment has been completed which has involved the building of two Graphical user interfaces to obtain subjective data as to various aspects of spatial quality (source width, envelopment, and room size) and sound quality (phase clarity, spectral balance, loudness, and overall sound quality). The testing will reveal the factors that listeners consider important when assessing reverberation characteristics of a room. Some work is also in progress on the use beamforming arrays for use in speech enhancement and ASR tasks.

Pursuing Task 2.1: Researching the Precedence effect, Worley travelled to Bochum to test subjects on the Franssen illusion within different sized rooms, and with a range of onset transitions. He completed three experiments in Bochum, which show that the traditional illusion breaks down when it is performed within a large hall. The preliminary conclusion for this work is that for the precedence effect to work, the secondary signal in the Franssen illusion must not be active until the listener has received the reflections within the room. Therefore, congruent with the ‘plausibility hypothesis’ the secondary signal will be perceived as a reflection and the illusion will operate.

A.2 Joint Publications and Patents

Publications

IDIAP and USFD

• Andrew C. Morris, Viktoria Maier and Phil Green, “From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition”, in International Conference on Spoken Language Processing (ICSLP), Jeju Island, Korea, 2004

HUT and USFD

• Palomäki, K. Brown, G., and Barker, J., 'Techniques For Handling Convolutional Distortion With `Missing Data' Automatic Speech Recognition”, Speech Communication Vol. 43, no. 1-2, pp. 123-142, 2004

• Palomäki, K., Brown, G., and Wang, D., ''A Binaural Processor for Missing Data Speech Recognition in the Presence of Noise and Small-Room Reverberation,'' Speech Communication, 2004. In press.

Patents

HUT and Bochum

Merimaa, J & Pulkki, V: Perceptually-Based Processing of Directional Room Responses for Multichannel Loudspeaker Reproduction, Proc. IEEE WASPAA, New Paltz, NY, USA, 2003, pp. 51-54.

Pulkki, V, Merimaa, J & Lokki, T: Multi-Channel Reproduction of Measured Room Responses, 18th International Congress on Acoustics, Kyoto, Japan, 2004, pp. II 1273-1276.

Pulkki, V, Merimaa, J & Lokki, T: Reproduction of Reverberation with Spatial Impulse Response Rendering, AES 116th Convention, Berlin, Germany, 2004, Preprint 6057.

Merimaa, J. & Pulkki, V: Spatial Impulse Response Rendering, 7th International Conference on Digital Audio Effects (DAFx'04), Naples, Italy, 2004. Invited paper.

Pulkki, V, Merimaa J. & Lokki T: A Method for Reproducing Natural or Modified Spatial Impression in Multichannel Listening. International patent application, filed March 2004

Part B - Comparison with the Joint Programme of Work (Annex I of the contract)

B.1 Research Objectives

The research objectives, as set down in Annex I of the contract, are still relevant and achievable. There are inevitable shifts in perspective and emphasis, to reflect scientific progress and the expertise and interests of the young researchers we have recruited.

B.2 Research Method

There were no additions to our methodological toolkit during the reporting period.

B.3 Work Plan

B3.1 Breakdown of tasks

We have made no changes to the task structure since year 1, though we recognise that the following table is looking somewhat dated.

B3.2 Schedule and milestones: Table 1

Note that here we are reporting on the work of the HOARSE teams, rather than the work of the young researchers alone.

|Task | |Lead Partner |12 Month Milestone |24 Month Milestone |Comments |

|1.2 |Modelling grouping |USFD |incorporation of noise estimation |mask-level integration. |. Multisource decoding |

| |integration by multisource | |into oscillator-based grouping | |theory journal article |

| |decoding | | | |published in Speech |

| | | | | |Communication |

|1.3 |Active/Passive speech |Liverpool |Planning experiments |Experiments conducted |Experiments conducted |

| |perception | | | | |

|1.4 |Envelope information and |Liverpool |Preliminary experiments |Experiments and analysis | Experiments conducted |

| |binaural processing | | | | |

|1.5 |Auditory Scene Analysis in |USFD |F0 estimation |Development of a two-stage|Second system completed |

| |Music | | |(lower and cognitive) | |

| | | | |precedence effect model | |

|2.1 |Researching the precedence |RUB |Psychoacoustic experiments on the |Development of a |Model completed [Faller & |

| |effect | |precedence effect in realistic |localisation model using |Merimaa 04]. Further |

| | | |scenarios. |automatic weighting |psychoacoustical |

| | | | |function for binaural cues|experiments being |

| | | | | |conducted. |

| | | | | |Some work at Patras on the|

| | | | | |relationship of Precedence|

| | | | | |effect and the Franssen |

| | | | | |illusion in conjunction |

| | | | | |with Bochum |

|2.2 |Reliability of auditory cues |RUB |The importance of single binaural |Extension to multiple |Completed [Braasch 03], |

| |in multi-source scenarios | |cues in various multisource |sources and practical room|Braasch et al 03], |

| | | |environments determined in |conditions |[Braasch & Blauert 03]. |

| | | |psychoacoustic experiments | |Research at RUB extended |

| | | | | |to spatial impression and |

| | | | | |separation of binaural |

| | | | | |cues to source and room |

| | | | | |related. |

|2.3 |Perceptual models of room |Patras |integrated response/ signal |Extension for multiple |Significant part of the |

| |reverberation with | |perceptual model for single source |sources |work completed |

| |application to speech | |in reverberant environments. | | |

| |recognition | | | | |

|2.4 |Speech enhancement for |Patras |Research into auto-directive |Development of new |Some work completed (test |

| |reverberant environments | |arrays, controlled from the |parameterisation |interfaces ready) to be |

| | | |perceptual directivity module |techniques for the voice |supplemented by subjective|

| | | | |source |tests. |

| | | | | |Missing data techniques |

| | | | | |for handling reverb |

| | | | | |developed at Sheffield |

|3.1 |Glottal excitation estimation|HUT |Research on combining new AR (Auto |Inverse filtering |On schedule |

| | | |Regressive) models to inverse |experiments on intensity | |

| | | |filtering |regulation of speech with | |

| | | | |soft and extremely loud | |

| | | | |voices | |

|3.2 |Voice production studies |HUT |Inverse filtering experiments on |Research on the |On schedule |

| | | |high-pitched voices |relationship between the | |

| | | | |main effects of the | |

| | | | |glottal flow (fundamental | |

| | | | |frequency, phonation type | |

| | | | |etc.) and brain functions | |

| | | | |using MEG. | |

|3.3 |Voice production and cortical|HUT |Development of DSP algorithms for | |.Ongoing |

| |speech processing | |parameterisation of the voice | | |

| | | |source, getting familiar with MEG | | |

|4.1 |Developments in MultiSource |USFD |Probabilistic decoding contraints |Design of predictive noise|Probabilistic decoding |

| |Decoding | | |estimation algorithms. |implemented in current |

| | | | |Known BSS algorithms |software. Adaptive noise |

| | | | |adopted as a common base |estimation implemented in |

| | | | |for evaluation |multisource models |

|4.2 |Informing Speech Recognition |Liverpool |Design of predictive noise |HMM2 & DBM adaptation |Work at DCAG and IDIAP |

| | | |estimation algorithms. Known BSS | | |

| | | |algorithms adopted as a common base| | |

| | | |for evaluation | | |

|4.3 |Advanced ASR Algorithms |IDIAP |Multistream adaptation |Assessment report 1 |Work reported on this |

| | | | |Targets for assessment |task in Eurospeech 03, |

| | | | |report 2 |IEEE ASRU 03 |

|5.1 |Speech recognition evaluation|DCAG |Database specification. | |First recognition test in |

| |in multi-speaker conditions | |Targets for assessment report 1 | |multi-speaker environment |

| | | | | |using separation |

| | | | | |algorithms (BSS and |

| | | | | |beamforming). |

|5.2 |Signal and speech detection |IDIAP |Analysis of auditory cues |ASR performance for |Work reported: Ajmera et |

| |in sound mixtures | | |simulated deteriorated |al 2003, Lathoud et al |

| | | | |speech tested |2003 |

|5.3 |Speech technology assessment |RUB |Simulation environment for | |Completed and integrated |

| |by simulated acoustic | |hands-free communication developed | |into IKA telephone line |

| |environments | | | |simulation tool. ASR, |

| | | | | |speaker recognition, and |

| | | | | |speech synthesis |

| | | | | |assessment experiments |

| | | | | |carried out. |

B3.3 Research effort in the reporting period: Table 2

|Participant |Young researchers financed |Researchers financed from other|Researchers contributing to the|

| |by the contract |sources (person-months) |project (number of individuals)|

| |(person-months) | | |

|1. USFD |12 |48 |1YR + 5 others=6 |

|2. RUB |15.5 |30 |2YRs + 4 others=6 |

|3. DCAG |12 |24 |1 YR + 2 others=3 |

|4. HUT |12 |12 |1YR + 2 others =3 |

|5. IDIAP |23.5 |24 |3YRs + 3 others=6 |

|6. LIVERPOOL |12 |24 |2YRs + 2 others=4 |

|7. PATRAS |7 |6 |1YR+ 1 other = 2 |

|Totals |94 |168 |11YR+19 other = 30 |

B.4 Organisation and Management

B4.1 Organisation and management

HOARSE is being managed in the way described in Annex 1 of the contract. The non-executive director is Dr. Jordan Cohen of VoiceSignal inc, Boston, MA. Administrative is being handled from USFD by Gillian Callaghan (g.callaghan@dcs.shef.ac.uk).

B4.2 Communication Strategy

Most communication within HOARSE is conducted electronically. The HOARSE web site is . Meeting records and so on are on password-protected pages on that site. The email address for the whole network is hoarse@dcs.shef.ac.uk.

B4.3 Network Meetings

Our pattern is to hold a HOARSE workshop every 6 months. Most of the time is taken on research updates: all young researchers make a presentation and we also have update talks from academics where appropriate. There is much discussion. The meeting begins with a report from the coordinator and ends with a session planning activities for the next 6 months. Prior to this, the non-executive director has an opportunity to provide feedback on the progress of the network. The workshops are scheduled for 2 days and the steering committee meets at some point in this time period. In the reporting period the following workshops were held:

• 3rd Workshop, hosted by IDIAP, 5-6 September 2003

• 4th Workshop, hosted by USFD, 20-21 February 2004

The non-executive director was present at both workshops. He is treated as an external expert for funding purposes. Dr. Stefan Launer of Phonak (a Swiss-based hearing aid company) attended the 3rd workshop as an external expert.

B4.4 Networking

HOARSE policy is that each young researcher should spend at least a week with each network partner.

Visits in the reporting period by members of the teams were as follows:

• Viktoria Meyer from IDIAP to Sheffield, February 04

• Eva Bjorkner from HUT to Sheffield, March 04

• John Worley from Patras to Bochum

• Juha Merimaa from Bochum to Patras, May 04

• Juha Merimaa of RUB to HUT, December 2003

The following visits are planned in the 3rd year:

• Jana Eggink from Sheffield to HUT

• Julien Bourgeois from DCAG to IDIAP

• Guillaume Lathoud from IDIAP to DCAG

B.5 Training

B.5.1 Publicising Positions

HOARSE opportunities have been publicised by means of

• email lists such as those maintained by ELSNET (European Language and Speech Network), ISCA (International Speech Communication Association) and SALT (UK Speech and Language Technology).

• The IHP network vacancies site

• The HOARSE web site

Though we have not been overwhelmed with applications, there has been a steady stream, of high quality. We are not recruiting at the moment though there may be some further opportunities later.

B5.2 Recruitment Progress: Table 3

Recruitment has gone well: all partners have YRs in place

|Participant |Contract deliverable of Young Researchers to|Young Researchers financed by the contract so far |

| |be financed by the contract (person- months) |(person-months) |

| |Pre-doc (a) |Post-doc (b) |Total (a+b) |Pre-doc ( c) |Post-doc (d) |Total (c+d) |

|1. USFD |18 |18 |36 |24 |0 |24 |

|2. RUB |18 |18 |36 |18 |9 |27 |

|3. DCAG |18 |18 |36 |24 |0 |24 |

|4. HUT |18 |18 |36 |18 |0 |18 |

|5. IDIAP |18 |18 |36 |32 |0 |32 |

|6. LIVERPOOL |18 |18 |36 |8 |11 |19 |

|7. PATRAS |18 |18 |36 |0 |7 |7 |

B5.3 Integration

We feel we have created an informal atmosphere in which young researchers can readily integrate with more experienced researchers and with their peers. It is difficult to be precise about how we have done this but the quality of the interactions at workshops, and the value of the discussions has been high. A young post-doctoral researcher comments:

"Having a mix of Ph.D and post-doctoral researchers, in addition to the more senior members of the group gives a synergistic effect. Since, the Ph.D students can learn from the direct contact with recent post-docs and the post-doc gains experience of advisement and discussion in an informal environment."

Our policy to encourage integration into the network is for each young researcher to have a supervisor in the host lab and an advisor in a different lab, usually but not necessarily in the network. These arrangements are as follows:

Table 4: Supervisors and Advisors

|Young Researcher |Supervisor |Advisor |

|Jana Eggink |Guy Brown, USFD |Georg Meyer, Liverpool |

|Juha Merimaa |Jens Blauert, RUB |Matti Karjalainan, HUT |

|John Worley |Jon Mourjopoulos, Patras |Jens Blauert, Bochum |

|Julien Bourgeois |Udo Haiber, DCAG |Ian McCowan, IDIAP |

|Eva Bjorkner |Paavo Alku (HUT) |Johan Sundberg, KTH, Sweden |

|Guillaume Lathoud |Herve Bourlard, IDIAP |Klaus Linhard, DCAG |

|Elvira Perez |Georg Meyer, Liverpool | Martin Cooke, USFD |

|Patti Adank |Georg Meyer, Liverpool | Guy Brown, USFD |

|Viktoria Maier |Hynek Hermansky, IDIAP | Martin Cooke, USFD |

|Petr Svojanovsky |Hynek Hermansky, IDIAP | Roger Moore, USFD |

B5.4 Training Measures.

At the IDIAP workshop there was a training session on the ‘smart meeting room’ facility.

Many Universities provide complementary skills programmes for researchers, and HOARSE students are encouraged to take advantage of these. An example is the Research Training Programme at USFD, which Jana Eggink has completed. There is a similar programme at Liverpool. At RUB, John Worley has taken a German language course.

B5.5 Equal Opportunities

We have taken no special equal opportunities measures, but 50% of the young researchers HOARSE has recruited are female.

B5.6 Multidisciplinarity

In HOARSE, multidisciplinarity is so central to the research that young researchers receive training across discipline boundaries every day. We have recruited from a variety of backgrounds: mathematics, phonetics, linguistics and music for instance. Much of our work involves a combination of experimental work, perhaps with human listeners, and computational or mathematical modelling.

B5.7 Industrial Training

HOARSE has a full industrial partner in DCAG, and in future years we anticipate young researchers spending time there to learn about the priorities and strengths of industrial research.

B.6 Difficulties

We are delighted to have recruited so many high-quality young researchers. Our only problem has been in Patras, where negotiations with two candidates foundered on the difficulty of non-national enrolling for a PhD in Greece and the practical difficulties with travelling to a relatively distant location of Patras from mainland Europe.

Part C - Summary Reports by Young Researchers

Patti Adank’s report has been sent to the commission.

Publications

Publications by young researchers

In print

• Eggink, J. and Brown, G.J. (2004): Instrument recognition in accompanied sonatas and concertos. Proc. International Conference on Acoustics, Speech, and Signal Processing, ICASSP'04, pp. 217-220

• Eggink, J. and Brown, G.J. (2004): Extracting melody lines from complex audio. Proc. International Conference on Music Information Retrieval, ISMIR'04

• Viktoria Maier and Hynek Hermansky, “Perception of synthetic consonant-vowel stimuli”, in Multimodal Interaction and Related Machine Learning Algorithms (MLMI), Martigny, Switzerland, 2004

• G. Lathoud, I.A. McCowan, and J.M. Odobez. Unsupervised Location-Based Segmentation of Multi-Party Speech. Proceedings of the 2004 NIST Meeting Recognition Workshop (NIST-RT04).

• J. Ajmera, G. Lathoud and I.A. McCowan. Clustering and Segmenting Speakers and their Locations in Meetings. Proceedings of the 2004 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-04).

• D. Zhang, D. Gatica-Perez, S. Bengio, I.A. McCowan and G. Lathoud. Modeling Individual and Group Actions in Meetings: a Two-Layer HMM Framework. Proceedings of CVPR 2004.

• D. Gatica-Perez, G. Lathoud, I.A. McCowan and J.M. Odobez. A Mixed-State I-Particle Filter for Multi-Camera Speaker Tracking. Proceedings of the 2003 IEEE Int. Conf. on Computer Vision Workshop on Multimedia Technologies for E-Learning and Collaboration (ICCV-WOMTEC), 2003.

• G. Lathoud, I.A. McCowan, and D.C. Moore. Segmenting Multiple Concurrent Speakers Using Microphone Arrays. Proceedings of Eurospeech 2003.

• D. Gatica-Perez, G. Lathoud, I.A. McCowan, J.M. Odobez and D.C. Moore. Audio-Visual Speaker Tracking with Importance Particle Filters. Proceedings of the 2003 IEEE International Conference on Image Processing (ICIP-2003).

• G. Lathoud and I.A. McCowan. Location based speaker segmentation. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-03).

• I.A. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud, F. Monay, D.C. Moore, P. Wellner and H. Bourlard. Modeling human interactions in meetings. Proceedings of the 2003 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP-03).

• G. Lathoud, J.M. Odobez and D. Gatica-Perez. AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking. IDIAP Research Report 04-28, 2004.

• D. Zhang, D. Gatica-Perez, S. Bengio, I.A. McCowan and G. Lathoud. Multimodal Group Action Clustering in Meetings. IDIAP Research Report RR 04-24, 2004.

• I.A. McCowan, D. Gatica-Perez, S. Bengio, and G. Lathoud. Automatic Analysis of Multimodal Group Actions in Meetings. IDIAP Research Report 03-27, 2003.

• Eva Björkner, Johan Sundberg, Tom Cleveland and Ed Stone: “Voice source characteristics in different registers in classically trained musical theatre singers”, Proc. ICA2004, Kyoto, Japan, April 4-10, 2004; accepted for publication in Journal of Voice.

• Julien Bourgeois and Klaus Linhard. Frequency-Domain Multichannel Signal Enhancement: Minimum-Variance vs. Minimum Correlation. Eusipco 2004, Vienna.

• Merimaa, J: Auditorily Motivated Analysis of Directional Room Responses, 1st ISCA Tutorial & Research Workshop on Auditory Quality of Systems, Akademie Mont-Cenis, Germany, 2003. Invited talk (no written paper).

• Merimaa, J. & Hess, W: Training of Listeners for Evaluation of Spatial Attributes of Sound, AES 117th Convention, San Francisco, CA, USA, 2004.

In Press

• Fousek Petr, Svojanovsky Petr, Grezl Frantisek, Hermansky Hynek: New Nonsense Syllables Database - Analyses and Preliminary ASR Experiments, In ICSLP 2004, Soul, KR,

• G. Lathoud and I.A. McCowan. A Sector-Based Approach for Localization of Multiple Speakers with Microphone Arrays. To appear in Proceedings of the 2004 ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing (SAPA-2004).

• G. Lathoud, J.M. Odobez and D. Gatica-Perez. AV16.3: an Audio-Visual Corpus for Speaker Localization and Tracking. To appear in proceedings of the 2004 MLMI Workshop, Bengio & Bourlard Eds., Springer-Verlag, 2004.

• D. Zhang, D. Gatica-Perez, S. Bengio, I.A. McCowan and G. Lathoud. Multimodal Group Action Clustering in Meetings. Proceedings of the 2004 ACM International Conference on Multimedia, Workshop on Video Surveillance and Sensor Networks (ACM MM-VSSN), 2004.

• I.A. McCowan, D. Gatica-Perez, S. Bengio, and G. Lathoud. Automatic Analysis of Multimodal Group Actions in Meetings. To appear in the IEEE Transactions on Speech and Audio Processing, 2005.

• Faller, C. & Merimaa, J: Source localization in complex listening situations: Selection of binaural cues based on interaural coherence, J. Acoust. Soc. Am., vol. 116, no 5, Nov. 2004.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download