A neural oscillator model of binaural auditory attention



A neural oscillator model of binaural auditory attention.

Stuart N. Wrigley and Guy J. Brown (University of Sheffield)

Motivation

Our ability to make sense of a world which is almost constantly filled with sounds has been investigated by numerous psychoacousticians ever since the term ‘cocktail party effect’ was introduced by Cherry (1953). This effect is related to how a listener at a party, subjected to multiple sound sources, is able to separate, and participate in, a single conversation; a problem which scales to ask how a listener makes sense of any complex auditory environment.

The auditory system must separate an acoustic mixture in order to create a perceptual description of each sound source. It has been proposed that this process of auditory scene analysis (ASA) takes place in two conceptual stages: segmentation in which the acoustic mixture is separated into its constituent ‘atomic’ units, followed by grouping in which units that are likely to have arisen from the same source are recombined. The perceptual ‘object’ produced by auditory grouping is called a stream. Each stream describes a single sound source.

Attempts to create computer models that mimic auditory scene analysis have led to the field of study known as computational auditory scene analysis (CASA) (e.g. Cooke 1991/1993; Mellinger, 1991; Brown 1992). Interestingly, few CASA systems attempt to incorporate attentional effects; typically, ASA is seen as a precursor to attentional mechanisms, which simply select one stream as the attentional focus. This assumption may be flawed: recent work by Carlyon et al. (2001) has suggested that attention plays a key role in the formation of streams, as opposed to the conventional view that attention merely selects a pre-constructed stream. Their study aimed to rigorously manipulate attention by monaurally presenting an alternating tone sequence capable of being streamed (van Noorden, 1975). In order to prevent attention being directed to the tone sequence, subjects were required to perform a distractor task in the contralateral ear. The experiment relied on the finding that there is a tendancy for the streaming percept to build over time in such a way that at the beginning, listeners only perceive the galloping rhythm, whereas towards the end of the sequence, the rhythm is lost and only tone bursts of a single frequency stream are heard (Anstis and Saida, 1985). It was found that when attention was directed away from the alternating tones, the streaming percept did not build up. However, when listeners were instructed to ignore the distractor task and concentrate solely on the alaternating tone sequence, the streaming percept built up as normal. Hence, it was concluded that attention was required for the creation of streams.

It has been proposed that attention can be divided into two different levels (Spence and Driver, 1994): low-level exogenous attention which groups acoustic elements to form streams, and a higher-level endogenous mechanism which performs stream formation and selection. Exogenous attention may overrule conscious (endogenous) selection (e.g. in response to a sudden loud bang). The work presented here incorporates these two types of attention into a model of auditory grouping.

Novel aspects of work

Little work has been conducted on computational models of auditory attention. This study extends the work of Wrigley and Brown (2002) to incorporate a binaural attentional framework into a computational auditory scene analysis model. It comprises a network of neural oscillators which perform stream segregation on the basis of oscillatory correlation (Wang, 1996). An important aspect of this model is the adoption of one-dimensional networks: the network processes sound input on a frame by frame basis. The output of this processing is an ‘attentional stream’: a description of which frequencies are being attended at each epoch.

Method

von der Malsburg and Schneider (1986) proposed an extension to the temporal correlation theory of von der Malsburg (1981) by using oscillators as the processing substrate: an oscillator being regarded as a model for the behaviour of a single neuron or as a mean field approximation to a group of connected excitatory and inhibitory neurons. Within this mechanism, the phase of an oscillator’s activity can be used to assess synchrony. All oscillators whose activities are above a given threshold at time t are said to be synchronised. Hence, a set of features form a group if the corresponding oscillators oscillate in phase with zero phase lag (synchronisation); oscillators representing different groups oscillate out of phase (desynchronisation). Within this framework, attentional selection can be implemented by synchronising attentional activity with the stream of interest.

The auditory periphery is modelled by a bank of gammatone filters whose centre frequencies are distributed on the rectangular bandwidth scale between 50 Hz and 2.5 kHz. Auditory nerve firing rate is approximated by half-wave rectifying and square root compressing the output of each filter. A correlogram (Brown and Cooke, 1994) is used to extract pitch information from the auditory nerve responses. This allows later stages of the model (the oscillator network) to perform grouping by fundamental frequency (F0) (Bregman, 1990).

The core of the model consists of two one-dimensional neural oscillator networks, one per ear; each oscillator in a network corresponds to an individual frequency channel. The networks used here are based upon locally excitatory globally inhibitory oscillator networks (LEGION, see Terman and Wang, 1995). Individual segments (contiguous regions of acoustic energy) are formed by forming local excitatory connections between oscillators. A block of channels is deemed to constitute a segment if the cross-channel correlation of the correlogram is greater than a certain value for every channel in the block.

Every oscillator is connected to a global inhibitor which receives excitation from each oscillator, and inhibits every oscillator in each network. This ensures that only one block of synchronised oscillators (corresponding to a perceptual group) can be active at any one time. Excitatory connections are made between segments to promote sychrony if they are consistent with the current F0 estimate. A segment is classed as consistent with the F0 if a majority of its corresponding correlogram channels exhibit a significant peak at the fundamental period. All such segments are then interconnected by (long-range) excitatory links subject to old-plus-new constraints.

The old-plus-new heuristic (Bregman, 1990) refers to the auditory system’s preference to ‘interpret any part of a current group of acoustic components as a continuation of a sound that just occurred’. This is incorporated into the model by attaching ‘age trackers’ (slow time-scale leaky integrators) to each channel of the network. Excitatory links are placed between harmonically related segments only if the two segments are of similar age.

Each oscillator is connected to an attentional leaky integrator (ALI) by excitatory links; the strength of these connections is modulated by endogenous attention. The attentional interest itself is modelled as a Gaussian for each ear according to the gradient model of attention (Mondor and Bregman, 1994). Initially, the connection weights between the oscillator array and the ALI are strong: all segments feed excitation to the ALI, such that all segments are attended. During sustained activity, these weights relax toward the endogenous attentional interest vector such that strong weights exist for channels of high attentional interest and low weights exist for channels of low attentional interest. ALI activity will only coincide with activity of the channels within the attentional interest peak and any harmonically related (synchronised) activity outside the endogenous attentional interest peak. All other activity will occur within a trough of ALI activity. This behaviour allows both individual tones and harmonic complexes to be attended to using only a single attentional interest peak. A segment or group of segments are said to be attended to if their oscillatory activity coincides temporally with a peak in the ALI activity.

Results

The model accounts for a number of interesting phenomena. The attentional stream can be subconsciously re-directed by the onset of a new, loud stimulus. The attentional mechanism also accounts quantitatively for the streaming effect of alternating tone sequences (van Noorden, 1975) and the stream segregation build-up effect (Anstis and Saida, 1985). Furthermore, the model demonstrates the failure of streaming to occur when attention is directed to a distractor task (Carlyon et al., 2001). In addition, the model demonstrates a number of other phenomena including induction of two tone streaming; grouping of a mistuned harmonic and complex (e.g. Darwin et al., 1995); and the capture of tones from a complex which demonstrates the old-plus-new heuristic (Bregman, 1990).

Summary

A computational model of auditory scene analysis is presented in which attention plays an important role in the formation and segregation of perceptual streams. The implementation is based on a previous model (Wrigley and Brown, 2002) and avoids the use of a two-dimensional time-frequency structure found in other CASA models (e.g. Wang and Brown, 1999): the network processes input on a frame by frame basis. Furthermore, the attentional leaky integrator and the connection weights which link it to the oscillators, determine which segments form an attended perceptual stream. The model shows a good match to experimental findings and offers an explanation for attentional influences on the grouping process.

References

Anstis, S and Saida, S (1985). Adaptation to auditory streaming of frequency-modulated tones. Journal of Experimental Psychology: Human Perception and Performance 11 257-271.

Bregman, AS (1990). Auditory Scene Analysis. The Perceptual Organization of Sound, MIT Press.

Brown, GJ (1992). Computational auditory scene analysis: A representational approach, Doctoral thesis CS-92-22, Department of Computer Science, University of Sheffield.

Brown, GJ and Cooke, M (1994). Computational auditory scene analysis. Computer Speech and Language 8 297-336.

Carlyon, RP, Cusack, R, Foxton, JM & Robertson, IH (2001). Effects of attention and unilateral neglect on auditory stream segregation. Journal of Experimental Psychology: Human Perception and Performance 27(1) 115-127.

Cherry, EC (1953). Some experiments on the recognition of speech, with one and two ears. Journal of the Acoustical Society of America 25 975-979.

Cooke, MP (1991/1993). Modelling auditory processing and organisation. Cambridge University Press.

Darwin, CJ, Hukin, RW and Al-Khatib, BY (1995). Grouping in pitch perception: Evidence for sequential constraints. Journal of the Acoustical Socierty of America 98(2)Pt1 880-885.

Mellinger, DK (1991). Event formation and separation in musical sound. Doctoral thesis, Stanford University.

Mondor, TA and Bregman, AS (1994). Allocating attention to frequency regions. Perception and Psychophysics 56(3) 268-276.

Spence, CJ and Driver, J (1994). Covert Spatial Orienting in Audition: Exogenous and Endogenous Mechanisms. Journal of Experimental Psychology: Human Perception and Performance 20(3) 555-574.

Terman, D and Wang, DL (1995). Global competition and local cooperation in a network of neural oscillators. Physica D 81 148-176

van Noorden, LPAS (1975). Temporal coherence in the perception of tone sequences. Doctoral thesis, Institute for Perceptual Research, Eindhoven, NL.

von der Malsburg, C (1981). The correlation theory of brain function. Internal report 81-2, Max Planck Institute for Biophysical Chemistry, Göttingen, Germany.

von der Malsburg, C and Schneider, W (1986). A neural cocktail-party processor. Biological Cybernetics 54 29-40.

Wang, DL (1996). Primitive auditory segregation based on oscillatory correlation. Cognitive Science 20 409-456.

Wang, DL and Brown, GJ (1999). Separation of speech from interfering sounds based on oscillatory correlation. IEEE Transactions on Neural Networks 10 684-697.

Wrigley, SN and Brown, GJ (2002). A neural oscillator model of auditory selective attention. Presented at NIPS2001 - Fifteenth Annual Conference On Neural Information Processing Systems, Vancouver, Canada, 4-6 December 2001. To appear in Advances in Neural Information Processing Systems 14 (2002), edited by T.G. Dietterich, S. Becker and Z. Ghahramani, MIT Press.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download