DESCRIPTION: State the application’s broad, long-term ...



INTRODUCTION TO REVISED APPLICATION

Overview. This application is a first resubmission. We thank the reviewers for their thoughtful comments on the original submission as they allowed us to significantly strengthen the proposal. The primary concern of the reviewers was that insufficient methodological detail was provided regarding the proposed experiments and modeling projects. They agreed that adequate progress had been made in the previous budget period and that the proposed work was poised to make significant advances in our understanding of speech production. We have therefore focused our revision efforts on providing additional methodological details throughout the Research Plan. To this end, portions of the Progress Report (Section C) that described subprojects in the last budget period which were not directly related to the current application were shortened or removed to allow us to focus more on material relevant for evaluating the proposed research. Section D has also been extensively revised, as detailed in the following paragraphs. Paragraphs of the application text that have changed substantially from the previous submission are indicated with vertical bars in the left margin of the Research Plan.

Concerns of Reviewers 2 and 3. These reviewers were generally enthusiastic about the application and suggested few revisions. The primary concern of Reviewer 3 was that he/she felt there were problems inherent to the design of the two proposed studies involving clinical populations. Reviewer 2 had a similar concern regarding one of these studies. Reviewer 3 felt the clinical population studies should either be dropped from the project or substantially revised. We chose to drop the studies from the project, thereby allowing us to devote more space to methodological details for the remaining experiments.

Reviewer 2 had two additional concerns. The first was a general lack of methodological details; this was the primary concern of Reviewer 1 as well, and below we describe how this has been addressed. The second was a minor concern involving a possible missed opportunity to study the time course of adaptation in Production Experiments 1 and 2 in Section D.1. This has been addressed by adding text in Sections C.2 and D.1 making explicit how we will investigate the time course of learning in the experiments and model simulations.

The remaining paragraphs of this Introduction detail the changes we made to the application to address the concerns of Reviewer 1.

Rationale for studying prosody. Reviewer 1 felt that the prosodic component of the research was not well-motivated and seemed distinct from the rest of the project. This concern has been addressed in several ways. First, we have highlighted the application’s focus on investigating auditory feedback control in speech production. In Sections B and D.1 we describe how the control of prosody is an important component of this line of inquiry. Second, we have added text explicitly stating the prosody-related hypotheses to be tested, both our own and those posited by other researchers, and the manner in which they will be tested. (This was done for all experiments in the application, not just the prosody experiments.) It is important to note that, although many of the hypotheses to be tested are embodied by our proposed neural model, they are not simply our view, but instead reflect current hypotheses proposed by many researchers. The proposed experiments therefore address a wide range of the speech theoretical and experimental literature, not just our own model. This aspect of the proposed research has been highlighted in Section D.

Finally, it should be stressed that speech motor disorders rarely affect only the segmental component of speech. Instead, both prosodic and segmental disturbances typically co-occur (Darley Aronson, Brown, 1969, 1975; Kent & Rosenbek, 1982). Yorkston, Beukelman & Bell (1988) have argued that prosodic and phonetic parameters are intertwined in their effect on speech intelligibility. Removal of prosodic cues has been shown to affect intelligibility of impaired (Bunton, Weismer, & Kent, 2000; Patel 2002b) and non-impaired speech (Bunton, Kent, Kent, & Duffy, 2001; Laures & Weismer, 1999; Laures & Bunton, 2003). Thus a complete model of speech production must provide a unified account of both segmental and prosodic aspects of speech. Relatedly, a common critique by colleagues familiar with the DIVA model is its lack of a prosodic component. For these reasons, we have chosen to spend a significant amount of effort in the coming budget period addressing this shortcoming.

Details concerning the relationship between modeling, behavior, and fMRI. Reviewer 1 felt that the description of how the model simulations will be compared to the results of fMRI and behavioral experiments was not sufficiently detailed. We have addressed this concern in several ways. First, we have expanded the description of the model in C.1 to include a treatment of how we generate simulated fMRI activations from the model. Our method is based on the most recent results concerning the relationship between neural activity and the blood oxygen level dependent (BOLD) signal measured with fMRI, as detailed in Section C.1. The section also includes a treatment of how inhibitory neurons are modeled in terms of BOLD effects. Second, in Section C.2 we detail how the model’s productions are compared to behavioral results from auditory perturbation experiments. We also detail how model predictions can be tested with fMRI, and how we have successfully done this to verify a model prediction concerning auditory error cells in speech production. Third, we have added further description of how we will compare the model fMRI activations to the results of our fMRI experiments in the modeling portions of Section D.

Reviewer 1 also noted that our previous comparisons between the model and fMRI data appear to be qualitative. Indeed we have not yet performed quantitative fits of the model activations to fMRI results, though the model does provide quantitative fits to the acoustic trajectories measured while subjects perform speech tasks in the fMRI scanner (see Fig. 5 in the revised Research Plan). Currently there is no accepted method for quantitatively comparing fMRI activations generated from a system-level neural model to the results of an fMRI experiment. This is because few system-level neural models make quantitative predictions regarding voxel-level activations across wide areas of the brain, unlike our model. To address this concern we have added text describing how we intend to investigate statistical methods for characterizing the goodness of fit of brain activity predicted by the DIVA model to statistical parametric maps of task-related BOLD signals collected in our fMRI experiments. Despite the qualitative nature of the model/fMRI comparisons to date, we believe the model-based hypotheses tested in our fMRI experiments are more precise and theoretically grounded than those tested in the large majority of fMRI experiments. For example, once a cell type has been localized in our model (note that we assign each cell type to a particular coordinate in the same stereotactic space used to analyze neuroimaging data), we can make a prediction about what should happen in that precise location under a certain speaking condition. We demonstrated this in our auditory perturbation study described in Section C.2. In that study, the model predicted both the existence of a cell type (auditory error cells that respond to a discrepancy between a speaker’s desired speech signal and the actual speech signal), as well as the location of this cell type in stereotactic space. These predictions were supported in the fMRI experiment. Furthermore, the same model provided quantitative fits to behavioral data collected during the fMRI experiment, specifically the acoustic signals produced by subjects under upward and downward perturbations of the first formant frequency. Section C.2 and Section D have been modified to highlight this aspect of the proposed research.

fMRI methodological details. Reviewers 1 and 2 felt that more methodological details concerning the fMRI experiments were needed in order to evaluate the proposed research. We have taken several steps to address this concern. First, we have expanded the “fMRI experimental protocol” subsection at the beginning of Section D. This section provides details concerning our event-triggered protocol and our rationale for using this protocol. Second, we have made explicit our hypothesis testing methods in Section D, as described above. Third, we have added a sub-section called “Effective connectivity analysis” in Section D of the proposal that describes the methods we will use for estimating path strengths in order to test predictions generated from the model regarding changes in effective connectivity across experimental conditions. We have also added descriptions in the hypothesis test portions of Section D of how effective connectivity analysis will be used to test predictions in the fMRI experiments. Fourth, we have added power analysis details for the behavioral and fMRI experiments to motivate the subject pool sizes. Included in the fMRI power analysis descriptions are details concerning the number of stimulus presentations for each condition.

Reviewer 1 also noted that it may be important to quantify the time courses of activations in the fMRI studies. This type of analysis is not possible on the time scale of a single trial given our event-triggered scanning paradigm, which utilizes sparse sampling and therefore does not contain enough time points of data to reconstruct the whole hemodynamic response time-course within a trial. (Instead our paradigm basically captures the peak activation in each trial and compares the size of this peak across different trial types.) This paradigm is necessary to avoid disruption of the auditory perturbation setup we use for our fMRI experiments: the setup will not work if the subject’s speech is contaminated by scanner noise. Although it is true that some studies in the literature are beginning to analyze time course information within a trial, this is not considered an absolute requirement in the neuroimaging community.

Relatedly, Reviewer 1 felt a need for fMRI analyses beyond “show[ing] just highly filtered images”. We have added text describing several additional analyses that will be performed on the results of the fMRI experiments, including structural equation modeling analyses of effective connectivity between regions, correlation analyses between behavioral data and BOLD response; and quantitative fitting of behavioral data measured during the fMRI experiment with our model.

A. SPECIFIC AIMS

The primary goal of this research project is the continued development, testing, and refinement of a computational model describing the neural processes underlying speech. We believe this model, called the DIVA model[1], provides the most comprehensive current account of the neural computations underlying speech articulation, including treatment of acoustic, kinematic, and neuroimaging data. In the upcoming budget period, we propose to focus on the neural circuitry involved in auditory feedback-based control of speech movements. The project involves 3 closely related subprojects aimed at key issues concerning auditory feedback control of speech:

(1) Control of prosodic aspects of the speech signal. This subproject combines neural modeling with psychophysical and functional magnetic resonance imaging (fMRI) experiments to investigate the neural mechanisms responsible for the control of word- and phrase-level prosody. The psychophysical experiments are designed to identify the degree to which prosodic cues are controlled independently vs. in an integrated fashion. The fMRI experiments are designed to identify the neural circuitry responsible for feedback-based control of prosodic cues. The modeling project involves modification of the DIVA model to incorporate mechanisms for controlling prosody. Simulations of the model performing the same speech tasks used in the experiments will be compared to the experimental results to test the model’s account of the neural circuitry responsible for prosodic control.

(2) Representation of speech sounds in auditory cortical areas. In this subproject, we propose fMRI experiments and modeling work designed to further our understanding of the representation of speech sounds in the auditory cortical areas. This issue is central to the DIVA model, which utilizes an auditory reference frame to store speech sound targets for production. The modeling work will extend our model of auditory cortical maps (currently distinct from the DIVA model) and integrate it with the DIVA model. The modeling work will be guided by two fMRI experiments investigating important aspects of the neural representation of speech sounds: phonemic vs. auditory representations and talker-specific vs. talker-independent representations.

(3) Integration of feedforward and feedback commands in different speaking conditions. In this subproject, we propose to investigate the hypothesis that decreased use of auditory feedback control in fast speech and increased use in clear (carefully articulated) and stressed (emphasized) speech are responsible for differences in articulatory and auditory variability in the different speaking conditions. We propose an fMRI experiment to investigate fast, normal, clear, and stressed speech to test two model-based hypotheses regarding brain activity in the different conditions: (i) fast speech will lead to increased inhibition of auditory and somatosensory areas (indicative of less use of feedback control) when compared to normal speech, and (ii) clear and stressed speech will lead to more activation in the auditory and somatosensory cortical areas than normal speech due to increased reliance on feedback control.

The modeling work proposed in the three subprojects will be integrated into a single, improved version of the DIVA model, thus insuring that they provide a unified account of the neural bases of auditory feedback control of speech. We believe the resulting model will be useful to researchers studying communication disorders by providing a much more detailed description of the functional and neuroanatomical underpinnings of normal speech than currently exists, and by providing a functional description of what to expect when part of the neural circuitry malfunctions. We believe this improved understanding will ultimately help guide improvements to diagnostic tools and treatments for speech disorders.

B. BACKGROUND AND SIGNIFICANCE

Note: Publications listed in boldface are available in the appendix materials of this application.

Combining neural models and functional brain imaging to understand the neural bases of speech. Recent years have witnessed a large number of functional brain imaging experiments studying speech and language, and much has been learned from these studies regarding the brain mechanisms underlying speech and its disorders. For example, functional magnetic resonance imaging (fMRI) and positron emission tomography (PET) studies have identified the cortical and subcortical areas involved in simple speech tasks (e.g., Hickok et al., 2000; Riecker et al, 2000a,b; Wise et al, 1999; Wildgruber et al., 2001). However these studies do not, by themselves, answer the question of what particular function a particular brain region may play in speech. For example activity in the anterior insula has been identified in numerous speech neuroimaging studies (e.g., Wise et al, 1999; Riecker et al., 2000a), but much controversy still exists concerning its particular role in the neural control of speech (Dronkers, 1996; Ackermann & Riecker 2004; Hillis et al, 2004; Shuster & Lemieux, in press). We contend that a better understanding of the exact roles of different brain regions in speech requires the formulation of computational neural models whose components model the computations performed by individual brain regions (see Horwitz et al., 1999; Husain et al., 2004; Tagamets & Horwitz, 1997; Fagg & Arbib, 1998 for other examples of this approach). In the past decade we have developed one such model of speech production called the DIVA model (Guenther, 1994, 1995; Guenther et al., 1998, 2005). The model’s components correspond to regions of the cerebral cortex and cerebellum, and they consist of groups of neurons whose activities during speech tasks can be measured and compared to the brain activities of human subjects performing the same task (e.g., Guenther et al., 2005). These neurons are connected by synapses that become tuned during a babbling process and with repetition of learned sounds. Computer simulations of the model provide a simple, unified account of a wide range of observations concerning speech acquisition and production, including data concerning kinematics and acoustics of speech movements (Callan et al., 2000; Guenther, 1994, 1995; Guenther et al., 1998, 2005; Nieto-Castanon et al., 2005; Perkell et al., 2000, 2004ab) as well as brain activities underlying speech production (Guenther et al., 2005).

In this application we propose a set of modeling, neuroimaging, and psychophysical experiments designed to address several current shortcomings of the DIVA model, in the process improving our understanding of several aspects of speech production, including the neural bases of prosodic control, the auditory representations used for speech production, and the control of speaking rate and clarity.

Neural bases of prosodic control. The speech signal consists of phonetic segments that correspond to the basic sound units (phonemes and/or syllables, or segments) and prosodic cues such as pitch, loudness, duration and rhythm that convey meaningful differences in linguistic or affective state (Bolinger, 1961, 1989; Lehiste, 1970, 1976; Netsell, 1973; Shriberg & Kent, 1982). To date we have focused on the segmental aspects of speech; in the current application we propose to address both prosodic and segmental aspects.

Prosody serves various grammatical, semantic, emotional, social and psychological roles in English. Prosodic cues, often referred to as suprasegmentals, have two defining characteristics: they tend to co-occur with segmental units and they can potentially extend over more than one segmental unit (Cruttenden, 1997; Lehiste, 1970). Young children can modulate the prosody of their cries and babbles well before they can produce any identifiable speech sounds (Bloom, 1973; Kent & Bauer, 1985; Kent & Murray, 1982; MacNeilage & Davis, 1993; Menyuk & Bernholtz, 1969; Snow, 1994). Recent findings indicate that people with severely impaired speech can also control prosody despite little or no segmental clarity (Patel 2002a,b, 2003, 2004; Vance 1994; Whitehill et al., 2001). Once regarded as subservient to speech segments, prosody is beginning to be viewed as the ‘scaffolding’ that holds different levels of phonetic description together. Thus far the DIVA model has focused on the segmental aspects of speech; in this application we propose to extend the model to address the neural mechanisms involved in the control of linguistic prosody.

Regarding the neural bases of prosody, there is agreement in the literature that no single brain region is responsible for prosodic control, but there is little consensus as to exactly which regions are involved and in what capacity (see Sidtis & Van Lancker Sidtis, 2003 for a review). Most studies have focused on hemispheric asymmetries in prosodic perception and control, with relatively little attention to different roles played by different regions within the cerebral hemispheres. One of the more consistent findings in the literature concerns perception and production of affective prosody, which appear to rely more on the right cerebral hemisphere than the left hemisphere (Adolphs et al., 2002; Buchanan et al., 2000; George et al., 1996; Ghacibeh & Heilman, 2003; Kotz et al., 2003; Mitchell et al., 2003; Pihan et al., 1997; Ross & Mesulam, 1979; Williamson et al., 2003), though the view of affective prosody as a unitary, purely right hemisphere entity is oversimplified (Sidtis & Van Lancker Sidtis, 2003). Some researchers have concluded that linguistic prosody primarily involves left hemisphere mechanisms (Emmorey, 1987; Walker et al., 2004), while others suggest significant involvement of both hemispheres (Mildner, 2004). Lexical pitch in tonal languages appears to be predominantly processed by the left hemisphere (Gandour et al., 2003). Phrase- or sentence-level aspects of linguistic prosody may be processed preferentially in the right hemisphere (Meyer et al., 2002), though some report bilateral activation for these processes (Doherty et al., 2004; Stiller et al., 1997).

We propose to investigate the neural circuitry involved in the control of linguistic prosodic cues at the word and phrase levels, with the goal of providing a more precise account of the roles of different brain regions in these processes. The proposed studies are novel in their use of a combination of modeling, neuroimaging, and psychophysics, as well as their use of an auditory perturbation paradigm to identify regions of the brain involved in online auditory feedback-based control of prosodic cues. This work, in combination with our previous and proposed research into the control of segmental cues, will also shed light on the interrelation between segmental and prosodic cues and its implications on speech motor control.

Auditory perturbation studies. Below we propose psychophysical and fMRI experiments in which the acoustic feedback of prosodic or segmental cues from the subject’s own voice are modified in near-real-time (less than 18 ms delay) to investigate how the subject compensates and adapts to these modifications. This paradigm, in which a subject’s perception of his/her own actions is “perturbed” by an experimenter and the subject’s response to the perturbation over time is measured, has been used in a number of prior studies to investigate motor control mechanisms. For example, it has long been known that adding noise to a speaker’s auditory feedback of his/her own speech or attenuating the speaker’s auditory feedback leads to a compensatory increase in speech volume (Lombard, 1911; Lane & Tranel, 1971). The majority of recent auditory perturbation studies have involved the manipulation of vocal pitch. The auditory percept of pitch depends primarily on the frequency of vocal fold vibration, which is referred to as the fundamental frequency (F0). Pitch shifting experiments typically involve shifting of the entire frequency spectrum (including F0 as well as the formant frequencies), which yields the subjective impression of hearing one’s own voice with a different pitch. Elman (1981) showed that speakers respond to frequency-shifted feedback by altering the fundamental frequency of their speech in the opposite direction to the shift. The details of this compensatory response have been investigated in several relatively recent studies by Larson, Burnett, Hain, and colleagues (Burnett et al., 1998; Burnett & Larson, 2002; Hain et al., 2001). In the studies of Larson and colleagues, subjects typically articulated sustained vowels in isolation, a rather unnatural speech task. Natke & Kalveram (2001) and Donath et al. (2002) extended these results by shifting pitch while subjects produced the nonsense word /tatatas/, where the first syllable could be either unstressed or stressed. If the first syllable was stressed (and therefore longer in duration), a compensatory response occurred during that syllable, but not if the first syllable was unstressed. The second syllable showed evidence of a compensatory response in both the stressed and unstressed cases. This suggests that subjects compensate for pitch shifting during production of syllable strings, though the compensation will occur during the first syllable only if this syllable is longer in duration than the latency of the compensatory response.

In Section D we propose several auditory perturbation experiments that extend beyond existing studies in several important directions. First, unlike previous studies, we will use pitch shifting and amplitude shifting in particular portions of an utterance to perturb specific prosodic cues. Second, we will perform perturbation experiments in the fMRI scanner to help us identify and characterize the brain regions involved in auditory feedback control of speech. Third, the studies will be accompanied by computational modeling projects in which the DIVA model will simulate the experimental tasks and its output will be compared to the experimental results to test the model’s account of the neural circuits responsible for auditory error detection and correction.

C. PROGRESS REPORT / PRELIMINARY STUDIES

This report covers research performed during the current budget period. Our primary aims during this period were: (i) to further develop the DIVA model of speech production, (ii) to further develop our model of neural maps in auditory cortical areas, (iii) to develop laboratory infrastructure and software for performing[2] and analyzing neuroimaging experiments, and (iv) to carry out fMRI experiments designed to test our models. The following subsections describe our progress on these aims.

C.1. Development of the DIVA model of speech production. During the current grant period, we have continued the development of a neural network model of the brain processes underlying speech acquisition and production called the DIVA model, with emphasis on relating model components to particular regions of the brain. The model is able to produce speech sounds (including articulator movements and a corresponding acoustic signal) by learning mappings between articulations and their acoustic consequences, as well as auditory and somatosensory targets for individual speech sounds. It accounts for a number of speech production phenomena, including aspects of speech acquisition, coarticulation, contextual variability, motor equivalence, velocity/distance relationships, speaking rate effects, and perception-production interactions (Callan et al., 2000; Guenther, 1995; Guenther et al., 1998; Nieto-Castanon et al., 2005; Perkell et al., 2004a,b). The latest version of the model is detailed in Guenther et al. (2005). Here we briefly describe the model with attention to aspects relevant to the current proposal.

A schematic of the DIVA model is shown in Fig. 1. Each box in the diagram corresponds to a set of neurons in the model, and arrows correspond to synaptic projections that form mappings from one type of neural representation to another. The mappings labeled Feedback commands in Fig. 1 are tuned during a babbling phase in which semi-random articulator movements are used to produce auditory and somatosensory feedback; this combination of articulatory, auditory, and somatosensory information is used to tune the synaptic projections between the sensory error maps and the model’s motor cortex.

In the next learning stage, the model learns a time-varying auditory target region for every speech sound presented to it. A “speech sound” as defined herein can be a phoneme, syllable, or word. For the sake of readability, in this application we will typically use the term “syllable” to refer to a single speech sound unit, represented by its own speech sound map cell in the model. The target region is effectively stored in the synapses projecting from a Speech Sound Map cell that is chosen to represent the sound and the Auditory Error Map cells. The Speech Sound Map is hypothesized to lie in left hemisphere lateral premotor cortex, specifically in the ventral portion of the inferior frontal gyrus pars opercularis[3]; we will use the term frontal operculum to refer to this location hereafter. (See Appendix A of Guenther et al., 2005 for justification of the hypothesized anatomical locations of model components.)

Once the target region for a sound has been learned, the model can use the feedback control subsystem to attempt to produce the sound. This is done by activating the Speech Sound Map cell representing the sound, which in turn reads out the sound’s target region to the Auditory Error Map, where it is compared to incoming auditory feedback of the model’s own speech. If the auditory feedback is outside the auditory target region, cells in the Auditory Error Map become active, and this in turn leads to corrective motor commands (Feedback commands in Fig. 1). Additional synaptic projections from Speech Sound Map cells to the model’s motor cortex (both directly and via the cerebellum) form Feedforward commands; these commands are tuned by monitoring the overall motor command, which is formed by combining the Feedforward commands and Feedback commands. Early in development the feedforward commands are inaccurate, and the model depends on feedback control. Over time, the feedforward commands are tuned through monitoring of the movements commanded by the feedback subsystem.

We have performed a number of computer simulations of the model to verify that it can account for a wide range of experimental observations concerning the articulatory kinematics and acoustics of speech. As part of the current grant, we showed that the model can account for one of the more perplexing observations concerning speech kinematics: the very large amount of articulatory variability seen during production of the phoneme /r/ in different phonetic contexts, even by the same speaker (e.g., Delattre & Freeman, 1968; Westbury et al., 1998). As described in Nieto-Castanon et al. (2005), we collected structural MRI scans and acoustic data from two speakers who had participated in an earlier kinematic study of /r/ production (Guenther et al., 1999), and we used these data to construct two speaker-specific vocal tract models that mimicked the vocal tract shapes, movement degrees of freedom, and acoustics of the modeled speakers. These vocal tract models were embedded into the DIVA model, which then controlled their movements to produce /r/ in different phonetic contexts. The model’s articulatory gestures for /r/ were qualitatively and quantitatively very similar to those of the modeled subjects in each phonetic context, with both the model and speakers displaying a wide range of tongue configurations for /r/ across contexts. Furthermore, differences in the model’s gestures using the two different vocal tract models mimicked the differences in articulations observed for the two speakers. These results, backed by detailed quantitative kinematic and acoustic analyses (for details see Nieto-Castanon et al., 2005), provide very strong support for the type of control scheme used by the DIVA model. We have also shown that this type of controller can account for observations concerning the kinematics of human arm movements (Micci Barreca & Guenther, 2001; Dessing et al., in press).

Generating simulated fMRI activations from model simulations. An important feature of the DIVA model that differentiates it from other computational models of speech production is that all of the model’s components have been associated with specific anatomical locations in the brain. These locations, specified in the Montreal Neurological Institute (MNI) coordinate frame in Table 1 of Guenther et al. (2005), are based on the results of fMRI and PET studies of speech production and articulation carried out by our lab and many others (see Guenther et al., 2005 for details). Since the model’s components correspond to groups of neurons at specific anatomical locations, it is possible to generate simulated fMRI activations from the model’s cell activities during a computer simulation. The relationship between the signal measured in blood oxygen level dependent (BOLD) fMRI and electrical activity of neurons has been studied by numerous investigators in recent years. It is well-known that the BOLD signal is relatively sluggish compared to electrical neural activity. That is, for a very brief burst of neural activity, the BOLD signal will begin to rise and continue rising well after the neural activity stops, peaking about 4-6 seconds after the neural activation burst before falling down somewhat below the starting level around 10-12 seconds after the neural burst and slowly rising back to the starting level. We use such a hemodynamic response function (HRF), which is part of the SPM software package for fMRI data analysis, to transform neural activities in our model cells into simulated fMRI activity. However, there are different possible definitions of “neural activity”, and the exact nature of the neural activity that gives rise to the BOLD signal is still currently under debate (e.g., Caesar et al., 2003; Heeger et al., 2000; Logothetis et al., 2001; Logothetis & Pfeuffer, 2004; Rees et al., 2000; Tagamets & Horwitz, 2001).

In our modeling work, each model cell is hypothesized to correspond to a small population of neurons that fire together. The output of the cell corresponds to neural firing rate (i.e., the number of action potentials per second of the population of neurons). This output is sent to other cells in the network, where it is multiplied by synaptic weights to form synaptic inputs to these cells. The activity level of a cell is calculated as the sum of all the synaptic inputs to the cell (both excitatory and inhibitory), and if the net activity is above zero, the cell’s output is proportional to the activity level. If the net activity is below zero, the cell’s output is zero. It has been shown that the magnitude of the BOLD signal typically scales proportionally with the average firing rate of the neurons in the region where the BOLD signal is measured (e.g., Heeger et al., 2000; Rees et al., 2000). It has been noted elsewhere, however, that the BOLD signal actually correlates more closely with local field potentials, which are thought to arise primarily from averaged postsynaptic potentials (corresponding to the inputs of neurons), than it does to the average firing rate of an area (Logothetis et al., 2001). In particular, whereas the average firing rate may habituate down to zero with prolonged stimulation (greater than 2 sec), the local field potential and BOLD signal do not habituate completely, maintaining non-zero steady state values with prolonged stimulation. In accord with this finding, the fMRI activations that we generate from our models are determined by convolving the total inputs to our modeled neurons (i.e., the activity level as defined above), rather than the outputs[4] (firing rates), with an idealized hemodynamic response function generated using default settings of the function ‘spm_hrf’ from the SPM toolbox (see Guenther et al., 2005 for details).

In our models, an active inhibitory neuron has two effects on the BOLD signal: (i) the total input to the inhibitory neuron will have a positive effect on the local BOLD signal, and (ii) the output of the inhibitory neuron will act as an inhibitory input to excitatory neurons, thereby decreasing their summed input and, in turn, reducing the corresponding BOLD signal. Relatedly, it has been shown that inhibition to a neuron can cause a decrease in the firing rate of that neuron while at the same time causing an increase in cerebral blood flow, which is closely related to the BOLD signal (Caesar et al., 2003). Caesar et al. (2003) conclude that this cerebral blood flow increase probably occurs as the result of excitation of inhibitory neurons, consistent with our model. They also note that the cerebral blood flow increase caused by combined excitatory and inhibitory inputs is somewhat less than the sum of the increases to each input type alone; this is also consistent with our model since the increase in BOLD signal caused by the active inhibitory neurons is somewhat counteracted by the inhibitory effect of these neurons on the total input to excitatory neurons.

Figure 2 illustrates the process of generating fMRI activations from a model simulation and comparing the resulting activation to the results of an fMRI experiment designed to test a model prediction. The left panel of the figure illustrates the locations of the DIVA model components on the “standard” single subject brain from the SPM2 software package. The DIVA model predicts that unexpected perturbation of the jaw during speech will cause a mismatch between somatosensory targets and actual somatosensory inputs, causing activation of somatosensory error cells in higher-order somatosensory cortical areas in the supramarginal gyrus of the inferior parietal cortex. The location of these cells is denoted by ΔS in the left panel of the figure. Simulations of the DIVA model producing speech sounds with and without jaw perturbation were performed. The top middle panel indicates the neural activity (gray) of the somatosensory error signals in the perturbed condition minus activity in the unperturbed condition, along with the resulting BOLD signal (black). Since the somatosensory error cells are more active in the perturbed condition, a relatively large positive response is seen in the BOLD signal. Auditory error cells, on the other hand, show little differential activation in the two conditions since very little auditory error is created by the jaw perturbation (bottom middle panel), and thus the BOLD signal for the auditory error cells in the perturbed – unperturbed contrast is near zero. The derived BOLD signals are Gaussian smoothed spatially and plotted on the standard SPM brain in the top right panel. The bottom right panel shows the results of an fMRI study we carried out to compare perturbed and unperturbed speech (13 subjects, random effects analysis, false discovery rate = 0.05). In this case, the model correctly predicts the existence and location of somatosensory error cell activation, but activation not explained by the model is found in the left frontal operculum.

Fig. 3 compares results from an fMRI study performed by our lab of single syllable production to simulated fMRI data from the DIVA model in the same speech task. Comparison of the top and bottom panels indicates that the model qualitatively accounts for most of the fMRI activations (see Guenther et al., 2005 for details).

Research in this subsection funded by the current grant has been accepted/published in Journal of the Acoustical Society of America (Nieto-Castanon et al., 2005), Brain and Language (Guenther et al., 2005), Journal of Motor Behavior (Micci Barreca & Guenther, 2001), Journal of Cognitive Neuroscience (Dessing et al., in press) and Contemporary Issues in Communication Science and Disorders (Max et al., 2004).

C.2. Combining modeling with psychophysics and neuroimaging to investigate auditory feedback control in speech. The DIVA model predicts that unexpected perturbation of speech will cause a mismatch between target sensations and actual sensory inputs. For example, an auditory perturbation such as a shift of the first formant frequency, or F1 (so that a subject hears himself saying “bit” when he is attempting to say “bet”) should activate auditory error cells. These cells are hypothesized to reside in higher auditory cortical areas in the planum temporale and superior temporal gyrus (ΔA in the left panel of Fig. 2). If speech is perturbed in this manner repeatedly, rather than unpredictably, the model predicts that feedforward commands will eventually become tuned to include compensation for the frequency shift. Learning of this sort is sometimes termed sensorimotor adaptation (e.g. Houde & Jordan, 1998). If the shift is then removed, the model will display an “after-effect” while it retunes the feedforward command to work properly under normal conditions. To test these hypotheses, we performed two experiments, accompanied by modeling projects, investigating the effects of shifting F1 of a speaker’s auditory feedback in real-time: an unexpected auditory perturbation fMRI study and a sensorimotor adaptation psychophysical study (the latter performed as part of another grant; R01 DC01925, J Perkell, PI).

In each trial of the fMRI study, subjects read a one-syllable word (e.g., “neck” or “bet”), and on 1 in 4 trials (randomly dispersed) the subject’s auditory feedback was perturbed by shifting the first formant frequency of his/her own speech upward or downward by 30% in real time[5]. (See fMRI Experimental Protocol at beginning of Section D for further details regarding scanning protocol). The top part of Fig. 4 shows the areas with significantly more activation during shifted trials as compared to unshifted trials. As predicted by the DIVA model (illustrated by model simulation results in bottom panel of Fig. 4), increased activation is found in higher-order auditory cortical areas, specifically in the ventral posterior superior temporal gyrus in the right hemisphere and within the posterior portion of the planum temporale in the left hemisphere. In the model these activations arise from auditory error cells becoming active when a discrepancy exists between the auditory target and the incoming auditory signal during speech. It is noteworthy that the DIVA model prediction of the existence of auditory error cell activation, as well as its location, was made prior to the running of the fMRI experiment (e.g., Guenther et al., 2005). These results highlight the effectiveness of our approach in generating predictions to guide fMRI studies, and more generally in furthering our understanding of brain function during speech production.

The speech of subjects in the fMRI study was recorded and analyzed to identify whether subjects were compensating for the perturbation within the perturbed trial. (Note that such within-trial compensation differs from adaptation; compensation refers to on-line corrections in response to a perturbation, whereas adaptation implies a learned compensation that occurs even in the absence of a perturbation.) The gray shaded areas in Fig. 5 represent the 95% confidence interval for normalized[6] F1 values during the vowel for upward perturbation trials (darker shading) and downward perturbation trials (lighter shading). Subjects showed clear compensation for the perturbations, starting approximately 100-130 ms after the start of the vowel. Simulation results from the DIVA model are indicated by the dashed line (upward perturbation) and solid line (downward perturbation). The model’s productions fall within the 95% confidence interval of the subjects’ productions, indicating that the model quantitatively accounts for compensation seen in the fMRI subjects.

This research was been presented at the 2004 Acoustical Society of America conference in San Diego; Jason Tourville was awarded Best Student Paper in Speech Communication for this presentation.

In the psychophysical study (Villacorta, 2005), 20 subjects performed a sensorimotor adaptation experiment that involved four phases: a baseline phase in which the subject produced 15 repetitions of a short list of words (each repetition of the list corresponding to one epoch) with normal auditory feedback (epochs 1-15 in Fig. 6), a ramp phase during which the shift in F1 was gradually introduced to the subject’s auditory feedback (epochs 16-20), a training phase in which the full F1 perturbation (a 30% shift of F1) was applied on every trial (epochs 21-45), and a post-test phase in which the subject received unaltered auditory feedback (epochs 46-65). The subjects’ adaptive response (i.e., the percent change in F1 compared to the baseline phase in the direction opposite the perturbation) is shown by the solid line with error bars in Fig. 6. The shaded band in Fig. 6 represents the 95% confidence interval for simulations of the DIVA model (where different versions of the model were created to correspond to the different subjects; see Villacorta, 2005 for details). Except for only two epochs, the model’s productions were statistically indistinguishable from the experimental results. Notably, subjects showed an after-effect as predicted, and the model provides an accurate quantitative fit to this after-effect.

C.3. Modeling and neuroimaging of sound category representations in the auditory cortical areas. This subproject involved four fMRI experiments and associated neural modeling designed to investigate the representation of sound categories in human auditory cortex. fMRI Experiment 1 investigated the representation of prototypical (good) and non-prototypical (bad) examples of a vowel sound. We found that listening to prototypical examples of a vowel resulted in less auditory cortical activation than listening to non-prototypical examples (Guenther et al., 2004). fMRI Experiments 2 and 3 investigated the effects of categorization training and discrimination training with novel non-speech sounds on auditory cortical representations. The two training tasks were shown to have opposite effects on the auditory cortical representation of sounds experienced during training: discrimination training led to an increase in the amount of activation caused by the training stimuli, whereas categorization training led to decreased activation (Guenther et al., 2004). These results, which utilized powerful region-of-interest based fMRI analysis techniques that we developed under the current grant (Nieto-Castanon et al., 2003), indicate that the brain efficiently shifts neural resources away from regions of acoustic space where discrimination between sounds is not behaviorally important (e.g., near the center of a sound category) and toward regions where accurate discrimination is needed. They also provide a straightforward neural account of learned aspects of perceptual distortion near sound categories (e.g., the perceptual magnet effect demonstrated by Kuhl, 1991; Iverson & Kuhl, 1996): sounds from the center of a category are more difficult to discriminate from each other than sounds near category boundaries because they are represented by fewer cells in auditory cortical working memory. We noted that activities in inferior parietal and superior temporal regions appeared to correlate well with discrimination scores; that is, stimuli that were relatively easy to discriminate from each other generally caused more activation in these areas than stimuli that were more difficult to discriminate from each other. This result is consistent with the hypothesis that cells in the superior temporal and inferior parietal areas represent the auditory characteristics of the stimuli in the discrimination task, and that larger neural representations in these areas (as evidenced by more activation) are less susceptible to noisy processing in individual neurons than smaller representations (see Guenther et al., 2004). Thus sounds with larger superior temporal and inferior parietal representations are easier to discriminate from each other than sounds with smaller representations. Involvement of superior temporal and inferior parietal areas in auditory discriminability is also compatible with the results of studies performed elsewhere (e.g., Caplan et al., 1995). In fMRI Experiment 4, we found evidence that the inferior parietal cortex is part of an auditory working memory system: this area was engaged in a discrimination task that required storage of the auditory details of one sound in working memory for comparison to a second sound, but not in an identification task that involved only identifying which phoneme was heard. Participation of inferior parietal cortex in phonological working memory tasks has been posited by numerous researchers (e.g., Baddeley, 2003; Hickok & Poeppel, 2004; Mottaghy et al., 2002; Jonides et al., 1998; Ravizza et al., 2004). Our results indicate that auditory details, as well as phonological information, can be stored in inferior parietal working memory.

Based on these and other results in the literature, we have developed a neural model of speech sound processing in the auditory system (e.g., Guenther & Bohland, 2002; Guenther et al., 2003). We have further elaborated this model as schematized in Fig. 7. We will refer to this elaborated model as the auditory map model[7] in this proposal. It serves as the theoretical starting point for several studies proposed herein. The model consists of three interconnected maps of cells. For each map there is an equation that governs cell activities in the map; these equations[8] are biologically plausible shunting equations (e.g., Grossberg, 1980). The first map in the model, the auditory map in Fig. 7, corresponds to primary and secondary auditory cortical regions of the superior temporal gyrus and supratemporal plane, including Heschl’s gyrus, planum temporale, and posterior superior temporal gyrus. Cells in this map represent an acoustic signal in terms of relatively low-level auditory parameters. For example, there are cells tuned to particular frequencies and sweeps of frequencies; cells of this type have been identified in electrophysiological studies of auditory cortex in animals (e.g., Morel et al., 1993; Kosaki et al., 1997). The auditory map thus provides a rich spectral representation of an incoming sound. The second map, labeled auditory working memory, is hypothesized to lie in the inferior parietal cortex. It receives input from the auditory map and is responsible for keeping recent sounds in working memory for a short period after the sounds have stopped. For example, in a same/different discrimination task, the auditory working memory holds the first of a pair of sounds in working memory long enough for it to be compared to the second sound of the pair. This area includes the anterior and posterior supramarginal gyrus as well as the parietal operculum. It overlaps with the inferior parietal region shown to be involved in both speech perception and production tasks by Hickok and colleagues (Buchsbaum et al., 2001; Hickok et al., 2003). The third map, labeled category map, also receives input from the auditory map and is concerned with classification of sounds into behaviorally relevant categories such as phoneme categories. A good example of a phoneme category (such as a prototypical /i/ sound in the study of Kuhl, 1991) causes a large amount of activation in this map during a discrimination task, specifically in cells that represent the vowel /i/. A poor example of /i/ or a sound that is not categorically processed (e.g., an unfamiliar sound) would cause little or no activation. This effect is achieved in the model through a simple, biologically based learning algorithm in which tuning curves of cells are adjusted to align with frequently occurring inputs (e.g., familiar/prototypical sounds in the native language). We hypothesize that the category map is located lateral and ventral to the primary auditory cortex, in the superior temporal sulcus and neighboring middle temporal gyrus. For brevity’s sake, we will refer to this general region as the superior temporal sulcus (STS) in this proposal. This location is consistent with neuroimaging studies comparing speech to non-speech stimuli, reporting preferential activation of anterior STS (Mummery et al., 1999; Binder et al., 2000; Scott et al., 2000) and anterior middle temporal gyrus (Zahn et al., 2000) bilaterally. Our own fMRI studies found more activity in this region when performing a phoneme identification task than when performing a discrimination task using the same sounds (Guenther et al., 2003), and that sound category learning leads to an increase in activity in this area (Guenther et al., 2004). These results provide further support for an STS category map. Buchsbaum et al. (2001) report speech-related activity in a more posterior portion of the STS. One of the goals of the current proposal is to refine our understanding of speech maps in STS, including investigating the possibility of different speech maps in different parts of STS.

According to the model, nerve impulses due to an incoming sound are first registered in cortex in the auditory map. This map represents relatively low-level details of auditory stimuli, such as the shape of the frequency spectrum and velocities of frequency sweeps. Active cells in the auditory map (schematized by the Gaussian “hump” in the map in Fig. 7) send projections to the other two maps in the model. Projections to the auditory working memory map transmit information about auditory details of the sound so they can be stored in short-term memory, e.g. for comparison to the second incoming sound in a same/different discrimination task[9]. We hypothesize that the detailed auditory representation in this region also plays an important role in the tuning of speech motor programs. In the DIVA model, a simplified version of such a representation is used to monitor the sensory consequences of speech motor commands. The projections to the category map are hypothesized to transform the sound into a categorical representation. A cell in the category map becomes active if a clear example of the cell’s preferred stimulus is processed. For instance, a good example of the phoneme /r/ is presumed to heavily activate the cells that represent the /r/ phoneme category in the map, whereas a poor /r/ example will only weakly activate these cells. In addition to connections from the auditory map to the other two maps, the model includes inhibitory connections between the inferior parietal auditory working memory and the category map. The model’s account of the perceptual magnet effect is schematized in Fig. 7. When a good category exemplar arrives, the category map is strongly activated, and this leads to relatively strong inhibition of the auditory working memory, resulting in fewer active cells in the working memory map (top panel). When a poor category example arrives (bottom panel), there is relatively weak category map activation and thus weak inhibition of auditory working memory, resulting in many more active cells in the auditory working memory map. Since larger neural representations are less susceptible to noisy processing in individual neurons, the larger working memory activation for poor category exemplars results in better discriminability of these sounds relative to good category exemplars. This model fits well with the dual language processing streams proposed by Hickok & Poeppel (2004). In terms of both function and cortical location, our proposed auditory working memory map corresponds to their hypothesized dorsal stream, while the category map corresponds to their hypothesized ventral stream.

The model has been implemented computationally as a neural network whose cell activities and synaptic weight strengths are governed by differential equations, allowing investigation of both the transient and equilibrium behaviors of the system. In the simulations described here, the model was trained by presenting it with sound distributions mimicking those of the American English phonemes /r/ and /l/ (approximated by Gaussian distributions centered on prototypical phoneme examples). Learning took place in the synaptic weights projecting from the superior temporal auditory map to the middle temporal category map using a biologically plausible self-organizing map learning law (Kohonen, 1982) that results in more cells becoming active for frequently occurring sounds (e.g., category prototypes). Following training, we tested the model using artificial stimuli from a stimulus continuum formed by interpolating the third formant frequency (F3) between values for the phonemes /r/ and /l/. To compare the model’s discrimination performance to psychophysical measures, we considered the cell activity within the working memory map as an internal psychological variable used for discrimination. Specifically, the perceptual distance (d’) between stimuli was calculated as the difference in the mean population response for each stimulus divided by the root mean square of the variances of this response (averaged over many trials). When more cells contribute to the population vector, the signal to noise ratio, and in turn discriminability, improves (see also Zohary, 1992). Fig. 8 shows that the model provides an accurate fit to behavioral data we collected in a discrimination experiment in which 5 subjects performed a same/different task on sounds from the same /r/-/l/ continuum presented to the model.

This research was published in Journal of Speech, Language, and Hearing Research (Guenther et al., 2004), Acoustical Science and Technology (Guenther & Bohland, 2002), and NeuroImage (Nieto-Castanon et al., 2003) and was presented at several major conferences.

C.4. Investigation of brain mechanisms underlying sound initiation and sequencing. In this subproject, we conducted an fMRI experiment to explore brain activity underlying the sequencing of speech sounds. We investigated brain responses related to motor preparation and overt production of sequences of memory-guided non-word syllables such as “ba-ba-ba” or “stra-stru-stri”. Two parameters determined the linguistic content of the stimuli used. The first, syllable complexity, varied the number of phoneme segments that constituted each individual syllable in the sequence (i.e. CV vs. CCCV syllables). The second parameter, sequence complexity, varied the number of unique syllables comprising the three-syllable sequence (repetition of the same syllable vs. three different syllables). Each factor was varied between one of two values (simple or complex), yielding four stimulus types. Comparisons across conditions were used to assess sequence-related networks. 13 right-handed adult American English speakers participated (6 female, 7 male; ages 22-50). Stimuli were presented visually on a projection screen in the rear of the scanner (3T Siemens Trio). A single trial (involving stimuli chosen randomly from all conditions) began with this stimulus display. After 2.5s the stimulus was removed and immediately replaced by a white fixation cross. Subjects were asked to maintain fixation and to prepare to vocalize the stimulus they had just read. After a short random duration (0.5-2.0s), the white cross became green, signaling the subject to immediately vocalize the most recent sequence. During this 2.5s period, the scanner remained silent. Subjects were instructed to speak each sequence at a typical volume and rate. Following the full production period, the scanner was triggered to acquire three full brain volumes (see fMRI Experimental Protocol in Section D for further details). At the end of the third volume acquisition, the fixation cross disappeared, and the next stimulus was presented. The subjects’ vocal responses were recorded using an MRI-compatible microphone and checked off-line for production errors; error trials were removed from the analysis. A first-level analysis was performed for each subject using the General Linear Model in SPM2. A second-level analysis using non-parametric permutation methods (Nichols & Holmes, 2002) from the SNPM2b toolbox assessed results across subjects.

Complex sequence responses were compared with simple sequence responses within each of the two syllable types. Several cortical regions responded more strongly for complex sequences, including the pre-SMA and SMA, left inferior frontal sulcus, left pre-central gyrus, anterior insula and frontal opercular regions, and left superior parietal cortex. Complex sequences also elicited subcortical activation in the right inferior cerebellum (lobule VIII) and left basal ganglia. Differential cortical activations related to sequence complexity were strongly left lateralized (c.f. the widely bilateral activity in overt production vs. baseline) suggesting a hemispheric distinction between areas used for speech sequence planning (which are left-lateralized) and those used simply for motor execution (which are bilateral). These results largely implicate the same regions described in clinical cases showing deficits in sequencing and/or initiation of speech (e.g. Jonas, 1981, 1987; Ziegler et al., 1997; Pickett et al., 1998; Dronkers, 1996; Riva, 1998), but also provide further functional and anatomical specificity regarding the network for speech sequencing.

Complex syllables required subjects to realize additional phonemic/phonetic targets for proper articulation compared to simple syllables. Differential brain responses related to syllable complexity were limited to the primary motor and somatosensory cortices in the left hemisphere around the lip, jaw, and tongue representations, and to the superior cerebellar cortex.

In related modeling work, we have described how a set of neural mechanisms may cooperate to mediate all the major stages of sequential action generation from initial sequence acquisition to precisely timed force production by muscles in fluent/skilled production (Bullock, 2004a,b; Rhodes et al., 2004).  Together these experimental and modeling results provide an important contribution to our understanding of how the brain plans and drives production of sequences of learned speech sounds.

This work has been presented at the 2004 Meeting of the Organization for Human Brain Mapping and published in the journals Trends in Cognitive Sciences, Motor Control, and Human Movement Science. An additional journal article is in preparation. We have proposed further work to investigate the role of prefrontal cortex and basal ganglia in the sequencing and initiation of speech as part of a recent new R01 application that is currently under review (“Sequencing and Initiation in Speech Production”, F. Guenther PI). None of the research proposed herein overlaps with that application.

C.5. Modeling investigation of cerebellar involvement in feedforward control and coarticulation. In this subproject, we are developing and testing a cerebellar component for use in the DIVA model that explores the role of the cerebellum in feedforward control and coarticulation. A difficult problem faced by the speech neural controller concerns tuning of feedforward motor commands, which involves monitoring corrective commands specified by the feedback control subsystem and incorporating them into the feedforward command. However, due to delays inherent in the processing of sensory feedback, these corrective commands occur significantly later in time (approximately 50-200 ms) than desired, since they occur well after the error has occurred. It is preferable for the feedforward controller to preempt sensory errors before they arise. Our computer simulations indicate that, if the delays are not corrected for in the feedforward command, instabilities arise during fast speech. We hypothesize that the cerebellum, specifically the superior medial cerebellum, performs the task of temporal alignment in the learning of feedforward motor commands for speech, and that this same learning process is responsible for anticipatory coarticulation in speech production.

In our cerebellum model, the delayed feedback command generated from perceived errors causes activation of cerebellar climbing fibers, which leads to long-term depression (LTD) in the synapses between parallel fibers and Purkinje cells in the cerebellum. The effect of this LTD in the model is to shift the current corrective command (specified by the feedback control subsystem) to an earlier point in time on the next attempt to produce the same speech sound. This process continues to occur until the cerebellum starts to produce an error in the opposite direction because it activates the command too soon. This causes a decrease in the climbing fiber activation, which leads to long-term potentiation (LTP) in the parallel fiber-to-Purkinje synapses, which causes the output of the cerebellum to shift to later points in production. This dynamic balance between LTP and LTD results in feedforward commands that are generated as early as is possible without interfering with previous sounds, thus providing an account of both anticipatory coarticulation and the proper temporal alignment of corrective commands when adjusting the feedforward command.

In other cerebellar modeling work, we have elaborated a model for adaptively timed hand movement control, which shares many mechanisms with speech articulator control and is involved in sign language (Ulloa et al., 2003a,b). This work has been published in the journal Neural Networks, and a journal manuscript describing the speech coarticulation and feedforward control simulations is in preparation.

C.6. fMRI study of cerebellar, premotor and motor cortical contributions to speech in normal speakers and individuals with ataxic dysarthria. In order to investigate the role of the cerebellum in relation to the primary motor and premotor cortical areas during speech production, we conducted an fMRI experiment involving the production of simple speech sounds by neurologically normal subjects and individuals diagnosed with ataxic dysarthria, a speech motor impairment due to cerebellum damage. We examined the differences in brain activations during production of vowels (V), consonant-vowel syllables (CV) and two-syllable nonsense words (CVCV). 10 neurologically normal right handed speakers of American English (NS group) and 5 subjects diagnosed with ataxic dysarthria (AD group) participated. The data were analyzed using the same methods as the previously described fMRI studies; results from normal subjects were presented previously in the top panel of Figure 3. Additional, more detailed ROI analyses of normal speaker data revealed superior lateral, superior medial and anterior medial cerebellum activity bilaterally for all utterance types (V, CV, CVCV) when compared to a baseline task involving no articulation. The deep cerebellar nuclei were also active in all conditions. We hypothesized that the cerebellar component of the feedforward command would be more important during consonant production since consonants have stricter timing constraints than vowels and the cerebellum is well-known to be involved in timing of motor commands (e.g., Perrett et al., 1993; Medina et al., 2000; Spencer et al., 2003). This hypothesis was supported by the finding of significantly more activity in the CVCV and CV conditions than the V condition in the right superior medial and anterior cerebellum and left superior lateral cerebellum. In the AD group, as expected, the cerebellar hemispheres were not active. We hypothesized that speakers with ataxic dysarthria rely more heavily on premotor cortex than normal speakers, particularly for consonant-heavy utterances. In support of this hypothesis, we observed bilateral activation of premotor cortex in the AD group, but only left hemisphere premotor activity in normal subjects. This suggests that, in individuals with ataxic dysarthria, right hemisphere premotor cortex may compensate for cerebellar loss. This research was presented at the 2003 Organization for Human Brain Mapping Meeting, and a journal article is in preparation.

C.7. Publications supported by this grant. The following publications were funded in significant part or in their entirety by the current grant (R01 DC02852) during the current funding period (2/1/2001–2/1/2006). This list includes 13 published or accepted journal articles, 4 book chapters, 1 journal commentary, 8 conference articles, 11 conference abstracts, 4 technical reports, and 2 Ph.D. dissertations. In the following, authors who were supported by the grant are marked with asterisks (*). As described in C.1-C.6, a number of additional journal manuscripts are in preparation; we plan to complete and submit these papers during the remaining year of the current funding period (2/1/2005-2/1/2006).

Callan, D.E., Honda, K., Masaki, S., Kent, R.D., Guenther*, F.H., and Vorperian, H.K. (2001). Robustness of an auditory-to-articulatory mapping for vowel production by the DIVA model to subsequent developmental changes in vocal tract dimensions. ATR Technical Report TR-H-309. Kyoto, Japan: Advanced Telecommunications Research Institute.

Guenther*, F.H. (2001). Neural modeling of speech production. Proceedings of the 4th International Nijmegen Speech Motor Conference, Nijmegen, The Netherlands, June 13-16, 2001.

Guenther*, F.H. (2001). A neural model of cortical and cerebellar interactions in speech. Society for Neuroscience Abstracts.

Guenther*, F.H., Nieto-Castanon*, A., Tourville*, J.A., and Ghosh*, S.S. (2001). The effects of categorization training on auditory perception and cortical representations. Proceedings of the Speech Recognition as Pattern Classification (SPRAAC) Workshop, Nijmegen, The Netherlands, July 11-13, 2001.

Ghosh*, S., Nieto-Castanon*, A., Tourville*, J., and Guenther*, F. (2001). ROI-based analysis of fMRI data incorporating individual differences in brain anatomy. Proceedings of the 7th Annual Meeting of the Organization of Human Brain Mapping, Brighton, UK.

Micci Barreca, D., and Guenther*, F.H. (2001). A modeling study of potential sources of curvature in human reaching movements. Journal of Motor Behavior, 33, pp. 387-400.

Perkell, J.S., Guenther*, F.H., Lane, H., Matthies, Vick, J., and Zandipour, M. (2001). Planning and auditory feedback in speech production. Proceedings of the 4th International Nijmegen Speech Motor Conference, Nijmegen, The Netherlands, June 13-16, 2001.

Guenther*, F.H. (2002). Effects of category learning on auditory perception and cortical maps. Program of the 143rd Meeting of the Acoustical Society of America, Journal of the Acoustical Society of America, 111(5) Pt. 2, p. 2383.

Guenther*, F.H., and Bohland*, J.W. (2002). Learning sound categories: A neural model and supporting experiments. Acoustical Science and Technology, 23(4), pp. 213-220. Japanese-language version appeared in Journal of the Acoustical Society of Japan, 58(7), pp. 441-449, July 2002.

Rhodes, B. and Bullock*, D. (2002).  Neural dynamics of learning and performance of fixed sequences: Latency pattern reorganizations and the N-STREAMS model.  Boston University Technical Report CAS/CNS-02-007. Boston: Boston University.

Ghosh*, S.S., Bohland*, J., and Guenther*, F.H. (2003). Comparisons of brain regions involved in overt production of elementary phonetic units. Proceedings of the 9th Annual Meeting of the Organization for Human Brain Mapping, New York.

Guenther*, F.H. (2003). Introductory remarks on neural modeling in speech perception research. Program of the 145th Meeting of the Acoustical Society of America, Journal of the Acoustical Society of America, 113(4) Pt. 2, p. 2209.

Guenther*, F.H. (2003). Neural control of speech movements. In: A. Meyer and N. Schiller (eds.), Phonetics and Phonology in Language Comprehension and Production: Differences and Similarities. Berlin: Mouton de Gruyter.

Guenther*, F.H. and Ghosh*, S.S. (2003). A model of cortical and cerebellar function in speech. Proceedings of the XVth International Congress of Phonetic Sciences. Barcelona: 15th ICPhS Organizing Committee.

Guenther*, F.H., Ghosh*, S.S., and Nieto-Castanon*, A. (2003). A neural model of speech production. Proceedings of the 6th International Seminar on Speech Production, Sydney, Australia.

Guenther*, F.H., and Perkell*, J.S. (2003). A neural model of speech production and its application to studies of the role of auditory feedback in speech. In: B. Maassen, R. Kent, H. Peters, P. Van Lieshout, and W. Hulstijn (eds.), Speech Motor Control in Normal and Disordered Speech. Oxford: Oxford University Press.

Guenther*, F.H., Tourville*, J.A., and Bohland*, J. (2003). Modeling the representation of speech sounds in auditory cortical areas. Program of the 145th Meeting of the Acoustical Society of America, Journal of the Acoustical Society of America, 113(4) Pt. 2, p. 2210.

Hampson*, M. Guenther*, F.H., Cohen, M.A., and Nieto-Castanon*, A. (2003). Changes in the McGurk Effect across phonetic contexts. Boston University Technical Report CAS/CNS-TR-03-006. Boston: Boston University.

Max, L., Gracco, V.L., Guenther*, F.H., Ghosh*, S.S., and Wallace, M. (2003). A sensorimotor model of stuttering: Insights from the neuroscience of motor control. In A. Packman, A. Meltzer, & H.F.M. Peters al. (Eds.), Proceedings of the 4th World Congress on Fluency Disorders. Nijmegen, The Netherlands: University of Nijmegen Press.

Nieto-Castanon*, A., Ghosh*, S.S., Tourville*, J.A., and Guenther*, F.H. (2003). Region-of-interest based analysis of functional imaging data. NeuroImage, 19, pp. 1303-1316.

Nieto-Castanon*, A., and Guenther*, F.H. (2003). A model of auditory cortical representations underlying speech perception and production. Society for Neuroscience Abstracts.

Tourville*, J.A. and Guenther*, F.H. (2003). A cortical and cerebellar parcellation system for speech studies. Boston University Technical Report CAS/CNS-03-022. Boston, MA: Boston University.

Ulloa, A., Bullock*, D., and Rhodes, B. (2003a). A model of cerebellar adaptation of grip forces during lifting. Proceedings of the IJCNN, 4, pp. 3167-3172.

Ulloa, A., Bullock*, D., and Rhodes, B. (2003b). Adaptive force generation for precision-grip lifting by a spectral timing model of the cerebellum. Neural Networks, 16, pp. 521-528.

Bohland*, J.W. and Guenther*, F.H. (2004). An fMRI investigation of the neural bases of sequential organization for speech production. Proceedings of the 10th Annual Meeting of the Organization for Human Brain Mapping, Budapest, Hungary.

Brown, J., Bullock*, D., and Grossberg, S. (2004). How laminar frontal cortex and basal ganglia circuits interact to control planned and reactive saccades. Neural Networks, 17, pp. 471-510.

Bullock*, D. (2004a). Adaptive neural models of queuing and timing in fluent action. Trends in Cognitive Sciences. 8, pp. 426-433.

Bullock*, D. (2004b). From parallel sequence representations to calligraphic control: A conspiracy of adaptive neural circuit models. Motor Control, 8, pp. 371-391.

Ghosh*, S.S. (2004). Understanding cortical and cerebellar contributions to speech production through modeling and functional imaging. Boston University Ph.D. Dissertation. Boston, MA: Boston University.

Guenther*, F.H., Nieto-Castanon*, A., Ghosh*, S.S., and Tourville*, J.A. (2004). Representation of sound categories in auditory cortical maps. Journal of Speech, Language, and Hearing Research, 47, pp. 46-57.

Guenther*, F.H., and Perkell, J.S (2004). A neural model of speech production and supporting experiments. Proceedings of From Sound to Sense: Fifty+ Years of Discoveries in Speech Communication, Cambridge, MA.

Max, L., Guenther*, F.H., Gracco, V.L., Ghosh*, S.S., and Wallace, M.E. (2004). Unstable or insufficiently activated internal models and feedback-biased motor control as sources of dysfluency: A theoretical model of stuttering. Contemporary Issues in Communication Science and Disorders, 31, pp. 105-122.

Nieto-Castanon*, A. (2004). An investigation of articulatory-acoustic relationships in speech production. Boston University Ph.D. Dissertation. Boston, MA: Boston University.

Rhodes, B.J., Bullock*, D., Verwey, W.B., Averbeck, B.B., and Page, M.P.A. (2004). Learning and production of movement sequences: Behavioral, neurophysiological, and modeling perspectives. Human Movement Science, 23, pp. 683-730.

Tourville*, J.A., Guenther*, F.H., Ghosh*, S.S., and Bohland*, J.W. (2004). Effects of jaw perturbation on cortical activity during speech production. Program of the 148th Meeting of the Acoustical Society of America, Journal of the Acoustical Society of America, 116, p. 2631.

Yoo, J.J., Guenther*, F.H., and Perkell, J.S. (2004). Cortical networks underlying audio-visual speech perception in normally hearing and hearing impaired individuals. Program of the 148th Meeting of the Acoustical Society of America, Journal of the Acoustical Society of America, 116, p. 2524.

Civier*, O, Guenther*, F.H. (2005) Simulations of Feedback and Feedforward Control in Stuttering, Abstracts of Oxford Dysfluency Conference, St. Catherine’s College, Oxford, 29th June to 2nd July, 2005

Dessing, J.C., Peper, C.E., Bullock*, D., and Beek, P.J. (in press).  How position, velocity and temporal information combine in prospective control of catching: Data and model. Journal of Cognitive Neuroscience.

Nieto-Castanon*, A., Guenther*, F.H., Perkell*, J.S., and Curtin, H. (2005). A modeling investigation of articulatory variability and acoustic stability during American English /r/ production. J Acoust Soc Am., 117, pp. 3196-3212.

Guenther*, F.H., Ghosh*, S.S., and Tourville*, J.A. (2005). Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language, E-print ahead of publication.

Guenther*, F.H., Ghosh*, S.S., Nieto-Castanon*, A., and Tourville*, J.A. (in press). A neural model of speech production. In: J. Harrington & M. Tabain (eds.), Speech Production: Models, Phonetic Processes, and Techniques. London: Psychology Press.

Horwitz, B., Husain, F.T., and Guenther*, F.H. (in press). Auditory object processing and primate biological evolution: Commentary to Arbib’s “From monkey-like action recognition to human language”. Behavioral and Brain Sciences.

Perkell, J.S., Guenther*, F.H., Lane, H., Marrone, N., Matthies, M., Stockmann, E., Tiede, M. and Zandipour, M. (in press). Production and perception of phoneme contrasts covary across speakers. In: J. Harrington & M. Tabain (eds.), Speech Production: Models, Phonetic Processes, and Techniques. London: Psychology Press.

D. RESEARCH DESIGN AND METHODS

The proposed research consists of a combination of functional brain imaging, psychophysical studies, and computational neural modeling to investigate the neural substrates of auditory feedback control in speech production. These studies are organized around the neural models described in Sections C.1 (the DIVA model) and C.3 (the auditory map model). For the sake of clarity, the basic fMRI methods are described first, followed by descriptions of the three subprojects that make up the project.

fMRI experimental protocol. All fMRI sessions will be carried out on a 3 Tesla Siemens scanner at the Massachusetts General Hospital NMR Center. Prior to functional runs, a high-resolution structural image of the brain is collected. This structural image serves as the basis for localizing fMRI activations. The fMRI experiment parameters will be based on the sequences available at the time of scanning[10]. The faculty and research staff at MGH, together with engineers from Siemens, continuously develop and test pulse sequences that optimize T1 and BOLD contrast while providing maximum spatial and temporal resolution for the installed Siemens scanners (Allegra, Sonata and Trio). A potential problem with using fMRI for speech production is that artifacts arise in images collected while the speech articulators are moving due to changes in the size of the oral cavity (e.g., Munhall, 2001). Furthermore, the fMRI experiments proposed herein involve real-time perturbation of the subject’s acoustic signal while speaking; this type of perturbation is currently not possible in the presence of scanner noise since our digital signal processing system cannot reliably track important aspects of the speech signal when it is masked by scanner noise. To avoid these potential problems, we utilize an event-triggered paradigm (schematized in Fig. 9) in which the scanner is triggered to collect 2 whole brain volumes worth of images starting approximately 4 seconds[11] after the subject has finished producing the stimulus for a particular trial (total inter-trial interval (ITI) of 10-14 seconds[12]). The scanner is silent during stimulus presentation and speech. Because the blood oxygen level changes induced by the stimulus persist for many seconds, this technique allows us to measure activation changes while avoiding scanner noise during stimulus presentation. Data analysis will correct for summation of blood oxygen level across trials using a general linear model (including correction for the effects of the scanner noise during the previous trial). Each session will consist of approximately 4-6 functional runs of 10-15 minutes each. During a run, stimuli will be presented in a pseudo-random sequence. For each experiment, the task(s) and stimulus type(s) are carefully chosen to address the aspect of speech being studied; these tasks and stimuli are described in the subproject descriptions in Sections D.1-D.3. We have developed software to allow us to send event triggers to the scanner and to analyze the resulting data, and we have successfully used this protocol to measure brain activity during speech production and perception in a number of previous studies (see Section C).

Effective connectivity analysis. Whereas commonly used voxel-based analyses of fMRI data rely on the notion of functional specialization, the brain, as well as the neural model proposed herein, is a connected structure in a graph-theoretic sense, and connections between specialized regions bring about functional integration (see e.g. Horwitz et al., 2000; Friston, 2002). Functional integration, or the task-specific interactions between brain regions, can be assessed through various network analyses that measure effective connectivity. In the current proposal, we will use structural equation modeling (SEM) to examine these interactions. SEM is the most widely used method for making effective connectivity inferences from fMRI (Penny et al., 2004), and benefits from a large literature regarding its application to neuroimaging (e.g. McIntosh & Gonzalez-Lima, 1994a; McIntosh et al., 1994b; Büchel & Friston, 1997; Bullmore et al., 2000; Mechelli et al., 2002) as well as its general theory (see e.g. Bollen, 1989). This method requires the specification of a causal, directed anatomical (structural) model, and estimates path coefficients (strength of influence) for each connection in the model that minimize the difference between the measured inter-regional covariance matrix and that implied by the model. We will utilize a single characteristic time-course of the BOLD response from each region-of-interest (ROI) corresponding to a component of our model in the SEM calculations. There exists a natural correspondence between structural equation models and neural network models, as both are specified by connectivity graphs and connection strengths. We can specify the connectivity structure in both models identically (based on known anatomy from primate studies, diffusion tensor imaging studies, etc.), and directly compare the resulting inferred path coefficients with the connectivity in the model. In both cases, inter-regional interaction may be dynamic in the sense that the activity in one region may be driven by different regions (or in different proportions) in varying tasks and contexts; likewise, learning may result in the strengthening or weakening of effective connections (e.g. Büchel et al., 1999). To assess the overall goodness of fit of the SEM we will use the χ2 statistic corresponding to a likelihood ratio test. If we are unable to obtain proper fits using our theoretical structural models (i.e. P(χ2) < 0.05), we will consider this evidence that the connectivity structure is insufficient and we will develop and test alternative models. To make inferences about changes in effective connectivity due to task manipulations or learning (see Sections D.1 and D.2 for details), we will utilize a “stacked model” approach; this consists of comparing a ‘null model’ in which path coefficients are constrained to be the same across conditions with an ‘alternative model’ in which the coefficients are unconstrained. A χ2 difference test will be used to determine if the alternative model provides a significant improvement in the overall goodness-of-fit. If so, the null model can be rejected, indicating that effective connectivity differed across the conditions of interest.

The remainder of this section describes the three subprojects. The first two subprojects investigate the auditory cortical representations underlying feedback-based control of prosodic (Section D.1) and segmental (Section D.2) aspects of speech production. The third subproject (Section D.3) investigates interactions between the feedback-based control system investigated in D.1 and D.2 and feedforward control mechanisms.

D.1 Modeling, neuroimaging, and psychophysical investigation of the neural bases of prosodic control

The primary objective of this subproject is to identify and model the neural mechanisms responsible for the control of prosody in speech production. We propose 7 psychophysical experiments, 4 fMRI experiments, and a modeling project that investigate key issues regarding linguistic prosody. The psychophysical and fMRI experiments involve real-time perturbation of prosodic cues during speech in order to probe the feedback and feedforward control mechanisms involved in prosodic control. The results of the psychophysical and fMRI experiments will be used to guide development of a neural model of prosodic control.

Theoretical background. Acoustic cues that carry prosodic information include fundamental frequency (F0), perceived as pitch; amplitude, perceived as loudness; and duration, perceived as length (Cutler et al., 1997; Shriberg & Kent, 2003; Shattuck-Hufnagel & Turk, 1996). Each prosodic cue has multiple linguistic and communicative roles. In some instances the cues work together, and in others one cue is predominant. For example, a rising terminal pitch contour is used to signal yes/no questions. In contrast, increased loudness, duration and pitch are often used to indicate word stress. F0 and duration are thought to be the main carriers of linguistic and pragmatic information (Fry, 1958; Pierrehumbert, 1980). Others have argued that intensity and vocal effort are also highly informative cues for detecting linguistic stress and emphasis (Denes, 1959; Denes & Milton-Williams, 1962; Fry, 1955). An important open question regarding prosodic control is the degree to which the different prosodic cues are controlled independently or in an integrated fashion. This question will be addressed in the psychophysical and fMRI experiments proposed below using a novel auditory perturbation paradigm, and the experimental results will guide refinement of the DIVA model.

In this proposal we focus on linguistic aspects of prosody, rather than affective prosody (cf. Streeter et al., 1983, Williams & Stevens, 1972). While the acoustic features used to signal linguistic contrasts and those used to signal affective states may be analogous, these two functions of prosody have largely been studied independently. It has been argued that the neural substrates for linguistic and affective prosody differ and that the subset of salient acoustic features also differs. For example, voice quality is an essential feature for signaling affective prosody yet it is not thought to be salient for linguistic contrasts (Ladd et al., 1985; Lehiste, 1970; Shattuck-Hufnagel & Turk, 1996; Streeter et al, 1983; Williams & Stevens, 1972).

Linguistic prosody shapes speech at lexical, phrasal and discourse levels. Here we address lexical and phrasal prosody. In particular, we explore noun/verb lexical stress and contrastive stress at the phrasal level. Noun/verb distinctions such as PROtest vs. proTEST are signaled by increasing the pitch, loudness and duration of the stressed syllable noted in uppercase. Similarly, at the phrasal level stress can be used to contrast an alternative meaning of the utterance, e.g., “JOHN hid his key” (i.e., not Bill). Once again the stressed word is acoustically represented by a longer duration, a higher fundamental frequency and greater intensity compared to when it is unstressed (Bolinger, 1961; Lehiste, 1970; Morton & Jassem, 1965).

As described in Section B, some researchers have suggested that linguistic prosody primarily involves left hemisphere mechanisms (Emmorey, 1987; Walker et al., 2004), while others suggest significant involvement of the right hemisphere (Riecker et al., 2002) or both hemispheres (Mildner, 2004). Simple hemispheric models of prosody try to break prosodic function into left- vs. right-hemisphere functions (for a review see Sidtis & Van Lancker Sidtis, 2003). One such model suggests that linguistic prosody is primarily processed in the left hemisphere, while emotional/affective prosody is primarily processed by the right (e.g., Van Lancker, 1980). Additional proposals include the suggestion that the left hemisphere is particularly involved in segmental and word-level prosody as opposed to sentence-level prosody (e.g., Baum & Pell, 1999). In their “dynamic dual pathway” model of auditory language comprehension, Friederici and Alter (2004) posit that the left hemisphere primarily processes syntactic and semantic information, whereas the right hemisphere primarily processes sentence-level prosody (see also Boutsen & Christman, 2002). A different type of hemispheric model suggests that the difference between hemispheres has to do with acoustic processing; the left hemisphere is specialized for the processing of timing (rate) whereas the right hemisphere is specialized for pitch processing (see Sidtis & Van Lancker Sidtis, 2003). A variant of this view suggests that the left hemisphere is specialized for analyzing sounds using a short temporal window, while the right hemisphere preferentially processes sounds using a longer temporal window (Poeppel, 2003). This view is supported by observations that left hemisphere damage seems to impair prosodic processing based on timing cues more than pitch cues (e.g., Van Lancker & Sidtis, 1992), whereas right hemisphere damage does not seem to impair processing of timing cues (e.g., Pell, 1999). Furthermore, the right hemisphere has often been implicated in perceptual processing of pitch (Sidtis & Volpe, 1988; Zatorre, 1988; Sidtis & Feldmann, 1990), and right hemisphere damage has been implicated in pitch production impairments (Sidtis, 1984), sometimes in the absence of a pitch perception deficit. Severe left hemisphere damage appears to typically spare the ability to perceive complex pitch (Sidtis & Volpe, 1988; Zatorre, 1988) and to manipulate pitch in singing (Sidtis & Van Lancker Sidtis, 2003).

The proposed experiments will build upon previous auditory perturbation studies, most of which have involved perturbation of pitch without regard to its specific role as a linguistic cue in speech (see Section B). It has been hypothesized that control of F0 at least in part involves feedback control mechanisms (e.g., Elliot & Niemoeller, 1970). Hain et al (2001) determined that delaying the auditory feedback of a subject’s compensatory response to a pitch shift causes an increase in the duration of the initial response peak. This result was interpreted as strongly supporting the use of a closed-loop, negative feedback system for the control of F0 (see also Larson et al., 2000, 2001). Such a system is utilized in the auditory feedback control portion of the DIVA model as described in Section C.1, although currently prosodic cues are not controlled in the model.

The perturbation studies described thus far involve the use of closed-loop feedback control mechanisms to compensate for unpredictable changes in pitch. Additional studies indicate that sustained (and thus predictable) perturbations of the acoustic signal during speech can lead to sensorimotor adaptation, wherein compensatory responses continue after feedback is returned to normal. These residual “compensatory” responses result in incorrect-sounding productions for the first few trials after the perturbation is removed. This result is indicative of reorganization of sensory-motor neural mappings, rather than closed-loop feedback control. Houde and Jordan (1998) demonstrated sensorimotor adaptation in speech by perturbing the first two formant frequencies of the speech signal while subjects whispered one-syllable words. Over many trials, subjects modified their productions to compensate for this perturbation. This compensation continued after the perturbation was removed; i.e., the subjects overshot the vowel formant targets in the direction opposite to the now-removed perturbation. Adaptation effects have also been demonstrated with pitch-shifted speech (Jones & Munhall, 2000). Most studies have focused on changes in the acoustic signal; however Max et al. (2003) showed adaptation in both articulator movements and acoustics to shifts in formant frequencies.

In the DIVA model, sustained perturbation of auditory feedback will lead to reorganization of the feedforward commands for speech sounds. This occurs because the model’s feedforward control subsystem constantly monitors the corrective commands generated by the feedback control subsystem, gradually incorporating repeatedly occurring corrections into the feedforward command (see Section C.2; Guenther et al., 2005; Villacorta, 2005). If the feedback perturbation is then removed, the first few non-perturbed productions will still show “compensation”, as in the sensorimotor adaptation experiment described in Section C.2. The model thus contains the basic mechanisms necessary to account for both closed-loop feedback control (as in the unpredictable pitch shift experiments of Larson and colleagues) and sensorimotor adaptation (as in the Houde & Jordan, 1998 and Jones & Munhall, 2000) of segmental aspects of speech. Here we propose to extend the model to include the control of prosodic cues in addition to segmental cues.

Methods and hypotheses. The proposed methods involve tightly integrated modeling, psychophysical, and fMRI studies. For purposes of exposition, a description of the model implementation will be described first, followed by descriptions of the proposed experiments, and concluding with a description of how the modeling work will be integrated with the experimental studies.

Model implementation. Currently the DIVA model does not address the control of prosody; instead it focuses on segmental aspects of speech production. For example, the model does not explicitly control pitch; the pitch profile is specified by the modeler. In the proposed project we plan to extend the model to include neural mechanisms for controlling pitch, duration, and loudness to indicate stress in simple utterances.

Fig. 10 schematizes two hypothetical architectures for feedback control of prosodic cues. These schemes are based in part on the DIVA model circuitry for controlling formant frequencies. In the Integrated Model, cells in higher-order auditory cortex compare a target stress level to the perceived stress level to compute a “stress error” which is then transmitted to controllers for pitch (P), duration (D), and loudness (L). In the Independent Channel Model, errors are calculated for pitch, duration, and loudness separately, with each error projecting to the corresponding controller.

The transfer of informational cues between prosodic features has been referred to as cue trading (Howell, 1993; Lieberman, 1960). It has been shown that even though different speakers may use different combinations of prosodic cues to indicate stress, listeners are able to reliably identify stress (Howell, 1993; Patel, 2003, 2004; Peppé et al., 2000). For example, some speakers may rely more on duration than pitch or loudness to indicate stress, while others may use pitch or loudness more, and naïve listeners appear to be able to leverage this phenomenon to perceive stress even in cases of severely dysarthric speech (Patel, 2002b). Such cross-speaker cue trading is consistent with both the Integrated and Independent Channel models[13]. However the models make differential predictions regarding the effect of perturbing one of the cues for stress in real time during speech production. In the Integrated Model, if the model’s perceived pitch level were perturbed downward while it was speaking a stressed syllable, it would compensate for the resulting Stress Error (Fig. 10, top) by increasing not only pitch but also loudness and duration. In the Independent Channel Model, however, the pitch perturbation would lead to a pitch error (ΔP in bottom of Fig. 10) which will lead to an increase in pitch only, not loudness or duration. These predictions will be tested in the experiments proposed below.

We will create two versions of the DIVA model: one involving integrated control of stress (Fig. 10 top) and one involving independent control of pitch, duration, and loudness (Fig. 10 bottom). The models will be implemented using the Matlab programming environment on Windows and Linux workstations equipped with large amounts of RAM (4 Gigabytes) to allow manipulation of large matrices of synaptic weights, as well as other memory-intensive computations, in the neural network simulations. Model implementation involves the definition of mathematical equations representing the “activity levels” of model neurons (corresponding approximately to neural spiking frequency) as well as the synaptic connections between sets of neurons. Three new sets of neurons are needed in the model to account for prosody (see Fig. 10): premotor cortex neurons representing prosodic targets, low-level auditory cortex neurons representing the pitch, duration and loudness cues available in the acoustic signal, and high-level auditory cortex prosodic cue error cells. The equations representing these neural activities will be constrained to be in general accord with cell properties found in neurophysiological studies in primate auditory cortex (e.g., Bendor & Wang, 2005; Godey et al., 2005) and ventral premotor cortex (e.g., Ferrari et al., 2003; Rizzolatti et al., 2003). For the synaptic weights, we will use biologically plausible synaptic modification equations that rely only on pre- and post-synaptically available information for weight adjustment, as in our prior work (e.g., Guenther, 1994, 1995; Guenther et al., 1998). After equations for the cell activities and synaptic weights have been defined, computer code implementing these equations will be written, tested, and integrated with the existing DIVA model software framework.

After computer implementation, the different versions of the model will be simulated on the same speech tasks used in the experimental studies described below, and the models’ adequacy in accounting for the experimental data will be analyzed. The details of this process are provided after the experimental descriptions.

Psychophysical experiments. The proposed psychophysical experiments investigate the adaptive responses of subjects to externally imposed perturbations of prosodic cues in their own speech. We have successfully implemented real-time perturbation of auditory cues using a Texas Instruments (TI DSK 6713) digital signal processing board that is portable and can be used for both psychophysical and fMRI experiments. In addition to F1 perturbations (as in Section C.2; also Villacorta et al., 2004), we have successfully implemented, tested, and validated perturbations to pitch and intensity in pilot studies for this subproject, as well as real-time pitch tracking and syllable segmentation algorithms (see Fig. 11 below for pilot results).

The perturbation experiments are designed to clarify the degree to which the neural control mechanisms for different prosodic cues (pitch, duration, and loudness) are integrated (Fig. 10 top) or independent (Fig. 10 bottom). The psychophysical experiments will be performed and analyzed at Dr. Rupal Patel’s laboratory at Northeastern University (see letter of cooperation). Dr. Patel has experience conducting production and perception experiments on prosodic control in children and in healthy and dysarthric adults (e.g., Patel, 2002a,b, 2003, 2004). Her lab is equipped with a sound treated booth and software infrastructure for perceptual and acoustic analyses (see Resources page for Northeastern University subcontract for details). Subjects. All subjects will be monolingual speakers of American English with no known speech, language, and neurological disorders between the ages of 18-55 (older subjects will be excluded due to possible high frequency hearing loss). All subjects will be required to pass a hearing screening with thresholds at or below 25 dB in at least one ear at 500, 1000, 2000 and 4000 Hz. Power analysis. Following the methodology described in Zarahn and Slifstein (2001), we utilized the data from our previous sensorimotor adaptation experiment involving perturbation of the first formant frequency (see Section C.2) to obtain reference parameters from which to derive power estimations for the psychophysical studies proposed here. F1 measures from the last half of the training phase were used to obtain measures of within- and between-subject variability, and the difference between average F1 of downward-perturbed subjects compared to upward-perturbed subjects was used as a reference effect size. The power estimate indicates that 12 subjects are enough to detect (with probability>.9 at a P ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download