DESCRIPTION: State the application’s broad, long-term ...



INTRODUCTION

Overview. This application is a first resubmission. The original application proposed three inter-related subprojects concerning modeling and neuroimaging of (i) the brain regions involved in speech sound sequence generation, (ii) the neural processes underlying the learning of new sound sequences, and (iii) problems with this sequencing circuitry that may underlie stuttering. The reviewers of the original application generally agreed that the proposed research was highly innovative, was of high potential significance, and was theoretically well-motivated. We have therefore left the Background and Significance intact, as well as the theoretical description of the proposed model and the theoretical motivation for the fMRI experiments (with the exception of the removal of the stuttering component of the project, as described below). The reviewers’ primary concerns were that the application was too ambitious, and that (perhaps as a result of the large scope) important details concerning the experiments were lacking, as were descriptions of how the model simulations would be compared to the experimental results. Our revisions have therefore focused on improvements to these aspects of the proposal, as detailed in the following paragraphs. Changes to the application text are indicated with vertical bars in the left margin of the Research Plan.

Project was too ambitious as originally proposed. The reviewers generally felt that the project as originally proposed was too ambitious. Reviewer 3 explicitly referred to the stuttering work, which formed over 1/3 of the original research plan, as overextending the project. To address these concerns, we have removed the stuttering component of the proposed project and used the resulting space to expand the remainder of the research design and methods. Removal of the stuttering component also addresses the concern of Reviewers 1 and 3 that this component may be confounded by treatment history and/or compensatory strategies of the stuttering subjects. Removal of the stuttering component included removal of three fMRI experiments and three modeling projects.

Simulating a BOLD signal from the model and comparing model activations to fMRI results. The reviewers generally felt that the description of how the model simulations will be compared to the results of fMRI experiments was not sufficiently detailed. To address this concern, we have added a section entitled “Generating simulated fMRI activations from model simulations” (Section C.2), in which we detail how we simulate fMRI data from the models. Our method is based on the most recent results concerning the relationship between neural activity and the blood oxygen level dependent (BOLD) signal measured with fMRI, as detailed in this section. The section also includes a treatment of how inhibitory neurons are modeled in terms of BOLD effects, thus addressing Reviewer 3’s concern about this issue. Additional text describing how we will compare the model fMRI activations to the results of our fMRI experiments has been added after the descriptions of each of the fMRI experiments.

fMRI power analysis. Reviewer 2 expressed concern regarding how many subjects would be needed to obtain significant results in our fMRI experiments. To address this concern, we have added a subsection entitled “fMRI power analysis” at the beginning of section D that justifies the subject population sizes proposed in the fMRI experiments. In this analysis, we have considered the possibility that many trials may contain production errors such as insertions of extra phonemes, etc. (as pointed out by Reviewers 2 and 3). When determining the number of subjects needed to obtain significant results, we have very conservatively assumed that as many as 30% of the trials in the experiments in Section D.1 and 50% of the trials in the experiments in Section D.2 may need to be removed from the analysis due to such errors. These assumed error rates are much higher than those obtained in our previous fMRI experiment on sound sequencing described in Section C.4 (which had an average error rate of 14.4% for the most difficult utterances) and our pilot studies for the learning experiments in D.2 (which had an average error rate of 21%).

Effective connectivity analyses. Reviewer 1 noted that effective connectivity analyses should be more emphasized in the proposal since they may provide a valuable means for testing model predictions. We have accordingly added a sub-section called “Effective connectivity analysis” in Section D of the proposal. We have also added more explicit descriptions of how effective connectivity analysis will be used to test predictions in the fMRI experiments.

Hypothesis tests. The reviewers also felt that descriptions of how specific hypotheses would be tested were lacking. We have thus added text explicitly stating the hypotheses to be tested and the manner in which they will be tested. To address reviewer concerns that it was not clear what we would conclude if our hypotheses are not supported, we have added descriptions of alternative interpretations. It is important to note that, although most of the hypotheses to be tested are embodied by our proposed neural model, they are not simply our view, but instead reflect current hypotheses proposed by many other researchers studying non-speech motor control, as detailed in the theoretical background portions of Section D. Indeed, these hypotheses were the primary forces that shaped the proposed model. Thus our experiments not only test our particular model, but they also test whether these non-speech motor control theories, which generally arise from the animal literature, generalize to the neural processes underlying speech production in humans. If the hypotheses are supported by our experimental results, then we have gained important information regarding similarities between the brain mechanisms underlying speech in humans and non-speech motor behaviors in animals. If the hypotheses are not supported, we have gained equally important information regarding how the neural bases of speech motor control differ from other forms of motor control.

Potential problems with the fMRI experiments of novel sound sequence learning. Two reviewers expressed concern regarding the design of fMRI experiments involving novel speech sound sequence learning in Section D.2, in particular the experiment involving generation of novel sub-syllabic sequences that are phonotactically illegal or highly unlikely in English. Specifically, the reviewers pointed out that subjects might insert additional phonemes or change some consonant clusters into more familiar forms, which would make these utterances easier to produce. Reviewer 1 also noted that the original experiment, which involved learning within a single fMRI scanning session, would necessarily be limited to very early stages of learning. To address these concerns, we have modified the design of the fMRI experiments in Section D.2. The experiments now involve learning in two training sessions that take place on different days, performed before the fMRI experiment. Furthermore, only subjects that show significant learning over the training sessions (as measured by improvements in error rate, duration, and reaction time) will be used in the corresponding fMRI experiment. Finally, all trials in the fMRI experiment will be checked for production errors, and any trials containing errors will be removed from subsequent data analysis.

Timing issues. Reviewer 3 stated that it was not clear how much priority is being given to modeling timing. We have provided text clarifying this in several places. The models to be developed will make systematic predictions regarding latencies and other aspects of the timing of speech sequences. In particular, we have emphasized that the modeling framework adopted (competitive queuing, CQ) is the only one that has been shown able to explain not just latency patterns for correct performances but also the latencies of errors in a recent comparison of four classes of sequencing models, conducted by Farrell & Lewandowsky (2004).

Computational framework. Reviewer 3 also expressed concern about missing details regarding software, hardware, etc. used to implement the proposed computational model. A sub-section called “Computational framework” has been added to the beginning of Section D to address this concern.

Miscellaneous concerns. Two reviewers were concerned with potential confounds in the jaw clench condition in fMRI Experiment 1 in Section D.1. This condition has been removed in the revised experiment.

.

A. SPECIFIC AIMS

The primary aim of this project is to develop and experimentally test a neural model of the brain interactions underlying the production of speech sound sequences. In particular, we will focus on several brain regions thought to be involved in motor sequence production, including the lateral prefrontal cortex, lateral premotor cortex, supplementary motor area (SMA), pre-SMA, and associated subcortical structures (basal ganglia, cerebellum, and thalamus). Each of these brain regions will be modeled mathematically with equations governing neuron activities, and the interactions between the regions will be modeled with equations governing synaptic strengths. The resulting model will be implemented in computer software and integrated with an existing neural model of speech sound production, the DIVA[1] model (Guenther, 1994, 1995; Guenther et al., 1998, in press), to allow generation of simulated articulator movements (along with corresponding acoustic signals) for producing speech sound sequences. The results of these computer simulations will be compared to existing behavioral and functional neuroimaging data to guide model development. We also propose 4 new functional magnetic resonance imaging (fMRI) experiments designed to test key hypotheses of the model, to test between the model and competing hypotheses, and to fill in gaps in the existing neuroimaging literature.

The project is divided into two separate but highly integrated subprojects whose aims are as follows:

(1) Creating and testing a neural model of speech sequence production. The primary aim of this subproject is to develop a model of the neural circuits involved in the properly ordered and properly timed production of speech sound sequences, such as a sequence of phonemes, syllables, and words making up a sentence. The model will be implemented in computer software and integrated with the DIVA model. The DIVA model describes the brain mechanisms responsible for producing individual speech sounds, whereas the model developed in this proposal describes the “higher-level” brain mechanisms involved in representing a sequence of sounds and determining which sound in the sequence to produce next (sequencing), as well as when to produce it (initiation). Thus the output of the current model essentially acts as input to the DIVA model, which then commands the sequence of articulator movements needed to produce each sound. We also propose two speech production fMRI experiments which test key hypotheses of our preliminary model: (i) the processing of syllable “frames” by the SMA/pre-SMA and “content” by lateral premotor areas, and (ii) the existence of a working memory representation for speech sound sequences in the inferior frontal sulcus. Simulations of the model performing the same speech tasks as subjects in the fMRI experiments will be run, and the results of these simulations will be compared to results from the fMRI studies to test key aspects of the model.

(2) Investigating the learning of new speech sequences. The primary aim of this subproject is to further develop the model created in Subproject 1 to incorporate the effects of practice on the neural circuits underlying speech sound sequence generation. In two modeling projects, we will model learning effects as changes in synaptic strengths in two subcortical structures: the basal ganglia and the cerebellum. This work will be guided by the existing literature on learning of motor sequences, and we propose two behavioral experiments and two corresponding fMRI experiments to test hypotheses concerning learning to quickly and accurately produce novel supra-syllabic sequences (multi-syllable utterances involving novel combinations of known syllables) and sub-syllabic sequences (new syllables consisting of infrequent combinations of phonemes).

We believe our integrated approach of computational neural modeling and functional brain imaging will provide a clearer, more mechanistic account of the neural processes underlying speech production in normal speakers and individuals with disorders affecting speech sound initiation and sequencing. In the long term, we believe this improved understanding will aid in developing better treatments for these disorders.

B. BACKGROUND AND SIGNIFICANCE

Combining neural models and functional brain imaging to understand the neural bases of speech. Recent years have witnessed a large number of functional brain imaging experiments studying speech and language, and much has been learned from these studies regarding the brain mechanisms underlying speech and its disorders. For example, functional magnetic resonance imaging (fMRI) studies have identified the cortical and subcortical areas involved in simple speech tasks (e.g., Hickok et al., 2000; Riecker et al, 2000a,b; Wise et al, 1999; Wildgruber et al., 2001) as well as more complex language tasks (e.g., Dapretto & Bookheimer, 1999; Kerns et al., 2004; Vingerhoets et al., 2003). However these imaging experiments do not, by themselves, answer the question of what important function, if any, a particular brain region may play in speech. For example, activity in the anterior insula has been identified in numerous speech neuroimaging studies (Wise et al, 1999; Hickok et al., 2000; Riecker et al., 2000a), but much controversy still exists concerning its particular role in the neural control of speech (Dronkers, 1996; Ackermann & Riecker 2004; Hillis et al, 2004; Shuster & Lemieux, 2005). A better understanding of the exact roles of different brain regions in speech requires the formulation of computational neural models whose components model the computations performed by individual brain regions, as well as the interactions between these regions (see Horwitz & Braun, 2004; Husain et al., 2004; Tagamets & Horwitz, 1997; Fagg & Arbib, 1998 for other examples of this approach). In the past decade we have developed one such model of speech production called the DIVA model (Guenther, 1994, 1995; Guenther et al., 1998, in press). The model’s components correspond to regions of the cerebral cortex and cerebellum, and they consist of modeled neurons whose activities during speech tasks can be measured and compared to the brain activities of human subjects performing the same task (e.g., Guenther et al., in press, included in Appendix materials for this application). These neurons are connected by adaptive synapses that become tuned during a babbling process as well as with continued practice with a speech sound. Computer simulations of the model have been shown to provide a unified account for a wide range of observations concerning speech acquisition and production, including data concerning the kinematics and acoustics of speech movements (Callan et al., 2000; Guenther, 1994, 1995; Guenther et al., 1998, in press; Guenther and Ghosh, 2003; Perkell et al., 2000) as well as the brain activities underlying speech production (Guenther et al., in press). The DIVA model computes articulator movement trajectories for producing individual speech sounds that are presented to it by the modeler. Importantly, however, the DIVA model does not model the brain regions and computations that are responsible for initiating and sequencing of speech sounds[2]. Understanding the computations performed by these brain regions could provide important insight into a number of important communication disorders that are characterized by their apparent malfunctions.

In this application we propose to develop a new model that identifies the neural computations underlying the initiation and sequencing of speech sounds during production. For clarity, we will refer to this model as the sequence model in this grant application. This model addresses several brain regions not treated in the DIVA model, including the supplementary and pre-supplementary motor areas, ventrolateral prefrontal cortex, and basal ganglia. These brain regions are believed to be involved in the initiation and sequencing of motor actions and appear to functional abnormally in several communication disorders (see Sequencing and initiation communication disorders below). Furthermore we propose fMRI studies of speech initiation and sequencing designed to test and refine this neural model. The “neural” nature of the model’s components makes possible direct comparisons between the model’s cell activities and the results of neuroimaging experiments. The resulting model will provide a computational account of normal speech mechanisms, and it will serve as a theoretical framework for investigating communication disorders involving malfunctions in speech initiation and sequencing, thus aiding in the development of better treatments/prostheses for these disorders.

Sequencing in motor control and language. A sine qua non of speech production is our ability to learn and perform many sequences defined over a relatively small set of elements. Behaviorist theories postulated that sequences are produced by sequential chaining, in which associative links allowed early responses in a sequence to elicit later ones. Recurrent neural network models (e.g., Elman, 1995; Dominey, 1998; Beiser & Houk, 1998) proposed revisions to the associative chaining theory, hypothesizing that an entire series of sequence-specific cognitive states must be learned to mediate any sequence recall. Although this type of recurrent net allows more than one sequence to be learned over the same alphabet of elements, there is no basis for performance of novel sequences, learning is often unrealistically slow with poor temporal generalization (Henson et al., 1996; Page, 2000; Wang, 1996), and internal recall of a sequence remains an iterative sequential operation. In contrast, competitive queuing (CQ) models allow performance of novel sequences (including reuse of the same alphabet of elements), rapid learning, and internal recall of a sequence representation as a parallel operation. Since Lashley (1951), behavioral evidence has accumulated (cf. Rhodes et al., 2004) to support the idea that parallel representation of elements constituting a sequence underlies much of our learned serial behavior. From speech and typing errors, Lashley inferred that there must be an active “co-temporal” representation of the items constituting a forthcoming sequence. He also inferred that item-item associative links may be unnecessary in, and even a hindrance to, the learning of many sequences defined over a small finite “alphabet”. Left unanswered were questions of mechanism: What is the nature of the parallel representation? How is the relative priority of simultaneously active item representations “tagged”? What limitations are inherent in this representation? What mechanisms convert the parallel representation to serial action? All four questions have been addressed, without any reliance on item-item associations, in various CQ models (Grossberg, 1978a,b; Houghton, 1990; Bullock & Rhodes, 2003). These neural network models postulate that a standing parallel representation of all the items constituting a planned sequence exists in a motor working memory prior to initiating performance of the first item. As explained in Section C.2, this parallel representation works in tandem with an iterated choice process to generate a sequential performance.

To date, CQ-compatible neural models have been used to account for data in many domains of learned serial behavior, including: eye movements (Grossberg & Kuperstein, 1986); recall of novel lists (Boardman & Bullock, 1991; Page & Norris, 1998) and highly practiced lists (Rhodes & Bullock, 2002); cursive handwriting (Bullock et al., 1993); working memory storage of sequential inputs (Bradski et al., 1994); word recognition and recall (Grossberg, 1986; Hartley & Houghton, 1996; Gupta & MacWhinney, 1997); language production (Dell et al., 1997; Ward, 1994); and music learning and performance (Mannes, 1994; Page, 1999). The stature of CQ as a neurobiological model has grown steadily due to an accumulation of directly pertinent neurophysiological observations (e.g., Averbeck et al., 2002, 2003; discussed in C.2). Section C.2 will summarize CQ theory and the evidence that led us to choose CQ circuitry as a core for a new model of sequential speech motor control.

Communication disorders involving sequencing and/or initiation of speech sounds. A number of communication disorders, including aphasias, apraxia of speech (AOS), and stuttering, include deficits in the proper initiation and/or sequencing of speech sounds. Sequencing errors known as literal or phonemic paraphasias, in which “well-formed sounds or syllables are substituted or transposed in an otherwise recognizable target word” (Goodglass, 1993) exist in most aphasic patients, most commonly in conduction aphasics. Most of the common symptoms reported in AOS patients[3] are selection and sequencing problems including articulation errors, phonemic mistakes, prosodic disturbances, difficulties initiating speech, and slowed speech (McNeil and Doyle 2004; Dronkers 1996). Several brain areas have been implicated in AOS, including the left premotor cortex, Broca’s area, and the left anterior insula (Miller 2002). Models of speech production have largely been unable to inform the study of AOS because “theories of AOS encounter a dilemma in that they begin where the most powerful models of movement control end and end where most cognitive neurolinguistic models begin” (Ziegler 2002). The model proposed herein attempts to fill this gap between neurolinguistic models and movement control models.

Though different in many ways, stuttering, which affects approximately 1% of the adult population in the United States, shares with AOS the trait of improper initiation of speech motor programs without impairment of comprehension or damage to the peripheral speech neuromuscular system (Kent 2000; Dronkers 1996). Phenomenological, physiological, and psychological studies of developmental stuttering over the past several decades have produced a large body of data and spawned many theories regarding both its etiology (e.g., Geshwind & Galaburda, 1985; Starkweather, 1987; Travis, 1931; West, 1958) and its expression (e.g., Johnson & Knot, 1936; Mysak, 1960; Perkins et al., 1991; Postma & Kolk, 1993; Zimmerman, 1980). The advent of structural and functional imaging technologies has provided a means to investigate the neural underpinnings of developmental stuttering and greatly increased the available data while offering a new perspective on the disorder. A proliferation of imaging studies prompted Peter Fox, whose group has conducted several imaging studies of stuttering (e.g. Fox et al., 1996, 2000), to describe the need for a model of the neural systems of speech, their breakdown in stuttering, and their normalization with treatment, in order to advance the study of developmental stuttering (Fox, 2003). The modeling work described here will provide a substrate for exploring the initiation and repetition behaviors in persons who stutter. Perhaps more importantly, it will provide a cohesive framework for the examination of available data and the assessment of theories of stuttering and other disorders (see also van der Merwe, 1997).

In addition to considering symptom-based diagnoses, it is important to consider the effects of lesions and pathological conditions involving particular brain regions on speech processes. Case studies in patients with lesions of the supplementary motor area (e.g. Jonas, 1981, 1987; Ziegler et al., 1997; Pai, 1999) or basal ganglia pathologies (e.g. Ho et al., 1998; Pickett et al., 1998) have shown that these areas provide specific contributions to the sequencing and initiation of speech sounds. While pathological speech data are abundant, parsimonious explanations for differential syndromes remain elusive. Many authors have noted the importance of establishing well-specified models of normal and disordered speech to help provide differential diagnoses and treatment options for these conditions. In just this context, while describing the DIVA model of speech production, McNeil et al. (2004) write “While this model addresses phenomena that may be relevant in the differential diagnosis of motor speech disorders…in it’s current stage of development it has not been extended to make claims about the relationship between disrupted processing and speech errors in motor speech disorders” (p. 406). The work proposed here seeks to extend the DIVA model in this direction.

C. PRELIMINARY STUDIES

C.1. The DIVA model. Over the past decade our laboratory has developed and experimentally tested the DIVA model, a neural network model of the brain processes underlying speech acquisition and production (e.g., Guenther, 1994, 1995; Guenther et al., 1998; Guenther et al., in press). The model is able to produce speech sounds (including both articulator movements and a corresponding acoustic signal) by learning mappings between articulator movements and their acoustic consequences, as well as auditory and somatosensory targets for individual speech sounds. It accounts for a number of speech production phenomena including aspects of speech acquisition, coarticulation, contextual variability, motor equivalence, velocity/distance relationships, and speaking rate effects (see Guenther, 1995 and Guenther et al., 1998 in Appendix documents). The latest version of the DIVA model is detailed in Guenther et al. (in press), in the Appendix documents. Here we briefly describe the model with attention to aspects relevant for this proposal.

A schematic of the DIVA model is shown in Fig. 1. Each box in the diagram corresponds to a set of neurons in the model, and arrows correspond to synaptic projections that form mappings from one type of neural representation to another. Several mappings in the network are tuned during a babbling phase in which semi-random articulator movements lead to auditory and somatosensory feedback; the model’s synaptic projections are adjusted to encode sensory-motor relationships based on this combination of articulatory, auditory, and somatosensory information. The model posits additional forms of learning wherein (i) auditory targets for speech sounds are learned through exposure to the native language (Auditory Goal Region in Fig. 1), (ii) feedforward commands between premotor and motor cortical areas are learned during “practice” in which the model attempts to produce a learned sound (Feedforward Command), and (iii) somatosensory targets for speech sounds are learned through practice (Somatosensory Goal Region).

In the model, production of a phoneme or syllable starts with activation of a Speech Sound Map cell corresponding to the sound to be produced. Speech sound map cells are hypothesized to lie in left lateral premotor cortex, specifically posterior Broca’s area (left Brodmann’s Area 44; abbreviated as BA 44 herein). After the cell has been activated, signals project from the cell to the auditory and somatosensory cortical areas through tuned synapses that encode sensory expectations for the sound, where they are compared to incoming sensory information. Any discrepancy between expected and actual sensory information constitutes a production error (Auditory Error and/or Somatosensory Error) which leads to corrective movements via projections from the sensory areas to the motor cortex (Auditory Feedback-based Command and Somatosensory Feedback-based Command). Additional synaptic projections from speech sound map cells to the motor cortex form a feedforward motor command; this command is tuned by monitoring the effects of the feedback control system and incorporating corrective commands into the feedforward command.

Feedforward and feedback control signals are combined in the model’s motor cortex. Early in development the feedforward command is inaccurate, and the model depends on feedback control. Over time, however, the feedforward command becomes well tuned through monitoring of the movements controlled by the feedback subsystem. Once the feedforward subsystem is accurately tuned, the system can rely almost entirely on feedforward commands because no sensory errors are generated unless external perturbations are applied.

Of particular note for the current application are the BA 44 speech sound map cells that form the “input” to the DIVA model. In Section D we describe a new model of the sequencing and initiation of sound sequences. Activation of the speech sound map cells in the DIVA model basically comprises the “output” of this new model, which will be integrated with the DIVA model to form a more complete description of the neural bases underlying speech. Also of note for the current application is the fact that the latest version of the DIVA model (Guenther et al., in press, in Appendix documents) incorporates realistic transmission delays between brain regions, including sensory feedback delays. This development has made it possible to more precisely simulate the timing of articulator movements during normal and perturbed speech. The results of these simulations closely approximate behavioral results (see Guenther et al., in press in Appendix). The modeling project proposed in Section D.1 extends this work to account for serial reaction time studies of sequence production. Finally, a unique feature of the DIVA model is that each of the model’s components is associated with a particular neuroanatomical location based on the results of fMRI and PET studies of speech production and articulation (see Guenther et al., in press in Appendix documents for details). Since the model’s components correspond to groups of neurons, it is possible to generate simulated fMRI activations corresponding to model cell activities during a simulation (described further in Section C.2). Fig. 2 compares results from an fMRI study performed by our lab of single consonant-vowel (CV) syllable production to simulated fMRI data from the DIVA model in the same speech task. Comparison of the top and bottom panels of Fig. 2 indicates that the model accounts for most of the fMRI activations. The proposed research will extend this work to account for additional fMRI activations that occur in more complex multi-syllabic speaking tasks.

C.2 Generating simulated fMRI activations from model simulations. The relationship between the signal measured in blood oxygen level dependent (BOLD) fMRI and electrical activity of neurons has been studied by numerous investigators in recent years. It is well-known that the BOLD signal is relatively sluggish compared to electrical neural activity. That is, for a very brief burst of neural activity, the BOLD signal will begin to rise and continue rising well after the neural activity stops, peaking about 4-6 seconds after the neural activation burst before falling down somewhat below the starting level around 10-12 seconds after the neural burst and slowly rising back to the starting level. This hemodynamic response function (HRF) is schematized in the figure at right. We use such a response function, which is part of the SPM software package that we use for fMRI data analysis, to transform neural activities in our model cells into simulated fMRI activity. However there are different possible definitions of “neural activity”, and the exact nature of the neural activity that gives rise to the BOLD signal is still currently under debate (e.g., Caesar et al., 2003; Heeger et al., 2000; Logothetis et al., 2001; Logothetis and Pfeuffer, 2004; Rees et al., 2000; Tagamets and Horwitz, 2001).

In our modeling work, each model cell is hypothesized to correspond to a small population of neurons that fire together. The output of the cell corresponds to neural firing rate (i.e., the number of action potentials per second of the population of neurons). This output is sent to other cells in the network, where it is multiplied by synaptic weights to form synaptic inputs to these cells. The activity level of a cell is calculated as the sum of all the synaptic inputs to the cell (both excitatory and inhibitory), and if the net activity is above zero, the cell’s output is proportional to the activity level. If the net activity is below zero, the cell’s output is zero. It has been shown that the magnitude of the BOLD signal typically scales proportionally with the average firing rate of the neurons in the region where the BOLD signal is measured (e.g., Heeger et al., 2000; Rees et al., 2000). It has been noted elsewhere, however, that the BOLD signal actually correlates more closely with local field potentials, which are thought to arise primarily from averaged postsynaptic potentials (corresponding to the inputs of neurons), than it does to the average firing rate of an area (Logothetis et al., 2001). In particular, whereas the average firing rate may habituate down to zero with prolonged stimulation (greater than 2 sec), the local field potential and BOLD signal do not habituate completely, maintaining non-zero steady state values with prolonged stimulation. In accord with this finding, the fMRI activations that we generate from our models are determined by convolving the total inputs to our modeled neurons (i.e., the activity level as defined above), rather than the outputs[4] (firing rates), with an idealized hemodynamic response function generated using default settings of the function ‘spm_hrf’ from the SPM toolbox (see Guenther et al., in press in Appendix for details).

In our models, an active inhibitory neuron has two effects on the BOLD signal: (i) the total input to the inhibitory neuron will have a positive effect on the local BOLD signal, and (ii) the output of the inhibitory neuron will act as an inhibitory input to excitatory neurons, thereby decreasing their summed input and, in turn, reducing the corresponding BOLD signal. Relatedly, it has been shown that inhibition to a neuron can cause a decrease in the firing rate of that neuron while at the same time causing an increase in cerebral blood flow, which is closely related to the BOLD signal (Caesar et al., 2003). Caesar et al. (2003) conclude that this cerebral blood flow increase probably occurs as the result of excitation of inhibitory neurons, consistent with our model. They also note that the cerebral blood flow increase caused by combined excitatory and inhibitory inputs is somewhat less than the sum of the increases to each input type alone; this is also consistent with our model since the increase in BOLD signal caused by the active inhibitory neurons is somewhat counteracted by the inhibitory effect of these neurons on the total input to excitatory neurons.

Figure 4 illustrates the process of generating fMRI activations from a model simulation and comparing the resulting activation to the results of an fMRI experiment designed to test a model prediction. The left panel of the figure illustrates the locations of the DIVA model components on the “standard” single subject brain from the SPM2 software package. The DIVA model predicts that unexpected perturbation of the jaw during speech will cause a mismatch between somatosensory targets and actual somatosensory inputs, causing activation of somatosensory error cells in higher-order somatosensory cortical areas in the supramarginal gyrus of the inferior parietal cortex. The location of these cells is denoted by ΔS in the left panel of the figure. Simulations of the DIVA model producing speech sounds with and without jaw perturbation were performed. The top middle panel indicates the neural activity (gray) of the somatosensory error signals in the perturbed condition minus activity in the unperturbed condition, along with the resulting BOLD signal (black). Since the somatosensory error cells are more active in the perturbed condition, a relatively large positive response is seen in the BOLD signal. Auditory error cells, on the other hand, show little differential activation in the two conditions since very little auditory error is created by the jaw perturbation (bottom middle panel), and thus the BOLD signal for the auditory error cells in the perturbed – unperturbed contrast is near zero. The derived BOLD signals are Gaussian smoothed spatially and plotted on the standard SPM brain in the top right panel. The bottom right panel shows the results of an fMRI study we carried out to compare perturbed and unperturbed speech (13 subjects, random effects analysis, false discovery rate = 0.05). In this case, the model correctly predicts the existence and location of somatosensory error cell activation, but additional activation not explained by the model is found in the left frontal operculum region.

C.3 Competitive queuing (CQ) models of motor sequencing. As described above, competitive queuing (CQ) models have been applied to many domains of learned serial behavior (see Background and Significance) and account for a number of behavioral and neurophysiological observations. An Investigator on the proposed project, Dr. Daniel Bullock, and his colleagues have published a series of articles describing CQ models and providing computer simulations verifying the ability of this class of models to account for sequencing, timing, and kinematic phenomena in non-speech motor tasks (Boardman & Bullock, 1991; Bullock et al. 1993; Bullock et al., 1999; Rhodes & Bullock, 2002; Bullock & Rhodes, 2003; Brown et al., 2004; Bullock, 2004a,b; Rhodes et al., 2004). Here we describe the basic CQ mechanism as formulated in such studies; further details are available in Bullock (2004a) and Rhodes et al. (2004) in the Appendix materials.

A schematic of a CQ network is shown in Fig. 5. A fundamental principle of CQ networks is the parallel representation of items in a motor sequence (e.g., a sequence of phonemes making up an utterance) in working memory prior to initiation of movement. These items are represented by nodes in the model’s planning layer. The relative activations of these nodes determine the ordering of the segments within the sequence. Items in the planning layer compete via mutually inhibitory connections at the competitive choice layer. The “winning” item, typically the item with the highest activity in the planning layer, is selected by the “winner-take-all” (WTA) dynamics of the neural network making up the choice layer, such that only one node (sequence item) is active at the choice layer, and the motor program corresponding to that item is executed by downstream mechanisms outside the CQ model. At this point the chosen item’s representation is extinguished at the planning layer, a new competition is run, and the item with the next highest activation is chosen. This cycle continues until all sequence items are performed.

Biologically plausible recurrent or feedforward competitive neural nets (e.g., Grossberg, 1978b; Durstewitz & Seamans, 2002) provide the types of interactions required in a CQ model. Because of its need to store an activity pattern, the planning layer is modeled as a normalized recurrent net in which the activity of each item is lessened as more items are added, and when an item is extinguished its share of activity automatically redistributes to the remaining items. Such networks afford parametric modulation of the competition, e.g., the rate at which the choice layer selects the most highly active node from the planning layer.

Psychophysical studies of serial planning and performance support these properties of the CQ model. CQ models explain error patterns in verbal immediate serial recall (ISR) studies including primacy and recency effects (Henson, 1996) and transposition or “fill-in” errors (e.g, Page & Norris, 1998) that occur when noise causes two items with similar activation levels to be selected in the wrong order. They also account for word length effects found in ISR (Cowan, 1994) and sequence length effects on latency and interresponse intervals in typing tasks (e.g., Sternberg et al, 1978; Rosenbaum et al., 1984; Boardman & Bullock, 1991; Rhodes et al. 2004). CQ models are unique for their ability to explain both timing and error data. Farrell & Lewandowsky (2004) recently compared four major classes of sequencing models, and showed that three of the four classes made incorrect predictions regarding the latencies of transposition errors. Only the CQ model predicted both the correct distribution of transposition errors and the latencies of those errors in production.

Complementary to such CQ-consistent behavioral patterns are recent neural recordings that provide direct evidence of CQ-like processing in the frontal cortex of monkeys. These studies have strikingly supported four key predictions of CQ models, as indicated by the neurophysiological data (top) and model simulations (bottom) in Figure 6. First, the primate electrophysiological studies of Averbeck et al. (2002; 2003) demonstrated that prior to initiating a serial act (using a cursor to draw a geometric form with a prescribed stroke sequence), there exists in prefrontal area 46 an active parallel (simultaneous) representation of each of the strokes planned as components of the forthcoming sequence. Small pools of active neurons code each stroke. Second, the relative strength of activation of a stroke representation (neural pool) predicts its order of production, with higher activation level indicating earlier production. Third, as the sequence is being produced, the initially simultaneous representations are serially deactivated in the order that the corresponding strokes are produced. Fourth, several studies (Averbeck et al., 2002; Basso & Wurtz, 1998; Cisek & Kalaska, 2002; Pellizer & Hedges, 2003) of neural planning sites also show partial activity normalization: the amount of activation that is spread among the plans grows more slowly than the number of plans in the sequence, and eventually stops growing. This property, which is critical to the CQ planning layer, explains why the capacity of working memory to encode novel serial orders is limited. A simulation of the planning layer dynamics of a CQ model (Boardman & Bullock, 1991) is shown at the bottom of Fig. 6 for comparison with recording data from Averbeck et al. (2002). The simulation traces correspond remarkably well with empirical observations made a decade later. Taken together, the physiological evidence of CQ-like processing in the brain and the behavioral results explained by the model provide a strong argument for choosing the CQ model as the basis of a sequencing mechanism for speech production.

In Section D.1 we propose the development of a model that effectively combines the CQ model with the DIVA model of speech production. CQ output will interface with the DIVA Speech Sound Map to enable serial performance of speech sound sequences.

C.4 fMRI study of brain activations during syllable sequence production. As part of an existing grant concerning the DIVA model (R01 DC02852), we have conducted an fMRI experiment to explore brain activity underlying the sequencing of speech sounds. Here we present results from this experiment. The experiment and associated modeling work serve as the starting point for much of the work proposed herein[5].

This experiment was designed to elucidate the roles of several brain regions in speech production, including the medial premotor cortex (the supplementary motor area proper (SMA), pre-supplementary motor area (pre-SMA), and cingulate motor areas), the peri- and intra-sylvian cortex (including the anterior insula, frontal operculum and inferior frontal gyrus), cerebellum, and basal ganglia. Clinical studies have suggested that these areas may be important for sequencing in speech motor control (e.g. Dronkers, 1996; Jonas, 1981, 1987; Riva, 1998; Pickett et al., 1998). Only a small portion of the functional imaging work dedicated to speech and language has dealt with overt speech production, but the largest body of relevant studies comes from Ackermann, Riecker, and colleagues (reviewed in Dogil et al., 2002). Regarding sequencing, Riecker et al. (2000b) examined brain activations evoked by overt production of speech stimuli of varying complexity: CV’s, CCCV’s, CVCVCV sequences, and lexical items (words). These speech test materials failed to elicit activation of SMA, cerebellum, or anterior insula. This finding contrasts with other studies (e.g. Fiez, 2001; Indefrey & Levelt, 2004) as well as our own findings suggesting involvement of these areas even in simple mono-syllable production (Guenther et al., in press). This issue provided further motivation for the following experiment.

In this study we examined the differences in brain activations during preparation for and overt production of memory-guided non-lexical sequences of three syllables. Two parameters determined the stimulus content. The first, sub-syllabic complexity, varied the number of phoneme segments that constituted each individual syllable (i.e. CV vs. CCCV). The second parameter, supra-syllabic complexity, varied the number of unique syllables comprising the three-syllable sequence (repetition of the same syllable vs. three different syllables). Each factor was varied between one of two values (simple or complex), yielding four stimulus types. Each of these types was presented either for vocalization (GO condition) or for preparation only (NOGO condition).

13 neurologically normal right-handed adult American English speakers participated. In a 3T Siemens Trio scanner, subjects were visually presented the syllable sequences. After 2.5s the syllables were removed from the projection screen and replaced by a white fixation cross. In the GO case, following a short random duration (0.5 - 2.0s), the white cross turned green, signaling the subject to begin vocalization of the most recent sequence. During this production period, the scanner remained silent. In the NOGO case, the fixation cross remained white throughout. Following a 2.5s production period (or equivalent time in the NOGO case), the scanner was triggered (see fMRI experiment protocol in Section D for details) to acquire three full brain functional images (TR=2.5s, 30 slices, 5mm thickness). Following the third volume, the fixation cross disappeared, and the next stimulus was presented. The subjects’ vocal responses were recorded using an MRI-compatible microphone and checked offline for accuracy. Functional image volumes were realigned, coregistered to a high-resolution structural series acquired for each subject, normalized into stereotactic space, and smoothed using a Gaussian kernel with full-width at half-maximum of 12mm. Analysis was performed using a random effects model with SPM2 ().

Fig. 7 shows statistically significant cortical activations during overt production (Go condition) of complex sequences of complex syllables compared to baseline (p < 0.05). Overt production activated a wide bilateral cortical and subcortical network: precentral gyrus and somatosensory cortices, the SMA and pre-SMA, auditory cortical regions, the intra-sylvian cortex and nearby frontal regions, as well subcortical regions including thalamus, basal ganglia, and superior cerebellum (not pictured).

Active regions relevant to the sequence model proposed in Section D include the left hemisphere SMA, pre-SMA, inferior frontal sulcus (IFS), and posterior Broca’s areas (BA 44). These areas are labeled in Fig. 7. The remaining activations in the figure, along the sensorimotor cortex surrounding the central sulcus as well as superior temporal auditory cortical areas, are accounted for by the DIVA model (see Section C.1 and Guenther et al., in press in Appendix materials).

The preparation (NOGO) condition (not shown) activated much of the same cortical network to a lesser degree. The auditory cortical areas as well as motor and somatosensory face areas were much less active in the NOGO conditions, as expected. The superior cerebellum, basal ganglia/anterior thalamus, left anterior insula/frontal operculum, and SMA also showed significantly increased activity for overt production compared to preparation. These findings generally agree with studies comparing overt to covert speech production (Murphy et al., 1997; Wise et al., 1999; Sakurai et al., 2001; Riecker et al., 2000a), although there is considerable variability in experimental designs and outcomes.

Production of complex sequences of three distinct syllables (e.g. ba-da-ti) was expected to engage cortical sequencing mechanisms to a greater degree than production of syllable repetitions (e.g. ba-ba-ba). Several cortical regions responded more strongly for complex sequences in our experiment (Fig. 8, top), including bilateral pre-SMA, frontal operculum/anterior insula, left IFS, and superior parietal cortex. Furthermore, subcortical activation in right inferior cerebellum and bilateral basal ganglia was observed for complex – simple sequences of simple syllables (Fig. 8, bottom).

Increased syllable complexity (e.g. stra-stra-stra vs. ba-ba-ba) should engage brain mechanisms necessary for programming articulator movements at a sub-syllabic level. Additional activations for overt production of complex vs. simple syllable types were observed in the pre-SMA bilaterally, within the left pre-central gyrus, and in the superior paravermal cerebellum.

The results of this experiment give some important initial data points to aid in understanding how the brain represents and executes sequences of syllables. The region of activity around the left inferior frontal sulcus (IFS) represents the “highest-level” brain region that showed reliable differential activations in this experiment and is likely to serve as a working memory representation for planned utterances (see D.1 for further discussion). The pre-SMA showed great sensitivity to both supra- and sub-syllabic complexity and has suitable connectivity to serve as an interface between the prefrontal cortex and the motor executive system (Jurgens, 1984; Luppino et al, 1993). The SMA proper, in contrast, showed larger activations for overt production and is likely to provide for initiation of motor outputs. A region at the junction of the frontal operculum and anterior insula also showed speech complexity differences; this region may play a role in online monitoring during speech production (see also Ackermann & Riecker, 2004). The cerebellum showed distinct activation patterns along its superior and inferior aspects. The superior portions were more active for overt production than for preparation and more active for complex syllables than for simple ones. This region may be crucial for execution of well-learned syllables and/or for coarticulation effects. The right inferior cerebellum was significantly active for complex sequences but not simple ones. In line with cerebellar projections to prefrontal cortex (Middleton and Strick, 2000; Dum and Strick, 2003), this area may play a role in support of the working memory representation in IFS (discussed in D.2).

D. RESEARCH DESIGN AND METHODS

The proposed research consists of two closely inter-related subprojects that combine fMRI with computational neural modeling to investigate sequencing and initiation in speech production. For the sake of clarity, the basic fMRI methods and modeling framework are described first, followed by the subproject descriptions.

fMRI experimental protocol. The fMRI experiments proposed herein will each involve 17 subjects[6]. All fMRI sessions will be carried out on a 3 Tesla Siemens scanner at the Massachusetts General Hospital NMR Center. Prior to functional runs, a high-resolution structural image of the subject's brain is collected. This structural image serves as the basis for localizing task-related blood oxygenation level dependent (BOLD) activity. The fMRI experiment parameters will be based on the sequences available at the time of scanning[7]. The faculty and research staffs at MGH, together with engineers from Siemens, develop and test pulse sequences continuously that optimize T1 and BOLD contrast while providing maximum spatial and temporal resolution for the installed Siemens scanners (Allegra, Sonata and Trio). Because scanner noise related to echo-planar imaging (EPI) may alter normal auditory cortical responses and/or cause subjects to adopt abnormal strategies during speech, and because articulator movements can induce artifacts in MR images, it is important to avoid image acquisition during stimulus presentation and articulation (e.g., Munhall, 2001). To this end, we will use an event-triggered paradigm in which the scanner is triggered to collect 2 full-brain image volumes following production of each stimulus. Because BOLD changes induced by the task persist for many seconds, this technique allows us to measure activation changes while avoiding scanner noise confounds and motion artifacts. The inter-stimulus interval will be determined for each experiment to be long enough to allow collection of two volumes starting approximately 3 seconds[8] after the speech production period is complete (total ISI of approximately 12-15 seconds). Data analysis will correct for summation of blood oxygen level across trials using a general linear model (including correction for the effects of the scanner noise during the previous trial).

Each session will consist of approximately 4-8 functional runs of approximately 6-12 minutes each. During a run, stimuli will typically be presented in a pseudo-random order. For each experiment, the task(s) and stimulus type(s) are carefully chosen to address the aspect of speech sequencing and/or initiation being studied; these tasks and stimuli are described in the subproject descriptions in Sections D.1-D.2. We have developed software to allow us to send event triggers to the scanner and to analyze the resulting data, and we have successfully used this technique to measure brain activation during speech production as part of another grant (R01 DC02852; see Sections C.2 and C.4 and Guenther et al., in press in Appendix).

fMRI power analysis. Following the methodology described in Zarahn and Slifstein (2001), we utilized the data from our syllable sequence production experiment described in Section C.4 to obtain reference parameters from which to derive power estimations for the fMRI studies proposed in this application. Activations during overt speech productions compared to baseline provided measures of within- and between-subject variability as well as a reference effect size of the SPM-derived general linear model parameters. The expected within-subject variability for our proposed studies was then computed from the reference value by using the number of conditions and stimulus presentations in the proposed studies compared to the same values for the reference study. Power estimates for the two proposed fMRI experiments in Section D.1, which involve 5 stimulus conditions, show that 17 subjects would be needed to detect (with probability>.8) in a random-effect analysis (at a p 0.05), we will consider this evidence that the connectivity structure is insufficient and we will develop and test alternative models. To make inferences about changes in effective connectivity due to task manipulations or learning (see Sections D.1 and D.2 for details), we will utilize a “stacked model” approach; this consists of comparing a ‘null model’ in which path coefficients are constrained to be the same across conditions with an ‘alternative model’ in which the coefficients are unconstrained. A χ2 difference test will be used to determine if the alternative model provides a significant improvement in the overall goodness-of-fit. If so, the null model can be rejected, indicating that effective connectivity differed across the conditions of interest.

Computational modeling framework. The model will be implemented using the Matlab programming environment on Windows and Linux workstations equipped with large amounts of RAM (4 Gigabytes) to allow manipulation of large matrices of synaptic weights, as well as other memory-intensive computations, in the neural network simulations. Matlab allows graphical user interface generation, sophisticated matrix-based computations (ideal for simulating neural networks as proposed herein), generation of graphical output (in the form of moving speech articulators, simulated brain activation patterns, and plots of variables of interest), and generation of acoustic output (in the form of speech sounds produced by the model). Portions of the model that require intensive computations will be implemented in C and imported into Matlab as MEX executable files in order to speed up the simulations. Our department has all the relevant Matlab licenses, and our laboratory has used this environment for development of the DIVA model of speech production as part of another project.

D.1 Subproject 1: Combining the CQ and DIVA models to investigate the neural basis of speech sound sequence production. This subproject will consist of one modeling project and two neuroimaging experiments. The goal of the modeling project is to create a competitive queuing (CQ) based model whose components specifically relate to cells in brain areas underlying sequence generation in speech. This model, which we will refer to as the sequence model, will then be integrated with the DIVA model of speech production. The combined model will be referred to as the CQ+DIVA model in this application. Simulations of the CQ+DIVA model producing syllable strings in particular speaking tasks will be run and compared to the results of the proposed fMRI studies which involve those same speaking tasks. These experiments are designed to guide model development, test particular aspects of the model, and test between the model and alternative theories. This work offers three advancements over previous modeling work in this area: 1) it builds upon the CQ work of Bullock and colleagues (Boardman & Bullock, 1991; Rhodes & Bullock, 2002; Rhodes et al., 2004) which has thus far focused on sequences of externally cued manual movements, 2) the CQ network will work in concert with a well-developed neural model of speech production (the DIVA model) to allow quantitative measurement of interactions between the planning and motor execution stages of speech, and 3) by developing a model that includes both stages, a framework will exist for examining speech disorders that involve sequencing, initiation, and/or motor execution of speech sounds, such as stuttering and apraxia of speech.

In the following paragraphs we describe the theoretical framework that will be implemented (in equations and computer simulations) in the proposed modeling project and tested in the accompanying fMRI experiments. The framework synthesizes a number of theoretical and experimental contributions into a cohesive, unified account of a broad range of neurological and behavioral data. We first describe a basic computational unit, the basal ganglia (BG) loop, that will be used in the complete sequence model, before describing the rest of the model.

The cortico-basal ganglia-thalamo-cortical loop as a functional unit. The proposed sequence model is built upon the basic functional unit illustrated in Fig. 9. This unit consists of a cortico-BG-thalamo-cortical loop that begins and ends in the same portion of the cerebral cortex. Such neural loops have been widely reported in the monkey neuroanatomical literature (e.g., Alexander et al., 1986; Alexander & Crutcher, 1990; Cummings, 1993; Middleton & Strick, 2000). In our model we will include two “cells”[9] to represent a cortical column: a superficial layer cell and a deep layer cell. This simplified breakdown of the layers in a cortical column is analogous to the breakdown utilized in the detailed model of BG function of Brown et al. (2004). The two-layer simplification allows the model to incorporate two major empirical generalizations regarding cortical-BG and cortico-cortical projections. First, the dominant cortico-striatal projection is from layers 5a or above (“superficial”) whereas the cortico-thalamic and cortico-sub-thalamic projections are from deeper layers (5b, 6). Second, the cortico-cortical projections are either from deep layers to superficial layers or from superficial layers to deep layers[10]; cortico-cortical projections between layers of equivalent depth appear to be excluded (e.g., Barbas & Rempel-Clower, 1997). Each superficial layer cortical cell projects to a corresponding cell in the striatum (the major input portion of the BG, which includes the caudate and putamen). The striatum then projects to cells in the internal segment of the globus pallidus (GPi) or the substantia nigra pars reticulata (SNr), the output structures of the BG, via two pathways: a direct pathway in which each cell in the striatum projects through an inhibitory connection to a corresponding output cell (Albin et al., 1989), and an indirect pathway that projects through inhibitory pathways to the external segment of the globus pallidus (GPe) which in turn provides diffuse inhibition of the GPi/SNr[11] (Parent & Hazrati, 1995; Mink, 1996). The BG output nuclei, typically tonically active, project through inhibitory connections to the thalamus (Penney and Young, 1981; Deniau and Chevalier, 1985), which in turn sends excitatory projections to the deep layer of the cortical column from which the loop originated.

We will typically assume that each cortical column represents a different planned motor action (e.g., a particular phoneme). Thus the cortex schematized in Fig. 9 represents four motor actions. The superficial layer in this example contains a parallel representation of the four actions, as in the planning layer of the CQ model described in Section C.3 (see Fig. 5). We follow prior interpretations (Mink & Thach, 1993; Kropotov & Etlinger, 1999; Brown et al. 2004) in hypothesizing that pathways through BG have the effect of selectively enabling output from the winner of the competition between motor actions. In the schematic of Fig. 9, a “winner take all” or “choice” dynamic enables the cortical column responsible for the largest input to the BG to receive enough thalamic activation to generate an output from its deep layer, which drives a subsequent stage of processing. In contrast, the columns representing the other items do not receive such output-enabling activation from thalamus and thus have no deep layer activity. In other instances, the BG loop may simply “gate off” or scale the activity in the deep layer of the cortical column instead of performing a “choice”.

To understand the functionality of the BG loop, it is useful to consider the net effect of the excitatory and inhibitory projections between the striatum and cortex via the direct and indirect pathways. In the direct pathway, activation of a striatal cell has the effect of inhibiting the corresponding GPi/SNr cell. This dis-inhibits the corresponding thalamic cell, which in turn excites the corresponding superficial cortical layer cell. In other words, the direct pathway is excitatory and has the specific effect of exciting only the same column(s) as the superficial layer cell(s) that provide the initial excitatory input to the striatum. Because it involves one extra inhibitory connection, the indirect pathway is inhibitory on thalamus and cortex. Furthermore, due to the diffuse indirect path projection, this inhibition spreads across all competing columns (i.e., across all motor action representations). Thus the larger an action’s advantage in superficial layer activation, the more it will excite itself and inhibit other actions in the BG-loop-mediated competition. This inherent connectivity provides the types of interactions that would be expected for a CQ-like choice network. The balance between direct and indirect pathways can change; e.g., the release of dopamine in the striatum has an excitatory effect on the direct pathway and an inhibitory effect on the indirect pathway (Gerfen & Wilson, 1996; Brown et al., 2004; Frank, 2005), thus biasing the system toward the direct pathway. Such a bias can affect competitions between actions. For example, a stronger indirect pathway will more strongly suppress all action; such a process may be responsible for the overall lack of movement seen in Parkinson’s disease, which is characterized by a lack of dopamine in the striatum and corresponding bias toward the indirect pathway (e.g. Wichmann and DeLong, 1996) in addition to pathological neuronal bursting that leads to tremor (Bevan et al., 2002).

This basic functional unit will be used to model several cortical regions along with associated basal ganglia and thalamic areas. The following paragraphs describe the model that will be implemented, tested, and refined in this subproject. For clarity of exposition, we will introduce the overall model in several stages.

Working memory sound sequence representation. An overview of the cortical interactions in our proposed model is presented in Fig. 10. The top part of the figure shows brain activations from the complex sequence/complex syllable vs. baseline comparison from our preliminary fMRI study (Section C.4), with the approximate locations of the proposed model’s components labeled. The bottom half of the figure schematizes the cortical interactions in the model, as described in the following paragraphs. Basal ganglia loops associated with the cortical areas are not shown for clarity.

A parallel working memory (WM) representation of speech sounds, hypothesized to exist in/along the left inferior frontal sulcus (labeled IFS in Fig. 10), constitutes the highest level of our model (in concert with pre-SMA, described later). The model’s “job” is to produce a sound sequence represented in the IFS working memory in the proper order and with proper timing. Evidence for a verbal working memory representation in or near the left inferior frontal sulcus[12], including the dorsolateral and/or ventrolateral prefrontal cortex, has been found in a number of experimental studies (e.g., Crottaz-Herbette et al., 2004; Henson et al., 2000; Veltman et al., 2003; Wagner et al., 2001), though the exact nature and location of this representation remain unclear. In our fMRI study of syllable sequence production described in Section C.4, left IFS had more activity for more difficult speech sequences, in keeping with its proposed role as a sound sequence working memory, and in an fMRI study of syllable production that did not include a working memory component (unlike the study in C.4), activation of this area was absent (see Guenther et al., in press in Appendix). Furthermore, Petrides & Pandya (1999) suggest this region may be the human equivalent to the monkey posterior principal sulcus region, which was the site of the queued representation of drawing movement sequences identified by Averbeck et al. (2002, 2003) and used as the basis for the working memory representation in our model.

The model’s working memory representation is informed by the competitive queuing (CQ) framework described in Section C.3 and by the strikingly consistent neurophysiological findings from the Averbeck et al. (2002) study of drawing sequences. However, speech is more complex than drawing, and patterns of speech errors provide evidence against the sufficiency of a CQ model that utilizes a single parallel planning zone that mixes representations of all phonological units without regard to type. Error distribution data suggest that competition occurs between speech units of the same phonological type but not between units of different types, such as syllable onsets, nuclei, and codas. In speech errors known as Spoonerisms (MacKay, 1970), two parts of a phrase are switched, e.g. “a lack of pies” may be spoken instead of the intended “a pack of lies”. Such sound swapping errors in speech almost always involve two items of the same phonological or grammatical type. For example, syllable onset consonants often swap with other syllable onset consonants (as in the example above), entire initial syllables can swap with other initial syllables (“Birthington’s washday” instead of “Washington’s birthday”), verbs can swap with other verbs, and nouns can swap with other nouns (Nooteboom, 1969; Fromkin, 1971, 1973; Garrett, 1975, 1980). However very rarely are vowels swapped with consonants, initial syllables swapped with final syllables, or nouns swapped with verbs.

We therefore propose that the IFS working memory consists of several distinct CQ circuits, each of which mediates competition among speech units of a particular type (cf. Hartley & Houghton, 1996). That is, there are separate CQ circuits in IFS respectively dedicated to choosing syllable onsets (the initial consonants of a syllable), syllable nuclei (vowels), and syllable codas (final consonants). In this modeling project we will limit ourselves to these three phonological types for the sake of tractability. Within each of these CQ circuits, the model will include sets of cells to represent sets of speech sounds of the corresponding type (e.g., in the syllable onset circuit there is a cell for /p/, a cell for /t/, etc.). Our modeling starts with the assumption that a sequence of speech sounds becomes represented as parallel patterns of activity in these IFS CQ circuits. The brain mechanisms leading to this activation of IFS (that is, the mechanisms responsible for choosing the words/syllables to be spoken) are beyond the scope of the current model, which starts from this pattern of activation and performs the neural computations needed to transform it into a set of articulator movements that produce the sound sequence in the proper order and with the proper timing.

Proper readout of a sound sequence requires coordination across the three CQ circuits in IFS. Each circuit only “knows” the order of sounds of a particular type (e.g., syllable onsets) in the current sequence; it does not know which sound type should come next in the utterance. We propose below that the pre-SMA sends signals to the CQ circuits in IFS indicating which sound type should occur next in the sequence. These “frame signals” cause the readout of the next sound in the CQ circuit that receives the signal. This breakdown of speech production into a subsystem that processes syllable structure without regard for the exact phonemes within a syllable (“syllable frames”, e.g. /CV/, /CVC/) and a subsystem that processes the phonemic “content” of these frames is in accord with “frame/content” or “slots-and-fillers” theories of speech production (e.g., MacNeilage, 1998; Shattuck-Hufnagel, 1979, 1983, 1987). Further, we agree with MacNeilage’s (1998) proposal that the medial premotor areas (specifically SMA and pre-SMA in our model) are primarily involved with processing frames while lateral areas (specifically IFS and BA 44 in our model) are primarily involved in processing content. The model we propose is also in the spirit of prior, more abstract, models, in which CQ appeared as a core of models that could explain grammar-respecting patterns of sequencing errors observed in language production (e.g., Dell et al., 1997; Hartley and Houghton, 1996; Ward, 1994). Our model differs in two key ways from prior efforts: it simulates in detail the lower levels of processing, including the interface between sequencing and the actual articulations needed for phone production, and its hypotheses have detailed brain-circuit interpretations that make them testable via brain activity measurements.

In the following subsections we delineate the neural circuitry believed to be involved in the “reading out” of sound sequences represented in working memory in the IFS.

Choosing the next sound from the working memory sequence representation. Once a syllable sequence representation has been activated in left IFS and the proper CQ circuit receives a frame signal from the pre-SMA (to be described in the next subsection), the triggered CQ circuit must read out the next sound from its parallel representation, as in the competitive layer of the CQ model (see Fig. 5 in Section C.3). We propose that this competition is carried out by the IFS-basal ganglia loop. In monkeys, BA 46v (thought to be equivalent to our IFS; cf. Petrides & Pandya, 1999) projects to the caudate nucleus (part of the striatum) and receives input from GPi/SNr via the thalamus (Middleton and Strick, 2002). We propose that this loop performs a competition between the different sound sequence items, as schematized in Fig. 9. The winning item projects back up to the IFS deep layers and then on to the BA 44 superficial layers (pathway labeled “next sound” in Fig. 10).

The DIVA model speech sound map cells are proposed to lie in the deep layers of BA 44. Recall that activation of these cells leads to readout of the motor program for producing the speech sound (see Section C.1 and Guenther et al., in press in Appendix materials for details). As described above, the superficial layer BA 44 cell corresponding to the chosen item from working memory becomes active due to projections from IFS. However, readout of the motor program for that sound does not occur until the corresponding deep layer cell (speech sound map cell) is activated. In the proposed sequence model, the activation of the deep layer cells depends multiplicatively on two factors: (i) the output of the BA 44 BG loop that starts from the superficial layers of BA 44 and terminates in the deep layers, and (ii) a “trigger signal” arising from the SMA. The SMA trigger signal (described further below) has a value of 1 or 0; it becomes active (value of 1) to initiate the motor production of the current syllable/phoneme, and its activity goes to 0 at completion of the motor program for the syllable. The model’s BA 44 BG loop has the effect of scaling the size of the activation represented in the superficial layer; this is hypothesized to control the speed of readout of the motor program (i.e., speaking rate). Thus the speech sound map cell corresponding to the chosen syllable will remain inactive until the SMA trigger signal arrives, at which time it will become active and will stay active during production of the current syllable, with its level of activation regulated by the BA 44 BG loop to control speaking rate. From this point the DIVA model takes over the motor execution of the syllable via a combination of learned feedforward commands to motor cortex and feedback control circuitry (see Guenther et al., in press in Appendix for details).

Next we address the generation of properly timed frame and trigger signals by the pre-SMA and SMA.

Initiating the chosen sound at the right time. Numerous studies in the past decade have suggested a separation of the medial wall premotor area previously described as the “supplementary motor area” into a posterior area termed the SMA proper (referred to here as SMA) and an anterior area termed the pre-SMA with different functionality (e.g. Matsuzaka et al, 1992, Shima & Tanji, 2000). The SMA has been implicated in the initiation of specific actions in numerous studies (Eccles, 1982; Krainik et al., 2001; Picard & Strick, 1996; Tanji, 2001; Hoshi & Tanji, 2004). Many cells in SMA fire time-locked to the execution of a particular motor act (Matsuzaka et al., 1992; Tanji & Shima, 1994). The pre-SMA, on the other hand, appears to be involved in higher-level aspects of complex tasks such as choosing the right action in response to a particular stimulus (“cognitive set”; Matsuzaka et al., 1992), representing numerical order of movements in a sequence (Clower & Alexander, 1998; Shima & Tanji, 2000), or updating an entire sequential motor plan (Shima et al., 1996; Kennerley et al., 2004). Whereas many SMA cells are only active immediately before and during movement execution and are only active for a specific movement type or effector system, pre-SMA cells typically begin firing well in advance of movements and often are not specific for which effector system is used to carry out an action (e.g. Shima & Tanji, 2000; Fujii et al., 2002; Hoshi & Tanji, 2004).

In keeping with these observations, we propose that the pre-SMA sends “frame signals” (see Fig. 10) to the IFS working memory that determine which CQ circuit (i.e., the onset, nucleus, or coda CQ circuit) will read out its next item to BA 44. Actual motor execution of the item, however, requires an SMA trigger signal projecting to BA 44. This is necessary in the model because pre-SMA does not have direct sensory and motor inputs (Luppino et al., 1993) that can act as “completion signals” indicating completion of the previous item in the sequence. The SMA, in contrast, receives a wide variety of sensory and motor inputs (Passingham, 1993) that can act as completion signals, and thus SMA is capable of timing the motor execution of the current item immediately upon completion of the previous item. We further propose that, in addition to sending trigger signals to BA44, the SMA also sends trigger signals to primary motor cortex. Such projections from SMA to motor cortex and F5 (the monkey analog of BA 44) have been identified in anatomical studies (Luppino et al., 1993; Jürgens, 1984). We propose that “inner speech” (with no overt articulation) involves only the trigger signals to BA 44, whereas overt speech involves SMA trigger signals to both BA 44 and primary motor cortex.

Generation of the pre-SMA frame signals and SMA trigger signals for a speech sequence is hypothesized to occur as follows. To fix ideas, consider production of a CVC syllable. A pre-SMA “frame cell” representing the syllable’s frame structure (e.g., CVC) is activated by other parts of cortex (not treated in the model) at the same time that the phonemes for that syllable are loaded into the IFS working memory representation. Single cell recordings in monkeys have identified cells in pre-SMA that respond in advance of a particular movement sequence but not other sequences (e.g., Shima & Tanji, 2000). This frame cell activates another pre-SMA cell that represents the first item in the sequence (C for a CVC), and this cell in turn sends a frame signal (see Fig. 10) to the corresponding working memory CQ circuit (the “onset” circuit). Activation of the SMA trigger cell (to initiate production of the item) can occur in one of three ways in the model: via projections from pre-SMA (for internally timed sequences that have not been practiced heavily), via the SMA BG loop (for heavily practiced internally timed sequences), or via cortico-cortical connections from sensory areas processing external timing cues (in externally timed sequences). We focus here on internally timed sequences. Furthermore, for the current modeling project we will focus on the pre-SMA -> SMA route; the SMA BG loop route is explored in Section D.2. Once the SMA trigger cell is activated, it remains active until it receives a signal indicating completion of the motor execution of the current item. The SMA then projects back to the pre-SMA to signal completion of the item, extinguishing activity in the pre-SMA cell representing the completed frame component and activating the cell for the next component (V for a CVC). This process repeats (except for the activation of the pre-SMA frame cell, which happens only once for each syllable) until the entire syllable has been completed. We further propose that the left SMA/pre-SMA and right SMA/pre-SMA control different aspects of speech production. We propose that left hemisphere SMA/pre-SMA is more involved in signaling the frame type associated with a word and triggering the production of the particular content items in the frame, while the right hemisphere is more involved in sentence-level prosodic aspects of the utterances, in keeping with several studies and theoretical treatments (e.g., Ross & Mesulam, 1979; Heilman et al., 2004; Baum & Dwivedi, 2003; Perkins et al., 1991). We focus on left SMA/pre-SMA in the current proposal; we omit sentence-level prosodic control from the model for the sake of tractability.

The distinction between motor program storage in lateral premotor areas and timing signal generation by medial premotor areas is also similar to the FARS model of Arbib and colleagues, a model of monkey grasping circuits (Fagg & Arbib, 1998) that has also been used to investigate evolution of language (Arbib, in press). Although space limitations preclude detailed treatment, our model differs in several key respects, most notably in the use of CQ for working memory and the fact that our model will generate articulator movements and sounds that will be compared to experimental data.

Modeling project. In the modeling project, we will define mathematical equations for all model components described above and implement these equations in computer software that will allow us to simulate the sequence model to produce speech sound sequences, in concert with the DIVA model. The modeling project will be carried out in several stages. In the first stage, a generic cortex-BG loop model will be implemented in computer simulations, and we will verify its ability to perform a competitive “choice” of a parallel plan in the superficial layer (as schematized in Fig. 9). In the second stage, we will implement the circuitry for readout of an IFS working memory sound sequence plan to the speech sound map cells. We will verify the circuit’s ability to read out sequences in the proper order when provided with appropriate trigger signals from the modeler. In the third stage, we will implement the circuitry for generating the pre-SMA and SMA trigger signals, and in the fourth stage we will combine the various sequence model components into a single computer simulation and verify its ability to activate BA 44 speech sound map cells in the proper order. The timing of signals between these regions will reflect realistic transmission delays, as is the case in the DIVA model (see section C.1 and Guenther et al., in press in Appendix documents). Finally we will integrate this model with the DIVA model (thus forming the CQ+DIVA model) to allow generation of the articulator movements and acoustic signal for the sound sequence. We will verify the functionality of the overall system by performing simulations of the CQ+DIVA model producing a number of sound sequences, allowing us to qualitatively assess model performance by listening to its productions.

After the model has passed these qualitative tests, we will evaluate it more quantitatively by adapting protocols for model-data comparisons that consider both timing and error patterns (MacKay, 1970; Sternberg et al., 1978; Boardman & Bullock, 1991; Verwey, 1996; Page & Norris, 1998; Conway & Cristiansen, 2001; Rhodes & Bullock, 2002; Klapp, 2003; Rhodes et al., 2004; Farrell & Lewandowsky, 2004; Agam et al., 2005). Regarding timing patterns, past studies of speeded immediate serial recall (ISR) of prepared verbal sequences have shown several systematic timing phenomena, including: (1) a sequence length effect on the latency to initiate a sequence; (2) a ratio much greater than one between initiation latency and continuation latencies (inter-item intervals); (3) an inverse sequence length effect on mean production rate; and (4) non-monotonic relationships between inter-item intervals and serial position within the sequence. Moreover, (5) high levels of practice with specific sequences causes the latency to initiate the practiced sequence to become independent of the length of that sequence. Thus learning can eliminate effect (1) while most of the other effects remain. All these patterns, which also hold for non-verbal key-press sequencing, were successfully modeled in the N-STREAMS model for key-pressing, an extended CQ model developed by Rhodes & Bullock (2002) and assessed vis-a-vis further data in Rhodes et al. (2004). In this modeling project we will verify that the CQ+DIVA model also has these properties, modifying the model if necessary to account for the data. Furthermore, adding noise at the CQ level of our deterministic model will create a stochastic model that is capable of making sequencing errors, notably transposition errors in which two elements of a planned sequence mistakenly swap positions in the output. Plotting data from a large number of stochastic simulations will allow us to verify that the model's phoneme transposition errors obey both frame and adjacency constraints suggested by many prior reports (MacKay; 1970; Fromkin, 1973; Garrett, 1980; Dell et al., 1997). In particular, most transposition errors should be swaps between items of the same type (e.g., onset or coda), and they should be separated by only one serial position within their respective CQ circuit (i.e. contents of adjacent onsets in a multi-syllable sequence plan should swap much more often than contents of onsets separated by more than one serial position in the plan). Also, CQ-type models, unlike all other types so far proposed to explain transposition distributions, can explain the monotonic decline of the latency of a transposition error as a function of transposition distance (Farrell & Lewandowsky, 2004; cf. also Dell et al., 1997). The stochastic model's ability to explain this key interaction effect will also be assessed.

As in our work with the DIVA model (see Section C.1 and Guenther et al., in press in Appendix), we will identify specific locations of the model’s cells in the stereotactic coordinate frame used in the SPM2 analysis software (based on the anatomy of the SPM “standard brain”) and extract simulated fMRI data from the model simulations (see Section C.2). These data will be compared to activations measured in our proposed fMRI studies (described further below).

Although the model described above is in keeping with a wide range of data from behavioral and neuroimaging studies, we consider it to be a preliminary model that will be refined in this modeling and experimental work. For example, we have omitted in this description the cingulate motor areas, which may also play a role in sequencing and initiation of motor actions (e.g. Paus et al., 1993; Procyk et al., 2000). We do this in part because the connectivity is not as well established (cf. Geyer et al., 2000), and also to keep the model as simple as possible. Nonetheless we will analyze the cingulate motor areas in our fMRI studies and look for evidence of their involvement in our tasks. If found, this evidence will be used to guide refinement of the model to incorporate these areas. Also, the anterior insula is not treated in the sequence model because it appears to be primarily involved in overt rather than covert articulation (Ackermann & Riecker, 2004) and therefore is considered part of the DIVA model, which is addressed by another grant. Nonetheless we will analyze insula activations in our fMRI studies and incorporate any relevant findings into the CQ+DIVA model.

fMRI Experiment 1: The representation of syllable frames in prefrontal cortex. According to our proposed model (and in accord with MacNeilage’s frame/content theory), the medial premotor areas, in particular the pre-SMA, are involved in syllable frame generation during speech, whereas the IFS is a working memory that stores content elements (i.e., specific phonemes to fill in the syllabic frames). Furthermore, projections from pre-SMA to IFS trigger the readout of the appropriate content element to the BA 44 speech sound map, and the production of this element takes place when the SMA sends a trigger signal to BA 44. Overt speech (as opposed to inner speech) additionally involves SMA trigger signals to primary motor cortex. In this experiment we propose to test these hypotheses.

Two different stimulus types will be used in this experiment: (1) three-syllable nonsense words with simple frames, and (2) three-syllable nonsense words with complex frames. The simple and complex frames will be matched for “content”; i.e., they will use the same phonemes. Each of these stimulus types will be used in one of two speech tasks performed in different runs: inner speech (i.e., saying the utterance “in your head”, without articulating) and overt speech (saying the utterance out loud) for a total of four experimental conditions. Additionally, a control condition of resting quietly while viewing “XXXX” on the screen will be used as a baseline for determining activations in the production tasks. For simple frames, we will use CV-CV-CV and V-CV-CVC nonsense pseudowords (like “padita” and “adapit”), and for complex frames we will use V-CVC-CV (e.g., “akupni”) and CV-V-CVC (e.g., “puabab”). These frames were chosen based in part on their relative frequency in English words. The publicly available CELEX-2 Lexical Database (Centre for Lexical Information, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands, 1995) contains detailed information about the phonology, morphology, syntax, and frequency of English words and syllables drawn from text passages. According to the CELEX-2 database, the frequency of occurrence of each frame type per 1 million spoken words is 6444 for CV-CV-CV, 2079 for V-CV-CVC, 177 for V-CVC-CV, and 29 for CV-V-CVC; thus the simple frames occur on average over 40 times more frequently that the complex frames. Subjects will be instructed before the experiment to stress the first syllable of each utterance to avoid variations in stress patterns across subjects. During the fMRI session, the experimenter will monitor the subject’s productions, and if frequent errors in production are detected, the experimenter will remind the subject between runs of the correct pronunciations. The event-triggered design of our protocol allows us to remove trials in which subjects have committed production errors. After completion of the fMRI session, recordings of the subject’s productions will be reviewed and erroneous productions removed from subsequent analyses. According to our power analysis described above, average error rates as high as 30% will still allow detection of significant activation changes in the fMRI analysis with 17 subjects. We expect the actual error rates to be much lower than this, given our rate of 14.2% for the most difficult syllable sequences in our prior experiment (Section C.4).

Hypothesis Test 1. It is very likely that complex frames require more neural processing than simpler frames in brain areas responsible for frame generation; fMRI and PET studies widely report more activation during more difficult tasks (e.g., Paus et al., 1998; Gould et al., 2003). If the pre-SMA is the primary site of frame processing as proposed in the model, then it should be more active in the complex frames task than the simple frames task. This should be true for both the overt and covert speech cases. To test this hypothesis, we will perform a contrast (using the SPM software package) between the complex and simple frame cases for both the overt speech and inner speech conditions (denoted as complex – simple frame hereafter). This contrast will identify any statistically significant activity differences (random effect analysis, statistics controlled at a 0.05 false discovery rate (FDR)) between the two cases. If there is significantly greater activity for complex - simple frames in pre-SMA in both the overt and inner speech cases, this will be taken as support for the model’s hypothesis that pre-SMA is involved in frame generation. Such a finding would also be consistent with MacNeilage’s (1998) proposal that medial frontal cortex is involved in frame processing. If there is no significant difference in one or both cases (overt or inner speech), we will examine alternative accounts of frame generation and of the pre-SMA role in speech. Significant activity differences in other premotor areas, for example, will be taken as evidence that these areas may contribute to frame generation. If such a result is found, the model will be modified accordingly. Hypothesis Test 2. In contrast to pre-SMA, IFS should have the same amount of activity in the two tasks (complex vs. simple frames) according to our model since they involve the same content elements (phonemes). This hypothesis will be tested using the complex – simple frame contrast in both overt and inner speech conditions. If significant activation is found in IFS for complex compared to simple frames, the model’s hypothesis will be rejected and the alternative hypothesis that IFS is involved in frame generation, not just content storage, will be supported. If no significant difference is found in the IFS region, this would support our model. This finding would also be consistent with the results shown in Figure 6 (greater activation for complex vs. simple sequences) and findings from previous imaging studies which show activation in IFS during verbal short term memory tasks (e.g., Wagner et al., 2001; Crottaz-Herbette et al., 2003; Veltman et al., 2003). Hypothesis Test 3. The model predicts that inner speech should involve less activation in SMA than overt speech since in overt speech cells within SMA send trigger signals to both BA 44 and primary motor cortex (involving two different subsets of SMA cells), while in inner speech only cells which trigger BA 44 are recruited. To test this hypothesis, we will perform a contrast of overt speech – inner speech for both the complex and simple frame cases. If these contrasts show significantly greater activity in SMA for overt speech, the model’s hypothesis will be supported; this would represent a generalization to human speech of the finding of SMA cell activity time-locked to movement onsets in monkeys (Matsuzaka et al., 1992; Tanji & Shima, 1994). If no significant activity is found in SMA for overt – inner speech, this would suggest that the SMA is equally involved in triggering speech regardless of whether or not the speech is overtly articulated will be supported (e.g., see Riecker et al., 2000a, Shuster & Lemieux, 2005). Finally, if significantly more activity is found in SMA for inner speech than overt speech, the alternative hypothesis that SMA actively inhibits motor cortex during inner speech will be supported (cf. Goldberg, 1985). Hypothesis Test 4. In contrast to SMA, the model predicts that pre-SMA activation should be the same for the overt and inner speech tasks since they both involve the same frames. This hypothesis will be supported if no significant activity is detected in pre-SMA in the overt – inner speech contrast. Significant activity in pre-SMA for this contrast[13] will be taken as evidence for the alternative hypothesis that pre-SMA is, like SMA, more involved when speech is overtly articulated than when it is simply spoken internally. This would be evidence for a more “motoric” role of pre-SMA than previously thought (e.g., Shima & Tanji, 2000; Fujii et al., 2002; Hoshi & Tanji, 2004). Although fMRI studies of both inner speech and overt speech have been performed[14] (e.g., Murphy et al., 1997; Riecker et al., 2000a; Shuster & Lemieux, 2005), none of these studies differentiated pre-SMA activation from SMA activation, likely because SMA and pre-SMA are abutting and thus difficult to differentiate using standard fMRI and PET analysis techniques. We have developed region-of-interest (ROI) based fMRI analysis tools that allow us to test hypotheses concerning particular anatomically defined regions of interest with greater statistical power and anatomical accuracy than standard fMRI analysis methods (see Nieto-Castanon et al., 2003, in Appendix documents for details). We will use these tools to differentiate SMA and pre-SMA when testing Hypotheses 3 and 4.

The components of our sequencing model are all hypothesized to lie in the left hemisphere, based on the fact that these regions were left-lateralized in our preliminary sequencing study (Section C.4). In addition to the hypothesis tests described above, we will also test for the predicted laterality for each of the regions in the proposed model (paired t-test within subjects across hemispheres for the contrasts of interest in Hypotheses 1 to 4), in addition to the contrasts comparing each stimulus type vs. the control condition. Though not repeated hereafter for brevity, laterality tests will be performed for all voxel-based hypotheses in all experiments.

In addition to the voxel-based tests described above, we plan to perform structural equation model (SEM) analyses to test hypotheses concerning the signaling between brain regions (see Effective connectivity analysis above for methodological details). It should be noted, however, that the validity of effective connectivity methods is not as well established as the validity of voxel-based methods. Thus caution must be exercised in interpreting SEM results as falsifying or strongly supporting a particular hypothesis. Nonetheless SEM can be expected to provide valuable supplemental information to the voxel-based hypothesis tests described above. Hypothesis Test 5. During both overt and inner speech the proposed model predicts signals from pre-SMA to IFS, from IFS to BA 44, from SMA to BA 44, and bidirectional signaling between pre-SMA and SMA. These signals are not expected during the baseline task (viewing ‘XXXX’). We therefore expect greater effective connectivity between these regions during each speech condition than during the control condition. To test this hypothesis we will compare effective connectivity path strengths determined using SEM between each speech condition and the baseline condition, looking specifically at the pathways mentioned above. Greater path connectivity for each speech condition for a given path coupling supports the model’s hypothesis that signaling between the two regions increases during speech. Hypothesis Test 6. The model predicts no change in effective connectivity for the paths outlined in Hypothesis Test 5 when comparing the overt and inner speech conditions. This hypothesis will be supported if SEM indicates no difference in path strength between these regions in the overt and covert conditions. If a difference is found, it will be taken as evidence for the alternative hypothesis that signaling between these areas is modulated by whether or not speech is overtly articulated. Hypothesis Test 7. The model predicts no modulation of effective connectivity from IFS to BA 44 and from SMA to BA 44 due to frame complexity. This hypothesis will be supported if SEM indicates no difference in path strength between these regions in the simple and complex frame conditions. If a difference is found, it will be taken as evidence for the alternative hypothesis that signaling between these areas is modulated by frame complexity, and the model will be modified accordingly. Specifically, a difference between IFS and BA 44 would indicate that IFS is involved in frame generation, not just content storage. Likewise, a difference in the coupling between SMA and BA 44 would indicate SMA plays a role in frame generation and not just timing signals. Hypothesis Test 8. As outlined in Hypothesis 3 above, the model posits trigger signals from SMA to the speech articulator portions of motor cortex during overt speech but not during inner speech. This hypothesis will be supported if SEM analysis indicates a stronger positive path from SMA to ventral motor cortex during the overt than the inner speech condition. If on the other hand a negative path strength is found between SMA and motor cortex for the inner speech condition, this will support the alternative hypothesis that SMA inhibits motor cortex to prevent movement during inner speech.

Comparing model cell activations to fMRI experimental results. We will perform computer simulations of the proposed CQ+DIVA model performing the same speech tasks as the speakers in the fMRI experiment. Specifically, we will first train the model to produce the utterances from the stimulus set (see Guenther et al., in press in Appendix materials for details regarding the learning of new speech sounds in the DIVA model). The model will then produce each utterance in the stimulus list in both overt and inner speech modes. In addition to verifying that the model is capable of producing the syllables for each utterance in the proper order (thus testing the functionality of the CQ components of the model as well as the frame generation mechanism), we will also generate simulated fMRI data as the model produces the utterances from the experiment. The same contrasts performed for the fMRI experiment will be performed on the simulated data to allow direct comparisons between simulated fMRI activity and the experimental results (see Section C.2 for details). As in our earlier work with the DIVA model (see Guenther et al., in press in the Appendix materials for details), each cell in the proposed model will be associated with a precise anatomical location specified in Montreal Neurological Institute (MNI) normalized spatial coordinates, the same coordinate frame used to analyze fMRI data in the SPM software package. This allows us to project the model’s simulated fMRI activations onto the same brain surface used to plot the results of our fMRI experiments, and to directly compare the locations and magnitudes of the peak activations in the model to those found in the fMRI experiments. If the model’s cell activities differ significantly from the experimental results (e.g., if the locations of the peak activations of the model differ substantially from the peak activity locations in the experimental data for a particular contrast), the model will be modified to be in accord with the experimental findings by relocating model cells and/or changing the hypothesized functionality of a particular brain region in the model. There is currently no standard for quantitatively comparing the locations of simulated BOLD activations produced by a model such as the one described here to fMRI experimental results. To test a proposed location of a cell type in the model that is active in a particular simulated contrast, we will simply determine whether the stereotactic coordinate of the model cell falls within a cluster of activation in the corresponding fMRI contrast.

fMRI Experiment 2: IFS as a sound sequence working memory or semantic processing center. Our model proposes that the left IFS is the site of a sound sequence working memory needed during production of speech sound sequences. Activity in this same region was identified in the word generation study of Crosson et al. (2001), who surmised from their results that the area was involved in semantic processing. This contrasts with our model as well as the results of our preliminary fMRI study described in Section C.4, wherein IFS activation was found for semantically meaningless syllable sequences. To resolve this issue and further inform our model, we propose to directly test between these two hypotheses concerning function of the IFS in this experiment, and test additional hypotheses generated from our model.

Subjects will perform three tasks in the fMRI scanner. In the semantic generation task, subjects will be briefly presented with a word (text on a video screen) that describes a category of items (e.g., “birds”, or “furniture”) and will be asked to think of two items (few items condition) or four items (many items condition) that are members of that category. The word will disappear, and after a few seconds a “go” signal will be presented on the screen, at which time the subject will say the items that he/she thought of. This is similar in several respects (though not identical) to the paced word generation condition of Crosson et al. (2001). In the nonsense utterance task, subjects will be briefly presented with two short nonsense words (few items condition) or four short nonsense words (many items condition) on the screen, then the words will disappear. A few seconds later a go signal will be presented and the subject will say the nonsense words. Finally, the baseline control task will consist of the brief presentation of “XXXX” on the video screen instead of a category word or nonsense word, and a few seconds later the go signal will be presented. However subjects will be instructed before the experiment to just rest quietly and look at the screen when they see the X’s, rather than producing speech. This task will be used as a baseline for measuring brain activations in the other tasks. Behavioral pilot studies will be run in advance of the scanning sessions (involving approximately 10-15 subjects) to determine an appropriate delay period between word presentation and the go signal, i.e. one that is long enough to allow a subject to generate the category items in the semantic generation task, but not so long that subjects cannot remember the nonsense words in the nonsense utterance task. This delay is designed to increase the working memory demand of the tasks in order to highlight the differences in the predictions of the two hypotheses. We will also measure the word lengths and phonological makeup of the words that subjects most frequently generate in the semantic task of the pilot study and use these to determine the length and phonological makeup of the nonsense words to use in the fMRI experiment nonsense utterance task.

Hypothesis Test 1. According to the Crosson et al. (2001) hypothesis of IFS function, the IFS should be significantly more active in the semantic generation tasks than in the nonsense utterance tasks. This contrasts with our model’s prediction that both tasks should lead to the same amount of activation in IFS since they involve approximately the same number of speech sounds buffered in working memory. We will test between these alternatives in Hypothesis Tests 1 and 2. The first test involves assessing a semantic generation – nonsense utterance contrast with SPM. Significantly more activity (random effects analysis, FDR = 0.05) in IFS for the semantic generation task will be taken as support for the Crosson et al. (2001) hypothesis. If no significant activation difference is found, this will support our model’s explanation for the role of IFS as a speech sound buffer, which is specifically tested in Hypothesis Test 2. Hypothesis Test 2. Our model predicts that the many items conditions of both tasks should lead to higher activation in IFS than the few items conditions since more items need to be stored in the sound sequence working memory. This should be true in both the nonsense utterance and semantic generation tasks. In contrast, the Crosson et al. hypothesis predicts no difference in the amount of IFS activation for many items vs. few items in the nonsense utterance task since neither condition involves a semantic component. We will test these predictions by performing a many – few items contrast for both the nonsense and semantic generation tasks. The model’s hypothesis will be supported, and the Crosson hypothesis rejected, if significantly more activation is found in left IFS for this contrast in both the nonsense and semantic generation tasks. If no significant activation is found for this contrast in either the semantic generation or nonsense utterance task, we will take this as evidence against the model’s hypothesis that left IFS represents a speech sound working memory that does not differentiate between meaningful and nonsense utterances. Collectively, hypothesis tests 1 and 2 will allow us to choose between the competing proposals of our model and Crosson’s model. Hypothesis Test 3. Our model predicts greater signaling from IFS to BA 44 during the many items condition of both tasks since a greater number of speech sound map cells will be activated (see also Shuster & Lemieux, 2005). This hypothesis will be supported if SEM analysis indicates a stronger IFS-to-BA 44 path in the many items condition than the few items condition.

Comparing model activations to fMRI experimental results. The CQ+DIVA model will be trained to produce the utterances used in the preceding fMRI experiment, and simulations of the model producing the stimuli from the experiment will be performed. In simulations of the semantic generation task, utterances from the experimental subjects will be used for the model simulation stimulus set. Note that the CQ+DIVA model does not differentiate between semantically meaningful and nonsense utterances; thus we expect the model simulations to show the same activations in the two tasks (semantic generation and nonsense utterance). In addition to verifying that the model is capable of producing the syllables for each utterance in the proper order (thus testing the functionality of the CQ components of the model as well as the frame generation mechanism), we will also generate simulated fMRI data as the model produces the utterances from the experiment, and the same contrasts performed in the fMRI experiment will be performed on the resulting data to allow direct comparisons between simulated fMRI activity and the experimental results. We will compare the locations and magnitudes of the peak activations in the model to those found in the fMRI experiments. If the model’s cell activities differ significantly from the experimental results (as described above for Experiment 1), the model will be modified to be in accord with the experimental findings by relocating model cells and/or changing the hypothesized functionality of a particular brain region in the model.

D.2 Subproject 2: Investigating the learning of new speech sound sequences. This subproject will consist of two modeling projects and two corresponding fMRI/behavioral experiments that investigate the learning of new speech sequences. Sakai et al. (2003) define motor sequence learning as “a process whereby a series of elementary movements is re-coded into an efficient representation for the entire sequence (p. 229).” In terms of speech learning, this implies that familiar phoneme or syllable motor programs might be combined into larger “chunks” that can be efficiently manipulated and read out for rapid speech. The proposed modeling projects will investigate two possible loci for learning to produce familiar speech sound sequences: the basal ganglia and the cerebellum. These areas are well-studied and quite distinct anatomically. Our computational studies are designed to help determine how these areas individually contribute to sequence learning, and will generate predictions that can be verified or falsified by assessing the results of the proposed fMRI studies.

Both the basal ganglia and cerebellum have been shown to be important for sequence learning in various studies (e.g. Miyachi et al., 1997; Lu et al., 1998; Nixon and Passingham, 2000; Doyon et al., 2002; Shin and Ivry, 2003). The storage of detailed sequence-specific information in these subcortical structures may reduce demands on higher-level cognitive processes mediated by prefrontal cortical areas during sequence performance. While this type of learning-driven shift in processing regions has inspired theories of sequential motor control for limb, eye, and finger movement sequences (e.g. Hikosaka et al., 1999; 2000; Nakahara et al., 2001; Rhodes and Bullock, 2002; Rhodes et al., 2004), the nature and limits of such off-loading remain controversial and have yet to be determined, particularly for well-learned sequences in speech. In the modeling studies, we hypothesize ways in which the basal ganglia and cerebellum could be utilized in learning without sacrificing flexibility in performance. We will use behavioral studies to verify that subjects indeed improve performance in our sequence learning tasks prior to the corresponding fMRI investigations of supra-syllabic (Exp. 1) and sub-syllabic (Exp. 2) sequence learning. The results of these experiments will, in turn, be used to further constrain our sequence learning model.

Modeling Project 1: Basal ganglia contributions in sequence learning. A key finding from studies that examined visuo-motor sequence learning through early, intermediate and late phases (Hikosaka et al., 2000) is a progressive decrease in activity in prefrontal/pre-SMA/anterior striatum (caudate) and a progressive increase in activity in SMA/posterior striatum (putamen). This generalization is supported by single-cell studies in non-human primates (Nakamura et al., 1998, 1999; Miyachi et al., 1997; Miyachi et al., 2002) and by brain imaging experiments in humans (Sakai et al. 1998). Nakahara et al. (2001) proposed a non-CQ computational model[15] of processes that might underlie these activity shifts; in contrast, we propose a CQ-consistent model to address these and further data on contributions of BG circuits and their associated parts in frontal cortex.

In our proposed model, there are two different sources of timing signals that initiate SMA trigger cell activation and thus trigger production of the next item during self-timed speech[16]: the SMA BG loop or cortico-cortical input from the pre-SMA. According to our model, when a child first begins producing speech sequences, his/her brain relies heavily on prefrontal cortical areas and the pre-SMA to trigger the onset of each sound in the sequence. During this time, the BG is effectively “monitoring” the time course of SMA activity generated by cortex, and after some practice with a particular sequence, the SMA BG loop becomes capable of activating and deactivating the SMA trigger cells internally (i.e., without requiring pre-SMA timing input to SMA). This is possible because the SMA BG loop receives a wide range of sensory, motor, and prefrontal input (including input from IFS, which codes the planned sequence, as well as motor and sensory cortical information concerning ongoing speech movements) that it can use as context for determining when to activate a particular SMA cell. Thus, the SMA BG loop is hypothesized to be involved in activating SMA trigger cells for “automatic tasks” that have been practiced before, including commonly occurring syllable strings. The generation of trigger signals by the BG when producing a familiar speech sound sequence can be envisioned as follows. High-level cortical areas such as BA 45 generate the sequence and activate the appropriate IFS working memory cells and pre-SMA frame cells. A pre-SMA to SMA signal activates an SMA trigger cell for the first item in the sequence, and this leads to motor execution of that item. Based on earlier learning, the BG are capable of recognizing the motor commands and sensory feedback associated with completion of the item[17]. When the BG recognize completion, they send a signal to SMA that terminates trigger cell activation for the current sound and initiates activation of the trigger cell for the next sound. In this way, the BG are responsible for the timing of heavily practiced sound sequences. In keeping with this view, several studies have indicated a role for BG in timing of movements within a sequence (Rao et al., 1997; Harrington and Haaland, 1998).

In this framework, prior to any practice, production of a novel speech sequence (and, more generally, a novel movement sequence of any sort) relies heavily on prefrontal and pre-SMA circuits for timing. These areas are part of BG loops that primarily involve anterior portions of the striatum, particularly in the caudate (Parent and Hazrati, 1995). With practice, as described above, the SMA BG loop becomes more and more involved in production of the sequence. This loop involves more posterior portions of the striatum, particularly the putamen. Thus this view is in accord with the main results reported by Hikosaka and colleagues. Below we propose experiments that extend these studies to the learning of new speech sequences.

In the first modeling study we will define the “learning law” equations that modulate the strengths of synapses in the SMA BG loop, and we will implement these equations in the CQ+DIVA model computer software. We will also modify the SMA BG loop as described below to allow it to identify completion of the current sound. We will perform simulations of the extended model performing speech sequence learning tasks to verify its ability to acquire new sequences in a manner consistent with human behavioral data. These tasks will include the learning of sub-syllabic and supra-syllabic sequences as in the experiments described below. The model will be fit to data obtained about the rate of learning new sequences, reaction times, and acoustic durations throughout the course of learning; this fitting will involve adjustment of learning rate parameters in the learning law equations. Synthetic fMRI data will be generated from the model simulations and compared to the results of the imaging experiments (described further below).

In order to recognize completion of an ongoing sound, the BG must receive inputs from the relevant motor, somatosensory, and (possibly) auditory areas. The DIVA model currently includes motor, somatosensory, and auditory representations that are appropriate for this purpose. We will implement projections from these areas to the striatum in the CQ+DIVA model, specifically to the portions of the putamen that are part of the SMA BG loop. This is in keeping with the notion of a sensorimotor loop (Parent and Hazrati, 1995). Furthermore, projections from the model’s working memory representation (left IFS) to the portion of the striatum corresponding to the SMA BG loop will be added; the pattern of IFS inputs to the BG identify the sequence being produced, whereas the motor and sensory inputs identify the state of the motor execution of the current sound. Together they provide all the information needed for identifying the completion of the current sound in a sequence and triggering of the next sound. Recall that each striatal cell in the model’s SMA BG loop corresponds to a different cortical column in SMA (Section D.1). Each time the SMA trigger cell from that column changes state (i.e., goes from inactive to active or from active to inactive), the BG will effectively take a “snapshot” of the current contextual inputs and will associate it with the change in state of the SMA trigger cell. This type of associative memory operation is a common property of many neural network architectures. Over time, this learning process will allow the BG to generate an appropriate timing signal based on the contextual inputs alone, without requiring cortico-cortical processes to activate/inactivate the SMA trigger cell via pre-SMA.

Modeling Project 2: Sequence learning in the cerebellum. In this modeling project, we will examine how cerebellar learning can contribute to sequencing of speech sounds. As in modeling project 1, we will implement biologically realistic learning equations that modify synaptic weights, in this case within the model cerebellum and cortico-cerebellar loops (e.g. Allen & Tsukahara, 1974; Middleton & Strick, 2000), and we will again make comparisons between data simulated from the model and the reaction time, duration, and fMRI results of the learning experiments described below. Like the basal ganglia, the cerebellum receives widespread inputs from motor, somatosensory, and prefrontal cortical areas (e.g. Schmahmann & Pandya, 1997) and its output nuclei project to many cortical areas including the SMA and pre-SMA (Wiesendanger & Wiesendanger, 1985; Matelli et al., 1995) and BA 46/9 (Dum & Strick, 2003). Furthermore, both the basal ganglia and cerebellum included portions that were more active for complex syllable sequences than for simple syllable sequences in our preliminary fMRI study (Section C.4). Although cerebellar and basal ganglia input systems are not completely redundant, the two receive much of the same contextual information from the cortex. However, unlike the striatum (Zheng & Wilson, 2002), the cerebellar cortex has connectivity sufficient to perform a sparse expansive recoding of its input. By implementing both of these adaptive components in neurobiologically realistic computer simulations of the CQ+DIVA model, and by comparing simulations to the results of our proposed fMRI and behavioral studies, we hope to clarify the precise nature of cerebellar and basal ganglia involvement in learning of speech sound sequences.

In prior research, we defined biologically plausible neural network models of learning in cerebellum to characterize its function in typing-like performances and in eyeblink conditioning (Rhodes & Bullock, 2002; Fiala et al., 1996). In this project, we will incorporate such a network within the CQ+DIVA model (in addition to the BG learning circuit described in Modeling Project 1). In particular, Rhodes & Bullock (2002) simulated how the cortico-cerebellar loop could learn long-term sequence memories for familiar items and retrieve them into a CQ planning layer. This enables the cortex to treat familiar sequences as single items, i.e., “chunks” (which behave similarly in manual and speech sequences; see Klapp, 2003). This will augment the CQ+DIVA model with an ability to incrementally learn what Levelt et al. (1999) called a “syllabary”, i.e., a representation of high frequency syllables that does not necessarily respect word boundaries but aids in rapid speech production[18].

In the experiments described below, participants will learn to quickly and accurately produce novel strings of syllables (Experiment 1) or phonemes (Experiment 2). Lesion and imaging studies suggest that multiple memory systems and brain areas may be recruited to support such learning, which occurs many times daily during the rapid vocabulary acquisition phase of language learning. If elements of the sequence are already familiar and easily reproduced in isolation, then such sequence learning has three notable phases: (1) good perceptual sequence recognition, but perhaps incomplete (error-prone) recall, following one or a few exposures, (2) robust sequence recall sufficient to accurately reproduce the sequence with moderate fluency, and (3) integration of the sequence into a single "chunk" with highly fluent production. The number of training trials in our experiments will allow subjects to reach at least stage (2) for learned sequences (cf. Klapp, 2003).

Behavioral Experiment 1: Learning of new supra-syllabic sequences. The purpose of this experiment is to prepare subjects for a subsequent fMRI experiment that will look for differences in brain activation for learned sequences of syllables compared to novel sequences. In order to carefully control the degree to which particular sequences of syllables have been learned, and to avoid potential confounds due to semantic content, we will construct a list of 18 nonsense pseudowords, each consisting of four concatenated syllables (e.g., “nerGERpretez”), and train each subject to produce a subset of the pseudowords. The pseudowords are generated by using random permutations of 4 syllables according to several constraints: 1) all stimuli have the same syllable structure ([CVC][CVC][CCV][CVC]); 2) stress is always assigned to the second syllable; 3) all syllables occur with approximately the same frequency in English; 4) consecutive syllable-syllable pairs occur with negligible frequency in English; 5) the phonological neighborhood density for all stimuli is zero; 6) none of the syllables are words in English. Frequency information is derived from the CELEX-2 database; neighborhood density is assessed using the Washington University Neighborhood Database.

The experiment will involve two approximately 1-hour sessions for each of 17-34 healthy adult subjects. The first run of the first session[19] will be a pre-test that measures mean error rates, reaction times, and durations as the subject produces the 18 pseudowords, presented 5 times each, in a simple reaction time experiment. At the beginning of each trial, the pseudoword will be presented orthographically and auditorily (to promote proper pronunciation) for 1.5 s. After a random delay period (500ms to 2000ms), a go signal in the form of a beep will be presented, and the subject will be instructed to accurately say the word as quickly as possible after the beep, then the next trial begins (total trial length approx. 5s). After the pre-test, the subject will perform 6 training runs (each approximately 8-10 minutes long) in which they repeatedly produce 6 of the pseudowords from the list in the same serial reaction time task (10 presentations of each stimulus in each run). These 6 pseudowords will constitute the learned sequences for that subject, and the remaining 12 pseudowords, which are not encountered during the training runs, will constitute the unlearned sequences for that subject[20]. The second training session will be scheduled one day after the first training session[21], at which time the subject will perform six additional training runs followed by a post-test identical to the pre-test.

Hypothesis Test 1. Based on the results of serial reaction time studies (Schendan et al, 2003; Doyon, 2002, 2003; Aizenstein, 2004) and prepared sequence reaction time studies (Klapp, 2003) involving the learning of novel sequences we expect reaction times to decrease more for the learned sequences than the unlearned sequences as a result of training[22]. This hypothesis will be tested by performing a repeated measures ANOVA (two-way interaction term between training session [pre- vs, post- training] and sequence set [learned vs. unlearned], one-tailed, p < 0.05). Hypothesis Test 2. We also expect a relatively larger decrease in the duration of the learned pseudoword productions after training compared to unlearned psuedowords (same test as Hypothesis 1 but with duration as outcome). Hypothesis Test 3. Finally, we expect a relatively larger decrease in the error rates of the learned pseudoword productions after training compared to the unlearned pseudowords (log-linear test for two-way interaction in crosstabulation of correct/incorrect productions, training session, and sequence set, p ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download