Background and Significance - Gene ontology



Specific Aims

The Gene Ontology Consortium (GOC) provides the scientific community with a consistent and robust infrastructure for describing, integrating, and comparing the structures of genetic elements and the functional roles of gene products within and between organisms. In just six years, its constituent ontologies have become the de facto community standard for expressing, in a machine-usable form, the biological domains of genome features, molecular function, biological process, and cellular localization. The Gene Ontology (GO) provides a set of well-defined terms organized into specialization and part-of hierarchies that are technology and data format neutral. This technical adaptability has led to its adoption by a wide range of databases and the GO has been integrated in a wide variety of technical environments. Hence, the breadth and diversity of organisms annotated with both the GO, alongside its associated Sequence Ontology (SO), continues to increase. This adaptability has also encouraged its use for many unforeseen purposes, e.g., Natural Language Processing (NLP) and Information Retrieval of the biomedical literature. The GOC will now increase the depth and taxonomic breadth of ontologies and associated annotations while keeping quality high so that it may be reliably used to draw inferences and translate knowledge across organisms. We will advance the understanding of the molecular basis of human health and disease by focusing on the following key aims to integrate and standardize biomedical and genomics information:

Aim 1: We will maintain comprehensive, logically rigorous and biologically accurate ontologies. We will work closely with biological experts to ensure that the ontologies accurately reflect biological reality. We will incorporate new relationship types into the ontologies as needed and we will recast compound terms as explicit cross-products with orthogonal ontologies. We will keep the ontologies logically rigorous so that when used to query for terms associated with gene products it will neither omit relevant annotations nor return incorrect annotations.

Aim 2: We will comprehensively annotate reference genomes in as complete detail as possible. Genomes that are fully and reliably annotated empower scientific research and are essential for use in automatic inference. We will annotate reference organisms selected according to the following criteria: a large body of scientific literature exists; a reasonably sized community of researchers study that organism; the organism’s relative importance as an experimental model in the study of human disease; and high impact on discovery in the scientific community.

Aim 3: We will support annotation across all organisms. Emerging genomes, or any collection of gene products (e.g. EST libraries or proteomic data), are best understood in a comparative context; and inferring function from highly reliable sets of annotation on related organisms, such as those provided by Aim 2, is the only practicable method to annotate the less well-known genomes. We will provide a standardized, structured methodology for functional annotation of emerging genomes. We will support and encourage the submission of functional annotations to the central GOC repository from the broadest possible spectrum of organisms.

Aim 4: We will provide our annotations and tools to the research community. Sharing the cumulative knowledge of the functional roles of each protein and non-coding RNA is the primary goal of the GOC. Therefore, we will support the use of the GO by all researchers in functional genomics, comparative biology, and other related fields. We will continue to provide all GO resources publicly and freely to the research community.

Background and Significance

1 The origin and Development of the Gene Ontology Consortium 1998-2005.

The Gene Ontology (GO) was founded, in the fall of 1998, by researchers from three community Model Organism Databases (MODs), Mouse Genome Informatics (MGI), Saccharomyces Genome Database (SGD), and FlyBase. The objectives of the founders were very simple: to provide a common vocabulary for the description of what gene products 'do', and to apply this vocabulary to annotate gene products in these three databases. The motivation of the founders of the GO was two-fold. First, recording the function of every gene product was an essential responsibility of the MODs. Second, this task would be best done collaboratively, as it would then be more efficient, accurate and comparable. If these MODs were to use a shared structured vocabulary for annotation then one could foresee a common query interface and the de facto integration of these databases within this domain, a result greatly facilitating comparative biology research. These considerations remain major motivations for the work of the GOC [GOC][1]. Indeed the needs are even more urgent now: in 1998 only two eukaryote genomes (S. cerevisae, C. elegans) and 18 bacterial genomes had been published. As of December 2005 the numbers are 53 (although not all of these are closed or finished) and over 200, respectively.

1 Growth of the Gene Ontology Consortium.

A series of small informal meetings in late 1998 and early 1999 between members of the three founding MODs, generously supported by funding from AstraZeneca (through the good offices of Dr. Ken Fasman), established the backbone of the Gene Ontology. The key decisions taken at these meetings were: (a) to limit initially the scope of the GO to three independent sub-domains: the molecular function, biological role and cellular location of gene products (be they protein or RNA); (b) to structure vocabularies as directed acyclic graphs [DAGs] using just two relationships between terms, is_a and part_of; (c) to provide each term with a dictionary style definition; (d) to provide a common database and interface to the GO and annotations supplied by the MODs, (e) to maintain and track the history of changes to each term; and (f) to provide all of the work of the GOC to the public without any constraint whatsoever. Finally, we determined that immediate testing and usage would drive the work of the GOC, solving immediate problems simply, without precluding increasing sophistication.

The first substantive application of the GO was made during the first annotation of the then newly completed genome of Drosophila melanogaster during the Celera/ Berkeley Drosophila Genome Project (BDGP) annotation jamboree in November 1999 (Adams et al. 2000). Encouraged by the success of this project and by the informal interest in the GO shown by others, we published, in May 2000, the first formal account of the GO in Nature Genetics (The Gene Ontology Consortium, 2000) and, in March 2000, made our first application to the NIH for funding. This was awarded in full with a start date of January 2001, and renewed in 2003.

This funding enabled the development of the GO (Appendix 11: Progress Report year 5). We are delighted that all of the major model organism databases for eukaryotic organisms now annotate with the GO. The phylogenetic range is about as wide as it could be, from protozoa such as Plasmodium and Tetrahymena (in progress) to human and rice. We are disappointed that, for a variety of reasons (primarily tradition and logistics), there has been less universal use of the GO for the annotation of bacterial genomes, with the notable exception of the work done at The Institute for Genome Research [TIGR]. However, we are encouraged that increasingly many outside TIGR are now using GO, although these projects have yet to deposit their data with the GOC (e.g., the annotations, see [Pseudomonas]). We have been in extensive discussion with many of the major players in this field, e.g., the Joint Genome Institute's (JGI) Integrated Microbial Genomes database [IMG], the Sanger's Pathogen Sequencing Unit [PSU] (who do use GO for their eukaryotic genomes), the E. coli community and the NIAID's Bioinformatics Resource Centers [NIAID]. We address these issues in more detail below (Aim 3).

The use of the Gene Ontology by many major cross-species databases has grown. These include not only the major GO annotation project (GOA) for UniProt at the European Bioinformatics Institute ([UniProt], Wu et al. 2006), the TIGR Comprehensive Microbial Resource [CMR] and the Sanger's GeneDB [GeneDB], but also the NCBI, where GO annotations are incorporated into Entrez-Gene [NCBI] and the Protein Data Bank [PDB], which has recently released annotations of its structures with GO terms. The GO is also incorporated into other open bioinformatics standards, e.g. the BioPax Level 2 specification ([BioPax]) and support for GO is now provided in Cytoscape (Ver. 2.2) ([Cytoscape]).

We are also encouraged by the very extensive use of the GO in industry, not only by most of the major Pharmaceuticals and many small biotech companies, but also by companies offering information services to the pharmaceutical industry (Table 1). This use of the GO has been paralleled by the extraordinary development of 'third-party' tools to manipulate the GO or GO annotations: we are now aware of over 70 different tools, some are commercial (e.g. the DecisionSite Ontology Browser of Spotfire [SPOTFIRE]), but most are open source [GO.tools]. These include ontology browsers, tools for the annotation of proteins with GO terms, tools for the analysis of high throughput expression data and tools for the use of GO in text mining.

|Products that Use GO | |

| | |

|NLP & Ontology products |

|Biowisdom | |

|ReelTwo | |

|IBM, Japan | |

|Array products and data analysis | |

|Affymetrix | |

|BioMind | |

|Spotfire | |

|GenePilot | |

|DNA Array Systems | |

|Molecular Station | |

|Proteome Software | |

|Medicel | |

|Sapio Sciences | |

|Avadis | |

|BioSieve | |

|Molmine | |

|MWG Genome Information | |

|Persistent | |

|VizXlabs | |

|Seqexpress | |

|Gene Pilot | |

|Macrogen | |

|Biomind | |

|Clontech | |

|Inpharmatica | |

|Ocimum Biosolutions | |

|Bioin4matrix | |

|Operon | |

|Elashoff Consulting | |

|Reagents & services |

|Actigenics | |

|Macrogen USA | |

|Sigma |

|ExactAntigen | |

|Abnova Corporation |

| |d= |

|Mirus | |

|Admetis | |

|Treenomix | |

Table 1. Some commercial products that incorporate or use the GO.

It was not coincidence that the idea of the GO came at just the time that Stanford researchers had first shown the power of microarray analysis of gene expression. Indeed, those charged with the analysis of microarray data are among the most intensive users of the GO, and many major manufacturers of both gene expression and protein arrays include GO annotation of their probes (e.g., [AFFY], [CLONTECH]). GO is even being used to describe commercial reagents (e.g., Abnova, a company that boasts over 10,000 monoclonal antibodies to human proteins [ABNOVA], and oligonucleotide libraries from Sigma-Genosys [GENOSYS])(Table1). However, it has been surprising to us that the GO has found so much use in the evolving NLP field (see, for example, Jenssen et al. 2001; Raychaudhuri et al. 2002; Hirschman et al. 2005). Commercial tools for literature mining that incorporate the GO are now beginning to be released (e.g. [AGILENT], [GO-KDS]), as has been a public tool that classifies PubMed abstracts with GO terms ([GOPUBMED]).

The idea of structured controlled vocabularies or ontologies was not new in 1998, even in biology, witness the development of SNOMED ([SNOMED]) and the UMLS ([UMLS)]. A very important step was taken in 1993 by Monica Riley's development of a hierarchical controlled vocabulary for the description of gene 'function' in E. coli (Riley, 1993). This has been developed as MultiFun [MultiFun], with which the GO has been mapped in collaboration with Greta Serres at Woods Hole. At about the same time, Overbeek, Maltsev and Gaasterland in the Argonne Group did pioneering work by developing the PUMA (see [PUMA2]) and WIT resources ([WIT]). The FunCat project of the Munich Information Center for Protein Sequences (MIPS) [FunCat] has also been mapped to the GO, in collaboration with MIPS staff. Both MultiFun and FunCat are relatively small (505 and 1307 terms respectively); both are strict subsumption hierarchies and both are very stable, not being regularly updated as biological knowledge changes. Although derivatives of Riley's 1993 classification and of MultiFun have been widely used for the annotation of bacterial genomes, Riley herself recognizes their limitations and has stated that the GO is needed for this task (Serres et al. 2001). Significantly, the most recent analyses of the E. coli K-12 genome use the GO (Riley et al. 2006; [EcoCyc]).

The accurate annotation of gene products with the GO depends on the availability of high quality genome annotation, and a robust mechanism for exchanging annotations between multiple groups and databases. The latter was developed by Durbin and Haussler in 1997 [GFF] and has been enhanced by Stein and colleagues as GFF3 [GFF3]. A major difference between GFF and GFF3 is that GFF3 incorporates an ontology that constrains the description of annotation feature types. This is the Sequence Ontology, developed by the GOC in collaboration with Richard Durbin, Lincoln Stein and Mark Yandell (Eilbeck et al. 2005, [SO]). The reason for the GOC's investment in this project is simply the GO's reliance on high quality genome annotations. Annotations will be of higher quality if the annotators all agree on the definitions of the objects they annotate. Moreover, as shown by a small example in Eilbeck et al. (2005), we discovered, having already built the SO, the unexpected benefit of using tools from the discipline of extensional mereology (see Simons, 1987, and [Mereology]). These methods promise novel methods for the analysis of genomes (see [CGL]). The SO adds a fourth domain to the efforts of the GOC.

There has been a major change in the bioinformatics community with respect to ontologies in the last six or seven years. Prior to 1999 only a few were advocating ontology development (e.g. Schulze-Kramer, 1997; 1998; see also Karp, 1995; Karp and Paley, 1996). Now, we see not only the development of ontologies for many different domains within biomedicine, but also their very extensive use by biologists and bioinformaticians. These changes have been driven, we think, by three considerations. First, the ever-growing number of completed genomes, and the increasing amounts of 'post-genomic' data that follow as a consequence, have opened the eyes of the community to the need to bring semantic order to biomedical data. Second, the concept of the Semantic Web (Berners-Lee et al. 2001), whose success is predicated on the development, availability and use of domain ontologies has influenced biomedical informatics (although for the counter case see [Shirky, 2005]). Finally, we like to think that, the success of the GO project has proved the benefits that accrue to a community from the adoption of an ontology. We also believe that the GO is an example of how an open ontology can be developed with widespread community participation.

The development of several new ontologies in the biomedical research domain is to be welcomed, but presents the community with several problems. The first of these is access, finding out just what is available. To solve this problem the GOC established the OBO site [OBO] as a 'single-stop shopping site' for biomedical research ontologies. As of December 2005, there were nearly 60 contributed ontologies accessible from this site (the majority maintained in the OBO Concurrent Versioning System (CVS) repository). Many of these are of central importance to the future development of the GO (see Aim 1), for example the EBI's ChEBI ontology for chemicals of biological interest [CHEBI], the Cell Ontology (Bard et al. 2005 and [CELL]), and many anatomical ontologies. Early in 2006 responsibility for the OBO site will move from the GOC to the newly established National Center for Biomedical Ontology [NCBO]. NCBO is an NIH funded consortium of biologists, clinicians, informaticians and ontologists who are developing novel methods for the creation, dissemination and management of biomedical information. NCBO is not responsible for the content of ontologies, but will, inter alia, provide services for their maintenance, evaluation, distribution and usage.

2 The Sequence Ontology.

The Sequence Ontology project was initiated by the GOC, in collaboration with Drs. R. Durbin and L. Stein, to provide a structured controlled vocabulary for the description of features used for genome annotation. Traditionally, the Feature Table descriptors of the International Sequence Data Library (GenBank/EMBL/DDBJ) ([FT]) have been used. While this has served the community well for many years, it suffers from certain disadvantages. On the one hand it is quite restricted in its scope, providing only 65 terms for the description of sequence features. It is very cumbersome to change, any alteration must be agreed by the international collaborators of the three data libraries and then only implemented after 6 months notice to the community. Most seriously, the groupings of terms are not formalized. Just as the MODs needed to express formally the attributes of gene products, so they also need to express formally the attributes of sequence. The computational analysis of annotated sequence would be enormously helped if the MODs expressed these attributes using the same terms, used with accepted formal definitions. A second justification for the development of the SO was that there was no easy—and certainly no rigorous—way to retrieve sequences based on some biological property from the sequence data libraries. Example queries are: retrieve all of the genes from the mouse genome that are 'maternally imprinted', retrieve all of the genes from mouse, worm, fly and yeast whose transcripts are translated with a +1 ribosome frameshift. The SO provides a small subset of locatable features, specifically for use by GFF3—this subset, SOFA (Sequence Ontology for Feature Annotation, see [SO.SOFA]), is only changed once a year, so as to afford stability to GFF3 files.

The development of the SO has been very closely associated with two other projects. The first is GFF3 ([GFF3]), a project to create a standardized file format for the exchange of annotations. The second is the Generic Model Organism Database's (GMOD) chado database project ([CHADO]), whose goal is to construct a generic database schema so that the annotations from different genomes can be archived, searched and managed using a single set of database tools. The Sequence Ontology will provide the terms and specify the relationships used to describe the contents of GFF3 files and GMOD databases. Thus SO is a necessary and planned adjunct to both of these projects. The SO provides a small subset of locatable features, specifically for use by GFF3—this subset, SOFA (Sequence Ontology for Feature Annotation, see [SO.SOFA]), is only changed once a year, so as to afford stability to GFF3 files.

3 OBO and OBO-Edit.

The OBO site has encouraged developers of biomedical ontologies to use the file format developed by the GOC. This OBO format has been enhanced considerably over the last few years and now has the advantages of great expressivity, computability and human readability [OBO-format]. A single common format promotes community (re-)use of tools, e.g., tools for ontology editing [OBO-edit] and browsing [AMIGO]. OBO-Edit, the ontology editing tool developed and used by the GOC community, is now widely used, and the AmiGO browser has also been used for the Plant Ontology [PO] and the Drosophila anatomy ontology [IMAGO].

The GOC is often asked: “What is the difference between Protégé-OWL and OBO-Edit?” Protégé and OBO-Edit are tools with many similarities, but fundamental design differences. Protégé is a frame-based editing tool, while OBO-Edit is a graph-based editing tool. Both tools have optional extensions that allows for successively richer and more expressive ontological modeling (namely the Protégé OWL plug-in, and a series of optional plug-ins in obo-edit, such as the cross product plug-in). Even when these extensions are used, the initial design philosophies of the two respective tools are apparent.

OBO-Edit's graph-based approach is ideal for the rapid generation of large ontologies focusing on relationships between relatively simple classes. In its default view, OBO-Edit hides the complex details of the ontology to allow the user to focus on the overall graph structure of the ontology. GO curators are not aware of (and have no use for) "slots", for example, they only see a graph of labeled relations between ontology terms. Since OBO-Edit's user community consists largely of biologists with little computer science background, this simpler, high-level editing approach is ideal for the target community. Further, by hiding the low-level complexity of an ontology, OBO-Edit can be a more usable tool for editing very large ontologies.

OBO-Edit and Protégé are interoperable tools, inasmuch as they both contain support for description-logic languages such as OWL (see section D.1.c.iii for more on interoperability). Most editing operations that are possible in one tool are possible in the other. Yet the two tools are highly complementary. Each is specialized for a different user community and editing paradigm. It is likely that many users will choose to possible that some advanced users will install both tools, and use each in different circumstances.

2 Rationale for the continued funding of the GOC.

“The growth of scientific data, and of scientific databases, in the biomedical field—a growth not only of size but also of complexity—has been remarked upon so often as to become a cliché. The urge for database 'integration' has been a mantra of both the bioinformatics community and of the funding agencies for decades." So we wrote in our 2003 proposal to the NIH and, frankly, we can do no better now. The 2006 Database Issue of Nucleic Acids Research [NAR] includes 162 papers each describing at least one database in the general domain of genomics and molecular biology. The Molecular Biology Database Collection, maintained by M. Galperin (Galperin, 2006) in association with NAR, records 858 databases[2]. Other than within the Generic Model Organism Database community [GMOD], there is very little agreement or collaboration between the providers of these databases with respect to semantic standards. Within the GMOD community, the chado database schema [CHADO] (designed by C. Mungall and D. Emmert) has ontologies at its heart, and these are all OBO ontologies. Chado has been adopted by several MODs, including FlyBase, dictyBase, ParameciumDB and TIGR.

The first uses of the GO for annotation were for the Drosophila genome (Adams et al. 2000) and for yeast, both in 2000. This was followed, in 2001, by its use for the annotation of mouse cDNAs by the FANTOM project (Kawai et al. 2001). Since then the GO has been used in over 166 published genome, EST or protein annotation projects, mostly by groups outside the GOC (data from [GO.pubs]). The extraordinary utility of GO annotations is seen by their very extensive use for the analysis of gene expression data. The first such papers were published in 2002, and we are now aware of over 230 studies to date (data from [GO.pubs]).

The nature of high-throughput gene expression data challenges the ability of biologists and bioinformaticists to extract useful knowledge from it. It is very clear that the GO is an essential tool, as indicated by the number of tools that have been developed for this purpose (39 to date) [GO.tools]. Many companies, academics and government institutions have developed products for specialized GO applications, e.g. GOFFA for toxicogenomics developed by the FDA [GOFFA]. We could quote scores of examples of the use of the GO for microarray analysis, but will restrict ourselves to just three. In the first, the authors of a very recent paper on gene expression in the invasive front of colorectal liver metastases write: "Using the gene ontology (GO) classification, we were able to determine patterns of up- and down-regulated genes in the liver part of the invasive front. We observed a pronounced overrepresentation, e.g., of the GO terms "extracellular matrix," "cell communication," "response to biotic stimulus," "structural molecule activity" and "cell growth," indicating a very pronounced host cell response to tumor invasion." (Bandapalli et al. 2006). Similarly, Zindy et al. (2005) discovered a "new set of genes involved in DNA repair and damage" using the GO for a study of gene expression in cirrhosis associated with liver cancer. Our final example is of a study of the effects of the EGFR inhibitor drug erlotinib on gene expression in metastatic breast cancer, from Swain's group at the National Cancer Institute: "Gene ontology comparison analysis pre-treatment and post-treatment in EGFR-negative tumors revealed biological process categories that have more genes differentially expressed than expected by chance. Among 495 gene ontology categories, the significant differed gene ontology groups include G-protein-coupled receptor protein signaling (34 genes, P = 0.002) and cell surface receptor-linked signal transduction (74 genes, P = 0.007)." (Yang et al. 2005).

The growing reliance of the biomedical community on the GO can also be seen by the sheer growth in the number of publications that cite the GO in their abstracts, as indexed by PUBMED [PUBMED]: from 7 in 2001, to 322 in 2005. A search for 'gene ontology' on Google Scholar reveals over 9,300 links and that our 2000 Nature Genetics paper has been cited 1291 times (01/16/06). Enthusiasm for the GO is also seen by the fact that 1470 people took the time to respond to our recent web survey (Appendix 13). An analysis of just 77 primary research papers showed that research supported by 17 of the 24 Institutes of the NIH has used the GO. Its importance has also been recognized by its incorporation into the National Library of Medicine's Unified Medical Language System [UMLS], see Lomax and McCray (2004), and the National Cancer Institute's Enterprise Vocabulary System [NCI].

Our case for the GO has been, and remains, that semantic integration of biomedical research data is both achievable and essential if these data are to be used by the communities of experimental and computational biologists to their greatest effect. We also argue that, while not inexpensive, the investment required to bring about this semantic integration is but a small fraction of what these biomedical data cost to discover and but a small fraction of what they can yield in long-term benefits to society. The relevance of this work for public health is that comprehensive integration and standardization of biomedical and genomics information is an essential component of advancing the understanding of the molecular basis underlying human health and disease outcomes.

The GO was invented out of necessity – the necessity of the MODs having a rigorous method for their users to query their databases for gene products by their 'functional' attributes. We have argued that this necessity has not gone away, but rather has become even greater with the dramatic increase in the number of 'completed' genomes in the last three years. Annotations of gene products with GO terms have now become absolutely central to the analysis of most high-throughput genomics data. We are firmly of the view that the GO can be one solution to the problem of biomedical data management and analysis, and that our work since 1998 has shown that this solution is both achievable and maintainable.

PROGRESS REPORT.

The GO includes four controlled vocabularies for describing biology: the Molecular Function ontology describing catalytic activity and other molecular properties of a gene product; the Biological Process ontology describing the role that a gene product plays in a higher biological pathway, the Cellular Component ontology describing the sub-cellular compartment in which a gene product can be found; and the Sequence Ontology describing features that are located on, or are attributes of, biological sequences thus providing a systematic way to annotate genomes.

The GOC is responsible for the content and structure of the four GO ontologies, for the software to edit and display the ontologies, for the structure of gene association data files, for supporting the application of the GO to classify data through tools and training, for the GO database, and for the project’s web presence.

1 go curation infrastructure.

A primary objective of the GO is to provide robust and biologically accurate ontologies for the research community. During the current granting period, we continued to improve the GO and to provide the mechanisms for community input into the GO development process.

1 Gene Ontology development.

The GO Editorial Office at the EBI manages the distributed tasks of developing and maintaining the GO vocabularies for molecular function, biological process and cellular component. The office consists of the GO Editor (Dr. Midori Harris) and three GO curators who have primary responsibility for the biological content—terms and definitions—of the GO database. The editor and curators facilitate internal consistency by coordinating all additions and changes to the GO suggested by contributors. They work closely with model organism database annotators to identify areas of the GO that require expansion or revision, respond to requests from the larger biological research community, and initiate recruitment of experts to refine specific sub-domains. The Editorial Office staff also: maintains and coordinates the GO project's documentation, development of the GOC web pages, presents GO at scientific meetings, and answers the many questions about GO and its resources received from community members.

| |January 2004 |January 2005 |January 2006 |

|Component |1269 |1397 |1681 |

|Process |7867 |8924 |10291 |

|Function |6907 |6929 |7384 |

|Total |16043 |17250 |19356 |

|Defined |86.1% |93.0% |95.4% |

|Obsolete |716 |968 |992 |

The totals exclude obsolete terms.

Table 2. The content of the Gene Ontology, January 2004 to January 2005, numbers of terms per ontology.

Development continues on the GO ontologies and the GO vocabularies now comprise nearly 20,000 terms (Table 2). The editorial group performs regular integrity checks on the ontologies, and provides a summary of changes to GO structures monthly ([GO.reports]). We use three strategies to keep the GO current: content meetings, special interest groups, and on-line request tracking. Before a new term is committed to the database a broad consensus must be reached as to the wording of the term, its placement and relationships, and its definition. The discussions around new terms and their definitions involve a broad group of people, to ensure the integrity of the ontologies. Only members of the GO Editorial Office and a few senior annotators associated with model organism databases have permissions to modify the GO master file directly.

Interest groups: Interest groups work together to develop the terms needed to describe a specific topic e.g. development, cell cycle, plant biology, metabolism. They include domain experts, GO curators and representatives from the model organism databases. When GO moves into an unfamiliar biological area we actively recruit external experts to form the core of a new interest group. Interest groups have also formed spontaneously when biologists volunteer their services to improve the terms in their domain of interest. The interest groups communicate through their own mailing lists (Table 3) and at meetings. Anyone may join an interest group. There are now 29 ontology development interest groups providing the GO with biological domain expertise in such areas as ‘cell cycle’ and ‘development’. Major changes this year included new high-level component terms including ‘organelle’, ‘protein complex’ and ‘receptor complex’. The current interest groups are described at [GO.interests].

Tracking Requests and Results: Individual users can submit requests for change to the ontologies via the GO request tracker system hosted by SourceForge. Over 700 tracker entries were processed in 2005 ([GO.sf]). There are approximately 70 new requests per month. A log of all requests, discussion, and the status of each are available at the site. Suggestions are submitted by a wide range of groups, including the various model organism databases, UniProt, TIGR, BRENDA ([BRENDA]) and Incyte (now BioBase), as well as from individual researchers.

|Mailing List Name |Number of |

| |Subscribers |

|GOFRIENDS – general announcements and discussions on GO. |481 |

|GO – Consortium discussions (closed list). |103 |

|ANNOTATION – discussions of functional annotation. |86 |

|PATHOGENESIS – special topic discussions on pathogenesis, cell killing and immunology. |41 |

|FARMANIMALS – special topic discussions on Farm and Work Animals genome annotations to GO terms. |44 |

|METABOLISM – special topic discussions on metabolism and GO annotations. |24 |

|DEVELOPMENT – special topic discussions on Developmental Biology. |22 |

|NEUROBIOLOGY – special topic discussions on Neurobiology. |14 |

Table 3. Major GO mailing lists.

Ontology Content Meetings: We organized three ontology content meetings in 2004-5 to bring together specialists to develop specific branches of the ontology. Each was focused on one or a few major ontology development issues, such as pathogenesis, metabolism, cell cycle, and immunology. The Editorial staff invites experts from relevant fields to participate in each meeting and initiates discussions with them to assess the requirements. Materials are prepared prior to these meetings, discussing the issues and alternative approaches that may be taken, and in this way the attendees arrive fully prepared for the intense discussions that are involved to reach resolution. The materials and minutes for such meetings are available on the GO website, see [GO.meetings]. The most recent GO Ontology Content meeting on the representation of immunology in the GO was held at TIGR in November 2005. In addition to core participants from the GOC, the attendees included immunologists working on ontological representation from three academic institutions as well as NIH intramural researchers. Major revisions of the immunology sections of the GO were proposed and discussed. The result of the meeting was agreement on the representation and definitions for new terms and changes to the organization of the Biological Process ontology structure in the area of immunology.

Ontology Quality Control: This year, we created software ([OBOL]) for parsing GO term names to infer missing relationships in GO. As missing relationships are determined, the GO editors correct and update the ontologies. A major project has been to curate logical definitions for GO terms involving cell types and to integrate the GO and OBO cell type ([CELL]) ontologies. These logical definitions are machine-interpretable, and thus can be used to detect inconsistencies between the ontologies, keeping the GO and cell ontologies in synchrony and permitting cross-ontology queries (see Mungall, 2004).

2 Sequence Ontology curation.

The Sequence Ontology is managed by Dr. K Eilbeck at LBNL, and is hosted via [SO]. SourceForge is a web-based resource for managing open source projects. The SO is continually updated in response to the growing needs of the community. The feedback from the community is predominantly expressed via an Internet mailing list. In this way the discussion reaches a wide variety of users and developers and the response is immediate. The mailing list is open to all and prior discussions are available from an archive. The Sequence Ontology held two content meetings in 2004-5 where particular issues were tackled in greater detail. The developers also solicit input from domain experts, and have worked with other groups for the development of common vocabularies, for example the MGED Ontology group [MGED], the alternate splicing terminology group [SANBI], and the RNA Ontology group [RNA].

Architecture of the Sequence Ontology. SO has kept to its aim of capturing the concepts necessary to describe four different orthogonal aspects of biological sequence: located_sequence_features are terms that describe concepts that can be located on the sequence in base coordinates; sequence_attributes describe the properties a feature may have. For example, a gene may have the attribute maternally_expressed, but this attribute cannot be located on a sequence; chromosome_variation catalogues the large-scale changes to the chromosomes such as ploidy and rearrangements; consequences_of_mutation details the terms necessary to outline what the mutation does, such as causing a frameshift mutation.

Development of the vocabularies. SO now contains 980 terms, including 180 SOFA terms, those terms specifically used by genomic annotation projects (see Figure 1). There has been considerable improvement in the number of terms that have a referenced biological definition. All of the terms in the current SOFA release are now defined, and over half of SO terms (550) are defined. A defined term has a free text description and a cross-reference to the resource that provided the definition. Mappings of SO to the GenBank/EMBL/DDBJ Feature Table and to the MGED Ontology are now provided (see [SO.mappings]).

[pic]

2 Annotation infrastructure.

The second objective of the GOC is to use the ontologies to describe biological data. From the outset, GO was intended as a practical means to achieve comparability across organisms and across sites. Each member of the GOC applies the ontologies to the gene products of its organism or discipline of interest, and deposits these annotations in the central GO repository where they can be accessed on the Web via the AmiGO interface ([AMIGO]). All GO annotations submitted to the central repository are backed up by associated evidence using a series of 12 evidence codes that describe the type and reliability of the underlying evidence (Table 4), along with the appropriate cross-references. For example, when citing published literature, the evidence includes supporting PubMed IDs.

|  IC Inferred by curator |

|  IDA Inferred from direct assay |

|  IEA Inferred from electronic annotation |

| IEP Inferred from expression pattern |

| IGI Inferred from genetic interaction |

|  IMP Inferred from mutant phenotype |

|  IPI Inferred from physical interaction |

|  ISS Inferred from sequence or structural similarity |

| ND No biological data available |

|  RCA Reviewed computational analysis |

|  TAS Traceable author statement |

| NR Not Recorded |

The success of the work done by the GOC is evidenced by the fact that all of the major model organism databases now use GO for annotation, and most of the new databases that will be developed in the near future have announced plans to use GO (e.g. Tetrahymena, Xenopus and Chlamydomonas). For a summary of all annotations that GO users have contributed see [GO.annotations]. The GO is integral to the UniProt database of proteins being developed by the SwissProt groups at the European Bioinformatics Institute (EBI), the Swiss Institute for Bioinformatics and the Protein Information Resource (PIR). GO is used by a number of other protein resource databases including PDB, BRENDA, and has recently been adopted by the National Center for Biotechnology Information (NCBI) for use in the Entrez-Gene and RefSeq projects (See Letter of Support from D. Maglott).

There is also extensive use of GO within the private sector for the annotation of in-house databases. AstraZeneca, Bayer HealthCare, Caprion, Celera, Eli Lilly, Genentech, GlaxoSmithKline, Hoffmann-La Roche, Incyte, Johnson&Johnson, Mendel Biotechnology, Merck, Millennium, and Unilever are just a few of the industrial users that we are aware of (see Letters of Support).

Six of the eight GOC groups are funded under our previous grant to curate and provide GO annotations to their user communities and to the common GO database resource. GO annotations are also regularly contributed to the GO resource from other groups such as the TIGR microbial and eukaryotic annotation groups, and from the GeneDB at Sanger Institute. In addition, Drosophila, Danio, Oryza and Candida model organism database groups participate in the work of the GOC and contribute annotation sets, but these efforts are not funded under the existing grant (nor are we now proposing funding). GO annotations are now regularly submitted by ten Model Organism Databases (CGD, dictyBase, FlyBase, MGI, RGD, SGD, TAIR, WormBase, ZFIN, Gramene) and three multi-organism databases (TIGR, UniProt and GeneDB).

Annotation Consistency and Quality Control 1: We have enhanced the quality control checks applied to the annotation sets that are provided by the GO resource. These checks include filters to ensure proper syntax and currency of identifiers. A major change implemented this year is daily automatic filtering of the contributed datasets of any annotations that do not meet a stringent check of the data. Any 'bounced' records are reported to the contributing project (see C.3). In addition to the checks of syntax and data currency the daily checks also eliminate duplication of annotations between datasets. This major enhancement utilizes the NCBI Taxonomy IDs provided for each annotation and states that a particular model organism project is the authority for providing annotations for a particular species. Within these annotation sets attribution is given to the source of the annotations, whether the model organism database or another contributor.

Annotation Consistency and Quality Control 2: To ensure consistency between mouse, rat and human we have developed an additional measure (described in Dolan et al. 2005) that allows curators to examine the consistency of GO annotations for orthologous gene sets. Every month, MGI generates reports for mouse, rat and human genes and MGI, RGD and GOA curators jointly review them. These comparative annotations reports are graphical, making it easy for curators to quickly see differences in the annotations; both in consistency and in granularity. The software creates an individual graph for every mouse-human-rat ortholog triple (Figure 2, [MGI-GOGRAPH]). Each node of the graph is a term, and the nodes are color-coded to indicate which organism-combination is annotated to that term. Examination of these graphs shows that annotations are often complementary, reflecting the fact that the different model organisms are used to study different aspects of biology. While the GOgraph comparative tool is now only being used for these three organisms, it can obviously be expanded to any combination of organism orthology sets (see, for example, the mouse-human-rat-fly-chicken HomoloGene set: [pare]).

[pic]

Figure 2. Comparative GO graph for Shc3. This comparative graph for Shc3 and its orthologs illustrates how combined knowledge leads to a more comprehensive understanding of the biological role of a gene. The combination of annotations from human, rat and mouse databases gives us a more complete picture of this gene than any of the individual species' annotations alone. The picture illustrates what we know and generates testable hypotheses concerning what we don't know about this gene. As annotations from each species become more and more complete, an even finer understanding of this gene will emerge.

GO Annotation Consistency Workshops: The focus of the first Annotation Workshop (Cambridge, UK, June 2004) was to evaluate the annotation methods used by the different Consortium members and to define a set of common annotation practices which should be used by GO annotation projects. The second Annotation Workshop (Stanford, June 2005) focused on educating non-GOC members on these annotation standards and working to facilitate the use of these standards. A total of 54 people, from 38 institutions representing five countries, attended the second Annotation Workshop. Workshop minutes are available online ([GO.camp]). Roughly half of the workshop was devoted to lectures and discussions on the use of the GO ontologies and evidence codes. For the other half, the attendees divided into small groups (3-5 people, including one person experienced with GO annotation from a GOC member group) to do annotations. These small groups read papers and then discussed the GO terms and evidence codes that best represented in information reported in the publication.

Sequence Ontology Annotation. Unlike the three other ontologies of the GO that characterize gene products, the SO is focused on the actual parts of sequence that comprise the gene, and the attributes that describe them. The Sequence Ontology provides a platform independent approach to describe the contents of a genome annotation. This flexibility means that there are many ways to utilize SO compliant annotation information ranging from database schemas to ad hoc flat file formats. Currently there are several public data models that are using SO or SOFA to describe the features in annotations. These include the flat file specification GFF3 ([GFF3]) and the chado ([CHADO]) relational model from the Generic Model Organism database. Two related kinds of XML specifications, chado XML and Chaos XML are derived from the chado relational model. The model organism communities using the SO in their annotations are FlyBase, WormBase, SGD, MGI, and dictyBase. The SO website does not currently provide a repository of annotations. Other groups that use the SO include MGED ([MGED]), FlyMine ([FLYMINE]) and Atlas ([ATLAS]).

3 go database and amigo.

The primary community access to GO and GOC annotations are via downloads from the GO database. The AmiGO Browser provides web-based access that supports both semantic and sequence-based queries on the GO database.

The GO Database: There are three sources for the information provided by the GO database: the annotation project provided gene association files, the GO ontologies in OBO format, and protein sequences obtained from UniProt and NCBI. Each annotation project provides their gene associations in a specific format [GO.ga] for all gene product annotations defined by that group. The projects are responsible for updating their file(s) within the CVS repository. The repository has been very effective in allowing this international collaboration to share responsibility for maintaining input files. A Perl script is provided as a quality control check in an effort to validate the format and to partially check the data provided within the gene association files. This script is used on all gene association files before they are loaded into the GO Database. This script is intended to be generic and to enforce the standards defined by the GOC.

Each night any updated files are validated. Once a week all are validated files to maintain their currency. Any annotation found not to comply with the standards is removed, and the results of this validation step are reported back to the submitting group. The validation script is available from the GO servers and maintained in CVS. Thus, groups can use the script to validate their file before it is submitted. However, with on going revision of the ontologies some annotations will fail validation – thus the weekly reprocessing of all available annotations. The checks provided define the minimum standard format for the repository:

• GO identifiers must be current and not secondary or obsolete.

• Abbreviations used with any identifier must be defined within the GO crossreference abbreviations file [GO.xref].

• IEA annotations must have been determined within the past year. Transitive annotations must be regularly updated to maintain a minimal level of quality.

• The WITH column is not allowed for annotations using the TAS, NAS or ND evidence codes.

• The cardinality of all columns must strictly match the file specification.

[pic]

Figure 3. Flow of information provided by GO.

Annotations for the major model organisms are limited to a defined set of annotation projects. This removes redundancy and motivates the annotation projects to collaborate. The GOA project includes the annotations from the MODs within their submitted file, but these are filtered (based on NCBI taxon IDs) to remove taxa covered by the MODs. The GOA is the authority for human (taxon:9606) annotations.

|Organism |Project |Taxonomy ID |

|Leishmania major |GeneDB |5664 |

|Plasmodium falciparum |GeneDB |5833 |

|Schizosaccharomyces pombe |GeneDB |4896 |

|Trypanosoma brucei TREU927 |GeneDB |185431 |

|Glossina morsitans morsitans |GeneDB |37546 |

|Candida albicans |Candida Genome Database |5476 |

|Dictyostelium discoideum & sp. |dictyBase |5782 & 44689 |

|Drosophila melanogaster |FlyBase |7227 |

|Gallus gallus |GOA at UniProt |9031 |

|Homo sapiens |GOA at UniProt |9606 |

|Oryza sp. |Gramene |4528, 4530, 4532 … |

|Mus musculus |Mouse Genome Informatics |10090 |

|Rattus norvegicus |Rat Genome Database |10116 |

|Saccharomyces cerevisiae |Saccharomyces Genome Database |4932 |

|Arabidopsis thaliana |The Arabidopsis Information Resource |3702 |

|Bacillus anthracis |The Institute for Genome Research |198094 |

|Coxiella burnetii |The Institute for Genome Research |227377 |

|Campylobacter jejuni |The Institute for Genome Research |195099 |

|Dehalococcoides ethenogenes |The Institute for Genome Research |243164 |

|Geobacter sulfurreducens |The Institute for Genome Research |243231 |

|Listeria monocytogenes |The Institute for Genome Research |265669 |

|Methylococcus capsulatus |The Institute for Genome Research |243233 |

|Pseudomonas syringae |The Institute for Genome Research |223283 |

|Shewanella oneidensis |The Institute for Genome Research |211586 |

|Silicibacter pomeroyi |The Institute for Genome Research |246200 |

|Trypanosoma brucei |The Institute for Genome Research |5691 |

|Vibrio cholerae 01 biovar eltor |The Institute for Genome Research |686 |

|Caenorhabditis elegans |WormBase |6239 |

|Danio rerio |ZebraFish Information Network |7955 |

Table 5. Project responsibility for taxon annotation file. Annotations for the listed species are limited to the gene association files from the stated project. The NCBI Taxonomy identifier is used to filter the annotations.

The GOC has defined particular projects as solely responsible for all annotations for specific organisms using the NCBI Taxonomy identifiers. For example SGD is responsible for all S. cerevisiae annotations (taxid:4932). Any annotations specifying that taxon identifier in gene association files other than SGD's are removed. A complete list of the responsible groups and the associated taxon IDs is available at ([GO.QC].The filtered gene association files are provided via HTTP, FTP and CVS. The ontology file is maintained as an OBO format file. A variety of other formats are also provided, the obsolete GO flat file is generated nightly from the OBO file. The OBO file is also used to generate OBO XML, RDF XML and OWL formatted files nightly. These files are provided via HTTP and FTP.

There are three flavors of the GO Database: GO full, GO lite and GO TermDB. The GO full and GO lite forms differ by the inclusion of IEA annotations with GO full and their exclusion in GO lite. They also differ in their frequency of creation. GO full is recreated once a month and archived. GO lite is the backend database that drives the AmiGO interface. GO lite is built three times a week, thus AmiGO is at most three days out of sync with the ontologies or gene associations. Note this changed dramatically during 2005, previous to May 2005 the AmiGO back end was only updated once a month. The GO lite database is recreated three times a week and archived once a week on the FTP archive. The GO full and GO lite databases include the protein sequences, identified by the submitting MOD, that are associated with non-IEA annotations. A FASTA formatted file of the sequences is also provided with the mySQL data dumps. The TermDB flavor is created nightly and only includes those tables defined by the ontologies, that is no annotations or sequences are included. The mySQL data dumps of the TermDB are useful for those wishing to only maintain a database of the ontology content.

AmiGO: AmiGO is the GOC'S GO browser ([AmiGO]). It allows users to browse the ontologies in an intuitive manner and determine which gene products have been associated with any one term. This past year the interface was updated to enhance the querying and display of annotations. The development procedures have also been refined to allow effective prioritization of new features.

The AmiGO team, or working group, consists of annotators from SGD, dictyBase, GeneDB, an editor from the GO editorial office, and a developer. All AmiGO development is tracked using the AmiGO request tracker, including requests from both the GOC and the general public. The AmiGO working group regularly collects current items from the request tracker and from these formulates the next release plan, which is announced to the GOC using the group mailing lists. The plan may be revised based on the feedback following this notification. Once the plan is finalized, mockups and specifications are generated and distributed to the AmiGO team for input. Mockups and specifications are iterated upon until the group has reached consensus. In the next phase, the GOC iterates upon these mockups and specifications to reach mutual approval. Based on the approved specifications, the developer implements the new features for the release, and AmiGO team jointly tests and debugs the software until it is ready for the GOC to test the software. The fully tested new AmiGO after accepted by the GOC is released to the general public. A typical release cycle is about 2 months and each release tackles any where from 5 to 10 problems or features.

The GO repository and FTP: All information provided by the GOC is maintained within a CVS repository, with archival access to information available via FTP. The CVS repository is a standard software environment used by collaborative software development projects and has served the GOC very well. The security of the information is good, SSH (secure shell) software is used by consortium members to login to the CVS server at Stanford. Each consortium site has at least one person with an account on the CVS server. Updates to files occur from the remote site via a CVS client application, a standard tool included with Unix computers including Linux and Mac OS X. CVS handles the file creation and modification issues created with a distributed project. Explicit file versions are maintained on the server and an update to an older version is not allowed. A remote user is informed when they have an old version of a file and the CVS client assists them in updating or merging the versions. The CVS repository contains the ontologies, documentation, gene association files, mapping files as well as meeting minutes and presentations. The file system that contains the web site is a “checked out” version of the CVS repository, also called a sandbox within the CVS environment. Every 30 minutes automatic processes update the web site from the CVS. Thus any group anywhere in the world can update a web page by updating the CVS repository, and in 30 minutes or less that page will be live on the Internet. The FTP site contains more than the contents of the CVS. It includes large archival files. As these files are archival, they will not change, and thus it is not necessary to maintain them within CVS. The FTP specific files include the ontologies, and email list discussion archives for all lists, and the mySQL data dump files that define the GO database. The Stanford group maintains the CVS and FTP servers.

For the convenience of users we archive a freeze of the ontologies, on the first of every month. The GOC's FTP site archives these monthly releases from January 1, 2001. The GO Database archive ([GO.database-arch]) provides downloads of the GO Database (MySQL), as well as an archive of the monthly freezes (from Dec 1, 2002). It also provides, since March 2005, a weekly copy of the "lite" version from which IEA annotations have been removed, of the database, schema documentation and a Perl library of modules for parsing and navigating OBO files and annotation data. We will continue to provide these resources.

4 GO Education, Documentation, And Outreach.

The GOC presents talks, posters and tutorials at scientific meetings and is constantly striving to develop new educational and visualization tools to make the organization of the GO ontologies and services more accessible to users. A complete list of outreach efforts, including tutorials and workshops, is present in Appendix 12. Recent innovations have included a self-guided tutorial, prepared by SGD [SGD.tutorial], an AmiGO visualization module that allows the relationships among terms to be visualized on the fly as interactive diagrams and comparative GO annotation graphs for mouse, rat and human annotation available at MGI [MGI-GOGRAPH] (see C,2).

The success of GO is demonstrated by the increasing numbers of research projects that cite it. A search for “Gene Ontology” in the abstracts at PubMed brings up hundreds of papers as shown in Figure 4. This is an underestimate because there are many other papers that only refer to the GO in the text, citations, or in table headings. Likewise, papers written by the GOC members are frequently cited in the literature, the most cited paper (The Gene Ontology Consortium 2000) being referenced 1291 times (data from Google Scholar, 16 Jan 2006).

Figure 4. Annual numbers of published papers that contain the key phrase “Gene Ontology” in their title or abstract (excluding papers published by members of the GO Consortia itself) from EntrezGene PubMed. This number does not reflect the many papers that use the GO for data analysis and representation that do not include ‘Gene Ontology’ in abstract.

The GOC maintains two Internet domains for access to the GO ontologies, databases, software and community outreach. The main site is , and the number of links from external sites to this site is on a par with other major genomic sites such as UC Santa Cruz [UCSC], Protein Data Bank (PDB), and EBI [EBI]. Figure 5 shows growth in use of the site. Between May 1 2005 and Jan 8 2006 the GO sites served an average of 68,768 high quality hits per week (excluding all robots, hits to images, style sheet files, etc). During this period there was an average of: 18,462 visits, 9,000 unique hosts and 2,407 trails per week. A visit represents a set of hits within a short period of time from a single user. A trail is a unique list of URLs observed in at least one visit, in effect trails are paths through the web pages provided. The top five GO terms that researchers search for are “transport” (72020), “ATP binding” (56862), “virion” (53622), “immune response” (47773), and “DNA binding” (46943). Half of these are biological process terms, indicating the community’s interest in these areas.

[pic]

The Sequence Ontology website is hosted via [SO]. This is a web-based resource for managing open source projects. Through this platform SO provides the following tools: a file release system, a CVS repository, and a mailing list for developers, a bug and feature tracking service, and a collection of project pages that house documentation other information. The documentation includes a FAQ, poster and power point presentations, publications, a style guide, and the minutes of content meetings. We provide information such as mapping tables to other vocabularies, and maintain lists of groups annotating or using SO, software that incorporates SO and information about using SO compliant annotation formats

We have also seen a growth in the use of the web based SO resources. The homepage and SourceForge based tools are served about 4000 times a month, and the SO project pages have peak usage of 180MB per month. There have been two releases of SOFA, in May of 2004 and May 2005 (and one will follow in May 2006). Notice is given prior to the release of a new version of SOFA to prepare the software developers of the changes. The download history of the file release system shows that SOFA has been downloaded 652 times, and the usage peaks after each yearly release. The mailing list has proved a very successful tool in the development of SO. There are over 80 subscribers who actively discuss the ontology. It is also open to non-subscribers to post questions or comments. It reaches a wide variety of scientists, both involved in software and those in the lab, from at least 8 countries, and from academia, hospitals, institutes and industry.

Research Plan.

The GOC faces a challenging task. Our vision is to ensure that all possible functional descriptions, of every protein and non-coding RNA spanning the spectrum of organisms, are accurate, detailed, and, most importantly, semantically compatible. Semantic compatibility is the bedrock for meaningful discussions, comparisons, and contrasts of the annotations. Our goal is to ensure the integrity of the ontologies and completeness of the annotations supported by the GOC community.

Ontological content: Biological data is now largely maintained on computer systems, and consequently, biology has become an information science. The implication is that the biological community must agree on standard, computationally tractable definitions to communicate their results with one another via these computer systems. Working closely with the biological community to create this standard semantic framework for describing gene products is our first aim.

Reference genomes: Insights into the roles of unknown gene products are largely based upon transference from well-characterized gene products, making it critical that such dependable well-characterized descriptions are available. Providing a core set of accurate, detailed annotations of the genomic sequence and the gene products for nine reference genomes is our second aim.

Annotation outreach: Currently, there are over 180 eukaryotic genomes in the sequencing and assembly pipeline (see Aim 3). For most of the 40 or so published eukaryote genome sequences the standard of their annotation is, with respect to the standards of S. cerevisiae, M. musculus, A. thaliana, and D. melanogaster, relatively poor. The extrapolated annotation forecast, for genomes in production, is more of the same: that is to say, unwieldy, difficult to compare, unreliable, and in some cases non-existent annotation. This has serious consequences, because it diminishes the utility of these sequences: for functional genomics, for improving the standard of annotation of the human genome (essential for its maximal exploitation) and for comparative genomics. Therefore, our third aim is to provide a basic, standardized methodology for functional annotation of emerging genomes.

Research community support: Our final aim is dissemination of this resource to the research and education community. We will work to educate the end users about how to use the GO to facilitate their research. We will also develop and implement ways to obtain feedback to ensure that GO is relevant and useful. The use of the ontologies and annotations to these ontologies is their enduring legacy.

1 Aim 1: We will maintain comprehensive, logically rigorous and biologically accurate ontologies.

Ontologies have thus far been used primarily for logically undemanding tasks such as search and retrieval. Increasingly, however, three key measures for evaluating an ontology need to be applied: Does it cover the domain sufficiently to allow the precise and exhaustive classification of all relevant entities and relations? Can it reliably be used for automatic reasoning? Is it made available with high-quality documentation and with software tools, which meet the needs of the users? The last point is particularly important—ontologies must be shared, and sharing demands high-quality documentation. In this section we present our plans for each of these three sub-topics: the expansion of the ontology into new biological areas to increase its descriptiveness; the continued logical development of the ontology; and the development of the software tools and documentation that are essential to accomplish these tasks.

1 Comprehensive biological domain coverage.

Ontological content cannot be developed without experienced, qualified biologists doing the real work: thus it is absolutely essential that we enlist biologists who are aware of the possibilities of high quality shared ontologies which can support genuine logical reasoning. Within GO, ontology content is curated and controlled by the EBI editorial office. However, content development is a collaborative effort, and the GO curators depend upon other biologists, both within and external to the GO consortium, to develop content. We use three methods to collaborate on content development:

• Request tracking, using the SourceForge system

• Special Interest Groups (SIGs)

• Content Meetings

Of these three, the content meetings are most critical for new domains and for comprehensively specifying existing domains. At these meetings, where domain experts are actively engaged in developing entire branches of the ontology, the more significant changes occur. As discussed in the progress report, there are four steps to the process.

• We initiate discussions with invited experts from relevant fields to assess the requirements and identify fundamental issues. We prepare material to summarize the most challenging questions to be resolved. These discussions are held via interest group e-mail lists, where the discussion is recorded and archived.

• We organize a face-to-face meeting between biological experts and members of the GOC. We will have at least two ontology development meetings each year. The agenda is established according to the key ontological issues established in the first step. The choices and alternative approaches are vigorously debated and ultimately resolved (or a date for a second meeting is set).

• The changes are implemented in the ontologies, using the Consortium's normal procedures for change. More minor questions are settled by using the interest group e-mail.

• The ontology is updated and released into production.

Quite often, as annotation begins, it becomes obvious that further changes to the ontologies are needed. This does not present a difficulty, since the GOC has a well-established mechanism for change. The entire process for developing ontology content is designed for agility and frequent reiteration (whilst ensuring an electronic audit trail is recorded, to help automate change management in annotations). It ensures that new terms are efficiently and reliably added to the GO, as appropriate to represent biological knowledge. It allows the GOC to rapidly respond to the needs of the biological curators.

This rapid response is important, because advances in biological knowledge, as deduced from new experimental results, drive ontology development. Aims 2 and 3 deal directly with annotation and therefore will be the primary drivers for new terms in the GO. The nine reference genomes will need new terms as they increase the precision of their annotation. The reference genome annotations are driven by current research results, largely collected from the literature. Review articles in particular are useful, and review authors are among the best candidates to enlist for ontology development efforts, as they have the necessary knowledge of the domain. Annotation projects for new organisms will also certainly require new terms. A wide variety of organisms are used in research because each offers unique leverage to explore certain aspects of life. The unique biology of individual organisms will broaden the biological content of the ontology, just as the reference genomes will deepen the biological content.

We are commonly asked how or whether we can enlist these experts, who are generally very busy individuals. Thus far this has not been an issue, for two reasons. First, many domain experts are actively managing data from high-throughput research projects; they are already highly motivated because they appreciate the challenge and need of organizing and classifying large data sets. In some cases there are legacy issues, but we resolve these by providing mappings between their existing terms and GO terms. The second factor is intellectual. Because the development of the ontology itself requires original, careful thought, papers describing the biology that underlies the development of a branch of the ontology are the result and this provides both additional motivation as well as education for others.

2 Logical development of the GO.

A logically robust ontology requires unambiguous formal definitions of the relational expressions used, coupled with their consistent application throughout the ontology. These formal definitions should enable software to draw inferences from the ontology and from data annotated with its terms. If software tools and databases are to interoperate and provide consistent answers to biological questions posed by end-users, then it is essential for relations to be shared and precisely defined. For example: a biologist querying for gene products localized to the nucleus [GO:0005634] should receive amongst their query results gene products annotated to GO:0016592; Srb-mediator complex, as this cellular component is part_of the nucleus.

1 Improving Biological Relations.

Biological ontologies are concerned with the kinds of biological entities that exist, and the relationships between these entities. Currently the GO only admits two relations: is_a and part_of. SO extends this core set with other relations such as homologous_to. Historically, other OBO ontologies have also taken these relations and extended them, often in an ad-hoc manner.

This was the primary motivation in the creation of the OBO Relations Ontology (Smith et al, 2005; [RO]), a collection of definitions for relations to be used by GO and other biological ontologies. The development of this ontology was a collaboration between members of the GO consortium, and formal ontologists including members of the National Center for Biomedical Ontologies (NCBO). This ontology also includes additional relations not yet used by GO, but anticipated to be useful for creating relationships between GO and other OBO ontologies, as well as within GO, and SO.

We will use these relations, and work with the OBO Relations Ontology content developers to extend and clarify these relations to enhance GO and SO. We will make five types of modifications to the GO term relationships: completion, replacement, additions within a single GO ontology, additions across current GO categories, and additions between GO ontologies and other external OBO ontologies[3]. The following set of modifications to the relationships in the GO is approximately ordered by implementation priority.

2 Providing comprehensive is_a relationships for all GO terms.

We will include appropriate is_a relationships for every GO term. The GO is currently incomplete in that not all terms have an is_a parent (terms without an is_a parent will always have a part_of parent, except for the 3 root terms). This has negative consequences at both an ontological level, and at a practical software level. Therefore filling in the missing is_a relationships is a critical goal. This task is currently underway.

On the ontological level, the missing is_a relationships create holes in the ontology, making inference unreliable. The formal OBO definition of is_a takes both time and contextual contingencies into account. The definition of is_a states: If, at all possible time points, all instances of Class_A are also instances of Class_B, then Class_A is_a Class_B. For example, when the GO affirms that glucose metabolism is_a carbohydrate metabolism, then we are stating that all instances of the former are ipso facto also instances of the latter. In common language, every class in GO must be a subtype of one of the three upper level classes. If it is not one of these three, and yet it still falls within the province of the GO ontologies, then there is no possible way to determine what the class of an entity it actually is. Moreover, there is no way to carry out inferences on the basis of data annotated to that term.

On a practical level, the missing is_a relationships can lead to non-intuitive displays, and hinder interoperability. The current visualization paradigm for the GO is to show all possible lineages to the root term, combining a mixture of is_a and part_of relationships. For example, the term AP-1 adaptor complex has 164 distinct lineages to the root traversing both is_a and part_of relationships. This leads to ‘tangled’ DAG displays that are confusing to users. It would be better to show distinct subsumption hierarchies and partonomies (is_a and part_of DAGs), but, because of missing is_a relationships, the filtered display would contain hundreds of orphan terms. An additional reason is the difficulty these is_a gaps create when importing GO into alternative ontology tools and software such as Protégé. These tools expect all terms (bar the root term) to have an is_a parent; unexpected things happen when this is not the case. We could circumvent this problem during import by creating default is_a parents to root terms, but this strategy produces errors in the ontology, as the true is_a parent may be a more intermediate term.

As of early November 2005 the number of terms without is_a relationships was: 287 cellular component terms, 658 biological process terms, and 2 molecular function terms. In January 2006 we carried out a pilot experiment on a non-public version of the cellular component ontology, making it is_a complete. This was a non-trivial exercise but it allows us to show is_a lineages distinctly from part_of lineages. Also note that this ontology can be imported into non-consortium tools such as Protégé (via conversion to OWL([OWL])) without users experiencing unnatural displays and unusual behavior. We have still to perform full QC and error checking on the new is_a complete cellular component ontology, and, reciprocally, many complexes have yet to be provided with a part_of parent, but we expect to merge it into the public GO in the first quarter of 2006. We will then embark on the more difficult task of making the biological process ontology is_a complete.

In addition to is_a completeness, we must ensure there are no errors of omission in the is_a hierarchy. Some element of human error is inevitable in the construction of large ontologies; fortunately it is possible to use automated means to help detect errors of omission and report these errors to curators. Our OBOL software (Mungall 2005 [OBOL]) has been used to successfully detect hundreds of these omissions in GO. We will continue to develop this software and better integrate this and other ontology analysis tools into OBO Edit and the curation workflow.

3 Improving ‘regulates’ relationship in the GO.

We will replace some current part_of relationships with “regulates” relationships. One of the factors contributing to the complexity of the GO is the way it currently handles the concept of regulation. Each biological process generally possesses a requisite sub-process entitled regulation of X. These regulatory terms are presently represented in GO by part_of relationships to the process being regulated. In most cases, these terms also possess two direct is_a children: negative regulation of X and positive regulation of X. In addition, regulation is itself a bona fide process, and therefore the term regulation of X also has an is_a relationship to the parent term regulation (see simple example in Figure 6).

Figure 6. An illustration of the replacement of a part_of relationship by a regulates relationship.

Ontologically and biologically this is inaccurate because a regulatory process that controls another process P need not itself necessarily be a part_of P. For example, it may not be appropriate to classify the "regulation of sleep" as a part_of “sleep” itself, as it is now classified. The complexities here are further compounded when we recognize that any particular process is itself often composed of several sub-processes (e.g., mitosis, immune cell differentiation, et cetera).

Replacing part_of with regulates will clearly not reduce the number of relationships—the total will be the same. However, it does introduce an important distinction that will more accurately reflect the biology, and in this sense it will also simplify the GO and make it easier to curate and use in the future. Making the distinction between part_of and regulates will also enable us to exclude regulatory processes from slimmed-down versions of the GO produced for specific application purposes. It will also enable more sophisticated distinctions between biological processes, clearly distinguishing any process X from the biological process regulation of X. Replacing part_of by regulates in appropriate cases is a reasonably straightforward task to implement, but will require a number of downstream changes to displays, querying, and inferencing tools to take advantage of this change. We will provide sufficient notice to other tool developers to allow them to also take advantage of this change.

4 Generating relationships between current GO ontologies and other ontologies.

We will recast the implicit ontologies contained in the GO as explicit links, in order to simplify maintenance and assist annotators. The GO is complex and continues to expand: and adding new terms and relationships will always be an ongoing exercise. This complexity is inherent to the GO—because biology itself is complex, and the GO must necessarily reflect the current state of biological knowledge. The GO must be sufficient for its purpose: accurately classifying and connecting the biological entities on which researchers are collecting data. Furthermore, the reason to carefully organize the data is that it propels science forward. Well-structured data gives researchers the power of ratiocination, the ability to draw reasonable inferences from these data, form new hypotheses, and design experiments to test these hypotheses. The terms in the ontology must be familiar to biologists, and no relationships should be overlooked; since in either case this would lead to erroneous, or partial query results. The strength of GO lies in the fact that its domain is precise. There are, however, many aspects of biology that are not covered by the GO, yet are needed for precise queries. These include protein structures, chemicals, anatomy, or cell types, to name a few. To satisfy this need, many GO terms implicitly refer to other domains of biology (for example, cysteine biosynthesis, or myoblast fusion). These are compound terms, and their necessary presence in the GO means that the GO includes ontologies implicitly, for example of chemicals, anatomy and cell types. Most, if not all, of these implicit terms can be found in other OBO ontologies (e.g. cysteine is in ChEBI, and myoblast is in the Cell Type ontology.

Figure 7. This figure illustrates the intersection of ChEBI and GO in construction of cross-product terms.

The fact that the GO includes these implicit ontologies of other domains was historically inevitable, but is not desirable. Maintaining a chemical or cell type ontology is not within the domain of the GOC. Fortunately, there are now good OBO ontologies maintained for all of these domains. We will now decompose compound terms within the GO and then recast these as explicit cross-products between the GO and an external OBO ontology. This has enormous benefits to the GO, both for its maintenance and utility.

We have begun this project using a relatively simple case, the Cell Type Ontology[CELL]. CELL has 710 cell type terms and 2 relation types, is_a and derives_from. The GO has around 756 terms that have a cell-type axis of classification, e.g. "B cell differentiation". Discovery of the implicit CELL terms in the GO was accomplished using OBOL (Mungall, 2005). This highlighted two classes of difference between GO and CELL. One was syntactic, e.g., "B cell" vs. "B-cell". All of these syntactic differences have now been reconciled. The second were structural differences. For example in the GO "plasmatocyte differentiation" and "lamellocyte differentiation" are siblings, whilst in CELL "lamellocyte" is a child of "plasmatocyte". Because of this structural difference, queries requesting all gene products annotated to "plasmatocyte differentiation" will omit any gene products annotated to "lamellocyte differentiation". We have initiated resolving these differences between CELL and GO. These new relationships will be represented in the GO OBO format file as illustrated in the following example, showing the explicit link between the GO term "differentiation" and the CELL term "B cell" for the GO term "B cell differentiation":

[Term]

id: GO:0030183

name: B cell differentiation

is_a: GO:0042113 ! B cell activation

is_a: GO:0030098 ! lymphocyte differentiation

intersection_of: is_a GO:0030154 ! cell differentiation

intersection_of: has_participant CL:0000236 ! B cell

We have already made significant progress towards implementing these changes, integrating the explicit creation of cross-products terms into GO content development. In addition to OBO, which parses implicit relationships from ontologies, the most recent version of OBO-Edit fully supports creation of explicit cross-product terms. The OBO flat-file syntax can now represent relationships between different ontologies, explicitly stating, for example, that the GO term "B cell differentiation" is a cross-product between the GO term "cell differentiation" and the CELL term "B cell" (see above) (see [OBO.flat] for specification). The primary impact on end-users of cross-product terms will be ontologies (GO and other OBO ontologies) that are of higher quality and robustness. We must point out that there will be minimal impact on the lexical strings of GO terms, "cysteine biosynthesis" will remain "cysteine biosynthesis", and its unique identifier within the GO will not change. The newly structured GO will remain backward compatible with previous versions.

A consequence of this change to the way that GO is built means that the GO editors now will not need to concern themselves with the relationships between terms from domains represented by these external ontologies, they will simply inherit these relationships from these ontologies. This will significantly reduce the amount of time needed to maintain the integrity of GO and will facilitate the scaling of GO to capture biology more comprehensively and accurately. In addition, this will greatly simplify the writing of definitions of terms within the GO.

There are many other benefits. We will be able to offer queries across ontologies, automate inconsistency detection, and create dynamic filters to provide domain specific views of the ontology (which will make an annotator's job easier). The major disadvantage of introducing cross-product relationships between ontologies is that it becomes more difficult to visualize the structure of the combined ontology. We will counteract this in AmiGO, by restricting its hierarchical view to a single, particular is_a or part_of tree (users will be able to switch between them, or open multiple trees). The other relations will be shown as attributes that link into the other ontology.

There are three major "external" classes of ontology now implicit in the GO: cell types, chemicals and anatomies. Work to combine the GO and CELL ontologies has made good progress and should be finished within a few months. Our next priority will be chemicals. For this we will use the EBI's ChEBI ([CHEBI]) resource, to which Ashburner is an advisor. The reason for using ChEBI, and not the NCBI's PubChem database ([PUBCHEM]) is that the former has a rigorous ontological basis. The integration of the MOD anatomical ontologies will only be done after that for cells and chemicals has been completed. Most of the MODs (FlyBase, WormBase, dictyBase, TAIR, MGI, ZFIN; see, e.g., Grumbling et al. 2006) now have reasonably mature anatomical ontologies, and these are available on the OBO site.

5 Improving relationships within single GO ontologies.

We will refine and extend the use of ‘part_of’ relationships in the GO. Presently in the GO, ‘part_of’ is applied with two different meanings, one in Biological Process the other in Cellular Component. These correspond to the part_of roles, feature part_of activity, and object part_of composite whole. Like is_a, both part_of relationships are transitive, so that, for example, if ‘GO process A part_of GO process B’, and ‘GO process B part_of GO process C’, then it can be inferred that ‘GO process A part_of GO process C’. These part_of relationships may thus be utilized in reasoning in much the same way that the is_a relationship is used.

Taking a step backward, the definition of the part_of relationship does not specifically indicate any restrictions on the necessity for inclusions. That is, part_of does not stipulate whether any particular instance of a parent includes a given child as a part. Reciprocally, any particular instance of the child may or may not be found as a part of that parent. Hence, the part_of relationship itself has three children that make further distinctions to restrict its meaning:

1. All instances of ‘GO process A’ have an instance of ‘GO process B’ as a parent (RO: located_in)

2. All instances of ‘GO process B’ have an instance of ‘GO process A’ as a child (RO: has_part)

3. All instances of ‘GO process A’ have an instance of ‘GO process B’ as a parent AND all instances of ‘GO process B’ have an instance of ‘GO process A’ as a child (both relationships are used in conjunction)

The three child relationships are specializations, which the ontology editors can use to describe more accurately the relationship between a parent and child (e.g. process and sub-process); for example, to say whether a process always includes a given sub-process, and conversely, whether a sub-process only occurs within a given process. Currently, the particular usage in Biological Process corresponds to the RO's located_in relationship (although colloquially it might be better expressed as occurs_in). We will, over the course of this grant period, employ the other part_of relationship in the Biological Process ontology when appropriate, to enable those using the GO to distinguish between these cases.

In Cellular Component the meaning of part_of is determined by spatial relationships. Spatial arrangements, of biological entities that overlap or in which one entity contains another entity, are designated by relationships between the corresponding terms (e.g. nucleus contained_in cell). RO defines the contained_in relationship as a child component completely enclosed within a parent component. The notion of containment, enclosed regions with boundaries separating the region from the outside and not extending beyond the enclosing region, is critical for describing where reactions and transport between regions occur. The contained_in relationship will be incorporated into the cellular component ontology to capture such phenomena.

6 Building relationships between current GO ontologies.

There is an increasing community demand for pathway data and most of the model organism databases (MODs) are either actively curating metabolic pathway data (TAIR, SGD, dictyBase), some using the Pathways Tools system ([PATHTOOL]), or are in various stages of obtaining funding and other resources to do so in the near future (FlyBase, MGI, RGD, WormBase). Other types of pathway data, such as signal transduction pathways, are not yet captured by any MODs to our knowledge, largely due to lack of infrastructure for representing such pathway information robustly. A pathway representation of biological processes (such as KEGG (Kanehisa et al. 2006), Reactome, MetaCyc, or Panther ([PANTHER])] and the GO representation are different. There are, nevertheless, important commonalities in these views of the world. Here we propose a number of ways to bring a variety of benefits to those who capture pathway data: Specifically we propose to improve the efficiency of pathway curation efforts; provide an organismal context for molecular pathways; increase the precision of the GO gene product associations; and provide independent indicators of orthology.

Efficiency: The functional annotation provided by the GOC offers a natural starting point for pathway curation. When initiating work on a new pathway, for example, pathway curators will be able to query the GO database and download all proteins that are associated with that biological process (a pathway is a class of biological processes) by querying for the gene products associated with that biological process term and its children. The gene annotation database provided by the GOC currently contains curated functional annotations of over 180,000 gene products. This number of annotations to the GO will continue to grow. While a pathway curator’s finely detailed description of pathways with respect to their components is one method of ascribing function, it is labor-intensive and can be made far more efficient. Pathway groups have requested a reliable mechanism to access GO functional descriptions to assist in their description of pathways. From the perspective of the GO, a pathway is a set (temporal ordering is not captured) of sub-processes (it’s parts), and, at the deepest granularity, a set of particular molecular functions. We will provide a web service to make these data accessible programmatically. Software will query using a GO process ID and will receive the genes associated to that process (and its children) in return.

Context: The GO includes higher order biological processes. A pathway may be associated with these either explicitly, because the pathway is directly related to that higher order process as a descendent, or implicitly, because of gene products that are annotated to both the pathway and the organism level process In either case, users will be able to see where, at the level of the organism, a pathway is critical (e.g. the role of the Notch signaling pathway in heart, or mammary gland development). Pathway groups that incorporate mappings into their curation effort, between GO terms and molecular events in the pathway, will be able to provide navigation between a GO site (via its web site or any of the many affiliated sites that offer GO term searches) and the pathway group’s web site.

Precision: The incorporation of pathway data will increase the specificity of GO functional annotations, because pathway curators work at the molecular level. A pathway curator simultaneously associates a particular gene product with a specific molecular function, cellular component, and biological process, and hence can concomitantly drive increasing granularity in the GO as they work.

Comparative analysis: Evolutionary analyses are greatly enhanced by the availability of pathway information from multiple species. In order to detect important biological differences it is crucial that identical frameworks are used so that differences will become apparent. The pathways should all be equally well curated, be expressed in the identical syntax, and share semantics definitions. We will work with all interested pathway groups to ensure that the GO is structurally compatible with their biological pathway representations.

An evaluation study: To understand the challenges we would face in integrating pathway descriptions and the GO, we carried out an initial evaluation comparing the Reactome (Release 14) description of the Notch signaling pathway to that in the GO. Higher-order pathways such as “Notch signaling,” correspond to terms in the GO biological process ontology; Molecular events correspond to terms in the GO molecular function ontology; and compartment terms correspond directly to the GO sub-cellular component ontology. In principle, creating a mapping is simply a matter of connecting pathways terms to GO biological process terms, connecting event terms to GO molecular function terms, and connecting the respective sub-cellular component terms. What makes this process challenging are the differences in details of the representation between Reactome and GO. In most cases an exact match is not found, and we must deal with, synonymous terms, missing child terms, and missing parent terms. The work was performed by Jane Lomax of the GO, and is summarized in Table 6.

In summary, based on these preliminary results, the GOC has concluded that mapping of ontologies to different pathway representations is both feasible, and highly desirable. The representations will converge, rather than conflict, and both the GOC and pathway groups will benefit from the cross checking. To achieve this will require a clever combination of computation and manual curation. We will proceed as follows: 1) carry out mappings, as above, of all existing Reactome pathways, and subsequently of all other pathways available in a standard syntax such as BioPax [BioPax] or ([SBML]; Hucka et al. 2006); 2) instantiate within the GO, new relationships connecting molecular functions to the lowest level sub-processes; and; 3) develop some of the tools and protocols to maintain this congruency between pathways and the GO (this will be in conjunction with the pathway groups).

Term Mapping: We will first computationally collect the GO terms provided by pathway projects that have already been attached to events and catalysts (currently several thousand associations). A GO editor, working in collaboration with a pathway curator (e.g. from Reactome), will use this list to identify those cases where synonyms or refinements to the ontologies that are needed, just as described above. A team of one GO editor and one pathway editor will manually step through all molecular events in the pathway descriptions (currently the collection of curated pathways is small enough that this is possible), and based on their joint evaluation, take the appropriate action, including adjustments to the part_of structure in both the GO and the pathway database. There are great advantages to sharing identical process hierarchies. The two projects will then be able to directly share data, and users will be able to move seamlessly from one to the other. The portions of the process ontology in GO that describe the “parts” of a process will correspond to their counterparts in the pathway descriptions. We will use the BioPAX descriptions as the intermediary form to move pathway data into the OBO format used by GO curation tools. The GO editor will compare the pathway graph to the GO graph, identify places that need modifications in order to be concordant, and consult with their counterpart in the pathway group to bring the two descriptions into complete agreement.

Function to Process relationships: We will instantiate relationships between processes, functions, and cellular locations by incorporating curated pathway data. For this aspect of the effort we will exploit the pathway graph and the term mappings from above. Terms will have been mapped from the pathway terms to both biological process and molecular function terms. From these we will infer what relationships need to be added to the GO between function and process. Instantiating direct links between the three axes of the GO will improve the quality of existing genome annotations by enabling annotators to detect errors of omission (e.g. a gene product is annotated to a function, but a process for that gene product is missing), as well as of commission (e.g. a gene product is annotated to a function and to a process, but that function is not an event in that process). Instantiating these relationships between the ontologies is a major goal in the development of the GO.

|Reactome |GO |Mapping |

|Notch signaling pathway |Notch signaling pathway GO:0007219 |Exact match |

|1. Transport of Notch precursor to Golgi |protein-Golgi targeting GO:0000042 |Use synonym & GO and Reactome need |

| | |future refinement |

|2. Maturation of Notch via proteolytic cleavage |proteolysis during protein maturation |Use synonym |

| |GO:0051605 | |

|3. Notch traffics to plasma membrane |protein-membrane targeting GO:0006612 |GO and Reactome need further |

| | |refinement |

|4.1. Notch binds with ligand in extracellular space (Notch |Transmembrane receptor activity GO:0004888 |GO needs refinement |

|receptor) | | |

|4.2. Notch binds with ligand in extracellular space (ligand) |Notch binding GO:0005112 |Use synonym |

|5. |Notch receptor processing GO:0007220 |Missing in Reactome |

|5.a. proteolytic cleavage of Notch, when bound to ligand, |endopeptidase activity GO:0004175 |GO needs refinement |

|producing NEXT in plasma membrane | | |

|5.b. proteolytic cleavage of NEXT releasing NICD into cytoplasm|endopeptidase activity GO:0004175 |GO needs refinement |

|6. NICD traffics to nucleus |protein-nucleus import GO:0006606 |GO and Reactome need further |

| | |refinement |

Table 6. Comparison of Reactome events to GO terms. One event is missing in the Reactome (5) needed to group the proteolytic cleavage events that take place after the Notch receptor binds a ligand and leads to the active NICD peptide that regulates transcription. In three of the cases (1, 3, and 6) where the biology is not yet fully specified, both the GO and Reactome will need refinement.

Develop tools for maintaining GO to pathway correspondence: In the longer term we need a more direct way to accomplish mapping between the GO and pathways that is simpler and less error-prone than manual curation. We want to use manual studies as a basis for developing automatic procedures for finding and resolving discrepancies between pathway descriptions. We propose to work with all interested pathway groups (see Stein and Karp Letters of Support) to embed GO editing capabilities in their own curation tools. Curators can then directly extend the GO while they are describing the pathway, generating transient identifiers for the terms they create during curation (these transient ID’s will be replaced by the GO ID's when these are available). The pathway groups will regularly submit these new and modified terms to the GOC in the OBO flat file format that allows inclusion of transient identifiers). The GOC, on receiving the pathway curator’s electronic request, will follow the standard protocol of notifying any affected contributor, considering the term in the overall context of the GO, and contacting the pathway curator with any concerns or requests for clarification. Once agreement is reached the GOC will add the suggested term to the GO, and provide the pathway group with a permanent term ID they will use to automatically replace the initial transient term ID.

The tools and methods developed from this project will be tested with another pathway database, MetaCyc [MetaCyc]. Currently MetaCyc2GO mapping is carried out manually by a GO editor and MetaCyc curators and is a time-consuming effort requiring extensive manual curation. One of the PIs of this proposal (Sue Rhee) is a PI of the MetaCyc project and will coordinate the testing of the application of mapping tools to MetaCyc.

Incorporate gene products associations from pathway curators: The GOC will work with pathway groups to include their annotations in the GO database. This work is discussed in detail in Aim 2.

7 Sequence Ontology.

Maintain the ontology. We will continue to maintain and develop the Sequence Ontology, providing a unified vocabulary, with synonyms, definitions and the structure relating these terms, to annotate the features located on biological sequence and the attributes of these features. We will also provide the terms to describe chromosomal variation and the consequences of mutation. We will respond to the needs of the community by extending the ontology as required.

Add new relationships between terms. We will add new relationships between the terms to better describe the biological reality regarding sequence that the ontology is capturing. This may require extending the set of core relations in the Relations Ontology ([RO]), including spatial relationships that apply to 1-dimensional biological sequence space.

In SO, the part_of relationship is used to describe that a feature is spatially a part of a whole or is a member of a collection. For example, a primary_transcript has different portions along its length where introns and exons occur. This means that we would expect the sequence coordinates of these parts to be contained within the sequence coordinates of the whole. There are also situations where collections rather than portions best describe the biology. A good example of this is the relationship between a gene and its regulatory regions. Genes have many types of regulatory regions that may not necessarily be located on the same strand of sequence where transcription takes place. For example, regulatory elements such as enhancers may even act in trans (e.g., Hendrickson and Sakonju 1995). The gene therefore is not a single physical entity but is a collection of entities, and the regulatory region is one of the members. This distinction between the different types of part_of is vital for displaying features in genome browsers and internal consistency checking of annotation validation software. We will continue to clarify the part_of relationships in the ontology, using its subtypes as needed.

The relationships between features of biological sequence can also be described in terms of topology, as defined by Egenhofer (1989) (Figure 8). Features, 1-d dimensional ranges on a single physical sequence, can be disjoint, meet or overlap. These relationships are defined in terms of boundaries and interiors as depicted in Figure 10. We currently use the adjacent_to (meets) relationship in the ontology to specify when features share a boundary. For example mRNA and polyA_sequence are asserted to be adjacent to each other in an annotation of a cDNA. These topological relations also provide the basis for validation of the annotations, based on the principles of spatial integrity. For example, exons within the same transcript must be disjoint, an intron must meet two exons, and 5'UTR and 3'UTR must not be adjacent to each other. We will continue to add topological relationships to the ontology to capture such spatial relations between features.

[pic]

Sequence feature qualities. We have compiled a large set of sequence quality terms in order to more fully describe sequence features. Qualities are dependent on the feature, in the sense that they do not exist except as an attribute of a feature. In SO there is an upper level term in the ontology that details these qualities of features, called sequence_attribute. In the same way that a rose may have the quality of being red, so features in the SO may have qualities of various kinds, such as imprinted and post_translationally_regulated. Including sequence qualities in SO allows us to provide more information about the specific kind of feature with which we are dealing; even though a feature is always of one single SO type, databases can attach any number of qualities to a feature. There are transcript_attributes such as frameshifted, transpliced and multicistronic. There are qualities that refer to the gene such as imprinted and post_translationally_regulated. There are qualities that refer to the results of experiments such as CDS_predicted versus CDS_independently_known. We will associate these quality terms to the relevant feature types using a new relationship quality_of; for example, imprinted is the quality_of a gene. This helps ensure that qualities are applied consistently.

Logical definitions. We will provide logical definitions for terms in the form of necessary and sufficient conditions; this can be done in a way that is amenable to computation using the OBO file format. For example, the necessary and sufficient conditions for a feature being a dicistronic_gene are (a) it is a gene and (b) it has the quality of being dicistronic. This allows computers to return the correct results in queries for the type dicistronic gene, based on the core feature type and recorded qualities of that feature. These logical definitions will complement existing free text definitions.

3 Tools and technical support.

1 Maintenance and management of the GO.

We will continue to utilize the SourceForge tracking system to manage the daily work of maintenance and improvement of the GO. This system provides both a mechanism for suggestions and discussions for improvements and new terms to be recorded, and provides a tracking system to monitor progress and results. The GO Editorial Office will continue to oversee the day to day management of the GO including the checking of ontology consistency and the oversight for additions to the terminology.

To manage the growing volume of requests for changes, we will add to our existing mechanisms for prioritizing incoming SourceForge requests. At present, each new item is assigned a priority, with high priority conferred upon small alterations that can be done quickly, issues that are of interest to many groups, and additions that cover previously unrepresented areas of biology.

We will review outstanding SourceForge requests on a regular basis to ensure that assignments reflect progress in ontology development and annotation, with out-of-date items closed, and open items assigned to projects and curators. To improve communication of content development within the GOC and to the wider public, we will establish an “editorial calendar” for content-related tasks. For each content project, the calendar will show the curator coordinating the project, an estimated finish date, and a link to a more detailed project progress page. This will allow us to group related content requests, and more easily identify what new projects may need to be initiated in the future. Also, users and annotators will be able to see at a glance whether a certain topic is already under review, and have a convenient mechanism to contribute to its development. Projects are linked to the appropriate interest groups, so anyone wanting to contribute to a project can communicate to the group through the interest group mailing lists.

2 Develop and support OBO-Edit for maintaining ontologies.

The OBO-Edit[4] working group consists of editors from SGD, dictyBase, MGI, GOC editors, and a developer (J. Day-Richter). OBO-Edit development is tracked using the SourceForge OBO-Edit request tracker, including requests from both the GOC and the general public. The OBO-Edit working group regularly collects current items from the request tracker and from these formulates the next release plan, which is announced to the GOC using the group mailing lists. The plan may be revised based on the feedback following this notification. Once the plan is finalized, it is implemented by the developer and released to the OBO-Edit working group, for both regression testing (retesting modified software to ensure that the targeted bugs have been fixed and that no other previously working functions have failed as a result of these modifications) and testing of the new features. The test suite protocol defines the requirements and is used to specify the behavior demanded by the users for their approval of OBO-Edit releases. Tests to evaluate new features are established and augmented as an integral part of developing the OBO-Edit software. Once OBO-Edit has successfully passed the appropriate test suite, the OBO-Edit team then jointly releases the tool to the general public. A typical release cycle is about 3 months and each release tackles any where from 5 to 10 problems or new features. We will continue to improve the OBO-Edit application.

It is now difficult for a relatively inexperienced GO user to discover when and why a particular change was made to the ontologies. Indeed, this is a hard task even for the experienced. We will institute a method by means of which all transactions are accessible to the public. This information is already tracked by the GOC, but it is difficult to extract from our current systems.

After a year of development and testing, OBO-Edit is now ready to be released as production software. All of the original DAG-Edit functionality is in place and working correctly. OBO-Edit's improved, extensible framework has allowed the rapid addition of OBO-Edit-specific features, such as the new filtering/rendering system, a fast mini-reasoner, faster and more flexible data adapters, and a plugin system for user-developed instance editors. OBO-Edit now uses the JavaHelp documentation system, and a comprehensive user's manual is almost complete.

The new instance editing facilities will be an important direction for future development. Instance editors will make OBO-Edit an extremely useful resource for the rapid prototyping of new ontology-driven applications. As third-party instance editors are written, we will learn more about the software tools that OBO-Edit needs to help developers create instance editors quickly and easily.

OBO-Edit will also begin to include new methods of visualizing ontologies. As more open source graph-drawing tools become available, OBO-Edit will offer new, clearer methods for displaying and editing the ontology. OBO-Edit may also provide browser-based methods of ontology interaction, including fast servlets for viewing ontologies on the web.

Finally, interoperability is an important long-term goal for OBO-Edit. OBO-Edit will support more file formats, including full support for OWL ontologies. OBO-Edit will provide an adapter architecture for the integration of third-party reasoning tools to supplement and extend OBO-Edit's mini-reasoner. OBO-Edit will provide facilities for interaction with other GO resources, including plugins that interface with OBOL and GONG ([GONG]), and enhanced support for gene product searches. OBO-Edit will continue to serve as the reference tool for the OBO file format. As the OBO format grows and develops, OBO-Edit will expand in parallel.

3 Interoperability.

The GOC has always been aware of the need for better interoperation between bioinformatics software and database systems. In addition to web interfaces geared towards biologists, we provide both ontologies and annotation sets in a variety of formats, including the native OBO file format, simple tab-delimited text, XML and relational database exports. We have also provided software and APIs to help integrate GO data into software applications. However, until recently it has not been our primary concern to enable interoperation with software and tools from outside the life sciences domain, such as traditional ontology platforms like Protégé. Whilst our primary commitment remains the biologist users of GO, we recognize that interoperation on a wider scale is increasingly important. The growth of ontology standards, platforms and tools (often termed “Semantic Web” technologies) in recent years promises to be an interesting avenue to explore better ways of integrating databases and knowledge systems. These technologies are being adopted by many data architects within the life sciences community.

We have made the GO ontologies and annotation sets available as the W3C standard “semantic web” language RDF (Resource Definition Framework) since 2002, and have recently made the ontologies available in OWL (Ontology Web Language) format, another W3C standard, more expressive than and layered on top of RDF. OWL is the major standard for ontologies, and it allows GO to be imported into ontology tools developed outside the bioinformatics world; these include the editors Protégé-OWL and SWOOP, databases such as Sesame and Kowari, reasoners such as FaCT++ and Pellet.

We would like to do more to improve interoperation with these technologies, in particular, using them in a way that allows the GOC and users of the GO to benefit. We would like to develop java interoperation layers between the OBO-Edit datamodel and APIs such as the Protégé-OWL API and the commonly-used Jena API.

4 Summary of Aim 1.

Our objective in Aim 1 is to continue the development of the GOC ontologies so that they are logically rigorous and biologically accurate. We recognize the challenge of representing changing knowledge in complex and dynamic systems. We will continue to develop tools and applications useful to scientific curators and to ontology developers. We will organize biannual domain workshops and active seek out biologists to work with us. We will coordinate with other ontology content developers to remove redundancies and ensure consistency. We will incorporate new relationship types into the ontologies as needed; for example, between Molecular Function and Biological Process, to better support groups who are utilizing the GO in conjunction with work on pathways. We will recast compound terms as explicit cross-products with orthogonal OBO ontologies. We will apply the principles of ontology development now being defined by the National Center for Biomedical Ontologies [NCBO].

2 AIM 2. We will provide Comprehensive Annotation of Reference Genomes.

We will produce comprehensive, high-quality, functional annotations for a selected set of reference genomes. These comprehensive annotations will be publicly available through the GOC resources. These functional annotations are essential to researchers from a variety of disciplines including genetics, genomics, bioinformatics, computational biology, cell biology, biochemistry and medicine. We will foster SO compliant annotations of the nine reference genomes.

1 Rationale.

Why is the detailed functional annotation of reference genomes important for biomedical research? The genomes of a relatively small number of organisms—the genomes of the major model organisms and that of human—are being subjected to very detailed experimental and computational analyses. This research is motivated by many factors: one, of course, is the desire, indeed the need to, understand the genetic basis of human disease. Another motivation is to understand the fundamental properties of genomes, their function and their evolution. Whatever the motivation for research, the fact is that it is generating an enormous amount of information about the structure and function of genomes. The GOC seeks to capture and describe a subset of these data using its ontologies. For the small number of organisms with a substantial community of experimentalists, we intend to capture this knowledge as comprehensively as possible. There are two reasons for this ambition: the first is that the experience of modern genetics has repeatedly shown the power of the comparative method. Experimental information about gene function in mouse, fly, worm, yeast, and other model organisms, informs understanding of gene function in human, and vice versa. For the very many emerging genomes (see Aim 3), the comparative method is the only practical way that we will be able to infer the properties of their genes. Given the reliance on this method of inference, it behooves the GOC to ensure that the annotation of the reference genomes, is as representative as possible. Without comprehensive annotation of reference genomes this inference is impossible.

The second reason for this ambition is that GO users desire a comprehensive annotation of reference genomes. A recent survey, conducted online by Advanced Survey ([SURVEY]), was completed by 1474 people during the last three weeks of October 2005. This large response was both gratifying and provided important insights into what a wide sample of biomedical researchers, ranging from wet bench scientists to mathematicians and computer scientists, viewed as important with respect to the functional annotation of genomes. Over 97% of the respondents said that the availability of carefully reviewed, precise, high-confidence functional descriptions of gene products was very or moderately important for their research (question 6). More than 61% found the GO functional annotations to be the most valuable, with GenBank functional descriptions coming in a distant second with only 36% of the respondents finding them most valuable (question 9). Thus, by a large majority, GO functional annotations were preferred two-to-one over the next best source of information. More than 83% of GO users found GO data useful, while 53% of GenBank users found GenBank functional data useful. The results also indicated that our highest priority should be the detailed annotation of a wide range of organisms (89%, questions 7), while more than 66% wanted more detailed functional annotations (question 14). The details of the survey are provided in Appendix 13. Surveys do not, of course, represent a complete cross-section of the user community. Nevertheless, the very clear message that we see in the answers provides an important rational for pursuing comprehensive functional annotation of reference genomes. We provide the details for our plan to accomplish this task in the next section.

2 Produce a Comprehensive Set of Functional Annotations for Reference Genomes.

We will produce a dataset of comprehensive functional annotations for a selected set of reference genomes. This will be done by leveraging existing GO supported staff with annotators supported by the MODs and other genomics databases. This cooperative effort will be coordinated by the GOC. We will establish priorities, and produce metrics to track progress toward these priorities. Databases for the identified reference genomes support this effort and have agreed to provide both data for metrics and to cooperate by incorporating priorities established by the GOC.

1 We have defined a set of reference genomes for comprehensive GO functional annotation.

We have used the following criteria to determine which genomes should be included within the GO reference genome set: (1) The organism must have a completed genome sequence. (2) There must be a substantial literature of experimental studies with the organism relating to the general domain of 'gene function'. (3) The community of researchers using the organism must be large enough to sustain continued significant functional studies. (4) The organism should be representative of a broad taxonomic group, so as to maximize its value for comparative genomics. (5) There should be on-going large-scale genome-wide functional studies in the organism. (6) The genome should be represented by a curated MOD that takes ownership of the genome annotations, and provides them to the GO database. This requirement also has the benefit of leveraging GO resources with the functional annotation expertise of curators rooted in the particular research community. Based on these criteria, we will focus on the following set of reference genomes: mouse, fruit fly, worm, Arabidopsis, budding yeast, cellular slime mold, and zebrafish.

Two other genomes will also be included in this reference set. The first is E. coli. The annotation of the genome of E. coli with GO terms is important for three reasons: (1) E. coli K-12 a very important model organism, with a huge body of experimental data; (2) it is the reference genome for the annotation of many other microbial, and all eubacterial, genomes, and (3) other strains of E. coli are of significant medical importance. Unfortunately, today, no community genome database for E. coli exists. Fortunately, the NIH issued an RFA for the development of an E. coli database [RFA] in 2005. One of the criteria in the RFA is the use of GO terms for annotation. Thus, there is every reason to believe that E. coli will also soon have a curated database, which can be leveraged in our efforts to provide comprehensive functional annotation for this important experimental system. Once the E. coli database has been established, the GOC will work with that project to assure that E. coli annotation is integrated into the dataset of reference genome functional annotation. The translation of existing annotation terms to GO terms for E. coli K-12 has already been achieved (Riley et al. 2006) and these have been included in the latest release of EcoCyc (v9.6) [EcoCyc].

The other genome requiring special consideration is human. Like E. coli, there is no community database currently available for human. However, there are enormous amounts of functional data available, especially in relation to human disease. Although it is obvious, it is essential to state that detailed functional annotation of human genes is a critical focus of the reference genome task. The GO Annotations group at the EBI ([GOA]), part of the UniProt project, will continue to lead the human functional annotation effort. GOC contributions to human annotation are discussed below.

2 Defining and measuring comprehensive functional annotation.

There are two perspectives for defining comprehensive annotation, breadth and depth. Breadth addresses the relative coverage of the genes in the genome, i.e., what proportion of an organism's gene products have been annotated?[5] Depth is viewed from a single gene's perspective, what fraction of the available information concerning that gene has been captured by GO annotation? We will now describe the metrics that we will use to evaluate our progress toward the goal of "complete" breadth and depth of annotation.

Comprehensive Annotation. Our goal is to provide comprehensive annotation for each of the reference genomes. We will accomplish this task through the definition of GO annotation standards for the curation of experimental data from the biomedical literature. We will provide metrics to observe our progress and to ensure that new experimental data is incorporated into the annotation dataset. A simple metric that will indicate progress toward this goal is the percentage of genes that have GO annotations. In practice we will provide three metrics for each of the reference genomes:

• The percentage of genes with any GO annotation (other than to the root terms)[6].

• The percentage of genes with GO annotation with an ISS[7] or experimental code.

• The percentage of genes with GO annotations with an experimental evidence code.

Our ideal dataset, where all the genes in a reference genome have manual GO annotation (reflecting experimental knowledge for that genome), cannot be achieved of course: even in S. cerevisiae about 25% of predicted gene products have no information (either published or computationally predicted) that would allow any meaningful GO annotation of Molecular Function. The proportion of such gene products in the other reference genomes will be about the same or higher. Where there is simply no information available, they will be annotated with one of the three GO "root" terms: Molecular Function, Biological Process or Cellular Component. This will allow us to determine, for each organism, the number of genes with no experimental results, thus enabling us to monitor one measure of the state of knowledge for the reference genomes (Figure 9). Note annotation to the root nodes is only done as a result of manual curation effects, signifying that no information was available at the time of curation.

[pic]

Figure 9. Annotation metrics as of January 2006. Here is reported the % of gene products with experimentally based annotations in each GO domain for each of the reference genomes.

Depth of Annotation. The second measure of annotation is its depth, the extent to which the known functional attributes of a particular gene have been captured. Again, the ideal would be to attach to every gene, a GO term for each function, process or component that has been reported for the gene products of that gene. From a practical point of view it is extremely difficult to find a reliable and objective measure of the size of the literature that would need to be curated for this objective to be achieved. While most MODs maintain comprehensive bibliographies of papers that at least mention any particular gene, we know from experience that only a subset of these papers are relevant to GO annotation and, furthermore, that there is considerable redundancy in information among these papers. We consider that any metric based upon the volume of literature is impossible to calculate in a way that would be comparable between MODs.

We can, however, easily produce two metrics:

• The total number of papers curated for GO annotation for each organism.

• The average number of papers curated for GO per gene for each organism.

Changes in these metrics over time will give us an indication of the progress of annotation.

There are two other metrics that we will also use:

• The granularity of GO annotation (Table 7).

• The average number of annotations (all, ISS plus experimental, and only experimental evidence codes) per gene.

The granularity of annotation with respect to the GO graphs is a measure of the depth of GO annotation. For a deeply annotated genome we expect the GO annotations to be closer to leaf, rather than internal node, terms. We will compute for each gene, and then average across each genome, the fraction of GO terms used that are leaf terms. We will track the change in the measure over time, as an indication of the depth of annotation.

|4.0560 |UniProt |Escherichia coli |

|4.2383 |FB |Drosophila melanogaster |

|4.2841 |UniProt |Homo sapiens |

|4.4800 |MGI |Mus musculus |

|4.7170 |DDB |Dictyostelium discoideum |

|4.7358 |WB |Caenorhabditis elegans |

|4.8267 |SGD |Saccharomyces cerevisiae |

|6.5618 |TAIR |Arabidopsis thaliana |

|7.3846 |ZFIN |Danio rerio |

Table 7. A measure of the granularity of annotation. For all terms used for annotation, for all gene products in each organism, we determined the distance between that term and the most distance terminal term (leaf) beneath it, and calculated the average of these distances. This is one way of assessing the granularity of annotation. If all gene products were annotated with terminal (leaf) terms then this average would be 0. On the other hand if all gene products were annotated with the root terms (“unknown”) the average would be 17.

General remarks. We are fully cognizant of the fact that fully satisfactory objective measures for the "completeness" of reference genome annotation are probably impossible to achieve. This does not, however, diminish the ambition of this aim. The acceptance of this aim by the GOC is itself a major achievement, for now all of the MODs are committed to these objectives. Moreover, the annotation metrics that we will supply, imperfect though they will be, will be useful in highlighting gaps in our knowledge and understanding of these genomes. The metrics described above will be archived on the GO site every quarter.

We also point out that "complete" annotation of all the reference genomes with experimentally determined GO annotations will probably never be achieved. The reason for this is obvious: Different experimental systems have different experimental strengths. For example, mouse has been widely used as a model of human disease and for studying development, while Drosophila and yeast are particularly well suited for defining biochemical and regulatory pathways. Thus, by collecting annotations from the selected reference species, gaps in the functional annotation for one organism are covered by the experimental information gathered in others. Since actual curation is done on an organism basis we will focus on assessing the depth and extent of annotations relative to a specific genome, relying on the breadth of genomes in the reference set to satisfy a biologist’s view.

3 Methods.

Organizational. The GOC currently supports GO annotators at several of the model organism databases. These curators are Ph.D. level biologists who have extensive experience in the curation of the biomedical literature for GO annotations. During the next grant period we will refocus the efforts of these experienced GO annotators to achieve the goal of comprehensive annotation of the reference genomes. These GO annotators will now play several new roles. They will continue to coordinate GO annotation locally, with the databases that serve as their homes. These GOC supported annotators will provide an embedded liaison with most of the identified reference genome annotation efforts. Currently these GO liaisons exist at the Mouse Genome Database (MGI), Saccharomyces Genome Database (SGD), WormBase, The Arabidopsis Information Resource (TAIR), dictyBase and GOA (responsible for human annotation). In addition, FlyBase, TIGR and ZFIN have agreed to identify a member of their annotation teams that will play this role for each of their databases (see letters of collaboration from M. Westerfeld and M. Gwinn). FlyBase has specific funding from the UK Medical Research Council for this purpose. They will form a GO annotation team that will monitor the consistency of annotation at their associated reference genome database and work directly with the GOC to provide the data to generate the metrics discussed above. These databases have agreed to work in a concerted fashion with the GO Coordinator of Reference Genome annotation (see below) to implement the priorities of comprehensive GO annotation, in both breadth and depth. The structure we propose has an important benefit. This is that each of the MODs will be providing, on average, three annotators using the GO, in addition to those staff funded directly by this award. This will produce a team of almost 30 annotators contributing to the goal of producing a comprehensive set of GO functional annotations for these reference genomes.

This GO annotation team will focus their efforts using priorities set by the GOC. The highest of these priorities for the reference genome annotation will be the orthologs of human genes known to be involved in human diseases, as well as the human disease genes themselves. These disease-related genes will be those identified by OMIM, and other resources such as MEDGENE ([MEDGENE]), the Human Gene Mutation Database ([HGMD]), and specialized resources such as Homophila ([Homophila]), as well as from literature curation. The GO annotation team will assure that these genes receive the highest priority at their respective reference genomes, as well as assuring that the human genes themselves have been correctly annotated. To accomplish this goal, the GOC will produce a list of human genes involved in human disease by extracting them from OMIM and the other sources mentioned above. Each MOD will use this list, together with computationally derived lists of orthologs, such as those produced for the model organisms by InParanoid, TreeFam, and other similar resources to provide a list of annotation targets. In addition, most of the MODs also have local tools to identify orthologous genes that will be employed to maximize the list of target genes in their reference genome. The GO Coordinator for Reference Genome Annotation and the GO liaison for each reference genome will assist the MODs with high priority annotation of the genes on this list. The reference genome MODs will be responsible for identifying the literature relevant to these priority genes in their organism’s genome. As the orthologous genes are annotated in the reference genomes, the GO liaison curators will review and where appropriate update the annotation for the human gene itself. The GO annotation team will provide annotations of human genes to the GOA curators, so as to expand the GO supported efforts to provide the most complete functional annotation the genes important for understanding human disease. This list of human genes will be maintained on the GO website, and will be updated to indicate which reference genome orthologs have been functionally annotated. In addition the GO Coordinator of Reference Genome Annotation will coordinate and monitor the annotation of human genes contributed by the MODs via GOA. This comprehensive list will be continually updated and available on the GO website.

The second priority for the GO annotation team will be to work with special interest groups, typically knowledge domain experts to curate large sets of genes. One example is the NIH funded effort to annotate genes involved in the immune response through the Immunology Database and Analysis Portal ([IMMPORT]). This group has identified 3000 human genes important for researchers studying the immune response. Members of the GO annotation team will be assigned to work with groups like this to assist in annotation and to provide annotation consistency between the groups. Additional high priority areas will include genes involved in cardiovascular disease and genes involved in neurodevelopment, both efforts currently being led by MGI.

A senior member of the GOC management team (R. Chisholm, Northwestern), the Coordinator for Reference Genome Annotation, will be responsible for overseeing this coordinated annotation effort. He will be responsible for implementing the Consortium established annotation priorities and for collating the statistics required to monitor progress toward comprehensive annotation, as discussed above. Dr. Chisholm will work with the individual reference genome curation teams to monitor progress toward these goals. The GOC leadership will regularly review this progress and help set priorities for large datasets or newly identified domains for focused annotation efforts. In addition, members of the GO annotation team will be actively involved in outreach to proactively identify new areas for potential special interest groups and to identify and contact groups as they emerge (see section D.6.a.).

Procedures. The primary method for annotation will be the curation of the primary research literature and authoritative reviews. Each of the MODs has in place mechanisms for the identification of the relevant literature, most rely on PubMed as their primary resource. Each of the MODs also has in place methods to triage their literature. These will remain in place, although we anticipate that the interactions between the different databases resulting from this coordination will likely lead to improvements in literature triage methods. In addition, there are other mechanisms in place (the GOC meetings and the annual BioCurators meetings ([BioCurator])) for the MODs and the GO annotation team to discuss these issues. We expect to see an increasing reliance on text processing methods (e.g., TextPresso, an effort being led by WormBase, ([TextPresso])) for the identification of papers relevant for GO annotation. Several of the MODs, for example FlyBase, MGI and SGD, have ongoing collaborations with Natural Language Processing groups. The GOC will facilitate the information exchange between the MODs to fully document their literature annotation methods.

In general, there are five steps in a comprehensive annotation process:

1. Identify the genes requiring annotation. The first step in the comprehensive annotation of a gene is the identification of a gene to annotate. A variety of strategies can be used to choose genes for annotations ranging from genes that have very little annotation to genes that are important for their roles in human disease.

2. Identify relevant papers. The second step in comprehensive annotation is to identify relevant papers. MODs are a good start, since they already capture literature about genes, but other resources such as NCBI’s EntrezGene ([EntrezGene]) can also serve as starting places.

3. Read papers. The third step in comprehensive annotation is for a curator to read papers that are associated with genes of interest to identify relevant GO annotations. Often, curators start by reading reviews and summary material. Unless a part of the ontology has been used extensively for annotation, this step usually leads to identification of areas where the ontology can be improved.

4. Modify GO. During the curation steps areas within the ontologies are recognized for improvement. Curators use SourceForge to suggest major changes for the ontologies. These changes are agreed upon after discussion and the GO is modified.

5. Define Annotation. Curators use primary experimental data from the literature to annotate genes with GO terms. This process often leads to the identification of new genes as targets for comprehensive annotation, identification of new papers for annotation, and identification of minor modifications that are still necessary for the ontologies.

6. Evaluate Annotation Consistency. The last step in the process is the curation of information. Evaluate the annotations of orthologs provided by other groups as well as annotations of homologous proteins within the same organism. This is part of quality control within an annotation group and between the contributing groups of the GOC. The results of this evaluation assist in the definition of standards and highlight potential topic areas that need more attention.

GOC will conscientiously endeavor to ensure the breadth, depth, quality and consistency of annotations. We have discussed above the difficulties in measuring the breadth and depth of annotations. There are two issues with respect to the quality and consistency. The first is that the consistency of annotation is driven by communication. The GOC will facilitate this communication by several actions. The first is by holding "annotation camps". Two of these have already been held ([GO.camp]) and we will organize further camps on an annual basis. The second is to ensure that at individual sites there are regular (monthly) sessions at which all of the GO annotators independently annotate two papers and then discuss any differences in annotation (this is already normal practice at dictyBase and SGD). Thirdly, we will extend this practice between sites. Each month one site (in rotation) will provide two papers for all GO annotators. This site will then collate the results of this exercise and initiate a monthly conference call during which the results and lesson learned will be used to enhance the annotation documentation. The SGD is beginning a collaboration with J. MacMullen (School of Information & Library Science, University of North Carolina, Chapel Hill) to develop methods to assess annotation quality. We expect all in the GOC to learn from this collaboration.

Annotations that arrive from alternate sources (different algorithms or different curators or both) may contain what appear to be contradictory data. We will develop tools to detect discrepancies in the gene product annotations. For example, when a pathway group submits new GO associations, scripts will automatically check for discrepancies with existing annotations submitted from other sources. When the discrepancies may affect another submitter’s associations (as in the case of a non-human ortholog of a protein), the script will alert both submitters by email. These differences in annotations may indicate errors, or they may simply reveal genuine biological dualities. Therefore, the script will only serve to detect differences, and to put appropriate people into contact. The questions will be resolved by direct dialog between the curators. A list of the acceptable biological discrepancies will be maintained so that the software utility will not contact curators repeatedly for the same question. These checks of gene annotations will also help shape the GOC efforts to define procedures to improve consistency. The existing consistency checks between mouse, rat and human annotations have been discussed above (Progress Report).

Our plan is to develop automatic methods to compare annotations between the identified reference genomes. The null hypothesis is that a pair of orthologous genes, one in species A, the other in species B, will have identical or very similar GO annotation. We are, of course, very aware that there are cases where this will not be so, where orthologs have diverged in their function. Nevertheless, the null hypothesis is a good starting point. We will develop a tool (see D.2.b.v below) that will allow annotators to easily review the annotations of the orthologs of a gene product that they are in the act of annotating. We discuss in Aim 3 the methods by means of which we will determine orthologs, and will not repeat them here. If the inspection of annotation of orthologs identifies gaps or differences (e.g. in the granularity of annotation) between species then an annotator has three main choices. S/he could use the annotation of the ortholog as a route into the literature concerning her gene of interest, and then annotate it directly. S/he could decide to annotate her gene transitively, using the 'inferred from sequence similarity' (ISS) evidence code. If there is experimental evidence that her gene of interest has in fact diverged in function from its orthologs, then she could annotate that fact, using the NOT qualifier [GO.assoc-format])..

The use of the ISS code for annotation evidence also provides an interesting possibility for checking annotation quality since the evidence must be based on experimental data obtained in the genome source of the sequence. It is a rule of the GOC that if protein A is annotated with GO term x by sequence similarity, then the annotation must declare the target to which protein A is similar (typically as a UniProt identifier). The consequence of this rule is that should the function of the target protein be changed in a database (either UniProt or the GO database) we have a method to change the, now outdated, annotation of protein A. We will implement scripts to highlight differences in GO annotation between query and target proteins, using the data in the 'WITH' column of the GO's gene_association data.

As detailed in Aim 1 there will be an increasingly close and synergistic collaboration between the GOC and Reactome and the Cyc family of pathway databases. This will be to the mutual benefit of both projects, but in particular will help the GOC increase the depth and quality of reference genome annotation.

Finally, the GOC will engage with its community to increase the quality and depth of annotation. The community's contribution to the structure of the ontologies themselves has been very important for their development, but we have not, as a GOC, engaged the community in annotation issues (although this may happen at the MODs). We will organize an annual annotation jamboree with domain specialists so as to increase the depth and quality of annotations in a particular sub-field – for example immunology, or host-pathogen interactions. We will solicit comments (corrections, additions) on annotations directly through a modified AmiGO browser site.

4 Sequence annotation management.

We will provide the means to manage the contents of sequence annotations. In a recent paper (Eilbeck, et al., 2005) we describe how the mathematical operations of Extensional Mereology can be applied to the parts of sequence annotations; for example, to classify the kinds of alternate splicing within a genome, and to compare these to the sequence annotations of other genomes (see Figure 10). The results of such analyses can be used to prioritize re-annotation and be used to redirect possible errors in annotation back to the curators. Some classes are more likely to indicate annotation problems than others—particularly those genes having one or more sequence-disjoint transcripts. Parts-disjoint transcripts, on the other hand, are more suggestive of complex biology. Alternatively spliced genes having only overlapping transcripts (0:0:N) comprise the vast majority of instances. We will provide the numbers of each kind of alternate splicing for each of the core organisms.

[pic]

Figure 10. Examples of alternatively spliced genes from Entrez Gene at the NCBI. Seven classes of alternatively spliced genes are shown, derived from asking three questions: How many pairs are there of spatially-disjoint transcripts? How many pairs are there of mereologically-disjoint transcripts? How many pairs are there of mereologically-overlapping transcripts?

5 Orthology annotation tool.

We will develop a tool to present biological annotators with the GO annotations for every protein in an orthology set (Figure 11). A floating palette, listing the terms annotated to each protein, will appear and disappear when the user clicks on the protein. The terms will be color coded according to the evidence for that annotation. The annotator will be able to add terms to a protein by dragging terms from one protein’s list onto another's. The evidence code for these annotations will automatically be set to IC (inferred by curator). The reference will indicate the orthology software that was used to build the tree (either a URI or a paper describing the software). The protein that was the source for this inference will also be automatically captured. Additional fields that are required for generating an annotation format ([GO.assoc-format]) will be configurable.The ontology itself will also be available on an adjoining panel, and terms may also be dragged onto a protein from there (a dialog box will prompt for the reference and evidence fields in this case).

This tool is primarily aimed at the reference genome groups, although we anticipate that others will also find it useful. It will be of particular use in comparing the annotations for orthologs of human disease genes with the human disease related genes themselves. We are agnostic as to the source of the ortholog sets, and will work with any group who is interested by providing the necessary protein sets. It is obviously very important, for comparability, that the reference genomes groups and the orthology building groups use identical proteins sets. The TreeFam group (see R. Durbin, Letter of Support) ([TREEFAM], Li et al. 2006) has indicated an interest in working with the GOC on this effort. TreeFam provides a full MySQL version of its orthology builds. The GOC will download this and install it in the

GO database. The annotation tool will access this database to retrieve the orthology sets for display. We will provide TreeFam, and other interested groups, with updates of the protein sets for new orthology builds every quarter. The protein sets for any number of organisms may be included in the build, and the tool will have the ability to filter out organisms that are not of interest to the annotator. We believe that this tool will have applications far beyond facilitating annotations and many of our end users (e.g. bench scientist) will find the tool useful for their research.

3 Summary of Aim 2.

Our objective here is to comprehensively annotate reference genomes in as complete detail as possible. We will develop metrics and quality controls to minimize the gap between what has been annotated and what could possibly be annotated. We will provide visualization and editing tools to make the process as efficient as possible. We will involve biologists with specific biological expertise to provide the most detailed annotations possible for these genomes. We will respond to community initiatives to completely annotate specialized sets of gene products as needed by the community.

Figure 11. A mockup of an "orthology" annotation tool based on a TreeFam view. Users will be able to select an ortholog set using a wide variety of methods, as sketched in the box at the bottom of the picture: by family name, family identifier, family accession, gene name, gene identifier, sequence accession, sequence match, GO term, or GO identifier. The interface will display the selected tree and users will be able will bring up individual panels showing all of the current annotations of each protein (e.g. NOTC1 for mouse and rat as shown here). More detailed information about an individual protein’s annotations will be shown in a summary panel (“Gene details”), users can click on these links to bring up the evidence that is referenced for that annotation. Users will be able to transfer annotations from one protein to another simply by clicking on a term in one protein’s panel and dragging it onto another.

3 AIM 3. WE WILL SUPPORT ANNOTATION ACROSS ALL ORGANISMS.

The objective of Aim 3 is to encourage the use of the GO for the annotation and curation of as wide a spread of sequences as possible—viral, prokaryotic, fungi, protists, animal, and plant. The sequences may be 'complete' or draft genomes, or collections of EST, cDNA or protein sequences. This aim will be achieved by providing the community with suitable tools, training and documentation as well as with annotations from the reference genomes (see Aim 2) to support annotation efforts. We will work with the community to jointly develop the ontologies in specialized domains, where appropriate.

The outlook, with respect to genomic sequencing projects, is quite daunting: The GOLD site lists (as of 1/27/2006) 1867 genome projects: 339 published completed genomes, 914 ongoing prokaryotic genomes, 588 ongoing eukaryotic genomes and 26 metagenomes ([GOLD]). Many eukaryotic genome projects have been funded, but have either yet to commence or be finished. 97 genomes were made available in 2005 (59 in 2004, 49 in 2003). All of these data reside, or will reside, in the International Sequence Data Library (i.e., GenBank/EMBL/DDBJ) [INSDC], and without focused effort, they will vary greatly in the depth, quality, and comparability of annotation. However, GenBank and its collaborators at the EBI and DDBJ, will reflect the data submitted to them by the genome projects. If the current situation persists inconsistent implementation of the INSDC Feature Table, and inconsistent annotation of the functions (known or predicted) of gene products will make these data increasingly difficult to use in any meaningful way, both for bench biologists and for computational biologists. The GOC's aim is to work with the relevant communities to ensure that both genomic features and the functions of gene products are represented in a uniform way, and in such a way as to encourage their analysis, both small and large scale. This basic framework is illustrated in Figure 12.

A basic toolkit of training material. The role of the GOC is to support the comprehensive annotation of emerging genomes. It can only do this by working with other communities. The GOC will identify these communities by a variety of methods (see D.3.a and D.3.b). We will support them in the following ways:

1. We will provide a corpus of reference annotations from the reference genomes. These will be sets of proteins (as FASTA files) and their associated GO annotations, clearly identified as being determined by experiment, sequence similarity or computation (using GO evidence codes, see [GO.evidence]). These can, if appropriate, be broken down by taxonomic group. We will also provide groups with data from the automatic annotation of proteins by the GOA group at the EBI (through mappings InterPro domains), and by TIGR using similar methods. HAMAP Protein Families [HAMAP] are also mapped to GO terms. These data will be very useful for the GO annotation since they are manually curated families of orthologous prokaryotic proteins (1256 have now been made).

2. We will provide a vehicle for discussion and debate on the appropriate tools (see [GO.tools]) for the prediction of GO terms from the reference set to novel proteins by transference. We will provide best practice methods for the curation of GO terms from the literature (see also #4, below).

3. We will provide tools for comparing GO annotations of cognate ('orthologous') proteins from related taxa as described in Aim 2. There is, as yet, no widely accepted standard for computing orthologous proteins across taxa. The GO, and most of the MODs, adopted InParanoid because of its availability; Sonnhammer's willingness to work with us; the project’s well-supported web and FTP sites; and its regular data updates. However, it is clear that tree-based methods have certain advantages to the GO, for example to visualize relationships between proteins and for incorporation into new curatorial tools (see D.2.b.v). Two projects have expressed their interest in working with the GO, Phylogenetically Inferred Groups [PHIGS], led by Jeff Bourne of the JGI and Berkeley, and TreeFam [TREEFAM] led by Richard Durbin (Sanger) and colleagues at the Beijing Genomics Group (see Letters of Support). For reasons of comparability it is important that all three of these databases work with common protein data sets and the GOC will coordinate the availability of these. TreeFam is, at the moment, concentrating its efforts on animal protein families, using Arabidopsis, Saccharomyces and Schizosaccharomyces as outgroups. PhIGS does not yet include either plants or bacteria, but their inclusion is planned. Other computed orthologous protein sets are provided to the community through the Homologene [Homologene] and Compara [COMPARA] resources. These complement the use of the InParanoid data, and provide the widest availability of comparative GO annotations. The GOC already has considerable experience in working cooperatively to provide curated ortholog sets in the context of mouse, rat and human (Blake et al. 2006).

4. We will continue to maintain, and develop as necessary, mappings of the GO to other systems of classification. Eighteen of these are now available [GO.mappings], and several, e.g. interpro2go, metacyc2go, reactome2go, hamap2go, spkw2go, are updated on a regular cycle (monthly, or less). These mappings are very useful when inferring appropriate GO annotations to genomes that have been annotated with other terminologies. When new terminologies are published we will, in concert with their authors, map these to the GO.

5. We will provide hands-on training for new groups who wish to annotate with the GO. This training will involve not only specific training on the use of GOC created tools, and tools from our associates (such as TextPresso [TEXTPRESSO] and the EBI and TIGR pipelines) and the use of the GOC's SourceForge site for requesting new or changed GO terms ([GO.sf]) but also training on the GOC's philosophy of annotation. This will be done in three ways.

i. By arranging open annotation workshops, similar to those held in Cambridge (June 2004) and Palo Alto (June 2005) [GO.meetings]. We will organize these workshops on an annual basis, alternating their location between Europe and North America. These will be open to all, and advertised via the GO-friends and other email lists, on the GOC's homepage, and at meetings.

ii. By personal interactions between the GO and new genome groups. This will involve both outgoing working visits (to the genome group) and incoming working visits (from the genome group to either the GO office at the EBI or to an existing GO curation site).

iii. We will collaborate with other organizations that conduct training in the area of genome annotation, for example TIGR [TIGR.training]. GO staff will continue to provide lectures and tutorials at suitable venues, e.g. the Plant & Animal Genomes Meeting, ISMB, Oxford Genomics Course, GO Users Meetings etc. (see Appendix 12).

6. Each new genome group will be assigned a 'mentor' from either the GO editorial team or from the GO annotation team. The 'mentor' will be chosen on the basis of 'taxonomic affinity'. For example, mentors for new fungal genomes would be drawn from SGD, those for new insect genomes from the Cambridge (UK) FlyBase and those for new mammalian genomes from MGI (see D.3).

7. We will write, publish and maintain a 'standard operating procedure' for the computational annotation of genomes with the GO. This will be drafted by the GOC, but we will rely heavily on advice from groups that have recently begun to use the GO for annotation (e.g., dictyBase, Gramene, CGD) and large-scale providers (e.g., TIGR, GOA).

8. We will encourage and enable new groups to submit their annotations, as GO gene association files, to the GO repository. This will ensure that they can then be viewed and queried in the context of all other GO annotations, either using a GO browser (e.g. [AMIGO]) or by direct query of the downloadable GO Database [GO.database].

[pic]

Figure 12. GO annotation for emerging genomes. The GOC will provide the materials and training for those elements of the diagram in orange.

In the next sections of this proposal we will detail our priorities for the GO outreach to 'new' communities. In many cases we have already made contact with these communities, as noted below. We will also discuss where we see the major hurdles to progress, and how these may be overcome.

1 The players: Viral and prokaryotic genomes.

There are several reasons why a comprehensive annotation of prokaryotes with GO terms is needed. One is that much of our fundamental understanding of metabolism comes from experimental studies with bacteria. Much of this knowledge has been transferred to other organisms, on the basis of sequence similarities, and, often its original experimental basis has been lost. Many microbes interact with eukaryotes, and an understanding of microbial gene function is vital to our understanding of the functions of eukaryote gene products that respond to microbial activities. Understanding these interactions is, of course, vital for human health, but is also very important in the field of agriculture. The microbial world is an extraordinarily rich resource of metabolic diversity. The representation of this diversity by GO annotation will be one way in that it can be explored and compared. It is for these reasons that the GOC will encourage the greater use of GO annotation for prokaryotes and viruses.

As mentioned in the Progress Report, the penetration of the GO for the annotation of prokaryotes and viruses has not been as extensive as we desired. There are several reasons why this is so. The first is that these 'communities' are far more diverse and fractured than those for the 'higher' organisms. In addition, for any one species, the community is usually very small. The second is that, for many, Riley's classification (Riley, 1993; [MultiFun]) and its derivatives have seemed to be satisfactory. The third is that the major microbial sequencing centers have a production process (of sequencing and annotation) whose funding usually does not afford the time for detailed curation of gene functions. This we can address with the development of new tools. Finally, there are, at present, structural issues within the GO ontologies that make the GO less than ideal for some groups annotating bacterial genomes (see below).

There are now many 'rival' consolidators of genomic data for prokaryotes. These include the TIGR Comprehensive Microbial Resource [CMR], the HAMAP [HAMAP] project of the Swiss Bioinformatics Institute and SwissProt, Integr8 [Integr8], JGI's Integrated Microbial Genomes [IMG], the Fellowship for the Interpretation of Genomes [FIGS], the Argonne National Lab.'s PUMA group [PUMA2] and the BioCyc family of databases [BioCyc], and many with a more restricted taxonomic view.

The annotation of microbial (prokaryote and fungal) genomes in general has been the subject of a recent report by the American Academy of Microbiology (2004). This report identified, as a new initiative, a database of annotations which would contain: "Predictions regarding the function of genes of unknown function, deposited by bioinformaticians, based on computationally inferred clues …" Should this initiative get off the ground the GOC would be delighted to collaborate with it. We are also aware of initiatives such as that of the National Environmental Research Council in the UK [egenomics] and the US National Research Council's new project on metagenomics [NRC], to be chaired by J. Tiedge and J. Handelsman (and of which Ashburner is a member).

Another major initiative is that from NIAID, to sequence and annotate over 50 bacterial or viral pathogens and their vectors. The annotation and curation of these sequences will be the responsibilities of the 8 funded Bioinformatics Resource Centers [BRC]. We already have close links with one of these, VectorBase [VECTORBASE] responsible for the curation of the sequences of the insect vectors of disease (Ashburner is a member of their advisory board), and there are very clear common interests between VectorBase and FlyBase. A very fruitful meeting with Dr. V. Di Francesco, responsible for the NIAID BRC's, was held in November 2005. Dr. Di Francesco confirmed that it was NIAID policy to annotate 'their' genomes with the GO, and members of the GOC will attend a meeting of the BRCs in February to take this forward.

Ten bacterial genomes have been manually curated (by TIGR), although the computational annotation of many more has been done. The TIGR Comprehensive Microbial Resource [CMR] has 241 completed bacterial and viral genomes, including both their original annotations and those assigned by TIGR to a TIGR role (which is mapped to the GO) and to GO terms. Thus, a search of the CMR with GO:0004764 ; shikimate 5-dehydrogenase activity will yield over 200 records from over 120 eubacterial genomes [TIGR.search]. Mapping enzymes is, of course, relatively low hanging fruit, because nearly all annotations include EC numbers. However, with the TIGR CMR, this mapping is also good for non-enzymatic functions or processes, e.g. GO:0006355; regulation of transcription, DNA-dependent.

Useful as the annotation from the CMR is, it is, with the exception of 10 genomes, automatic, and has not been assessed by an expert curator. This is because of the limited resources available to TIGR for this project. It is the GOC's task to work with TIGR, and similar groups, to make automatic annotation more rigorous, by the development and promotion of the tools.

Another large project, the Integrated Microbial Genomes resource at the Joint Genome Institute [IMG], has a broadly similar aim to that of the CMR, to provide consistent annotations of public microbial genomes. It now contains data from 289 bacterial and archeal genomes, plus 260 viral genomes. The IMG is not yet using GO for primary annotation, although it does map to the GO. For instance a search of the IMG for shikimate 5-dehydrogenase activity and applying a GO term filter returns 305 gene products. We have had discussions with the IMG group about using GO, especially for the primary annotation of those microbial genomes (44 finished so far) sequenced at the JGI. One of the main stumbling blocks for this group is the absence, in the GO, of any links between the molecular function, biological process and cellular component ontologies. We have addressed this issue elsewhere in this proposal, and state there that instantiating such links is now a priority for the GOC. We will continue to work with the IMG group to resolve these issues. The PUMA project at the Argonne National Laboratory [PUMA2] includes the computational analyses of 213 prokaryotic genomes, including GO data from the GO Database. N. Maltsev is now mapping their Functional Overview [PUMA2.over] to the GO (personal communication).

The Sanger Institute's Pathogen Sequencing Unit [PSU] is another major microbial sequencing center; it has finished 45 genomes and has a further 27 in progress [PSU.microbes]. Although the PSU uses the GO for its primary curation of its eukaryotic genomes (e.g. Schizosaccharomyces, Plasmodium, Leishmania, Trypanosoma), it uses various home-constructed versions of Riley's classification for the annotation of its prokaryotic genomes. We have had much discussion with J. Parkhill (Sanger, Pathogen Sequencing Unit) as to how the GOC can aid the transition to the use of the GO for this annotation. The reason why this is not done now is that the Unit cannot afford the time to both change annotation systems and maintain its throughput of genomes. We have already made a subset of the GO for prokaryote genomes [GO.prok], and we will work with the PSU to produce GO annotations (by translation from EC numbers and their 'Riley' annotations) of their genomes.

Although TIGR, the JGI and the Sanger are major prokaryote sequencing centers, many of these genomes are being sequenced and annotated by smaller communities of biologists. It is critical that the GO reach out to these. A start has been made, with the Pseudomonas community [Pseudomonas], with the E. coli community and with the NIAID's Bioinformatics Resource Centers [NIAID] (see above). We regard the curation of the E. coli K-12 genome with GO terms, using not only transitive annotation from that already existing, but also curation of the experimental literature, as a very high priority (see Aim 2). We have also recently started to work with the ACLAME, a database of mobile genetic elements ([ACLAME]). This database already uses the GO [ACLAME.fun], but requires many new terms to cover their domain. We have an ongoing project to include these; in addition we are working with this group to incorporate their annotations in the GO database. The community annotation project ASAP (Glasner et al. 2006) uses the GO for the manual curation of genomes, with an emphasis on the enterobacteria (15 genomes), as do some of the single prokaryotic organism databases, e.g. that for Listeria ([LEGER]).

In summary, working with TIGR, the JGI and the Sanger's PSU, and with the new E. coli K-12 database (see Aim 2), are the GOC's major priorities under this Aim. We will work with other microbial communities on a more opportunistic basis, the larger the corpus of genomes with GO annotations and GO curations (i.e., annotations validated from the literature or by human analysis of predictions), then the easier it will be for all biologists to use the GO for their own work. We also expect that the methods that we develop with these major groups will be very helpful to all others who could annotate prokaryote genomes with the GO.

2 The players: Eukaryotic genomes.

All of the major eukaryotic model organisms are now annotated with the GO, and our plans for increasing the quality and depth of these annotations, including assessing their quality, have been discussed in Aim 2. Of the emerging eukaryotic genomes the great majority are phylogenetically close to one of these models, whose database will act as mentor. The following (Table 8) is an estimate, from various published resources, of the taxonomic spread of these emerging genomes:

|Group |# Species |Cognate GO group |

|Protozoa |26 |TGD via SGD, GeneDB |

|Porifera, Placozoa |2 |SGD |

|Fungi |73 |SGD |

|Amoebozoa |4 |dictyBase |

|Cnidaria |4 |SGD |

|Platyhelminthes |2 |WormBase |

|Annelida |2 |FlyBase |

|Mollusca |5 |FlyBase |

|Nematoda |11 |WormBase |

|Arachnida |1 |FlyBase |

|Crustacea |5 |FlyBase |

|Insecta |23[8] |FlyBase |

|Echinoderms |2 |ZFIN |

|Early chordates etc |5 |ZFIN |

|Vertebrates |48 |ZFIN, MGI |

|Algae & plants |43 |TAIR, Gramene |

Table 8. Estimate of number of genomes being sequenced and corresponding GOC mentor group.

We are already in close contact with several of these emerging genomes and have advised several with respect to GO annotation. These include Tetrahymena [TGD] (maintained by M. Cherry), Chlamydomonas [ChlamyDB], Xenopus [Xenbase], Physcomitrella (B. Mischler, see letter of support), the honey bee [BeeBase], VectorBase [VECTORBASE], and the farm animal genome community (see below).

The JGI's Eukaryotic Genomes group [JGI] has sequenced a number of eukaryote genomes (and has many more in its pipeline). These (Chlamydomonas reinhardtii, Ciona instestinalis, Fugu rubripes, Phanerochaete chrysosporium, Phytophthora ramorum, Phytophthora sojae, Populus trichocarpa, Thalassiosira pseudonana, Trichoderma reesei and Xenopus tropicalis) are all automatically annotated with GO terms. We will work with the JGI (see Letter of Support from D. Rokhsar) and, where they exist, the cognate database groups (e.g., ChlamyDB, XenoDB) to improve the depth and accuracy of these annotations and to import them into the GO Database.

Resources will limit our work with all of these genome groups, thus our priorities will be set by the criterion of the existence of a community database. In general, if a community database exists and is maintained, then (a) there will be a sufficient community of researchers for there to be the resources for genome curation, and (b) we have a single point of contact.

We have identified the following databases (Table 9) as our first target list (in addition to those listed above). Many of these already predict GO terms, or have published plans to do so, and our priority in these cases is to encourage them to submit these annotations to the GO database and to work with the groups to improve the quality and depth of their annotations, both by computation and by manual curation. In some cases additions to the GO itself will be required, e.g. in the area of silica metabolism for the diatoms, and in secondary product metabolism for the Aspergilli. We will work with these groups to achieve these additions:

|Organism |Database |Organism |Database |

|Protozoa | |Plants | |

|Cryptosporidum |CyptoDB * |Antirrhinum |DragonDB |

|Toxoplasma gondii |ToxoDB |Brassica |Brassica.Astra (EST) * |

|Trypansoma cruzi |TcruziDB |Glycine |SoyBase |

|Giardia lambilia |GiardiaDB |Leguminaceae |LIS * |

|Cnidaria |StellaBase, Cnidbase |Solanaceae & Coffee |SGN * |

|Diatoms | |Forest trees |TreeGene |

|Phaeodactylum tricornutum |PtDB (EST) * |Zea mays |MAIZE, PANZEA |

|Thalassiosira pseudonana |TpDB (EST) * |Annelids |Lumbribase * |

|Fungi | |Nematodes |Nema |

|Ashbya gossypii |Ashbya Genome Database |Pristionchus pacificus |PRISTO |

|Aspergillus nidulans |Aspergillus Genome Database |Insects | |

|Neurospora crassa |Neurospora crassa Genome Database|Bombyx mori |SILKDB |

|Phytopathogenic fungi |COGEME (EST) |Spodoptera |SPODOBASE |

|Oomyctes |OGD *, VBI * |Tribolium |BeetleBase |

|Phytophthora |PGD * |Vertebrates | |

| | |Medaka |MBASE (EST) |

Table 9. Organisms and Databases representing emerging genomes and possible GOC collaborations. Asterisk identifies resources that already provide the GO.

The farm animal community presents the GOC with a particular challenge, because of its breadth and, in some cases, rivalries between different database groups. However, working with this community we have already formed a 'Farm Animal' GO Interest Group [GO.farm] (with a corresponding email list with 44 subscribers) and we have close contacts with the Ark family of databases [ARK] and the US Livestock Genome Projects [AGO]. The curation of GO annotations for the chicken has already begun under the auspices of the US Poultry Genome Project [CHICK] and for the cow by the AgBase group [COWGO] and by the Bovine Genome Database [BGD]. The GO annotation of the bovine genome has just been released by the EBI GOA group (in collaboration with [COWGO] and [BGD]). A proposal for a GO curator for the chicken genome who will be located at the EBI is being submitted to the UK BBSRC (D. Burt, pers. comm.). AgBase, undertaking GO annotations for a series of farm animals, is now a GO Associate.

3 Annotation refresh tools.

Another challenge we face is keeping GO annotations up to date. For well-supported MOD projects, refreshing the annotations is an integral part of their normal operations. However, for many of the newer groups the original annotation effort will be their sole opportunity to describe these gene products. Even if the gene product set remains the same, changes to the GO and additional annotations from other organisms will eventually erode their value. Annotations, particularly automated annotations, must be regularly updated.

We do not, however, intend to build an automated compute pipeline, or a sophisticated Natural Language Processing (NLP) tool, to do this. There are many other groups working on automatic annotation methods, and we intend to use them. We propose an annual annotation challenge, much like CASP ([CASP]), the BiocreAtIve initiative (Critical Assessment of Information Extraction Systems [BioCreative]) or the TREC-genomics ([TREC]) competitions. Over the course of each year we will identify particular groups of gene products to use as test cases. For the challenge we will provide annotation datasets that include all gene products, but omit the annotations for the selected test set. Our contribution (in addition to the dataset) will be utilities for evaluating the results. The most basic utility will compare annotation results from alternative methods, where all annotation results are utilized on the same gene products and version of the GO. This utility will indicate when two independently derived annotations are identical, differ in granularity, or appear to contradict one another (which in some cases may be biologically valid). Each year we will use the winning method (or methods) to refresh annotations sets whose usefulness would otherwise erode with time.

4 Metrics.

The purpose of Aim 3 is to widen the use of GO to organisms whose genomes have been sequenced (or which have large EST collections), and to include these annotations in the GO Database. Ideally, we would like to see these emerging genomes being manually curated with GO terms, rather than just annotated by some computational method. We realize, of course, that many of the responsible database groups simply do not have the resources to do this. Our, primary objective must be, therefore, to work with these groups to ensure that they annotate with the most reliable methods, and to provide them with tools which will help them assess the quality of their GO annotation.

We will measure our success by two metrics: The first will simply be the number of new organismal groups submitting curation or annotation to the GO Database. The second will be our own in-house evaluation of the quality of these data. Criteria for submission of GO annotations to the GO Database are detailed in Section D.4.d.i. Initial annotation sets will be evaluated through the GO mentor and via the standard GO annotation quality control system as discussed previously.

5 Summary of Aim 3.

Our objective here is to support functional annotation using the GO for all organisms. We will provide a basic toolkit of training material, standard procedures, tools, documentation, and reference data to new genome groups for automated functional annotation. We will offer biannual training sessions in methods of annotation. Curators, from the GO annotation team, will mentor groups who are interested in manual annotation. We will support and encourage the submission of resulting functional annotations to the central GO repository to increase its taxonomic breadth.

4 Aim 4. We will provide our annotations, tools and training materials to the research community and support widespread use of GO resources.

Our final aim is to disseminate GO resources to the research and education community. We will support the use of the GO and GO annotations by all researchers in biology. There are two major components of this objective. Firstly, we will work to increase our understanding of the needs and expectations of researchers studying specific topics or using particular method. Secondly, for each of these groups, we will incorporate and enhance methods to provide the ontologies, annotations, tools and supporting documentation and resources.

Background: The biological community is a large and diverse group of scientists. Although it is difficult to determine an accurate measure of its size, there may be well over 1.5 million researchers in the biological research community (Newman 2001). The GO system fulfills many of the requirements for sophisticated data retrieval and analysis platforms needed for biological research today. However, the development and maintenance of the ontologies, as well as the curation of gold-standard annotations by highly trained biologists, is costly, and these efforts are futile if the GO and its annotations are not widely adopted and used both for descriptions and for data analysis. Currently, only a minority of the biological research community are using GO to its fullest potential. Moreover, many users of the GO annotations do not take full advantage of the evidence code system and indeed do not fully understand either the power or the limitations, and are often are not aware of the powerful tools and applications developed to exploit this resource of functional information.

1 We will determine community needs and priorities.

We will engage a variety of communication techniques and technologies to facilitate and enhance our ability to communicate with our users, to understand their priorities, to track and respond to their insights and suggestions, and to aid in the use of the GO for biological research. Each research community has differing priorities in regards to their need for resources provided by the GOC. For example, experimental biologists responding to our survey request improvements in the completeness of annotations, whereas ontology developers request partition of the ‘part-of’ relationships, others request better integration of GO with representations of biological pathways and still others desire enhancements to the project’s web interfaces.

We will conduct usability studies of the user interfaces. This is a design exercise to learn how effectively users find the information they desire. Such a project requires observation of users exploring the web site while they verbally describe what they are doing to an observer. The observer does not provide assistance to the user during this observation. The goals of this exercise include the determination of assumptions users bring to the web site such as: where would they expect to find a particular type of information, how would they expect the information to be described or linked to other information, are the provided hyperlinks appropriate and sufficient for their need, and do they understand the words (i.e. headings or hyperlink tags) on the interface. These observations are of a small group of individuals, this is not a scientific study on the usability of the site, rather the goal is to gather information that is used to improve the interface as a result of better understanding the users’ perspective. These observations must happen in conjunction with the interface development and occur regularly. . We desire the general user’s reasons more than those that know and use the site regularly. The regular user is important but for this purpose are not as useful. Regular users would have learned how to find information even if the design was poor. Observations of this type will not only aid in the interface design but will also identify new features that would be useful to include.

We will continue use of the SourceForge Tracker site to allow community input for suggesting change. We currently provide five tracking systems on various topics at SourceForge. These include Ontology Requests, Annotation Issues, AmiGO Requests, OBO-Edit Requests, and GO Website Requests. These tracking systems are very useful as they allow members of the Consortium and users to submit requests/suggestions and to be able to see the other requests that are active. A hyperlink to the Ontology Requests tracker is provided from the AmiGO and HTML web pages. The AmiGO tracker also provides a hyperlink to the AmiGO Requests tracker. Requests to these trackers are triaged by the GO Editorial Office staff and assigned to an appropriate person(s) within the GOC.

We will have GO call-in sessions to provide a forum for discussion of any of the GO resources. The motivation for these sessions is to allow open discussions to occur at more frequent intervals. We currently have many discussions, internal and with our users, via email. However, we have discovered that telephone conferencing is more appropriate for complicated topics and for issues where concern for someone’s feeling are important. Interpersonal communication etiquette allows for difficult topics to be discussed easier in person and via conference calls than via email. Statements can be made on the conference call to check that the meaning of a statement is clear before going into more detail. The call-in session will allow individuals that are not secure with the terminology to pose a question and not worry about the way they ask the question. Finally, there are many that would like to have more information and don’t know where to start. A call-in session allows these individuals to participate by just listening to the discussion. We will also use Virtual Network Computing [VNC] technologies for communication between groups. We have recently been using VNC between members of the OBO-Edit working group with good success.

We will provide an on-line questionnaire and continue to solicit specific requests from the communities that use the GO resources on a regular basis. The response to the survey that was conducted in October 2005 was larger than we anticipated. Conducting regular surveys on particular topics will assist our tools and interface development projects by obtaining a message of importance for a new feature from the tools potential users. The questionnaires and requests for input are particularly necessary as GO begins to be used by new communities. These communities are likely to have needs that have not been expressed to the GOC before and thus would not be on our radar.

2 We will support the use of the GO by all researchers.

We recognize that different groups of users require different types of support. We will concentrate on support for research groups in functional genomics (large-scale, high-throughput projects), bioinformatics (including the NLP community), and researchers investigating smallish numbers of gene products. We will determine and respond to the needs and priorities of each group. We will identify priorities, provide appropriate access, tools and documentation, and support the use of the GO and of GO annotations in the particular research communities. We will coordinate with other biological database research groups to integrate GO structures and annotations within the biological database community. For each group, we will carefully access and implement necessary and appropriate measures to facilitate communications between the user community and the GOC, to provide the ontologies and the annotations in useful formats and representations, and to support compatible access to GO tools, documentation and resources.

1 We will support research groups working with large-scale high-throughput projects.

GO annotations are currently the best way to analyze high-throughput data, including the systematic categorization of analyzes into their broader biological context, and for providing statistically robust means of assessing the quality of high-throughput data. Microarray analysis of transcript abundance was the first type of high-throughput functional genomics analysis. There are now many other ‘omics’ technologies. We will support the continued use of the GO ontologies and annotations for data analysis for high-throughput data sets.

Communication: This user community is currently well connected with the GO via the MODs. Continued outreach is required to assist these researchers as the experimental and analysis methods continue to be enhanced.

Access to ontologies: The GOC will continue to provide its ontologies to all without restrictions other than our request that the GO not be altered and then redistributed under the same name, and with the same namespace. The ontologies are available for download in OBO format, the (deprecated) GO flat file format, XML format, RDF-XML format and OWL format. On average during 2005 the main ontology file, in OBO format, was updated twice a day by the Editorial Office. The ontologies are provided in other formats on the FTP site and they are converted from the OBO file daily. The files can be dynamically downloaded from the Web site, download by Anonymous FTP or from the GOC's CVS server.

Access to annotations: We will provide GO annotations for all genes. One of the priorities for researchers is the incorporation into the GO annotation sets of information about gene products for which there is no annotation. That is, they seek to know whether a gene product is missing GO annotation because of a lack of annotation or because of the lack of any information. We will now include, in the GO annotation files, all gene product features from each genome, indicating whether they have been assessed for functional information and are poorly characterized, or whether they have not (yet) been assessed and there is simply a lack of annotation.

Tools and Resources: We will continue to post information about tools for data analysis using the GO on our website. Most researchers will access the GO and its annotations via software tools that help them analyze their datasets. Indeed, a large number of tools that incorporate the GO have already been developed ([GO.tools]). We will continue to be the conduit between tool developers and users of the tools. For example, we will continue to maintain GO tools page to help disseminate software that incorporate GO to the research community.

|Research Group |Listen to Community |Provide Ontologies |Provide Annotations |Provide Tools and Support |

|Functional Genomics (whole |Questionnaires and |Formatted ontology |Download data files from GO |Documentation; GOFriends mail list; |

|genome and large-scale studies) |Meeting Workshops; Web |files; OBO, OWL, |HTTP, FTP and CVS servers |Call-in teleconferences |

| |email access to GO |OBO RDF, XML, … | | |

| |support group; GOFriends | | | |

| |email; specialty mailing | | | |

| |lists | | | |

| | | | | |

|Investigation of small sets of | | |AmiGO browser access to GO |Above and special interest groups and |

|genes | | |database |associated mailing lists. |

|BioInformatics (including NLP) | | |Downloadable data files: |Downloadable database; files and |

| | | |annotations, ontologies, FASTA |documentation; GO database API. |

| | | |protein sequences, database | |

| | | |table dumps. Formatted in OBO, | |

| | | |OWL, XML, FASTA, RDF, mySQL, … | |

|Bioinformatics & Database Groups| | |All of the above. |Annotation and special interest mailing|

| | | | |lists. |

Table 10. General division of tasks and products to support use of GO by major research groups.

2 We will support individual researchers investigating smallish sets of genes.

This user community is a heavier user of Internet resources as an access point for information. We will provide email access to GO editors to answer user questions and concerns. We will provide FAQs and call-in teleconferences to aid users to connect with the GO.

Access to ontologies: This group is also an important source of specialized domain expertise. Thus members of this community will continually be recruited to participate in ontology development. Also because of their specialized expertise this community often provides specialized databases with annotations for all members of a single protein family across all species.

Access to annotations: The GO annotators will actively seek out these annotations and work with their providers to have them incorporated within the GO. There is an added benefit of this effort as these specialized annotations will also enhance the annotations available from the MODs.

Access to tools and resources: The AmiGO interface allows a set of gene identifiers to be input and we will continue to improve resources that work on groups of annotations. Several MODs provide tools to assess the statistical relationship of the annotations within a set of genes. Tools such as GOTermFinder at SGD (Boyle et al. 2004) are an example of the type of tool useful for these users.

3 We will support for bioinformatics community

We will work with bioinformatics research groups to facilitate their use of the GO to investigate the structure and utility of ontologies for data representation and exchange.

Communication: We will hold a GO users meeting specifically for software developers (many of whom have already attended GO users meetings). We will continue to facilitate work by the semantic mining community through our publicly available curated datasets. Because GO annotations are supported with citations to the biological literature, the GO annotation sets provide an important resource for the testing of NLP algorithms to retrieve relevant information from text. These data provide a measure for assessing information retrieval systems in the genomics domain, such as the BiocreAtIve initiative (Critical Assessment of Information Extraction Systems [BioCreative]) and the TREC-genomics competition ([TREC]). We will provide gold-standard data sets as requested and will continue to actively collaborate with this community.

Access to ontologies and annotations: As with the high-throughput users, this community will primarily access the GO and GO annotations via downloads from the GO database.

The GO Database includes annotations of gene products with the GO provided by members of the GOC. Each member of the GOC uploads to the GO CVS server repository a TAB-delimited flat file (known as a gene_ association) file. The format of this file is described in [GO.ga].

In addition to providing the annotations of gene products the GOC also makes available files of the annotated (protein) gene products themselves. These are provided as simple tables of the unique identier of each gene from the MODs and the UniProt Accession Numbers of the corresponding gene products. FASTA formatted sequences for non-IEA annotated proteins are also included in the weekly and monthly database archives.

These files and procedures and files have now been well tested by time, and no major changes are envisaged.

4 We will support close communication with other biological database resource groups:

We will work closely with interested biological database resources to make the GO annotations available through their databases and websites. An example of resources that are important to the GO and our users are those that provide hyperlinks to the AmiGO tool. An analysis of over 260,000 referrals is provided (Figure 13). This shows that NCBI as the top resource that is used to direct users to the GOC with over 150,000 users directed from an NCBI page to AmiGO between May 2005 and January 2006. We will reach out to sites that use GO and to those that interested.

[pic]

Figure 13. Analysis of how users of AmiGO reached the site. The web logs include “referrer” that states the URL of the page that provided the user with the hyperlink to the AmiGO site. This analysis was performed on the logs from May 2005 until Jan 2006 (8 months). Over 260,000 referring URLs are represented in the results.

Communication: We will provide workshops and presentations at meetings attended by biological database curators and developers such as the International BioCurators meetings ([BioCurator]), Generic Model Organism developers meetings ([GMOD]), The Cold Spring Harbor/Hinxton Genome Informatics ([GENOMEINFO]) meetings, Biology of Genomes ([CSHL]), and Plant and Animal Genome ([PAG]) conference. In addition, we will encourage database groups to attend our annual GO Users meeting by sending the meeting announcements to the BioCurator and GMOD mailing lists, among others. We will also continue to work with large data repositories such as NCBI, UniProt and Ensembl ([Ensembl], Birney et al. 2006) to disseminate GO ontologies and annotations.

Access to ontologies: Central to all the aims of the project is the need for communication about the content represented within the GO ontologies. Some resources may not use GO as the primary source of annotations because it does not represent a specific aspect of biological knowledge. We will work with these groups to enhance the ontologies to appropriately incorporate missing biological knowledge. We will also empower the GOC annotators to engage other resources in discussions about how we can be more useful to them. By working to enhance our methods of distribution, as well as education on how to use the current methods, we will make the GO better.

Access to annotations: In addition to the types of discussions mentioned above we need to connect with the many resources that have useful annotations to share. Of the several hundred biological databases available on the Internet today many are focused on providing very detailed information about one protein family or the genes involved in a particular human disease. These data would be important additions to the annotations provided by the GOC. We will thus work to make the connection with these resources through our outreach efforts. The GO annotators will be charged with contacting databases and the PIs will follow-up with these contacts. As necessary the GO annotation staff will be able to help reformat the annotations from these databases for submission to the GOC repository.

Access to tools and resources: Through the connections that are made by opening communication with other biomedical resource projects we will identify tools that are necessary to fulfill our mission of providing useful access to the ontologies and annotations. We will also discover the needs of special communities that had not been previously identified. The GO managers working with the development teams will define specifications and establish priorities for the creation of these new tools. In addition to the software development identified to aid other resources we will provide a web page that lists use case scenarios for software development. The scenarios draw on the existing documentation from the GO database, its loading and API, as well as documentation from other tools that assist in the manipulation of the ontologies. This information is necessary to minimize duplication of effort and to provide easy access to a complete understanding of how and when different software tools are appropriate.

3 We will regularly assess our ability to deliver to the research communities

Through the use of online surveys, Users Meetings, call-in sessions, email lists and directed solicited feedback we will be ever mindful that our goals include broad support for the use of the GO, its tools and resources, by the greater biological research community. A significant challenge for the future is the outreach and support for the experimental biologists. The cohesive interactions between members of the model organism communities has allowed us to work easily with those communities. As functional genomic tools become part of the diverse physiological and developmental biology communities (to mention just two) we need to reassess whether we are meeting the needs of these communities. Thus we were redouble our commitment to outreach. The GOC annotators will each determine groups of users who should be targeted (e.g. via attendance at specialist meetings) and topic areas that should be developed. We will also create a Community Requirements group, lead by Drs. Lomax (EBI) and Hong (Stanford). This group will work on web site development, communication with the diverse biological research groups, and as liaison to academic societies so as to organize workshops and tutorials at major scientific meetings (such as FEBS ([FEBS])).

4 Technical infrastructure and engineering management

We will continue to improve access and usability of the GO Internet resources

1 GO database

We continue to improve query response time and the frequency of updates. We will explore the addition of more computational predictions being available via the database for use by AmiGO. Changes to the hardware used to provide the GO database via a cluster of small fast machines (see Stanford budget justification) and enhancements to the database indexing machines will aid this process in year one. Future enhancements to the database loading are required for the long-term. The current loading procedures reload all the information anew rather than updating the existing database. This will require a major rewrite of the current procedures. The Chado schema has growing support within the MOD community and we will seriously consider using it for the GO database.

2 AmiGO Browser: Access to GO Database

We will continue improvements for AmiGO web access to the GO database. The AmiGO team, or working group, consists of annotators from SGD, dictyBase, GeneDB, the GO editorial office, and a developer (Shu ShengQiang). All AmiGO development is tracked using the AmiGO request tracker, including requests from both the GOC and the general public. The AmiGO working group regularly collects current items from the request tracker and from these formulates priorities for the next round of development. The plan may be revised based on the feedback following notification on the GOC mailing list. Once the plan is finalized, mockups and specification are generated. These mockups and specifications are iterated upon until a consensus is reached, first within the working group and then within the GOC. Based on the approved specification, the developer implements the new features for the release, and AmiGO team jointly tests and debugs the software until it is ready for the GOC beta testing. The fully tested new AmiGO, after approval by the GOC, is released to the general public. A typical release cycle is about 2 months and each release tackles any where from 5 to 10 problems or new features.

The decision to not provide the IEA annotations was in part to provide faster response for database queries, there are over 9 million total annotations that are contained within the validated (filtered) gene association files and of these only about half a million contain a non-IEA evidence code (indicating the annotation was assigned as using an experimental result, not a computational predictive analysis). However, a redesign of the AmiGO web pages is also required to allow the higher quality annotations to not be lost within a sea of IEA annotations.

At time of writing (12/2005) there are approximately 30 outstanding feature requests pending for AmiGO. The release plan just described can easily stay ahead of new incoming requests. Certain new features however will require major work. Specifically, changes to AmiGO are needed to accommodate the new relationships in the GO. This work will occur in parallel with the normal release cycle. The current visualization approach exploits the generally hierarchical nature of the ontology and presents the DAG as an index, or tree. As more and more new relationships are added, the structure of the ontology bears less and less resemblance to a hierarchy: it becomes a mesh of interconnected terms. We are developing multiple approaches for presenting this mesh to users, which will clearly convey an overall perspective of the ontology. One prototype has already been implemented that presents the ontology as tables. We will also prototype a tiled display, where each tiled web page frame presents a different view of the data and selection of terms between frames is synchronized.

5 Summary of Aim 4.

Our objective here is to provide our annotations, ontologies and tools to the research community. We will support the use of the GO by researchers in functional genomics, comparative biology, and other related fields. We will maintain our Web presence, continue to offer other Internet services, and develop new visualization strategies for comparisons of annotations. As an outreach effort, we will modify existing utilities to make them more generic. We will make all of the data submitted to the GOC (sequences, annotations, and ontology) available for local installation and use.

5 TIMELINE AND MILESTONES.

|O==Ontology Editors |

|S==Software Group |

|C==Community Group |

|R==Reference Annotation Group |

|A==Annotation Outreach Group |

We have, in the following table, set out our management goals and project allocations for the next 5 years. Following the table, in section D.6., we discuss each of these groups, identify the staffing, and provide a text overview of our objectives and expectations for this effort. On page 190, we provide a Resource Staffing Responsibilities table (Table 15) for all positions for which we seek funding.

Table 11. Timeline and Milestones allocated to designated groups.

|Year |Task |Group |

|1 |2 |3 |

|Software development |C. J. Mungall, M.Sc. |LBNL |

|GO content |M. Harris, PhD. |EBI |

|Annotation Outreach |D. Hill, PhD. |Jackson Laboratory |

|Community Needs Assurance |J. Lomax, PhD./E. Hong, PhD. |EBI/Stanford |

|Reference genomes |R. Chisholm, PhD. |Northwestern U. |

|SO development |K. Eilbeck, PhD. |U. Utah/LBNL |

Table 12. GO Management Team.

Christopher Mungall will be responsible for the software group, who design, implement, document, and maintain the technical infrastructure used by the project. He is a key developer of the GO bringing his strong background in logical systems development to this project. He will work with the annotation outreach team to provide a toolkit (based on existing contributed software) to jumpstart automatic annotation efforts. All software of the GO project is openly available for use by the community and will be supported.

Midori Harris will be responsible for the editorial group, who develop and maintain the ontology. She will work the GO representatives of the 9 reference genome projects to identify and recruit experts to develop the ontology in specific domain areas. Dr. Harris has been the Editorial Coordinator since April 2001 and will continue in this role, providing continuity and oversight to the operations of the Editorial Office.

David Hill will be responsible for coordinating outreach to emerging genome projects, developing training material, arranging visits to reference genome centers for mentoring, and facilitating the distribution of software utilities. He will work with the 9 reference genome projects to support the annotation of emerging genomes that are members of that organism’s clade. Dr. Hill is a developmental biologist and senior scientific curator in the MGI group. He has been an active participant in GOC since its inception, bringing a strong background in biological systems and overall knowledge of model organism database annotation systems to this work.

Jane Lomax and Eurie Hong will act as liaisons to the community. They will be responsible for quality assurance, Web site ease of use, the availability of appropriate downloads, fostering communications, and providing easy to use mechanisms that encourage feedback. Dr. Lomax is a very experienced GO editor at the EBI. Dr. Hong is Head Scientific Curator at SGD where she has been very involved in GO annotation, curation consistency, and user interface design. She is currently leading the AmiGO interface feature discussions.

Rex Chisholm will be responsible for coordinating the discussions and determining solutions leading to the production of uniform standards of annotation. He will work to establish mechanisms for comparative assessments of annotations, and metrics to determine the completeness of the

annotations for each of the reference genomes. He will coordinate the production and presentation of reference genome breadth and depth assessments.

Karen Eilbeck will be responsible for developing and maintaining the Sequence Ontology. She will work on the GO representatives of the 9 reference genome projects to provide respective sequence sets for each organism in comparable formats to the community. She will work with Dr. Hill assisting individuals from the emerging genome projects to establish appropriate methodologies to produce genomic annotations in standardized form. Dr. Eilbeck has been the SO Coordinator since 2003 and will continue in this role providing continuity and oversight to the operations of the SO Office.

Fortnightly tactical teleconference: Tactical decisions affecting the GO data model, software infrastructure, curatorial issues, annotation procedures, and community needs will be discussed at a conference call between the 7 managers held every other week. The moderator, who will be selected in advance on a rotating basis, will send out the agenda in advance and follow up with minutes. The fortnightly teleconference will also be used to review progress and to identify critical issues affecting productivity and quality. The manager who runs the meeting will provide a brief written report to the directors following every call.

In addition, on-going e-mail correspondence on the go-exec mailing list, will serve for immediate concerns that may arise. All of the managers will also need to interact regularly with the staff members responsible for the focused annotation of the reference genomes. Only some of the reference genome curators are funded through the GOC. These curators are expected to play a dual role. They will be actively annotating gene products for their organism of interest, and also be acting as liaisons to emerging genome projects. The former gives them the real-world, first-hand context for the challenges of annotation, so that they can speak from direct experience when supporting new groups.

|Organism |Contact |Institution |

|Homo |R. Apweiler/E. Camon* |EBI |

|Escherichia |M. Gwynn-Giglio |TIGR |

|Drosophila |M. Ashburner/M. Williams |Cambridge |

|Saccharomyces |J. M. Cherry/K. Christie* |Stanford |

|Mus |J. Blake/A. Diehl* |JAX |

|Arabidopsis |S. Rhee/S. Mundodi* |CI |

|Caenorhabditis |P. Sternberg/R. Kisore* |CalTech |

|Danio |M. Westerfield/D. Howe |Oregon |

|Dictyostelium |R Chisholm/P. Gaudet* |Chicago |

|*Paid for on GO grant and therefore also works on annotation outreach |

Table 13. GOC Reference Genomes, the project PI, and Contact Curator for each.

2 The GO Team.

The GOC team members are responsible for carrying out the necessary work and delivering results. The members of software, editorial, and community assessment groups all reside in offices at the same institutions (LBNL, EBI, and Stanford respectively). They will naturally be in close communication with other members of their own team on a daily basis. Daily correspondence with staff from other teams is handled over e-mail on topic mailing lists. Progress reports from each of the 5 teams will be written annually for the overall project and for NIH. Annual meetings will be held for the entire GOC.

The editorial team organizes two yearly meetings to develop ontology content, and arranges the schedule so that experts in the topic may attend. These meetings are preceded by face-to-face site visits, where a member of the GO editorial team visits to train the experts in necessary technical details of formal representation within the GO. Conclusions and minutes from these meetings are archived on the GO site. The editorial team is responsible for implementing the resulting changes to the ontology.

The annotation outreach team also organizes two yearly meetings to provide training in annotation protocols and best practices. Meetings may be of general interest, or focus on a particular annotation issues (e.g. automated techniques). A substantial curriculum has already been established and this material is provided on the GO site.

A simplified organization chart is shown below.

Figure 15. GOC organization chart.

3 GO Associates.

GO Associates are groups recognized for their significant contributions to the GO project. Typically, GO Associate groups contribute in at least one of the following ways:

• Curation and maintenance of annotations provided to the GO repository for one of the non-core organisms.

• Provision of open source applications for use with GO, and ongoing development and support for these applications.

• Collaboration on major Gene Ontology content development to develop areas requiring specialized domain knowledge.

We recruit Associates both incidentally and proactively. To solicit spontaneous contributions we provide transparent links through our public interfaces encouraging researchers to contact us when they encounter deficiencies. We are receptive to their initiative and respond within 24 hours. In addition, we directly initiate contacts with researchers whenever our current development brings us into a new domain.

The editorial team organizes two yearly meetings to develop ontology content, and arranges the schedule so that experts in the topic may attend. These meetings are preceded by face-to-face site visits, where a member of the GO editorial team visits to train the experts in necessary technical details of formal representation within the GO. Conclusions and minutes from these meetings are archived on the GO site. The editorial team is responsible for implementing the resulting changes to the ontology.

The annotation outreach team also organizes two yearly meetings to provide training in annotation protocols and best practices. Meetings may be of general interest, or focus on a particular annotation issues (e.g. automated techniques). A substantial curriculum has already been established and this material is provided on the GO site.

4 The Community of Biologists.

In the ideal world we strive for, accurate annotations for every gene product spanning the spectrum of organisms would be available in a semantically compatible, computable form. The rationale for the GOC is research support, and semantic compatibility is the bedrock for meaningful discussions, comparisons, and contrasts of the data. To accomplish our aim we rely heavily on feedback from the community. Every web page on the GOC’s site contains a link to request help, or propose a change. We also attend meetings to gather feedback and report our summaries from these meetings back to the rest of the Consortium. We will conduct a survey every two years to explore ways of improve our resource. We will monitor publications using the GO to determine how we should adjust our priorities.

2 Resource Advisory Committee.

Once a year the entire project team will gather for a day and a half long meeting with the project Scientific Advisory Board (SAB). The role of the advisors is to review the progress of the project and to advise it on matters ranging from priorities to community outreach, to the usability of the project web sites and associated tools. The scientific disciplines we will represent on the advisory board include genome biologists, computational biologists, systems biologists, ontology specialists, natural language processes researchers, and representatives of the biotech/pharma industry. We seek expertise in genomics and proteomics, logical systems development, statistical inference, semantic mining, and information technologies.

At the recent SAB meeting held at CalTech (WormBase, host) in 2005, our Board discussed and reviewed with us broad outlines of necessary and challenging tasks. The Board emphasized to us the importance of high quality manual annotations both for genomics researcher, and also to help speed up the automated annotation of other genomes. They identified the need to have a core set of annotators who would provide critical input for development of the ontology. They imagined a SWAT team to facilitate functional annotation with the GO for newly sequenced genomes. They encouraged a comprehensive reconceptualization of the organization of the GOC in order to carry out our mission. They encouraged us to survey our community, and to incorporate a wide variety of mechanisms to gather community input and to promote effective use of the resource. We have considered carefully these suggestions and have endeavored to incorporate them into this proposal.

The proposed project will use the current GO SAB, which represents researchers with a wide range of interests and expertise, both biological and computational, and academic and commercial (Table 14). Dr. Larry Hunter, the Chair of the SAB, is a computational biologist, with decades of experience, who studies the application of machine learning and statistical inference techniques for high-throughput molecular assays. Ian Dix is actively involved in the provision of sequence, mutation, expression and annotation data for human, mouse and rat genes to AstraZeneca drug-discovery programs. David Botstein is Director and Anthony B. Evnin Professor of Genomics, Lewis-Sigler Institute at Princeton University, and was one of the pivotal ring leaders who instigated the formation of the GOC in 1998. Lynette Hirschman’s work in bioinformatics applies MITRE’s expertise in natural language processing to the problem of automatically extracting relevant facts from biology literature. Peter Tarczy-Hornoch is an ontologist whose interests include cross-database queries. He is a co-PI on a National Center for Excellence in Biomedical planning grant, working on a toolkit for sharing heterogeneous biological data from diverse existing data sources leveraging new and existing ontologies. Craig Neville-Manning is a Senior Research Scientist at Google, which has an over-arching interest in organizing the world’s information. Mike Tyers studies the cell cycle from the perspective of the many signaling pathways that coordinate growth and development which target it. And finally, David States, who is in the Department of Human Genetics at Michigan, whose research focuses on gene expression regulation and proteomics.

A program project officer from the NHGRI will be invited to observe all SAB meetings and will receive copies of the agenda, minutes and meeting report.

|Larry Hunter (Chair) |University of Colorado Boulder |

|Ian Dix |AstraZeneca |

|David Botstein |Princeton University |

|Lynette Hirschman |MITRE Inc. |

|Peter Tarczy-Hornoch |University of Washington |

|Craig Neville-Manning |Google Inc. |

|Mike Tyers |Mount Sinai Hospital Research Institute, Toronto |

|David States |University of Michigan |

Table 14.Scientific Advisory Board

3 Resource Collaborations.

The GO is an intensively collaborative project, not only does it bring nine of the model organism databases together in a single collaboration (SGD, TAIR, ZFIN, dictyBase, RGD, MGI, Gramene, WormBase and FlyBase), but it also involves four major database providers directly, TIGR, Reactome, GeneBD and the UniProt group (EBI, SIB and PIR). But the extent of the collaboration is far greater than this, since the GO must collaborate with all of the major genomic database groups worldwide.

The GOC has two levels of collaboration. The first is the Consortium itself, composed of the groups named in the previous paragraph. It is unlikely that the Consortium itself will grow significantly. Since membership of the Consortium requires very active participation, by attendance at the GOC's meetings and by other means, there is a limit to its size. For it to grow significantly would make its management and day-to-day activities unwieldy. For this reason we established the level of GO Associates. This is for groups that make a significant contribution to the GO's work, without being involved in its management. There are now four GO Associates (AgBase, the Candida Genome Database, the Muscle Trait Database and the Plant-Associated Microbe GO group. We expect many other groups to join as Associates, and we have set out clear criteria for membership ([GO.as]).

In addition to these formal levels of collaboration the GOC works with many other groups. Many of these are database providers, others are software providers, while some have more specialized interests, e.g. using the GO for Natural Language Processing. While some of these collaborations are transitory and not very intensive, others are long-term and require investment on both sides, for example of collaboration with Sonnhammer's inParanoid group in Stockholm and that with the BRENDA group in Cologne. We have mentioned many of these collaborations in the body of our proposal, and include as Letters of Support evidence of our long-term relationships with others.

The entire world of genome informatics must collaborate. There is simply too much information for any one group or organization to bring it all together. In the last few years the principals of the GOC have been in the forefront of forging collaborations in genome informatics. This is evident from our roles in the Generic Model Organism Database organization, our roles in organizing major meetings in the field, such as the CSHL/Wellcome Trust Genome Informatics meetings, our encouragement of the new BioCurators Society, and our willingness, both individually and collectively, to devote considerable energy to the development of this field. These activities will continue, not only because of our self-interest in seeing the GO develop, but also because we believe that we will all benefit from such activities.

6 SUMMARY OF RESEARCH PLAN

The Gene Ontology Consortium (GOC) provides the scientific community with a consistent and robust infrastructure for describing, integrating, and comparing the structures of genetic elements and the functional roles of gene products within and between organisms. Our plan over the next five years is to develop the Gene Ontology, and its associated products, so as to best serve the biomedical research community. We will do so by meeting the objectives described in our four aims.

Our objective in Aim 1 is to continue the development of the GOC ontologies so that they are logically rigorous, comprehensive and accurate. We will increase the coverage of the ontologies in response to the needs of its users. We will incorporate new relationship types into the ontologies as needed; for example, between Molecular Function and Biological Process, to better support groups who are utilizing the GO in conjunction with work on pathways. We will make the GO more scalable and more rigorous by constructing its terms with explicit cross-reference to other OBO ontologies. We will apply the principles of ontology development now being defined by the National Center for Biomedical Ontologies.

Our objective in Aim 2 is to comprehensively annotate nine reference genomes in as complete detail as possible. These will be the genomes of Escherischia coli K-12, Dictyostelium discoideum Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, Mus musculus and human. We will develop metrics and quality controls to determine the number of genes that are fully annotated. We will provide visualization and editing tools to make this process as efficient as possible. We will involve biologists with domain expertise to provide the most detailed annotations possible for these genomes. We will respond to community initiatives to completely annotated specialized sets of genes.

Our objective in Aim 3 is to support the functional annotation using the GO for all organisms. We will provide a basic toolkit of training material, standard procedures, tools, documentation, and reference data to emerging genome groups for automated functional annotation. We will offer biannual training sessions in methods of annotation. Curators, from the GO annotation team, will mentor groups who are interested in manual annotation. We will support and encourage the submission of the resulting functional annotations to the central GO repository so as to increase its taxonomic breadth.

Our objective in Aim 4 is to provide our annotations, ontologies and tools to the research community. We will support the use of the GO by researchers in functional genomics, comparative biology, and other related fields. We will maintain our Web presence, continue to offer other Internet services, and develop new visualization strategies for the comparison of annotations. As an outreach effort we will modify existing utilities to make them more generic. We will make all of the data submitted to the GOC (sequences, annotations, and ontology) available for local installation and use.

As we have shown in this proposal, we will meet the challenges of the future. We have addressed quality, scalability, interoperability, and administration. Our first principle is to focus on the biological content, making the GO an intuitive and important resource for biologists. We will minimize technical barriers, so that the bioinformatics community will continue to adopt the GO. We have defined communication channels within the consortium, critical for effective administration. We are excited by these challenges and will meet them with enthusiasm.

HUMAN SUBJECTS RESEARCH.

None.

Vertebrate Animals.

None.

Literature Cited.

Adams, M. et al. (2000). The genome sequence of Drosophila melanogaster. Science 287:2185-2195.

American Academy of Microbiology. (2004). An experimental approach to genome annotation. Washington, DC, American Academy of Microbiology.

Bandapalli, O.R. et al. (2006). Global analysis of host tissue expression in the invasive front of colorectal liver metastases. Int. J. Cancer 118:74-89.

Bard, J. et al. (2005). An ontology for cell types. Genome Biol. 6:R21.

Berners-Lee, T. et al. 2001. The Semantic Web. Scient. Amer. May 2001.

Birney, E. et al. (2006). Ensembl 2006. Nucleic Acids Res. 34:D556-D561.

Blake, J. et al. (2006). The Mouse Genome Database: Updates and enhancements. Nucleic Acids Res. 34:D562-D567.

Boyle, E.I. et al. (2004). GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics.20:3710-3715.

Dolan, M.E., et al. 2005. A procedure for assessing GO annotation consistency. Bioinformatics Suppl 1:i136-i143.

Egenhofer, M. (1989). A formal definition of binary topological relationships. Third International Conference on Foundations of Data Organization and Algorithms (FODO), Paris, France, W. Litwin and H. Schek (eds.), Lecture Notes in Computer Science 367:457-472.

Eilbeck, K. et al. (2005). The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 6:R44.

Galperin, M. (2006). The Molecular Biology Database Collection: 2006 update. Nucleic Acids Res. 34: D3-D5.

Glasner, J.D. et al. (2006). ASAP: a resource for annotating, curating, comparing, and disseminating genomic data. Nucleic Acids Res. 34:D41-D45.

Grumbling, G. et al. (2006). FlyBase: anatomical data, images and queries. Nucleic Acids Res. 34:D484-D488.

Hendrickson, J.E. and Sakonju S. (1995). Cis and trans interactions between the iab regulatory regions and abdominal-A and abdominal-B in Drosophila melanogaster. Genetics 139:835-848.

Hirschman, L. et al. (2005). Overview of BioCreaAtIvE: critical assessment of information extraction for biology. BMC Bioinf. 16: Suppl.

Hucka, M. et al. (2004). Evolving a lingua franca and associated software infrastructure for computational systems biology: The Systems Biology Markup Language (SBML) project. Systems Biology 1:41-53.

Jenssen, T.-K. et al. (2001). A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28: 21-28.

Kanehisa, M. et al. (2006). From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34:D354-D357.

Karp, P. (1995). A strategy for database interoperation. J. Comput. Biol. 2:573-586.

Karp, P. and Paley, S. (1996). Integrated Access to Metabolic and Genomic Data. J. Comput. Biol. 3:191-212.

Kawai, J. et al. (2001). Functional annotation of a full-length mouse cDNA collection. Nature 409:685-690.

Li, H. et al. (2006). TreeFam: A curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 34:D572-D578.

Lomax, J. and MacCray, A.T. (2004). Mapping the Gene Ontology into the Unified Medical Language System. Comp. Funct. Genomics 5:354-361.

Mungall C. 2004. OBOL: integrating language and meaning in bio-ontologies. Comp. Funct. Genomics 5:509-520.

Newman, M.E. (2001). The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA. 98:404-409.

Raychaudhuri, S. et al. (2002). Associating genes with Gene Ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 12:203-214.

Riley, M. (1993). Functions of the gene products of Escherichia coli. Microbiological Reviews 57:862-952.

Riley, M. et al. (2006). Escherichia coli K-12: a cooperatively developed annotation snapshot –2005. Nucleic Acids. Res. 34:1-9.

Schulze-Kremer, S. (1997). Adding Semantics to Genome Databases: Towards an Ontology for Molecular Biology. In: Proceedings of the Fifth International Conference for Intelligent Systems for Molecular Biology Conference, pp. 272-275. AAAI Press, Palo Alto.

Schulze-Kremer, S. (1998). Ontologies for Molecular Biology. In: Proceedings of the Third Pacific Symposium on Biocomputing, pp. 693-704. AAAI Press, Palo Alto.

Serres, M.H. et al. (2001). A functional update of the Escherichia coli K-12 genome. Genome Biol. 2:research0035.1-0035.7.

Simons, P. M. (1987). Parts. A Study in Ontology. Clarendon Press, Oxford.

Smith, B. et al. (2005). Relations in biomedical ontologies. Genome Biol. 6:R46.

The Gene Ontology Consortium. (2000). Gene ontology: tool for the unification of biology. Nat Genet. 25:25-29.

The Gene Ontology Consortium. (2006). The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 34:D322-D326.

Wu, C. et al. (2006). The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 34:D187-D191.

Yang, S.X. et al. (2005). Gene expression patterns and profile changes pre- and post-erlotinib treatment in patients with metastatic breast cancer. Clin. Cancer Res. 11: 6226-6232.

Zindy, P. et al. (2005). Up regulation of DNA repair genes in active cirrhosis associated with hepatocellular carcinoma. FEBS Lett. 579:95-99.

Consortium/Contractual Arrangements.

There are continuing consortium arrangements in place with Stanford University, Carnegie Institution, California Institute of Technology, Northwestern University, Medical College of Wisconsin, the European Molecular Biological Laboratory, and the Lawrence Berkeley National Laboratories. Below is a list of these, with the subcontract PI, and the institution. Each of these groups brings a unique perspective to the development of the GO resource that could not be accomplished within the MGI group alone, nor within any one resource group. Model organism resources represented include SGD, FlyBase, TAIR, WormBase, dictyBase, and RGD, the model organism databases for Saccharomyces, Drosophila, Arabidopsis, C. elegans, Dictyostelium and Rattus respectively. Each group brings a unique biological perspective to the building of ontologies for molecular biology. The LBNL group provides a software development resource that is focused on building robust and user-friendly applications for genome and model organism databases. The EBI Editorial Office provides GO resource support for ontology development and documentation. UniProt is a protein informatics resource that contributes to the development of the GO and that provides GO annotations for all protein sequences in the UniProt resource.

The work of the GO resource is distributed among these groups, taking advantage of the particular perspective and skills provided by each. The distributed ontology development and annotation efforts are described in the research plan. Each group both contributes to the shared development and oversight in regards to their particular biological expertise, and provides outreach to their community and closely related communities. The collective and cooperative efforts of these groups have been and will continue to be essential to the success of the GO. This resource, by its very nature of being an essential resource for the biological resource community, benefits beyond measure by the collaborative nature of this work.

|Resource Group |Sub-Contract PI |Institution |

|GO Editorial Office and FlyBase |Michael Ashburner |European Molecular Biological Laboratory and University of |

| | |Cambridge |

|Saccharomyces Genome Database (SGD) |J. Michael Cherry |Stanford University |

|LBNL Biological Knowledge Research Group |Suzanna Lewis |Lawrence Berkeley National Laboratory |

|The Arabidopsis Information Resource (TAIR) |Seung Rhee |Carnegies Institution |

|WormBase |Paul Sternberg |California Institute of Technology |

|dictyBase |Rex Chisholm |Northwestern University |

|Rat Genome Database (RGD) |Simon Twigger |Medical College of Wisconsin |

|UniProt |Rolf Apweiler |European Molecular Biological Laboratory |

Table 16. Sub-Contract PI and Institutions.

Resource sharing.

The GO project is a community ontology and database resource project. We do not generate animals or reagents. All resources from the GO project, including ontologies, genome annotation sets, ontology development and genome annotation tools, and database resources, are made available openly and freely to all including academic and industry research communities. Details of documentation, versioning, API, database downloads, ontology syntax and files, and GO tools are discussed through the Research Plan.

Data in our public database are updated on a regular basis (see section C:Progress Report). Software changes are incorporated as developed with notification to GOC members and to the research community through the GO web site and through mailings to interest groups, go-friends, and MOD mailing lists. The GOC is committed to supporting the widest possible distribution and use of the ontologies and the GO annotations sets.

Consultants and Letters of Support.

1 Consultants/collaborators.

1. Monte Westerfield, Ph.D.

Director, ZFIN and the Zebrafish

International Resource Center

Member, Institute of Neuroscience

University of Oregon

Professor of Biology

Eugene, Oregon

2. Michelle Gwinn-Giglio, Ph.D.

Staff Scientist III

Coordinator for Prokaryotic Annotations

The Institute for Genomic Research(TIGR)

Rockville, MD

Drs. Westerfield and Gwinn-Giglio are actively involved in the development of GO ontologies and in the use of the GO for genome annotation for Danio rerio and for prokaryotic genomes respectively. They and their staff participate, and will continue to participate, in the development of the GO and the outreach and training for new groups seeking support for GO annotation efforts. We have requested travel funds to bring Drs Westerfield and Gwinn-Giglio to GOC meetings including to ontology development meetings and to annotation training workshops as appropriate.

Dr. Westerfield and the ZFIN group will provide the comprehensive and ‘deep’ annotations for the zebrafish, a model organism identified as one of the important ‘reference’ genomes in this proposal. This work extends their existing efforts to contribute GO annotation sets for the Danio genome to the GO database.

Dr. Gwinn-Giglio leads the efforts at TIGR to provide functional annotations for the prokaryotic genomes sequenced and annotated by TIGR. She actively participates in the development of the GO, and continues to be a key person in the development of new GO resources to support the use of the GO by microbial genome annotation groups.

2 Letters of Support

Joseph A. Bedell, Ph.D., Director of Bioinformatics, Orion Genomics, LLC, Saint Louis, MO 63108.

Ewan Birney, Ph.D., Group Leader, Ensembl, European Bioinformatics Institute, Hixton, Cambridge, UK.

Peer Bork, Ph.D., Senior Scientist, European Molecular Biology Laboratory, Heidelberg, Germany.

David Botstein, Ph.D., Director, Lewis-Sigler Institute for Integrative Genomics, Princeton University, 140 Carl Icahn Laboratory, Princeton, NJ 08544.

Steve DM Brown, Ph.D., Director, MRC Mammalian Genetics Unit, MRC Mammalian Genetics Unit, Harwell, Didcot, Oxfordshire OX11 0RD, United Kingdom.

David W. Burt, Ph.D., Professor. Department of Genomics and Genetics, Roslin Institute (Edinburgh), Midlothian EH 25 9PS, United Kingdom.

John N. Calley, Ph.D., Global Lead for Annotation and Principal Research Scientist, Integrative Biology, Lilly Research Labs, Greenfield, IN.

Daniel Chelsky, Ph.D., EVP & Chief Scientific Officer, Caprion Pharmaceuticals Inc., 7150 Alexander-Fleming, Montreal Quebec CANADA H4S 2CB.

Tom Defay, Ph.D., Associate Director, Molecular Sciences and Chair of the Bioinformatics Strategic Steering Groun, AstraZeneca R & D Wilmington, 1800 Concord Pike, Wilmington DE.

Ernst Dow, Global Lead for Expression Analysis and Principal Research Scientist, Bioinformatics, Lilly Research Labs, Greenfield, IN.

Sorin Draghici, Ph.D., Associate Professor, Department of Computer Science, Wayne State University Director of the Bioinformatics Core, Karmanos Cancer Institute, Associate Editor, IEEE Transactions on Computational Biology and Bioinformatics. 5143 Cass Avenue, 431 State Hall, Detroit, MI 48202.

Richard Durbin, Ph.D., Deputy Director and Head of Informatics, The Sanger Institute, Hinxton, Cambridge, UK.

Janan Eppig, Carol Bult, Martin Ringwald, Joel Richardson, Jim Kadin (Ph.D.s), Mouse Genome Informatics Consortium PI leadership group. The Jackson Laboratory, Bar Harbor ME 04609.

Ken Ichiro Fukuda, Ph.D., National Institute of Advanced Industrial Science and Technology(AIST), Computational Biology Research Center (CBRC), AIST Tokyo Waterfront Bio-IT Res. Bldh. 10F, 2-42 Aomi Koto-ku, Tokyo 135-0064 Japan.

Steve Gardner, Ph.D.,Chief Technical Officer, Biowisdom, Harston Mill, Harston, Cambridge CB2 5GG, United Kingdom.

Ramneek Gupta, Group Lead for Bioinformatics Development and Research Scientist, Lilly Systems Biology, Republic of Singapore.

Lynette Hirschman, Ph.D.,Director, Biomedical Informatics, Information Technology Center, The MITRE Corporation, Center for Integrated Intelligence Systems, 202 Burlington Road, Bedford, MA 01730.

John M. Hancock, Ph.D., Head of Bioinformatics, MRC Mammalian Genetics Unit, Harwell, Didcot, Oxfordshire OX11 0RD, United Kingdom.

Lawrence Hunter, Ph.D., Associate Professor of Pharmacology, Biometrics and Computer Science, Director, Computational Bioscience Program, University of Colorado School of Medicine, Fitzsimons Campus Box 6511 Mail Stop 8303, Aurora CO 80045-0511.

Peter Hunter, Ph.D.,Director, Bioengineering Institute, University of Auckland, New Zealand.

Patrick Hurban, Ph.D., Scientific Director, Paradigm Array Labs, Director of Investigational Genomics, Icoria, Inc, A Clinical Data, Inc. Company, 108 T.W. Alexander Dr., P.O. Box 14528, Research Triangle Park, NC 27708-4528.

Robin J. Johnson, Ph.D., Editor-Scientific Curation, Biobase Corporation, Biological Databases, 100 Cummings Center, Suite 4208, Beverly MA 04609.

Peter Karp, Ph.D. Director, Bioinformatics Research Group. Artificial Intelligence Center, SRI International Room EK207 333 Ravenswood AvenueMenlo Park, CA 94025-3493.

Johan Lund, Vice President, Respiratory and Inflammation, Chair of the Heads of Molecular and Cellular Biology, AstraZeneca R & D Wilmington, 1800 Concord Pike, Wilmington DE 19850.

Donna R. Maglott, Ph.D., Staff Scientist, NCBI, Manager, Entrez Gene, National Institute of Health,National Library of Medicine, Bethesda MD 20894.

Brent D. Mishler, Ph.D., Director, University and Jepson Herbaria, Professor, Department of Integrative Biology, PI, “Deep Gene” Research Coordination Network, Co-PI, CIPRES Project, Co-PI, The Moss Genome Project, University of California, Berkeley, California.

Mark A. Musen, M.D., Ph.D., Professor of Medicine and Computer Science, Chief, Stanford Medical Informatics, Stanford University, Medical School Office Building, Room X-215, 251 Campus Drive, Stanford, CA 94305-5479.

Jean-Marc Neefs, Ph.D.,Senior Scientist, Functional Genomics, Johnson & Johnson PRD, Janssen Pharmaceutica, 30 Turnhoutseweg, B-2340, Beerse, Belgium.

Kenneth Paigen, Ph.D., Executive Research Fellow, Senior Staff Scientist, The Jackson Laboratory, 600 Main St., Bar Harbor ME 04609.

Dan Rokshar, Ph.D.,Programme Head for Computational Genomics, DOE Joint Genome Institute, Walnut Creek, CA; Professor of Molecular Cell Biology and Physics at the University of California at Berkeley, CA.

David B. Searls, Ph.D., Senior Vice President, Worldwide Bioinformatics, GlaxoSmithKline Pharmaceuticals, 709 Swedeland Road, P.O. Box 1539, King of Prussia, PA 19406-0939.

Gavin Sherlock, Ph.D., Assistant Professor, Stanford University, School of Medicine, Research Center for Clinical Sciences Research, Room 2255B, 269 Campus Drive, Stanford, CA 94305-5120.

Bradley K. Sherman, Director of Bioinformatics, Mendel Biotechnology, Inc., 21375 Cabot Blvd., Hayward CA 94545.

Ron Shiegeta Ph.D., Team Lead NetAffx, Affymetrix, 6550 Vallejo Street, Suite 100, Emeryville CA 94608.

Barry Smith, Ph.D., Professor of Philosophy & Director, National Center for Ontological Research, University at Buffalo, 135 Park Hall, Buffalo NY 14260-4150.

Lincoln Stein, Ph.D.,Cold Spring Harbor Laboratory, Cold Spring Harbor, New York.

Robert Stevens, Ph.D.,BioHealth Informatics Group, School of Computer Science, The University of Manchester, Oxford Road, Manchester, M13 9PL, United Kingdom.

Mike Tyers, Ph.D., Professor, Samuel Lunefield Research Institute, Mount Sinai Hospital, Toronto. Canada.

Guy Warner, Ph.D.,Unilever, Safety and Environmental Assurance Centre, Colworth Park, Bedford, MK44 1LQ, United Kingdom.

John Weinstein, M.D., Ph.D., Bethesda, MD.

Adam West, Group Leader and Information Consultant, Bioinformatics, Lilly Research Labs, Greenfield IN.

Barry Zeeburg, Ph.D., Bethesda, MD.

Table of Appendices.

Appendix 1. The Gene Ontology Consortium. 2006. The Gene Ontology (GO) project in 2006. Nucleic Acids Res. 34: (Database issue):D322-D326.

Appendix 2. Clark, J.I., Brooksbank, C. and Lomax, J. 2005. It's all GO for plant scientists. Plant Physiol. 138:1268–1279.

Appendix 3. Eilbeck, K., Lewis, S.E., Mungall, C.., Yandell, M., Stein, L., Durbin, R., and Ashburner, M. 2005. The Sequence Ontology: a tool for the unification of genome annotations. Genome Biology 6:R44.

Appendix 4. Camon, E., Barrell, D., Dimmer, E., Lee, V., Magrane, M., Maslen, J., Binns, D., and Apweiler, R., 2005. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics 6:S17.

Appendix 5. Harris MA, Lomax J, Ireland A, and Clark JI. 2005. The Gene Ontology project. In Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics, Part 4, Bioinformatics. S. Subramanian (section ed.), Wiley & Sons, Inc., New York.

Appendix 6. Lomax J, McCray AT. 2004. Mapping the Gene Ontology into the Unified Medical Language System. Comp. Funct. Genomics. 5:354-361.

Appendix 7. Lewis, S.E. 2005. Gene Ontology: looking backwards and forwards. Genome Biology 6:103.

Appendix 8. Dolan, M.E., Ni, L., Camon, E., and Blake, J.A. 2005. A procedure for assessing GO annotation consistency. Bioinformatics Suppl 1:i136-i143.

Appendix 9. Mungall C. 2004. Obol: integrating language and meaning in bio-ontologies. Comparative and Functional Genomics 5:509-520.

Appendix 10. Smith, S., Ceusters, W., Klagges, B., Koihler, J., Kumar A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A.L., Rosse, C. 2005 Relations in biomedical ontologies. Genome Biology 6:R46.

Appendix 11. Letters of Support.

Appendix 12. Copy of 5th Year annual Progress Report for "The Gene Ontology Consortium (5 P41 HG02773-06).

Appendix 13. GO Publications, tutorials, workshops & presentations, 2003—2006.

Appendix 14. Results of the GO Survey, October 2005.

-----------------------

[1] References within square brackets are listed under "Acronyms and Web References" table provided just before section D:Research Plan.

[2] There is now agreement for the GOC to provide NAR with a GO classification of many of these database projects for the 2007 issue. This will be most useful for specialty databases on  particular processes and functions.  For example, NESbase with  "nuclear export", NMPdb with "nuclear matrix".

[3] Some of this work is already in progress.

[4] OBO-Edit evolved from our first editor, DAG-Edit.

[5]Now, most of the MODs attach GO annotation to genes, rather than to their explicit products. The MODs will all be transitioning, over the next year or so, to attaching these annotations to particular gene products, allowing, for example, different isoforms of a protein (splicing or post-translationally modified) to have different GO annotations.

[6] Annotation to the root terms, molecular function, biological process & cellular component now replace annotations to the old "unknown", e.g., "molecular function unknown", terms.

[7] ISS is the GO annotation evidence code, Inferred from Sequence Similarity (see [GO.evidence]).

[8] Excluding Drosophila species.

-----------------------

[pic]

Granularity Database Species

Figure 8. The Topological relationships application to 2-dimensional features (Egenhofer, 1989).

Table 4. Gene Ontology Evidence Code Abbreviations. The codes reflect the type of assay used to infer the annotation of the gene product.

Figure 1. The number of defined and undefined terms in the SO and SOFA in the Spring of 2003 and in the Fall of 2005.

Figure 5. GO Web Use. Growth of the weekly use of since 10/99. Combined with use of since 5/05.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download