Principal Investigator/Program Director (Stoeckert ...



Research Plan:

a. Specific Aims

As the number of organisms increases for which genomic sequence and/or EST data is available, so does the need for database systems that can leverage this data into powerful functional genomics resources. Most communities have not begun this process. For those that have, rather than build such a resource from scratch, some are decreasing their overhead by adapting established systems. The Genomics Unified Schema (GUS) developed by the Computational Biology and Informatics Laboratory (CBIL) at the University of Pennsylvania is a functional genomics database system with demonstrated success in multiple projects representing multiple organisms. Several groups, particularly from the pathogen organism community, have expressed an interest in GUS. We propose to package it as a reusable infrastructure for functional genomics. This packaging will make GUS installable and customizable by a small staff in a brief time.

To date, GUS has been developed as an outgrowth of specific genomics projects at CBIL. The projects have benefited by their joint use of GUS. However, in this arrangement the GUS development schedule and priorities have been secondary while the genomics projects have incurred the cost and complications of improving GUS. This state has also hampered our ability to help others investigate the use of GUS. We propose here as an alternative to host and manage GUS as an independent, open development project. In particular we will:

1. Package the GUS Relational Schema for release. The GUS Relational Schema is a time-tested extensive representation of bioinformatics concepts. We will develop thorough and sophisticated documentation for it, hardware and RDBMS guidelines, a robust installation system that will accommodate on-site schema extension and a sample data set for testing. We will support multiple RDBMSs, including at least one that is open source, and will host and arbitrate the ongoing schema evolution effort.

2. Extend the GUS schema to include proteomics and other biological domains. The five domains of the GUS schema are (1) genes, RNAs, proteins, biological sequence and features, (2) transcript expression, (3) regulatory regions, (4) controlled vocabularies and (5) tracking and evidence. We will add a protein expression analog to the transcript expression domain, providing a significant bridge between genomics and proteomics. We will add other functional genomics domains as they emerge and continue our commitment to promoting and adopting data representation standards.

3. Package the GUS Application Framework and GUS applications for release. The GUS Application Framework is an object oriented layer over the GUS schema that accelerates the development of the data acquisition programs, analysis programs and user interfaces needed by functional genomics resources. We will provide hardware and application server guidelines, improve its API, provide it with exhaustive documentation and develop standard operating procedures to enable less experienced programmers to tailor GUS. We will include key data acquisition and presentation applications and upgrade them for portability.

4. Manage the GUS system as an open development project. We will promote the GUS system as a genomics community resource by managing it as an open development project, distributing the development effort between us and the system’s users. We will improve the GUS web site at to provide the standard features of an open development project, including a public CVS repository. We will ensure the quality of the codebase by instituting an ongoing code review process for internally and externally submitted code and requiring regression testing as necessary, providing a public sample instance of GUS for testing.

b. Background and Significance

General solutions are needed for functional genomics databases

Genome and EST sequencing efforts have provided many model organism, disease and human genomics projects with raw material that could be used to build functional genomics databases. According to the Genomes Online Database (GOLD, Bernal et al. 2001), there are over 200 eukaryotic genome sequencing projects underway. Most have not yet situated their data in a functional genomics context.

To do so involves integrating a diversity of internal and external data sources and analyses on a large scale to ultimately provide functional details of genes, proteins and pathways and their phenotypic contributions. For example, technologies such as microarrays and protein mass spectrophotometer analysis produce gene and protein expression data that are best exploited when associated with sequence data. Much of the work is in developing sophisticated data storage systems to organize the wide range of high volume data. The work also involves developing and deploying substantial analysis software and running high-throughput computations. Sophisticated user interfaces are expected, both for data presentation and for manual curatorial efforts.

An investment of this size is often out of reach of smaller projects and the bioinformatics core facilities that are becoming more common in biomedical research facilities. For larger projects, infrastructure development is a significant source of overhead. For this reason some groups are seeking to decrease their costs by deploying an available system. Acquiring functionality such as database organization, data loading and data presentation frees the resource’s developers to concentrate on the specific problems and issues of their biological system. An appreciation by the scientific community and funding agencies for general solutions for genomics databases is evidenced by workshops addressing this issue (e.g., NIH/NIAID/Wellcome Trust Workshop on Model Organism Databases, April 29-30, 2002, Bethesda, Maryland, see ).

Model organism databases have taken diverse approaches to data management. Ensembl is a powerful system with a focus on sequence data and automated annotation that is available for installation at other sites (Clamp et al. 2003). EcoCyc (Karp et al 2002a) provides a pathway-based view of organisms built on an ontological foundation. Although initially focused on metabolic pathways and E. Coli, the system has been extended as BioCyc to encompass other organisms, pathways, and genomes. These have largely been generated using the program PathoLogic that predicts metabolic pathways given an annotated genome (Karp et al. 2002). The C.elegans project introduced ACeDB which was adopted by WormBase (Stein et al. 2001), Gramene (grain genomes) (Ware et al 2002) and others. ACeDB is an intuitive and powerful object-oriented system but lacks features that industrial strength relational systems such as Sybase and Oracle offer.

Most model organism databases now use (or are planning to use) an RDBMS in their architecture. The Generic Model Organism Database project (GMOD, ) is moving in this direction. Gramene and WormBase along with MGD (mouse, Blake et al 2003), SGD (Saccharomyces, Weng et al 2003), FlyBase (Drosophila, The Flybase Consortium, 2003), TAIR (Arabidopsis, Rhee et al. 2003), and RGD (rat, Twigger et at. 2002) are collaborating in that effort. The goal of the GMOD project is to “develop reusable components suitable for creating new community databases of biology.” Planned components include database schemas, literature curation tools, Gene Ontology management tools, visualization tools, and data processing pipelines. Recently an early (alpha) version of a database schema “chado” from Flybase has been posted.

The Genomics Unified Schema (GUS)

GUS is a genomics infrastructure that has demonstrated itself to be a solution to this problem. GUS was developed as part of the PlasmoDB (Kissinger et al. 2002, Bahl et al. 2003), AllGenes (human and mouse, ) and EPConDB () projects. The three projects are housed in a single instance of GUS at CBIL containing approximately 140 gigabytes of data. GUS will also provide the basis for a newly funded component of the Stem Cell Gene Anatomy Project (SCGAP, ). PlasmoDB is the highly recognized community repository for Plasmodium functional genomics data covering six species of Plasmodium with Plasmodium falciparum as the central focus (see Figure 1 and Appendix). PlasmoDB provides integrated access to and automated annotation of the Plasmodium genome sequencing efforts from the Sanger Institute, Stanford, and TIGR. With a complete draft version of the genome, PlasmoDB contains the official annotation along with EST, microarray (oligo and cDNA), SAGE, and mass spectrometry-based expression and links to metabolic pathways. AllGenes is a human and mouse transcript assembly based gene index with genomic alignments, links to external resources (e.g., MGD, GeneCards), automated annotations (e.g., GO function term assignment and transmembrane predictions) and manual curation. AllGenes has been used to annotate in detail a portion of mouse chromosome 5 (Crabtree et al. 2001). EPConDB is an endocrine pancreas disease-specific subset of AllGenes that focuses on gene expression from EST and microarray experiments and incorporates signal transduction pathways. EPConDB has been used in the development of the PancChip (Scearce et al. 2002) and will be the central repository for microarray experiments generated by the Beta Cell Biology Consortium (). Another instance of GUS is being used for fungi, protozoa, and microbes in GeneDB () at the Sanger Institute.

These GUS based systems are part of a new generation in genomics resources that focuses on querying and data mining. They offer ad hoc querying over a wide range of integrated data and provide highly structured results. Researchers use these facilities to ask genome-wide questions unsupported by many other organism databases. All three use Genes or RNAs as entry points (top level data types) into the data. EPConDB and PlasmoDB have moved in the direction of multi-dimensional resources by also offering transcript expression experiments at the top level.

As an infrastructure for functional genomics projects, GUS offers a vertical solution. GUS is organized into two subsystems, the GUS Relational Schema and the GUS Application Framework (Figure 2). These build on an RDBMS and provide: an extensive schema covering more than 300 tables and views designed for data mining; data robustness features such as tracking and versioning; data loading programs for a wide range of standard data sources and analysis results; a facility to ease development of data loading programs for new data sources; a pipeline facility to specify in silico protocols; a sophisticated web development kit designed to support advanced web based user queries; a set of manual data entry applications; sample code for advanced web sites and queries; and, an open architecture for extendibility.

The GMOD group has approached a similar problem from a different angle. It is assembling a federation of tools that it expects will converge on interoperability. This is a good approach, and we plan to both contribute to GMOD and use GMOD tools when possible. The advantage to an existing integrated system like GUS is that its components are known to work well together to form the basis of a sophisticated resource. The GMOD group has expressed an interest in the GUS schema described below. They are in the process of developing the chado schema which will cover many concepts included in GUS. The two schemas have similar design philosophies although the GUS schema has been in use for four years and is more comprehensive.

[pic]

Figure 1. The PlasmoDB web site. Data types stored in GUS and available through the PlasmoDB site are shown on the upper left. The PlasmoDB home page is shown in the center with reports and analyses that are accessible through the site. At upper right is a self-contained CD version of PlasmoDB (GenePlot). This figure was used as the cover for the 2003 Nucleic Acids Research Database issue.

The GUS Relational Schema

The GUS Relational Schema runs on an RDBMS and is implemented in Oracle 9i but is largely designed to be server independent (work on this is included in this proposal). The schema represents many years of development. It evolved by combining features of the prototype systems EpoDB (Stoeckert et al 1999), GAIA (Bailey et al. 1998), and DoTS (Database of Transcribed Sequences). These systems focused on sequence and sequence annotation in terms of curation (EpoDB), automated genomic annotation (GAIA), and clustering and assembly of ESTs (DoTS). A key improvement in GUS is an explicit representation of the central dogma relationships: genes → RNAs → proteins. Genes, RNAs and Proteins have stable identifiers and form the backbone of GUS. They contain, for example, name, symbol and functional assignment. They are higher level representations than sequence features and include individual sequences (those with the relevant gene, RNA or protein as a feature) as instances. This is a move away from sequence-centrism that characterizes sequence archives such as GenBank (Benson et al. 2003). Queries directed towards Genes, for example, return the Gene itself, links to related RNAs (and through them to Proteins) and the sequences that are the Gene’s instances. Because the concepts are stable, manual annotation of them persists across versions of the database.

We have recently augmented the GUS schema with the gene expression database, RAD (Stoeckert et al. 2001), and the transcription factor binding site search system, TESS (Schug 2003). RAD is a fully realized system for microarray (both cDNA and oligo) and SAGE data. As part of the inclusion of RAD into GUS we have upgraded the schema to include and keep up with the MIAME (Brazma et al 2001) and MAGE (Spellman et al. 2002) standards. To our knowledge, GUS is the only freely available system that integrates sequence and expression data with the degree of expressiveness that DoTS and RAD provide. The addition of TESS, a system for regulatory sequence analysis, provides a unique dimension for data mining.

[pic]

Figure 2. The GUS system. GUS includes a relational schema with five domains. The application framework is a Perl and Java layer that offers objects and direct SQL database access. The schema and application framework will be managed as an open development project housed in CVS. The GUS applications include data integration and analysis pipelines, data entry systems such as the RAD Web Forms and the Annotator’s Interface and sample web sites. The applications are open source (not open development) and will be released via download (not CVS).

To accommodate the wide range of data represented in GUS, and the continuing expansion we envision, we have divided GUS into domains, each with its own namespace. The five domains included in GUS now are: DoTS, RAD, TESS, Core for data provenance tables and SRes for shared resources including ontologies and controlled vocabularies

GUS uses the warehouse approach to data integration (Davidson et al. 2001), storing external data sources internally (see Appendix). If the data source is a recognized standard, GUS will model it directly. Otherwise it uses its own representation and a transformation program (which evolves as the external model changes) for loading. Warehousing provides a straightforward way to efficiently query across the external data sources. The snapshot of an external resource also provides a semi-stable basis on which to derive computed data such as annotation. Some external sources (such as the GO hierarchy) provide explicit new releases while others (such as NRDB) are updated incrementally. In the former case, GUS stores each release individually. Computations derived from a particular release of the external data source can persist after the next release is loaded. The derived data may or may not be recomputed at that time. In the case of an incrementally updated external source, all derived data is recomputed at the time the external source is refreshed in GUS.

The relational schema is strongly typed, explicitly representing concepts such as different types of sequence features (Gene, RNA, exon, repeat, etc.) and using controlled vocabulary tables rather than unconstrained properties. To manage the large number of data types we introduce with strong typing, we use views to implement specialization. For example there is one general table for all nucleic acid features but 26 specialized views (GeneFeature, RNAFeature, etc.). This approach is taken throughout the domains in GUS and is interpreted as a single level of subclassing in the GUS Application Framework discussed below. While the GUS schema is extensive and highly normalized, its utilization in a number of projects has demonstrated its ability to handle queries efficiently.

Every table in GUS has a set of standard attributes. Each table has a single numeric primary key attribute with automatically generated values. The value serves as a unique handle in the GUS Object layer described below. Each table also includes attributes to specify data ownership by project and tracking information such as who updated the row and how (i.e., using what algorithm).

The integrative and strongly typed design of the GUS schema facilitates queries that may be difficult or impossible using other infrastructures. For example, the PlasmoDB inquiry Find all genes whose proteins are predicted to contain a signal peptide and for which there is evidence that they are expressed in Plasmodium falciparum's late schizont stage combines: (1) proteins predicted to contain signal peptides; (2) proteins with evidence of expression in the form of late-stage schizont specific (a) EST overlap, (b) SAGE tag overlap, (c) proteomic fragment overlap or (d) expression signal on a microarray contained within the database; and (3) links from proteins to genes. This query efficiently provides a viable list of candidate genes to a bench researcher for evaluation. Functionality such as this has led the Plasmodium community to embrace PlasmoDB.

2 The GUS Application Framework

The GUS Application Framework is a tier on top of the relational database. It abstracts the relational schema into objects, and facilitates the development of data acquisition and analysis programs and user interfaces. The framework includes the GUS Object Layer (Perl, with a Java implementation underway), the GUS Plugin Facility (Perl), the GUS Pipeline API (Perl) and the GUS Web Development Kit (Java and Perl). It has been used by AllGenes, PlasmoDB, EPConDB and GeneDB to accelerate their development of sophisticated data acquisition and presentation programs.

The Perl and Java object layers are native representations. They map directly onto the relational tables and views, and are generated automatically from them. They offer functionality such as accessor methods for attributes and relationships, cascading submit and delete and transaction management. They also include a facility to hand-edit objects to add business logic, which is retained across releases. The Perl objects are typically used in applications for loading and exporting data (see the GUS Plugin Facility below). The Java object layer is targeted towards GUI applications such as an Annotator’s Interface that is in development. The object layer’s single-level subclassing simplifies the coding of generalization-specialization cases.

The native objects offer a transform from and to a direct XML representation, which GUS uses as a data exchange format. They are also transformed into non-native object models in the context of particular applications. For example, objects generated from the RAD portion of GUS are mapped to MAGE objects to generate MAGE-ML (Spellman et al. 2002), and vice versa.

Data acquisition and update programs are written using the GUS Plugin Facility. They are called “plugins” and are invoked by the GUS Plugin Facility’s guswrite program. guswrite handles robustness functions such as argument validation, data tracking (who ran the plugin, what algorithm was run, etc.) and versioning. The gbParse plugin, for example, parses a GenBank record into GUS objects that are inserted into the DoTS domain of GUS. The GUS Object Layer and GUS Plugin Facility have shown themselves to significantly simplify the process of writing data loading applications so that scientists with less programming experience can produce clean data acquisition and analysis programs. GUS includes a set of plugins for common data loading and analysis operations.

Developers and scientists can use the GUS Pipeline API to specify and manage large and small protocols. It is a lightweight Perl API that eases the specification and running of a protocol, particularly those that use GUS plugins. For example, developers can define a pipeline to download a set external data sources, run plugins to insert the data and run analyses such as BLAST (Altschul et al 1990) or FrameFinder () to annotate inserted sequences.

The GUS Web Development Kit (WDK) offers web developers substantial assistance in developing a sophisticated web interface for their project. The WDK provides web access to GUS for AllGenes, PlasmoDB, and EPConDB. The WDK includes session management and a declarative mechanism to generate query forms and result pages. The forms issue advanced SQL queries to the database and allow users to build complex queries out of simpler ones with boolean operations. Results are strongly typed and are stored in a result history. Users can also use boolean operations to combine the history results, iteratively refining their result. The WDK utilizes the GUS objects and SQL and is implemented in a combination of Java servlets, PHP4 and Perl.

3 GUS applications

The GUS system includes a set of data acquisition, analysis and presentation programs. Some, such as the RAD Web Forms (described in more detail in the Research Design and Methods section: Specific Aim 3), are likely to be a key component of a GUS installation and some, such as the AllGenes transcript assembly pipeline, are special purpose programs with narrower applicability. Some provided applications, such as the AllGenes web site, will serve only as code examples for resource developers, not as running applications.

Standards in Functional Genomics

Functional genomics is driven by high-throughput technologies that in turn drive biological databases. Sequence data was the initial motivator but recently microarray data for gene expression has spurred informatics needs. Proteomics is moving forward with mass spec. and protein interaction technologies gaining usage and generating large datasets. Other biological data types are likely to follow (e.g., metabolomics).

Standards have proven essential for biological databases to efficiently import, exchange and distribute these different data types. They also facilitate data mining and data integration by providing terms that can be consistently interpreted, and well-defined and stable formats usable by applications. For these reasons, the FASTA format (Pearson 1990) is a de facto standard for sequence and the GFF () is gaining usage for representing sequence features. The Gene Ontology Consortium has established as standards the Gene Ontology (GO) terms for molecular function, biological process and cellular components (Ashburner et al. 2000). Sequence Ontology (SO) terms for sequence features are being developed ().

Standards for microarray data are also underway. Microarray data is more complex than sequence reports because it is context-dependent. The experimental design and information on samples, arrays and protocols are needed to interpret and verify the quantified and processed data. Over the last 3 years, the Microarray Gene Expression Data (MGED) Society () has developed standards for the Minimal Information Associated with a Microarray Experiment (MIAME, Brazma et al. 2001) and a Microarray Gene Expression Object Model (MAGE-OM, Spellman et al. 2002). Dr. Stoeckert has been involved with both of these efforts. MAGE is an approved OMG standard () and is implemented as a type of XML (MAGE-ML). An effort (led by Dr. Stoeckert) is underway to develop an MGED Ontology for microarray annotation terms as part of the MGED ontologies. The Human Proteome Organization (HUPO) has begun generating standards for mass spec. and protein-protein interactions as part of a Proteomics Standards Initiative (PSI, ). MGED has offered to work with HUPO to leverage the MGED experience and standards as a starting point for the PSI effort.

Our long standing involvement with standards efforts has influenced our structuring of GUS. The schema utilizes standards throughout. For example, the SRes domain (shared resources) incorporates complete representations of the NCBI taxonomy, GO function terms, SO terms, MGED Ontology and others. The RAD gene expression schema is MIAME compliant.

c. Preliminary Data

While GUS has been in use for years in multiple projects at the University of Pennsylvania, its development has been driven by funding from those projects, and subject to their priorities. Those priorities such as releasing new versions of AllGenes and PlasmoDB have been at odds with packaging GUS for release or managing it as an open development project. Nonetheless, we have begun these efforts on a modest scale. The results are encouraging, the most significant being our delivery of GUS to two external sites, and a commitment from a third to proceed. These three sites, it turns out, exemplify the types of projects we have in mind for GUS: a heavily staffed major genomics project, a lightly staffed smaller genomics project, and a bioinformatics core facility. We have also set up a web site which is a start at open source project management and have made progress documenting and packaging GUS.

Delivery to external sites

We began work with the Pathogen Sequencing Unit (PSU) of the Wellcome Trust Sanger Institute in early 2002. PSU is developing GeneDB (), a genome data resource initially focusing on pathogens and fungi, such as sequence and annotation for Schizosaccharomyces pombe. They chose GUS over other systems they evaluated because GUS holds the type of information they need and the PlasmoDB project demonstrated that GUS is suitable to their organisms of interest. PlasmoDB is user-friendly and able to formulate sophisticated queries leading it to be well received by the malarial community. This demonstrated to PSU that the information in GUS can be extracted in a biologically relevant manner. PSU was also attracted to GUS because GUS permits curation of a genome to continue beyond its sequencing project, and the genome sequence information can be accessed in biological context. They plan to provide phenotypic data and links to alleles and genes. PSU has contributed significantly to the development of the GUS schema in this area and has provided creation scripts for introduced tables. Coordination of the project involves discussions on a SourceForge mailing list and monthly conference calls. Prior to setting up GUS in the summer of 2002, Ms. Marie-Adele Rajandream and 3 members of her group visited CBIL for a week. Working with the PSU group clarified many issues associated with setting up GUS at external sites. (See letter of support.)

Dr. Jessica Kissinger’s group at the University of Georgia (UGA) has also installed GUS. Dr. Kissinger was a major developer of PlasmoDB and is familiar with GUS. Dr. Kissinger and a computer science Masters student managed the installation with consultation from an Oracle database administrator and a series of calls and emails with CBIL. We believe this demonstrates that using GUS does not require a major investment or a large staff. Dr. Kissinger is using GUS to house TcruziDB (Trypanosome cruzi) and plans to move ApiDB, a database of Apicomplexan data into GUS. (See letter of support)

A third external installation of GUS is about to start at the University of Pennsylvania’s Bioinformatics Core Facility. The Director of the core, Dr. Brian Brunk, was a developer of GUS (DoTS in particular) when he was a member of CBIL. Dr. Brunk expects the DoTS domain of GUS to be of widespread use to the Penn community and the RAD domain and the RAD Web Forms to be heavily used by the Penn Microarray Core facility. (See letter of support.)

Packaging GUS

The current schema, code, and documentation for GUS are available at the GUS web site . This site provides pointers to a public Concurrent Versioning System (CVS, ) and a public SourceForge ()mailing list. There are reciprocal links between the GUS and GMOD sites.

1 Licensing

We have adopted the Apache Open Source License () which allows unrestricted use as long as the copyright notice is retained.

2 CVS Repository

Code placed in a public CVS repository is most useful if the repository is well organized. We have completed the process of designing a highly structured project and component based organization of our CVS repository. The projects specify their dependencies upon other projects, and the components within projects specify their dependencies upon other components. This has enhanced the modular structure of our software and has aided in software maintenance and project management. We have placed a subset of our codebase in this new structure (and will continue the migration as part of this proposal).

3 Relocatable executables

We have transitioned a significant subset of our executable software to a relocatable file system location designated by the $GUS_HOME environment variable. This helps ensure that GUS is simple to install, allows multiple running versions of GUS and avoids conflicts with a site’s pre-existing file structure.

4 Build System

We have completed the development of an Ant-based () build system to transform source code as stored in our structured CVS repository into the structured executable form expected in $GUS_HOME. The build system detects inter-project and inter-component dependencies, ensuring that the installed codebase is internally consistent, complete, and up to date.

5 Documentation

The existing documentation for the GUS schema is available in the GUS schema browser on the GUS web site. The schema browser displays an alphabetical list of tables divided by domains (Figure 3). The listing is generated dynamically by querying GUS. Each table name links to a detailed view displaying attribute names and types and links to related tables. The browser also displays descriptions of the tables and attributes that are stored as meta information in the database. The documentation effort to create those descriptions is approximately 30% complete. UML diagrams of selected portions of the schema are available from the GUS web site.

6 Installation tutorial

Dr. Kissinger’s group has developed an installation tutorial that documents the detailed procedure they followed to install GUS. It begins from a blank slate, and explains installing an Oracle server, installing the GUS schema, populating the database, running Plugins and installing Java, Tomcat and Java servlets.

7 Platform independence

The Oracle RDBMS we have used for GUS and the GUS Application Framework run on Red Hat Linux 7.x and 8.x. We have not purposefully relied on any Linux specific UNIX features but have not tested GUS on other UNIX variants (although it previously ran on Solaris). The RDBMS could in theory be moved to any operating system because we communicate with it as a client.

In general we have designed GUS to be RDBMS independent. We were exposed to the issues when we moved GUS from Sybase to Oracle two years ago. At that time we factored most of the RDBMS-specific code into a small number of files. There is still some work to be done, in particular re-implementing the RDBMS-specific features for each new RDBMS that we support.

The GUS WDK runs on the 4.x releases of Tomcat, which implement the Java Servlet 2.3 and JSP 1.2 specifications.  The WDK code is compatible with any servlet container implementation that meets these specifications, although the configuration details may vary slightly. (Tomcat is the "Reference Implementation" for these Java technologies.)

8 Schema installation programs

We have developed prototype schema installation programs for Oracle 8i and 9i. More sophisticated programs, and programs for other platforms such as mySQL are included as part of this proposal.

[pic]

9 Figure 3. The GUS Schema Browser. This figure shows the Gene table in detail. Relationships for parent and child tables are based on foreign keys.

10

11 Mirroring GUS

Both PSU and Dr. Kissinger’s group have installed replicas of our development GUS database as an initial data set. This was useful but we expect in the future to provide a smaller sample dataset. In the case of PSU, we used the standard Oracle IMPORT and EXPORT utilities, transferring the data by ftp. This approach requires manual intervention unless the two instances’ table spaces match. It also requires an amount of temporary storage equal to the data size. For Dr. Kissinger we experimented with a script that transfers tables (both their schemas and data) directly from one Oracle instance to another. Installing the UGA mirror tested this script over a relatively slow long-distance network link, and the installation was too slow (20 days) to consider for the future. However, other tests indicate that the script is practical for use over a higher speed (100 Mbps) local network connection and would be faster than using the Oracle utilities.

These two test cases did not address two important issues: transferring a subset of the database, and transferring across RDBMS platforms. These issues will be considered as part of this proposal.

d. Research Design and Methods

Specific Aim 1: Package the GUS Relational Schema for release

The GUS Relational Schema has established its effectiveness through years of service for the AllGenes, PlasmoDB and EpconDB projects. These projects have relied upon local communication at the University of Pennsylvania for comprehension and specification of the schema. To package GUS for widespread release, we are faced with the challenge of educating external developers, and providing them with a means to easily install and extend the schema.

1 Schema documentation

The current schema documentation consists of a web based schema browser and hand-made UML diagrams of subsets of the schema. This has been serviceable internally, but needs improvement for external use.

1 Subdividing the DoTS domain

Users approaching the schema turn first to the domains as an organizational aid. While RAD, TESS, Core and SRes are natural divisions, the DoTS domain grew historically from the DoTS database, and includes a very large number of tables and views (200). We will break DoTS into three new domains: Dogma, the central dogma tables; Seq, the sequence and feature tables; and, Assem, the transcript assembly tables. This represents a minor adjustment but will have a large impact in terms of accessibility.

2 Table categories

To make the documentation manageable throughout, we will categorize the tables and views by topic, such as “Gene Annotation” or “Processing and Normalizing Expression Experiment Results.” Each table and view may belong to more than one category. The category information will be stored in the database as meta information.

3 Schema browser

Our schema browser is an efficient way to get detailed information about a table, including its attributes, relationships and documentation. We will continue the effort to document each of the tables and views. We will make minor improvements on the browser’s user-friendliness, such as adding a search function that lists tables whose attributes or descriptions match provided keywords. We will supply each table with a link to its categories (where a category specific tutorial will be available) and to the relationship browser described below.

4 Tutorials and UML diagrams

We will provide tutorials and UML diagrams for each table category. The tutorials will cover standard use cases for the category, explaining aspects of the schema that are difficult to deduce. We use Rational Rose () to generate the UML diagrams from the schema. However, because GUS uses views to implement an ISA relationship, the generated Rational Rose UML diagrams do not correctly convey the subclassing. We manually correct the UML now. But as the XML Metadata Interchange (XMI) standard evolves to include diagramming capability, we expect to automatically alter the diagrams as the schema changes, decreasing the cost of maintaining the UML diagrams. In the meantime, we will explore using a graphics format such as Scalable Vector Graphics (SVG) to render objects representing the GUS tables and layout algorithms such as Graphviz () to place them in a legible graph display.

5 Relationship browser

The GUS schema is large and includes a complex has-a relationship structure. UML diagrams display these relationships but, because they are static representations of subsets of the schema, sometimes suffer from being incomplete, overwhelming or both. We propose to introduce a simple relationship browser to provide developers (and other interested scientists) concise and complete relationship information for the GUS tables. We will model this browser on the AmiGO Gene Ontology browser (), which displays the GO ontology in an expandable folder format. Each folder will be a link which invokes a detailed page about a table, preceded with a + or - icon to expand or collapse the tree at that point. A parent-child relationship is one between two tables where the child table has a foreign key to the parent. The default view shows only root tables, i.e., tables that have no parent. These expand to show children. The children expand to show their children, and their other parents (the links are marked as such). Cycles are broken (and marked as such).

The user will navigate from a table to any others with which it has relationships seeing the relationships laid out in one display. A search function will take as input either one table or two tables. The single table search creates a display with that table as root. The two-table search display is the same, but the tree is pruned of all paths that do not lead to the second table.

2 Definitive representation of the schema

Co-development and distribution of the GUS schema requires a definitive and portable representation of it. We propose to store this information in the form of GUS definition files. The schema itself will be stored in files containing SQL create table statements (one file per domain). We will also have formats to store table and attribute comments and meta information describing the tables for the benefit of the GUS Application Framework (see Specific Aim 3). The definition files will be housed in the source code repository, and be subject to release cycles in the same way as all other components (see Specific Aim 4). We will provide a program to generate definition files from an existing GUS instance.

3 Schema installation

We will provide scripts that create or revise the schema of a GUS instance based on a set of GUS definition files. The installation process will take as configuration information a directory tree in which to find definition files for on-site extensions to the schema.

4 Migrating data across schema versions

We have recently upgraded the GUS schema to version 3.0. This version included significant changes. We have developed and put to use a programming framework to help manage migrating data across versions of the schema.

We will use this approach to provide, for each release of the schema, a migration program that transforms the data from the previous release to the new release. Sites with GUS installations that have fallen behind releases will have the option of installing and migrating the intervening releases in succession to catch up, or of installing only the latest release and writing a migration script to make the jump.

5 Sample data set

New installations of GUS need a sample dataset to begin the process of understanding the system in detail, and testing plugins and other programs. To date we have approached this problem by providing a dump of the entire database. This had the advantage that it was relatively simple to implement. However, it is overkill for the intended purpose, and is cumbersome or impractical on a wider basis.

Instead we will provide a sample dataset that covers much of the schema (and includes only freely available data). It will be delivered in a flat file format for each RDBMS that we support. For simplicity, this file will have embedded ids, which means that it can only be inserted into an empty database. We will also provide a sample pipeline (discussed below) that will generate and insert this dataset, even into an already populated database.

The sample dataset will evolve as the schema changes. We will include migration scripts to update the sample for new releases of the schema.

A more distant goal will be to develop a system to transfer large subsets of or entire existing GUS instances.

6 Platform independence

We will complete the process of localizing RDBMS dependence that began when we ported GUS from Sybase to Oracle. We will specify in general how to deploy GUS on a particular server, and implement this on at least one open source server, such as MySQL or PostgreSQL. We will assist co-developers (see Specific Aim 4) in their efforts to support additional platforms.

7 Meta information

The GUS Application Framework uses meta information stored in the database (see Specific Aim 3). For example, the schema browser uses this information to display table attributes and relations. It also contains an as yet unused representation for categorizing tables. We will populate this with information categorizing tables by use case.

We will also extend the meta information to include a semi-formal representation of changes between versions of the schema. It will capture tables and attributes that have been added or deleted and the list of relevant tables and attributes from the previous schema. The information stored is intended for manual not automated interpretation, and will be used, for example, when generating release notes and writing schema migration programs. We will develop a simple user interface to enter this information. It will be initialized with the differences found between the definition files of the current and previous releases of the schema.

8 Release Notes

We will use the meta information stored in the database to generate release notes for the schema. These will detail the differences between schema versions.

Specific Aim 2: Extend the GUS schema to include other biological data types such as proteomics data

The RAD domain of the GUS schema contains microarray and SAGE data. We will add an analogous domain for proteomics. In PlasmoDB we are using sequence feature tables for mass spec. data because we are storing only limited information on peptide mass and assignments. As is the case with microarray experiments, we expect more detailed descriptions of mass spec. experiments to be forthcoming and we are fortunate to have Dr. John Yates, one of the leaders in this field, as a contributor to PlasmoDB (see letter of support). While it is relatively simple to capture high level and highly processed results from mass spec. experiments, it is more difficult to capture the minimal information needed to interpret and reproduce these experiments. We will continue to monitor efforts led by HUPO to generate standards along the lines of MIAME and MAGE for proteomics data and incorporate them into GUS as they develop. A summary of key points from HUPO on mass spec. working groups is available at . It provides a starting point for modeling the representations of raw data and its interpretation, determining upstream components that need to be recorded and questions the data repository should be able to answer.

A graduate student, Andrew Jones, University of Glasgow has been adapting MAGE to mass spec. data () as part of a project with Dr. Jonathan Wastling (see letter of support). Mr. Jones is planning to spend time working with CBIL to develop schema based on this work prior to the funding period of this proposal.

In conjunction with the proteomics schema domain, we will develop the plugins required to load this data in the database.

The approach we take for mass spec. data will be applicable to other types of functional genomics experiments, such as protein-protein interaction data, use of microarrays for comparative genomic hybridizations (CGH), chromatin immunoprecipitations (ChIP), RNA interference assays and phenotypic descriptions. Through MGED, we are well positioned to monitor and help direct standards efforts for capturing these different types of information. Once the standards are underway we will model them as domains in GUS, on an ongoing basis.

Specific Aim 3: Package the GUS Application Framework and GUS applications for release.

The GUS Application Framework consists of the GUS Object Layer, the GUS Plugin Framework, the GUS Pipeline API and provided GUS plugins and applications. We will improve the usability of each of these components by reviewing and improving their APIs as necessary, internally commenting the code, generating well commented web-based API documentation and by providing user guides and tutorials. We will specify the third party software required by the GUS Application Framework.

1 Platform independence

The GUS Application Framework runs on Linux 7.x and 8.x and on Tomcat 4.x. We will assist co-developers (see Specific Aim 4) in their efforts to support additional platforms. Client side applications run in web browsers and in Java. We will test these on Linux, Windows and Mac with recent versions of Internet Explorer, Netscape and Java.

2 GUS Object Layer

We have established that the native GUS Object Layer is enormously useful in data acquisition programs such as GUS plugins and in GUI oriented applications such as our Annotator’s Interface (described below). We have also shown that they are compatible with integration into applications that use non-native object models. We have written adapters from GUS objects to MAGE-ML (used by the MAGEtoGUS application) and to the bioWidgets (Fischer et al. 1999) visualization API (used by our Annotator’s Interface).

The Perl version of the object layer was developed three years ago, and the Java version is under development now. We will re-implement the Perl version to be largely identical to the newer Java version. This will reduce their combined maintenance cost and permit users to feel comfortable in both implementations. We will use coding conventions in Perl to clarify the API in ways that are natural in Java, such as specifying the visibility of methods and including publishable inline documentation (using Perl's POD). The Java layer will include a Java RMI based remote access layer. This will facilitate the use of the object layer in Java-based client-side applications such as the Annotator’s Interface.

We will also fill gaps in the object layer that have become apparent over time, including improving garbage collection to avoid current limitations on object counts, re-engineering the object-relational mapping to include transparent handling of many-to-many relationships, explicit accessor methods for each table attribute, nested value objects (such as Date), and a GUS base class.

3 GUS Plugin Facility

The GUS Plugin Facility helps both more and less experienced programmers populate their GUS database by simplifying and structuring the process of developing data loading and analyzing programs. Our staff of biologists and software experts has together written more than 100 plugins over three years. New installations will utilize existing plugins to load standard data formats, but will also have site-specific data to load that will entail writing plugins.

We have recently upgraded the GUS Plugin Facility's API to make it more self-explanatory. We have also added a feature that detects when a plugin has changed with respect to the version registered in the database, forcing the plugin developer to re-register the new version (ensuring correct tracking information).

Plugins implicitly depend on a particular version of the schema, leaving them vulnerable to errors when the tables they use have undergone a schema change. We will introduce a schema certification mechanism for plugins and a schema-check mode. Each plugin will explicitly declare the tables it depends on and the last release number for which it is certified. The installation of a schema release will run each registered plugin in schema-check mode (using the meta information in the database that captures differences between schema verisions). If the plugin uses tables that have changed since its certified version, the schema-check mode will provide an error report summarizing the changes made to the relevant tables. After bringing the plugin up to date and testing it against the new schema, the developer will mark it as certified for the new schema version.

4 GUS Plugins

CBIL has developed more than 100 plugins (see Table 1 for examples) to parse and load data for nucleic acid sequences, gene expression results, external ontologies (e.g., NCBI taxonomy, GO terms), BLAST (Altschul et al 1990) similarity searches, and the results of other types of sequence analyses (e.g., BLAT [Kent 2002] alignments, protein sequence predictions, signal peptide predictions, transmembrane predictions, gene function predictions, etc.). We have recently completed an initial pass over approximately 35 plugins to bring them into compliance with the GUS Plugin Facility's improved API and with our most recent revisions to the schema. We will release these first, and make the others available over time.

The plugins included with GUS have been programmed over a number of years, often by biologists with limited computer science training and, except for use of the GUS Plugin API, they diverge stylistically. We will define a plugin coding standard to make plugin programming as accessible as possible to less experienced programmers. This will include specifications for: making plugins compatible with the GUS Pipeline API (e.g., proper return status, restartability, non-interactivity); making them platform independent (no hard coded file system paths, no use of Oracle specific features); making them compatible with automatically formatted documentation; including declarations of GUS tables used; ensuring that they use superclasses for proper generality; and, making them user-friendly (e.g., minimizing use of SQL as input arguments). We will bring the most widely used subset of released plugins up to this standard, and improve the rest as needed.

We will keep plugins up to date with respect to new schema versions on an ongoing basis.

A significant number of the plugins are parsers of standard data formats. We will keep these current as needed, including coding improvements required to scale with what can sometimes be dramatic increases in input data set size.

|Plugin Name |Plugin Description |

|GBParser |Parse Genbank records and store results |

|InsertDbRefAndDbRefNASequence |Insert external database identifiers into DbRef and associate them with NA sequences|

|InsertExternalSequences |Insert new ExternalSequences (NA or AA) from a FASTA file |

|LoadBlastSimilarities |Load BLAST similarity results into the Similarity table |

|LoadNRDB |Load a new NCBI nr protein version into NRDBEntry and ExternalAASequence |

|LoadTaxon |Load new NCBI taxon files into Taxon, TaxonName, GeneticCode |

|UpdateGusFromXML |Update the database from objects in an XML file |

|MakeGoPredictions |Associate GO predictions with sequences |

|ArrayLoader |RAD: Load information about the array used and the elements spotted or synthesized |

| |on this array. |

|ArrayResultLoader |RAD: Load all measurements output by the image quantification software utilized to |

| |attach intensity values to the elements on an array for a given hybridization. |

|ProcessedResultLoader |RAD: Load processed data (e.g. normalized or averaged data) and protocol information|

| |into the appropriate RAD tables for processed data. |

Table 1. Example GUS plugins.

5 GUS Pipeline API

The GUS Pipeline API must interact with compute clusters to do large batch tasks such as whole database BLAST. It does so now, but is limited to the type of cluster manager we use at Penn. We will incorporate changes that ease the burden of communicating with other cluster managers.

The AllGenes pipeline has ninety stages, covering a very wide range of EST assembly and gene, RNA, protein and sequence annotation. We will provide it as a sample.

6 GUS Website Development Kit

The GUS WDK facilitates the development of web interfaces for GUS. It is implemented using the Java Servlet specification and handles the low-level details of a typical web-based application, such as interacting with the application server (e.g., Tomcat), tracking user sessions, and issuing queries to the database. A web site developer supplies the WDK with a list of SQL queries in a declaratively-specified configuration file. The toolkit generates a web page and HTML form for each of the queries, including the appropriate form elements (checkboxes, pull-down menus, etc.) to gather query parameters. The form elements may themselves depend on database queries. For example, a pull-down list of HUGO-approved gene names is generated using a query that retrieves HUGO names. This mechanism ensures that the terms and controlled vocabularies displayed on the web pages are always up to date with those stored in the database.

Each query that a user runs is stored for the duration of the user’s session and is available in the user's "query history.” The results can be manipulated to perform simple data mining tasks such as list all of the genes that were returned by both of the last two queries. Query result sets are strongly-typed and those with the same type (e.g., two sets of genes) may be combined using boolean operators to refine the results.

The WDK includes a set of "wrapper" classes that implement customizable web interfaces to a number of useful tools (e.g., BLAST, Xcluster []). It also includes a number of library routines for generating graphical views, (e.g., of genes, proteins, and chromosomes) and performing other common tasks.

We will initially document and package the Java Servlet based WDK for release. In the following releases we will repackage the WDK so that it can be used with Java Server Pages (JSP, ). This will give web developers more flexibility in choosing which WDK components to use and also make it easier to create more sophisticated web interfaces, or simpler ones, as needed.

We will also significantly improve the WDK's data mining power. Where queries now ask for a single value, they will allow lists of values in a text box or file. For example, a user may ask for all genes that match a set of accessions previously isolated in his or her research. The result will go into the query history. The user may then find all genes that fit similar criteria (possibly using a series of queries and/or complex boolean operations) and use set subtraction to filter away the previous genes, highlighting a set of new candidates. Boolean operations will also be available across diverse complex data types, joining on common attributes. For example, a user may produce a set of genes of interest and a set of expression experiments of interest, and then issue a query to determine which of the genes appear in all of the experiments. We will also add a simple report-maker facility so users can download their results in a tab-delimited file with columns that they have chosen.

The WDK is implemented in Perl and Java, with the Java component running on the Tomcat application server. The initial release of the GUS WDK will include installation guidelines for Tomcat, with subsequent releases incorporating documentation and examples showing how to use the WDK in conjunction with JSP.

We plan on submitting the WDK to GMOD which requires that it be GUS independent. The WDK currently assumes that each table in the database has a unique primary key column (surrogate key), but this assumption will be relaxed. We also plan to improve the integration between the WDK and third-party tools, making it easier to use the features of the WDK (e.g., query history tracking, boolean queries for data mining) with an institution or lab’s existing tools and web services.

7 GUS applications

We have developed or are developing for internal use a number of GUS applications for data acquisition, data analysis and data presentation. We will release these projects under our open source license, making them available by download. (We do not plan to include these in the GUS open development CVS repository). We will also include code that, while not directly incorporated into the user’s project, will be useful as a template.

For each program we release, we will develop documentation. For the programs that will run on the user’s site (as opposed to the templates) we will provide an installation system and specify any required third party software.

The set of released programs will grow on an ongoing basis.

1 Sample data load pipeline

We will develop a sample data load pipeline (using the GUS Pipeline API) that will create a sample dataset and serve to test much of the GUS Application Framework. The pipeline will load standard data such as GenBank, NRDB, NCBI Taxon, etc.,(Wheeler et al. 2003) and attempt to exercise as much of the schema as possible.

2 Sample web site

We will develop a sample web site that issues queries against and formats results from the subset of the schema which is active in the sample data set. The web site will run on the user’s site and provide the starting point for web site customization.

3 Web sites

We will release as sample code the AllGenes, PlasmoDB and EPConDB web sites (they will not run on the user’s site). Each of these sites is an extensive code base incorporating an enormous amount of functionality. They demonstrate the use of the GUS WDK, as well as standard web site design approaches.

Web sites are non-trivial to build especially when considering performance times of queries requiring a join over several large database tables. The provided web site code will include templates for many optimized queries such as retrieving mRNAs and gene models based on text descriptions, various accessions, gene trap lines, pathways, mapping data, predicted GO function terms, predicted signal peptides, predicted transmembrane domains, expression profiles, location on chromosomal contigs, gene structure, and presence of polymorphisms.

4 The RAD Web Forms

The RAD Web Forms application is an annotation system for microarray experiments. Users specify the information mandated by MIAME: study design description; biomaterials, protocols and parameter settings for the bench work and for the image scanning and quantification. This functionality is unique and is important to functional genomics databases with microarray data as part of their mission.

The forms are written in PHP4 (a rapid development web scripting language []). They query ontologies stored in GUS to populate pull-down menus with controlled terms. The MGED ontology () provides terms for standardized classes (e.g., terms for gender) enabling MAGE-based data sharing (e.g., between RAD and ArrayExpress). New terms are collected and can become part of the ontology upon approval of the MGED ontology working group.

The PlasmoDB and EPConDB projects have been using a prototype of the RAD Web Forms (Figure 4). A production version is slated for March 2003. Recipients of microarray slides from the MR4 () will use it to enter data for PlasmoDB in conjunction with the ATCC (). Members of the Beta Cell Biology Consortium () will use it to enter data for EPConDB.

[pic]

Figure 4. The RAD Web Forms. The forms capture the different parts of a microarray experiment that are accessible through the menu in the left-hand frame. The form for describing the experimental or study design is shown. The forms make use of the MGED ontology (the class hierarchy on the right).

5 GUStoMAGE

GUStoMAGE transforms GUS objects representing gene expression data and annotations from RAD to the MicroArray Gene Expression Object Model (MAGE-OM) for submission to ArrayExpress (Brazma et al. 2003). GUStoMAGE uses a declarative rule structure in XML to represent the object transformation and an engine implemented in Perl. The input is Perl GUS objects, and the output is Perl MAGE-OM objects which dump themselves to MAGE-ML.

6 TESS

The TESS domain of GUS stores information about transcription factors, examples of their binding sites, and grammar-based models of their binding sites.  We will provide plugins to load TESS and DoTS with known binding sites from TRANSFAC (Wingender, et al. 2000) and COMPEL(Kel-Margoulis, et al. 2000), to predict binding sites on DNA sequence in DoTS using site models in TESS, and to identify novel motifs in DNA using third-party programs like AlignACE (McGuire, et al. 2000).  This functionality is now part of the EPConDB project.

7 Annotator's Interface

The AllGenes project uses a web-based Annotator’s Interface. Annotators correct the automated grouping of putative transcripts into putative genes and the automated GO functional assignment of the putative transcripts. They also assign approved (HUGO or MGI) gene symbols and synonyms to putative genes.

After evaluating existing options (Apollo [Lewis et al. 2002] and Artemis [Rutherford et al. 2000]) we determined that the best route to achieving our new broader requirements is to re-engineer the Annotator’s Interface using the new Java GUS Object Layer, an effort which is underway. Version 2.0 will add significant functionality, including creating gene models, defining alternative transcripts and associating RNAs and their corresponding protein isoforms to tissue expression and function. It will operate as client side Java application (permitting our annotators for AllGenes in Russia to work offline) and include a bioWidgets-based (Fischer et al. 1999) genomic viewer to aid in annotating and correcting gene models.

[pic]

Figure 5. Setting up GUS. GUS provides (1) server acquisition and hardware guidelines, (2) a default system to learn on, (3) the means to customize the schema and data acquisition programs, and (4) a means to create a sophisticated custom web site.

1 Additional applications

As we develop additional GUS compatible applications we will consider releasing them if there is demand. Two existing applications may be candidates. GoPredictor is a heuristic rule-based system that assigns GO molecular function predictions to sequences (Schug et al 2002). OrthoMCL is a Markov clustering algorithm to construct ortholog groups for multiple eukaryotic taxa.

Specific Aim 4: Provide facilities for and manage GUS as an open development project.

1 The GUS web site

We will expand the GUS web site at to make it the hub of the GUS open development effort. It will encourage GUS users to contribute their GUS programs and act as an open forum for mutual support. It will incorporate facilities available through SourceForge, and the CVS server at the Sanger Institute, including standard features used for open source projects: a CVS server, bug trackers, feature trackers, documentation of APIs, tutorials, user guides, mailing lists, FAQs, people involved, references, links. We will also include a schema browser as described in Specific Aim 1 and investigate the use of a Wiki Wiki Web ( ) server to promote a free exchange of ideas about GUS.

2 Co-developed applications

Our intention is to release what we have developed so far with GUS, and take responsibility for developing additional GUS applications. But more importantly we expect to facilitate the co-development of GUS. One of our collaborators, Dr. Kissinger, has proposed to develop web services for GUS. Similarly, we plan to rely upon co-developers to provide services such as a DAS server (Dowell et al. 2001), assistance for direct SQL access and email notification of query results.

3 The CVS repository

We will continue our effort to organize our source code into highly structured modules. The WDK, Java object layer, Annotator’s Interface and the TESS system have yet to be moved.

4 The Build System

We will provide our Ant-based build system and continue its development to include transformations such as C and Java compilation and macro substitutions, the running of unit test suites and the automated generation of API documentation.

5 The GUS test server

We will set up an instance of GUS using a freely available RDBMS such as MySql for use as a GUS testbed. The server will be available to open development collaborators. We will maintain on the server the standard GUS sample dataset.

6 Code reviews and regression testing

One of the challenges of an open source project is the incorporation of code from and maintenance of code by a diversity of developers. We will adopt an approach similar to that used by the bioPerl project (Stajich et al. 2002). We will institute a code review process for all code placed in the public repository. We will expect regression tests for submitted plugins, data extraction programs and data analysis programs. The tests will run against the sample database. They will provide a test input and an expected test output, failing if the actual output does not match the expected. We will provide a mechanism to efficiently restore the sample database to its standard state after the test.

7 Schema evolution

We expect the GUS schema to evolve on an ongoing basis, through our initiative and through that of the open source collaborators. We will arbitrate that process, and schedule releases approximately quarterly.

8 Participation in GMOD

We are encouraged by the GMOD effort and expect to contribute to it. Our first goal will be to post the GUS schema, its installation program, a sample dataset and an adapter for GMOD's Gbrowse genomic browser (gbrowse, Stein et al. 2002). We will continue to develop our tools, when appropriate, in a GUS-independent manner so that we can contribute them to GMOD. Our first candidate tool will be the GUS WDK. We will also use available GMOD tools when possible, and consider compatibility with them as part of our designs.

Timeline

| |Year 1 |Year 2 |Year 3 |Year 4 |Year 5 |

|Aim 1 |Comment tables |Comment tables |Comment tables |Comment tables |Comment tables |

|Release GUS Schema |Tutorials |Tutorials |Tutorials |Tutorials |Tutorials |

| |Sample dataset |Relationship browser |Ongoing devel. |Ongoing devel. |Ongoing devel. |

| |Schema install system |Platform independence | | | |

| | |MySQL | | | |

| | |Migration system | | | |

|Aim 2 |Proteomics schema |Additional domains |Additional domains |Additional domains|Additional domains |

|Develop New Schema |Proteomics plugins | | | | |

|Domains | | | | | |

|Aim 3 |Document & clean Plugin API |Upgrade Perl Object layer |New plugins |New plugins |New plugins |

|Release GUS |Document & clean plugins |Improve WDK data mining |Ongoing devel. |Ongoing devel. |Ongoing devel. |

|Application |Sample data pipeline |Release Annotator’s Intfc. | | | |

|Framework |Sample web site |Release GUStoMAGE | | | |

| |Release WDK |Release TESS | | | |

| |Upgrade WDK to JSP |Document & clean Plugins | | | |

| |Release RAD Web Forms | | | | |

|Aim4 |Move WDK to CVS |Open dev. Project mgmt. |Open dev. Project |Open dev. Project |Open dev. Project |

|Open Development |Setup test server |Code reviews |mgmt. |mgmt. |mgmt. |

| |Setup web site | |Code reviews |Code reviews |Code reviews |

| |Post schema to GMOD | | | | |

| |Regression test frmwk | | | | |

| |Code reviews | | | | |

e. Human subjects research

No human subjects.

f. Vertebrate animals

No Vertebrate animals.

g. Literature cited

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 5:403-10.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 25:25-9.

Bahl A, Brunk B, Crabtree J, Fraunholz MJ, Gajria B, Grant GR, Ginsburg H, Gupta D, Kissinger JC, Labo P, Li L, Mailman M, Milgram AJ, Pearson DS, Roos DS, Schug J, Stoeckert CJ Jr, Whetzel P. (2003) PlasmoDB: The Plasmodium Genome Resource. A database integrating experimental and computational data. Nucleic Acids Res. 31:212-5.

Bailey LC Jr, Fischer S, Schug J, Crabtree J, Gibson M, Overton GC. (1998) GAIA: framework annotation of genomic sequence. Genome Res. 8:234-50.

Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL. (2002) The Pfam protein families database. Nucleic Acids Res. 30:276-80.

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. (2003) GenBank. Nucleic Acids Res. 31:23-7.

Bernal A, Ear U, Kyrpides N. (2001) Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res. 29, 126-127.

Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT. (2003) MGD: the Mouse Genome Database. Nucleic Acids Res. 31:193-5.

Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Glenisson P, Holstege FCP, Kim IF, Markowitz V, Matese JC, Robinson A, Sarkans U, Stewart J, Taylor R, Vilo J, Vingron M. (2001) Minimum Information About a Microarray Experiment – MIAME – towards Standards for Microarray Data. Nature Genetics. 29:365-371.

Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, Oezcimen A, Rocca-Serra P, Sansone SA. (2003) ArrayExpress-a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 2003 31:68-71.

Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Hubbard T, Kasprzyk A, Keefe D, Lehvaslaiho H, Iyer V, Melsopp C, Mongin E, Pettett R, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Birney E. (2002) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 2003 31:38-42.

Corpet F, Gouzy J, Kahn D. (1999) Browsing protein families via the 'Rich Family Description' format. Bioinformatics.15:1020-7.

Crabtree J, Wiltshire T, Brunk B, Zhao S, Schug J, Stoeckert CJ Jr, Bucan M. (2001) High-resolution BAC-based map of the central portion of mouse chromosome 5. Genome Res. 11:1746-1757.

Davidson SB, Crabtree J, Brunk B, Schug J, Tannen V, Overton GC, Stoeckert CJ Jr. (2001) Data integration and warehousing in genomics: Two case studies. IBM Systems Journal. 40:512-531.

Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. (2001) The Distributed Annotation System. BMC Bioinformatics. 2:7.

Fischer S, Crabtree J, Brunk B, Gibson M, Overton GC. (1999) bioWidgets: data interaction components for genomics. Bioinformatics.15:837-46.

The Flybase Consortium. (2003) The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 31:172-5.

Karp PD, Paley S, Romero P. (2002) The Pathway Tools software. Bioinformatics 18 Suppl 1:S225-32.

Karp PD, Riley M, Saier M, Paulsen IT, Collado-Vides J, Paley SM, Pellegrini-Toole A, Bonavides C, Gama-Castro S. (2002a) The EcoCyc Database. Nucleic Acids Res. 30:56-8.

Kel-Margoulis OV, Romashchenko AG, et al. (2000). COMPEL: a database on composite regulatory elements providing combinatorial transcriptional regulation. Nucleic Acids Res 28: 311-5.

Kent WJ. (2002) BLAT--the BLAST-like alignment tool. Genome Res.12:656-64.

Kissinger JC, Brunk BP, Crabtree J, Fraunholz MJ, Gajria B, Milgram AJ, Pearson DS, Schug J, Bahl A, Diskin SJ, Ginsburg H, Grant GR, Gupta D, Labo P, Li L, Mailman MD, McWeeney SK, Whetzel P, Stoeckert CJ, Roos DS. (2002) The Plasmodium genome database. Nature. 419:490-2.

Lewis SE, Searle SM, Harris N, Gibson M, Lyer V, Richter J, Wiel C, Bayraktaroglir L, Birney E, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smithy CD, Tupy JL, Rubin GM, Misra S, Mungall CJ, Clamp ME. (2002) Apollo: a sequence annotation editor. Genome Biol. 3:RESEARCH0082-2.

McGuire AM, Hughes JD, et al. (2000). Conservation of DNA regulatory motifs and discovery of new motifs in microbial genomes. Genome Res 10: 744-57.

Pearson WR. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183:63-98.

Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, Miller N, Mueller LA, Mundodi S, Reiser L, Tacklind J, Weems DC, Wu Y, Xu I, Yoo D, Yoon J, Zhang P. (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res. 31:224-8.

Rutherford K, Parkhill J, Crook J, Horsnell T, Rice P, Rajandream MA, Barrell B. (2000) Artemis: sequence visualization and annotation. Bioinformatics. 16:944-5.

Scearce LM, Brestelli JE, McWeeney SK, Lee CS, Mazzarelli J, Pinney DF, Pizarro A, Stoeckert CJ Jr, Clifton SW, Permutt MA, Brown J, Melton DA, Kaestner KH. (2002) Functional genomics of the endocrine pancreas. The pancreas clone set and PancChip, new resources for diabetes research. Diabetes. 51:1997-2004

Schug J, Diskin S, Mazzarelli J, Brunk BP, Stoeckert CJ Jr. (2002) Predicting Gene Ontology functions from ProDom and CDD protein domains. Genome Research. 12:648-655.

Schug, J. 'Using TESS to Predict Transcription Factor Binding Sites in DNA Sequence', Current Protocols in Bioinformatics. Baxevanis AD, Davison DB, Page RDM, Petsko GA, Stein LD, and Stormo GD. (eds.) 2003. John Wiley & Sons.

Schultz J, Copley RR, Doerks T, Ponting CP, Bork P. (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res. 28:231-4.

Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, Bernhart D, Sherlock G, Ball C, Lepage M, Swiatek M, Marks WL, Goncalves J, Markel S, Iordan D, Shojatalab M, Pizarro A, White J, Hubley R, Deutsch E, Senger M, Aronow BJ, Robinson A, Bassett D, Stoeckert CJ Jr, Brazma A. (2002) Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biology 3: RESEARCH0046

Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E. (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res.12:1611-8.

Stein L, Sternberg P, Durbin R, Thierry-Mieg J, and Spieth J (2001). WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Research 29:82-86.

Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, Lewis S. (2002) The generic genome browser: a building block for a model organism system database. Genome Res.12:1599-610.

Stoeckert CJ Jr, Salas F, Brunk B, Overton GC. (1999) EpodDB: a prototype database for the analysis of genes expressed during vertebrate erythropoiesis. Nucl. Acids Res. 27:200-203.

Stoeckert C, Pizarro A, Manduchi E, Gibson M, Brunk B, Crabtree J, Schug J, Shen-Orr S, Overton GC. (2001) A relational schema for array and non-array based gene expression data. Bioinformatics. 17:300-308.

Stoeckert CJ Jr, Causton HC, Ball CA. (2002) Microarray Databases: Standards and ontologies. Nature Genetics. 32 Suppl 2:469-73.

Twigger S, Lu J, Shimoyama M, Chen D, Pasko D, Long H, Ginster J, Chen CF, Nigam R, Kwitek A, Eppig J, Maltais L, Maglott D, Schuler G, Jacob H, Tonellato PJ. (2002) Rat Genome Database (RGD): mapping disease onto the genome. Nucleic Acids Res. 30:125-8.

Ware D, Jaiswal P, Ni J, Pan X, Chang K, Clark K, Teytelman L, Schmidt S, Zhao W, Cartinhour S, McCouch S, Stein L. (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res. 30:103-5.

Weng S, Dong Q, Balakrishnan R, Christie K, Costanzo M, Dolinski K, Dwight SS, Engel S, Fisk DG, Hong E, Issel-Tarver L, Sethuraman A, Theesfeld C, Andrada R, Binkley G, Lane C, Schroeder M, Botstein D, Michael Cherry J. (2003) Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins. Nucleic Acids Res. 31:216-8.

Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, Wagner L. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31:28-33.

Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Prüß, M., Reuter, I. and Schacherer, F. (2000) TRANSFAC: an integrated system for gene expression regulation       Nucleic Acids Res. 28:316-319

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download