FuGE Spec document



FuGE: Functional Genomics Experiment model specification

Status of This Document

This document provides information to the biomedical research community on the Functional Genomics Experiment model, which can be used to capture descriptions of high-throughput experimental approaches and can be used to develop modular data formats for technologies with specific requirements. Distribution is unlimited.

Abstract

This document describes the Functional Genomics Experiment (FuGE) model, which has been created to aid the development of data standards in the life sciences and biomedical domains. FuGE serves as a base model to be extended for creating technology specific data formats and as a platform for collating external files in non-FuGE based formats within the framework of a laboratory workflow. It is anticipated that widespread adoption of FuGE will lead to a greater uniformity of technology-specific data formats and will allow the collation of functional genomics studies that cross technological boundaries. This document should be read in conjunction with the reference guide.

Table of Contents

Abstract 1

1 Introduction 2

1.1 Development process 2

1.2 Goals of FuGE 3

2 Concepts and Terminology 3

3 Platform independent model 3

3.1 Package structure 4

3.2 mon 4

3.2.1 Base classes 4

3.2.2 Audit 5

3.2.3 Description 6

3.2.4 Reference 7

3.2.5 Ontology 7

3.2.6 Measurement 9

3.2.7 Protocol 10

3.3 FuGE.Bio 16

3.3.1 ConceptualMolecule 16

3.3.2 Data 17

3.3.3 Investigation 19

3.3.4 Material 20

3.4 FuGE.Collection 22

4 Unique identifiers 23

5 Mapping to XML Schema 23

5.1 Mapping from classes to elements 24

5.2 Mapping UML attributes 24

5.3 Mapping composite associations 25

5.3.1 Mapping composite associations to an element 26

5.4 Mapping non-composite associations 26

5.4.1 Mapping non-composite associations to an element and an attribute 27

5.4.2 Mapping non-composite associations to an attribute 27

5.5 Mapping multiple associations 28

5.6 Mapping enumerations 28

5.7 Implementation of cardinalities 29

5.8 Mapping UML data types to XML Schema data types 29

6 Mapping to other platforms 30

7 Dependencies on external ontologies 30

8 Contributors 31

9 Intellectual Property Statement 31

Copyright Notice 31

10 References 32

Introduction

The FuGE model has been created to facilitate the development of data formats for functional genomics and related data intensive experimental processes. This document presents the formal specifications for FuGE, in terms of the platform independent model (object model) represented in the Unified Modeling Language (UML, [14]). The document also describes the mapping of the object model to an XML Schema [13] following a set of defined rules. The XML Schema defines the transfer format for representing FuGE compliant data sets in XML [3]. It is anticipated that FuGE will be extended to create technology-specific formats. The rules presented in this document will also be used for creating XML Schemas for models that extend from FuGE. This document does not provide a guide for how to develop extensions to FuGE and does not provide a description of available tools for FuGE. Such information is provided in separate documents on the FuGE website ().

1 Development process

The initial FuGE model was created by analysis of the microarray data model, MAGE-OM [6, 11], and removing the concepts that were specific to microarray technology. Models from other domains, such as PEDRo (the Proteomics Experiment Data Repository) were also taken into account [12]. Further requirements were gathered from a number of use cases from a range of ‘omics experiments, which are available on the FuGE website. Use cases describing experimental designs, employing multi-omics and conventional technologies, were provided by the Reporting Structures for Biological Investigations (RSBI) working groups [10]. Modeling requirements were also gathered by discussion with implementers of databases for omics experiments, such as ArrayExpress [4]. The Ontology package is a simplification of the Ontology Definition Model [7], which has been submitted to the Object Management Group (OMG). The Data package contains similar concepts to models that represent multi-dimensional scientific data, such as HDF5 ().

FuGE has been developed by a process of managed evolution with the release of milestone versions to allow developers to begin working with the model prior to the release of the formal specifications. The primary mechanism for evolving the model was to gather feedback during the development of extensions and the deployment of FuGE milestone releases in systems. One such example is CPAS (Cancer Proteomics Analysis System) developed at the Fred Hutchinson Cancer Research Centre [9], which contains an archiving framework based on FuGE milestone 1. To date, the following modular formats have been created, or are under development, by extending FuGE:

• GelML – a model for capturing the metadata about gel electrophoresis used in proteomics ().

• spML - a model of sample processing in proteomics, including liquid chromatography ().

• MAGE 2 – version 2 of the MAGE format for microarrays ().

• GelInfoML – a data transfer format for the results of informatics performed on gel images in proteomics ().

• analysisXML – a format for capturing the results of informatics performed on mass spectrometry results ().

• A model for NMR experiments in metabolomics ().

2 Goals of FuGE

FuGE has been developed to support the following goals with respect to functional genomics experiments and the wider context of biomedical research:

1) To provide a model representing laboratory workflows and experimental pipelines used in the context of high-throughput biological investigations. The model should capture descriptions of experimental workflows, including protocols, samples and references to the data sets produced, which may or may not be in FuGE-compliant formats. The model should also capture the design of the experiment and the independent variables that are being studied.

2) To provide a framework for building new data models with a common structure for techniques that have specific requirements, to facilitate the creation of new data standards. The model should define extension points to allow developers to create formats by adding subclasses to FuGE.

For both of these goals, there are two relevant sub-goals:

a) The platform independent model should facilitate the creation of databases and software for in-house management of laboratory workflows, including reference mechanisms for pre-existing (non FuGE-based) formats and for new data formats developed by extending FuGE.

b) The model should define an exchange format to facilitate sharing FuGE-compliant data sets and data sets that comply with extensions of FuGE. The format is represented in XML and the conversion from UML to XML Schema is explicitly defined by a set of rules and an implementation of the rules.

A relational database schema and toolkit for Java and Perl are in development but these will exist only as supporting mechanisms and do not constitute part of the formal specification presented here. The platform independent model can be implemented using various technologies (in support of sub-goal a).

Concepts and Terminology

This document assumes familiarity with two data modelling notations, namely UML () and XML Schema (XML/Schema). Models are described using UML class diagrams; such diagrams provide concise structural descriptions of the artifacts in an application, which can then be implemented in different ways. One such way is through a mapping to XML Schema; an automated mapping is assumed in this document, which is described in Section 5.

The key words “MUST,” “MUST NOT,” “REQUIRED,” “SHALL,” “SHALL NOT,” “SHOULD,” “SHOULD NOT,” “RECOMMENDED,” “MAY,” and “OPTIONAL” are to be interpreted as described in RFC-2119 [2].

Platform independent model

In this section there is a description of the model represented in UML 2.0 (download the model from the Web address of this document). The model’s diagrams can be viewed with MagicDraw version 11.5 and upwards (), and other compatible tools. The model contains the AndroMDA plug-in for UML2 (), which provides data types and various stereotypes, which are used to generate the XML Schema (Section 5). AndroMDA can also be used in the generation of other platform-specific models.

This section includes a general description of the purpose of each of package in FuGE and the class diagrams. The reference document should be consulted for the exact specification of the associations, attributes and stereotypes used on each model element.

In the following text, a fixed width font is used to denote the names of classes, attributes, associations and stereotypes in the model. The names of packages in FuGE are denoted in the standard font. A tutorial on the UML structures used is provided on the FuGE website ().

1 Package structure

|F |Common |Audit |Contacts, auditing and security settings for all objects. |

|u | | | |

|G | | | |

|E | | | |

| | |Description |Additional annotations and free-text descriptions for all objects. |

| | |Measurement |Classes for providing measurements within FuGE, including slots for the measurement value, |

| | | |the unit and the data type. |

| | |Ontology |A mechanism for referencing external ontologies or terms from a controlled vocabulary. |

| | |Protocol |A model of procedures, software, hardware and parameters. The package can define workflows by|

| | | |relating input and output materials and/or data to the protocols that act on them. |

| | |Reference |External bibliographic or database references that can be applied to many objects across the |

| | | |FuGE model. |

| |B |ConceptualMolecule |Captures database entries of biological molecules such as DNA, RNA or amino acid sequences |

| |i | |and an extension point for other molecule types, such as metabolites or lipids. |

| |o | | |

| | |Data |Defines the dimensions of data and storage matrices, or references to external data formats. |

| | |Investigation |Defines an overview of the investigation structure by capturing the overall design and the |

| | | |experimental variables and by providing associations to related data. |

| | |Material |Models material types such as organisms, samples or solutions. Materials are characterized by|

| | | |ontology terms or by extension of the Material package. |

Table 1 The package structure of FuGE. Column 3 contains the name of each package in Bio and Common and column 4 contains a description of the package. In addition to Common and Bio, a third package exists as a child of FuGE, Collection, which contains only classes and no child packages.

2 mon

The Common namespace contains components used to develop models for high-throughput or data intensive experimental processes. This includes support classes used across the whole of FuGE for auditing, security, references to external information, measurements and protocols as described below.

1 Base classes

Every object of the FuGE model is a descendent of either one or both of two base classes (Figure 1) which are themselves arranged in a hierarchy. Describable is the top class in the hierarchy and all classes in FuGE extend from Describable. Describable provides functionality for aiding in-house management of the format, extensibility and the development of internal pipelines (sub-goal a). For data exchange, the extensibility capabilities provided by Describable SHOULD only be used to capture additional information where the model contains no other structures that could be used to capture the information. Describable has associations allowing auditing and security settings to be given (Section 3.2.2) and additional annotations as URIs (Uniform Resource Identifiers [1]), plain text descriptions, name-value-type triples (Section 3.2.3) or ontology terms (Section 3.2.5).

[pic]

Figure 1 The hierarchy of base classes in FuGE.

The other base class, Identifiable, inherits from Describable and adds a referencing mechanism. Classes that inherit from Identifiable are able to be referenced by other classes using the identifier attribute. The identifier attribute is understood to be a globally unique string that resolves a particular instance of the object. The “globally unique” restriction implies that there is exactly one and only one object with a particular identifier. The scope of unique identifiers in FuGE is described in Section 4. The name attribute stores a human readable name for the object that need not be unique. Classes that inherit from Identifiable can have associations to external database entries or bibliographic references (Section 3.2.4).

2 Audit

The Audit package contains classes that model contacts and the roles of contacts and that enable the annotation of objects with security settings and auditing information (Figure 2).

Contact is an abstract superclass which has subclasses of Person and Organization. In various places in FuGE, there are associations to ContactRole, which references Contact, to specify a particular role that a Contact plays, such as “data analyst”, “provider” or “principal investigator”. The name of the role is supplied by an ontology term.

All FuGE classes inherit from Describable and thus can be associated with Security objects and Audit objects. Security has associations for capturing the owner(s) of the security settings on the object, and for annotating the objects with particular rights (SecurityAccess). SecurityAccess can reference a groups of contacts (SecurityGroup) and ontology terms for “read” or “write” access on an object.

Audit can be used to capture an audit trail of changes to the instance. Audit has attributes date for the date on which a change was made, and action, which is an enumeration of values (“creation”, “modification” or “deletion”). Audit has an association to Contact for the Person or Organization that has made the change.

[pic]

Figure 2 The Audit package.

3 Description

The Description package (the subset of classes referenced from Describable in Figure 1) contains the class Description for adding descriptive text to any object and the URI class for providing a location to additional information about an object. Description and URI SHOULD only be used to capture additional information where the model contains no other structures that could be used to capture the information. For example, Description could be used for adding an annotation about what edits have been made to an object or tasks that still have to be completed. A URI could be used to reference a webpage for a help file or a representation of the Describable object in a proprietary data format.

NameValueType can be referenced from any Describable class for adding additional information about an object, for example for capturing in-house parameters that are not part of the core specification of a standard.

[pic]

Figure 3 The Description package contains Description, URI and NameValueType classes and associations to Security, Audit and OntologyTerm.

4 Reference

The Reference package (Figure 4) contains the classes BibliographicReference, DatabaseReference and Database, which can be used to annotate any subclass of Identifiable with literature references or entries in external databases. BibliographicReference is itself a subclass of Identifiable and hence can be associated with a DatabaseReference.

[pic]

Figure 4 The Reference package contains the classes BibliographicReference, Database and DatabaseEntry.

5 Ontology

The Ontology package (Figure 5) provides structures for referencing ontological concepts and terms from controlled vocabularies. The structures provided SHOULD NOT be used for de novo modeling of ontologies, or representing complete ontologies, instead, the package provides mechanisms for unambiguously identifying classes, properties, associations or instances from external ontologies or controlled vocabularies, which should be interpreted only in their original context i.e. the source ontology. In many places in the FuGE specification, there are associations from classes to OntologyTerm, which is the top-level abstract class in the package, allowing objects to be associated with the non-abstract subclasses OntologyIndividual, OntologyProperty and DataProperty. All associations in FuGE to OntologyTerm are non-composite, meaning that ontology terms can be re-used across different objects.

The OntologyIndividual class serves multiple purposes, as it is intended to represent ontological classes, instances of classes (individuals) and terms from simple controlled vocabularies. The source of a term SHOULD be specified by the association to OntologySource. The term attribute stores the term itself (or the name of the ontology class) and termAccession stores the identifier or accession assigned to the ontology term within the source ontology. If no such accession is available, the ontology term SHOULD be repeated as the accession (by default). The inherited identifier attribute stores a unique identifier for the instance of the term used within the scope of the current FuGE document or instance.

[pic]

Figure 5 The Ontology package provides slots for referencing classes or properties from external ontologies or controlled vocabularies.

In an ontology, classes can be related to each other through associations. Such associations are modeled in FuGE by the abstract class OntologyProperty. If the association is between two terms, ObjectProperty models the association from the parent ontology class to the child ontology class (both modeled by OntologyIndividual). A second type of association exists in ontologies, modeled by DataProperty, which specifies that a data value can be entered by the user in a slot provided in the ontology.

Simple ontology example for the term “hour” from the unit ontology ():



Ontology example for “a measurement of 20 minutes” encoded using the MGED Ontology ():



Note, this is provided as an example of the intended purpose of OntologyIndividual, DataProperty and ObjectProperty, however this should not be taken as a definitive usage guide. A separate usage guide, describing how to encode different types of ontology structures within the Ontology package, is available from the FuGE website ().

6 Measurement

The Measurement package contains classes to capture measurement values in the model (Figure 6), for example to provide default and runtime parameter values. The abstract Measurement class has associations to OntologyTerm that MAY be used to record the simple data type (examples: Boolean, string, non-negative integer) and the unit of measurement. The subclasses: AtomicValue, BooleanValue, RangeValue and ComplexValue provide different slots for the value of the measurement. RangeValue MAY be associated with instances of OntologyTerm, for example to describe whether the values are inclusive or exclusive. ComplexValue SHOULD be used to capture any values that cannot be captured by the other three subclasses, referencing ontology instances for capturing more complex values.

[pic]

Figure 6 The Measurement package in FuGE.

7 Protocol

The Protocol package represents any method or procedure in an experiment. The package is separated into two parts: i) an abstract representation of how a protocol should be structured, which can be extended for developing modular formats, for example in support of Goal 2 (Section 1.2) and ii) non-abstract classes that can be used without extension, for example in support of Goal 1. The abstract classes Protocol, Action, Equipment, Software and Parameter fall into category i (Figure 7). These classes cannot be instantiated as they are, because it is intended they should be extended (by creating subclasses) in the development of new modules. In addition to the abstract classes, a set of non-abstract classes are provided that can be used without extension, called GenericProtocol, GenericAction, GenericEquipment, GenericSoftware, GenericParameter and GenericProtocolApplication (Figure 8). These classes can be instantiated with user entered text and ontology terms to capture the details of the protocol as described below.

[pic]

Figure 7 The Protocol package contains the abstract classes Protocol, Software, Equipment and Action that should be extended in the development of new modules.

A general experimental protocol is structured as follows in FuGE. The Protocol class can be associated with Software and Equipment, each of which can have a set of parameters with default values. A Protocol consists of a set of steps (Action) that can be ordered. An Action can be associated with parameters (Figure 10) and/or it can be a reference to a child Protocol. This means that a complex procedure can be represented by building a Protocol that references other Protocol objects in a nested structure. The Protocol class is abstract and contains abstract associations to Software, Equipment and Action, all of which are also abstract. In effect, these classes act as a template to demonstrate how an extension of a FuGE Protocol should be developed, by extending these classes and associations.

[pic]

Figure 8 GenericProtocol, GenericEquipment, GenericSoftware and GenericAction allow experimental procedures to be represented in FuGE without extension using free text and ontology terms.

A laboratory procedure is typically defined once but may be applied many times. FuGE represents this distinction by defining ProtocolApplication. ProtocolApplication represents the running of a Protocol, allowing runtime parameter values to be supplied if they differ from the default values. ProtocolApplication also provides mechanisms for recording the operator and date of the procedure. ProtocolApplication references the input and output materials and/or data that were acted upon. As such, it can be used to construct experimental workflows by tracking the identity of all samples, represented by the Material class (Section 3.3.4) and data files (Section 3.3.2). ProtocolApplication is an abstract class and the associations to Material and Data are also abstract. This signals to a developer that an extension of ProtocolApplication can have named associations to Material, Data or subclasses of Material and Data (for Goal 2). GenericProtocolApplication has non-abstract associations to Material and Data (Section 3.3.4) allowing it to be used to construct experimental workflows without extension of the model (e.g. for Goal 1).

[pic]

Figure 9 ProtocolApplication represents the running of a protocol. ProtocolApplication can record the operator(s) (Performers), the date, and any deviations in the Protocol or Action objects contained within the Protocol.

Protocol, Equipment, Software (inherited from Parameterizable) and Action have an abstract association to Parameter (Figure 10). This implies that extensions (subclasses) of these classes can have associations to Parameter, newly defined subclasses of Parameter or its subclass in FuGE. GenericProtocol, GenericEquipment, GenericSoftware and GenericAction have associations to GenericParameter,. The type of GenericParameter SHOULD be provided by the association to OntologyTerm. Parameter has an association to Measurement for providing a default value. The classes ProtocolApplication, EquipmentApplication, SoftwareApplication and ActionApplication inherit an association to ParameterValue, which can be used for providing runtime parameter values if they differ from the defaults specified by in the Protocol.

It is expected that ProtocolApplication and Protocol will be extended to describe procedures such as sample preparation, data acquisition and data transformation where specific constraints must be placed on the details to be captured. GenericProtocol and GenericProtocolApplication should be used for capturing general details about laboratory procedures that do not require specific constraints.

[pic]

Figure 10 The Parameter class enables the definition of parameters and default values. Protocol, Software, Equipment and Action have an abstract association to Parameter. GenericProtocol (and related classes) have associations to GenericParameter. ProtocolApplication and related classes enable a runtime value to be supplied for a Parameter if it differs from the default value.

In a data transformation pipeline, the output parameters from one process may be input parameters to another process. This is modeled in FuGE by the ParameterPair class (Figure 11). ParameterPair can relate a sourceParameter (output of a process) to a targetParameter (input of a process). Instances of ParameterPair are owned by GenericAction (or other subclasses of Action). A GenericAction can be used to reference child protocols, for example representing a pipeline in which a series of processes are performed. Where ParameterPair has been used, an instance of GenericAction SHOULD have a reference to both the ParameterPair object and the childProtocol that owns the targetParameter, as exemplified in Figure 12.

[pic]

Figure 11 ParameterPair can relate input and output Parameter objects for a Protocol which is referenced as a ChildProtocol from an Action.

[pic]

Figure 12 A graphical example to demonstrate the use of ParameterPair, in which a data analysis pipeline has three sub-protocols. The value from an output parameter from sub-protocol 1 acts as an input parameter to sub-protocol 2, indicated by the ParameterPair object.

3 FuGE.Bio

The Bio package contains classes that relate to biological materials, data formats, investigational structure and database representations of bio-molecules, used in an omics experiment.

1 ConceptualMolecule

The ConceptualMolecule package (Figure 13) represents descriptions of biological molecules in databases (as opposed to Material that describes actual materials or samples used in an investigation). ConceptualMolecule could be used to capture information stored in database entries about biological sequences, lipids, metabolites and so on.

[pic]

Figure 13 The ConceptualMolecule package can represent biological sequences or it can be extended to represent database entries of other types of molecule.

In FuGE, ConceptualMolecule has only been extended for describing biological sequences (Sequence). A Sequence object has various properties to capture the actual sequence, the length, and the start and end positions if only a section of the total sequence is being specified such as a part of chromosome. Multiple Sequence objects can be associated with a SequenceAnnotationSet object to describe the Species, Types of sequence (such as BAC, gene, EST, peptide) and PolymerType (DNA, RNA or protein for example). The inherited association to DatabaseEntry should be used to annotate a Sequence with external records containing the complete details of the sequence, stored for example in Genbank, EMBL or SwissProt.

2 Data

The FuGE Data object has been designed to enable representation of n-dimensional data matrices (Figure 14 The Data package provides extension points for describing data dimensions and the elements within the dimensions. A data matrix can be stored either in an external file (ExternalData) or within FuGE itself (InternalData).). The definitions of Data axes (Dimension and DimensionElement objects), are re-usable across Data instances. A simple example would be a tabular representation of gene expression, where one axis (Dimension) represents a gene list (a 9500 feature array would have 9500 DimensionElement instances for this Dimension), while another axis represents the types of measurements derived from scanning the slide (e.g. signal, normalized value and P-value, giving rise to three instances of DimensionElement). Dimension and DimensionElement, and the associations from Data to Dimension and from Dimension to DimensionElement are abstract. Extension developers should extend Data, Dimension and DimensionElement to describe the relevant data axes and create associations that extend from the Dimensions and DimensionElements associations defined in FuGE.

Data has two subclasses ExternalData and InternalData. ExternalData specifies the location of the data matrix or a data file in a non-FuGE based format. ExternalData also has an association to URI to capture a validation schema, descriptors or documentation on the external format.

[pic]

Figure 14 The Data package provides extension points for describing data dimensions and the elements within the dimensions. A data matrix can be stored either in an external file (ExternalData) or within FuGE itself (InternalData).

The InternalData class is abstract, and should be extended by module developers if they wish to specify that the data matrix should have a specific encoding or data type, such as base64 binary [5]. The GenericInternalData class provides a slot for the data matrix to be stored in any format as required, which can be specified by an ontology term. In XML, the storage attribute is mapped to a string datatype, as described in Section 5.8, which can store any type of character data.

If Dimension and DimensionElement have been extended to describe the data axes, the data matrix, instantiated in an InternalData or ExternalData object, should conform to the specification of its dimensions.

ProtocolApplication has abstract associations to Data, for inputs and outputs, indicating that a subclass of ProtocolApplication in a FuGE extension can be associated with specific types (subclasses) of Data. GenericProtocolApplication also has non-abstract associations to Data (as described below, Figure 17).

The DataPartition class can be used to describe a subset of a multidimensional data set by referencing certain DimensionElement objects and the storage matrix or external file in which the data are stored. DataPartition is abstract and hence should be extended to reference particular subclasses of DimensionElement and Data. The GenericDataPartition class can be used without extension for referencing any subclasses of DimensionElement and Data. The PartitionPair class references two DataPartition objects, corresponding to a subset of the input Data to, and output Data from, a ProtocolApplication. PartitionPair is abstract (as are the input and output data associations), hence the class and associations should be extended for various purposes, such as to describe “supporting evidence” where certain results in the output Data are dependent only on certain parts of the input Data set. The algorithm that relates the input DataPartition to the output DataPartition can be captured using the association to Description. A theoretical example of the use of PartitionPair is as follows. In a search with mass spectrometry data, particular peptides may be identified based on the masses of particular peaks within a trace. In this example, the trace could be modeled as the input Data to a ProtocolApplication and the set of identified peptides could be modeled as the output Data. Instances of PartitionPair could be used to relate specific peptides with a subset of the input Data, containing only those specific peaks from which the identifications were made.

3 Investigation

The Investigation package captures a summary of the investigational design (Investigation), the independent experimental variables (Factor and FactorValue) and a summary of the main technologies used (InvestigationComponent).

[pic]

Figure 15 The Investigation package stores a summary of the Investigation, the technologies used (InvestigationComponent), and the independent variables being studied (Factor).

Investigation has an association to Material representing the important sources of material, as determined by the investigator to facilitate querying (the same Material objects should be defined within a workflow and referenced by instances of ProtocolApplication that act upon them). The overall design of the investigation can be captured by the InvestigationTypes associations to OntologyTerm, examples include: “interventional design”, “observational design” or “drug-dose response”. InvestigationComponent represents a single technique used within the Investigation, such as “microarray analysis”, “proteome analysis” or “biometric testing”. A summary of the Investigation, for example in terms of the Hypothesis or Conclusion MAY be stored as plain text in Description or using ontology terms under InvestigationSummary. InvestigationSummary MAY also be used to capture keywords about the investigation to facilitate retrieval or querying. Investigation has attributes start and end, allowing the start and end point of the investigation to be represented. InvestigationComponent allows the user to specify the number of replicates performed, the normalization strategy and quality control. There is also an association from InvestigationComponent to OntologyTerm to capture the design with respect to the particular technology, for example “dye swap”. The principal comparators in an Investigation are modeled by Factor, such a “time course” or “genetic difference”. Factor objects can, but need not, be shared across different InvestigationComponent objects. The actual values for Factor objects are stored in FactorValue and the association to Measurement. FactorValue has an association to DataPartition to reference extensions of DimensionElement and Data to define the data values that correspond to that FactorValue.

There is an association from Investigation to HigherLevelAnalysis for representing analyses that are specific to a technology, and the Data on which such analyses are based. It is intended that HigherLevelAnalysis will be extended by developers of standards for individual domains.

4 Material

All materials used in an investigation are modeled in the Material package by the class Material and the GenericMaterial subclass. Material has three named associations for capturing ontological information: MaterialType, Characteristics and QualityControlStatistics. This allows developers to create ontologies that define the characteristics and behavior of materials, or alternatively, the Material class can be extended by developers to define specific constraints or attributes within a new module.

ProtocolApplication has two direct associations and one indirect association, via MaterialMeasurement, to Material for representing the inputs and outputs for a process, all of which are abstract. These associations exist to demonstrate that extensions of ProtocolApplication can be associated with specific subclasses of Material that are measured inputs (InputMaterials), complete inputs (InputCompleteMaterials) or outputs (OutputMaterials) for a process.

[pic]

Figure 16 The GenericMaterial class can be used to describe all materials used in an investigation by providing ontology terms (on the associations inherited from Material). The abstract Material class can be extended to capture specific properties of a particular substance.

The input and output samples to a process (GenericProtocolApplication) are modeled by GenericMaterial (Figure 17) or other subclasses of Material defined in a model extension. All output samples from a GenericProtocolApplication are modeled by the OutputMaterials association. There are two associations from GenericProtocolApplication modeling the input samples: InputMaterials to be used when a measured source of material is input to the process and InputCompleteMaterials to be used when the whole material is input to a process, such as a microarray or a polyacrylamide gel (where a measurement would never be required). The GenericMaterialMeasurement class extends from MaterialMeasurement (which allows a Measurement to be given) and references Material (allowing either a GenericMaterial to be referenced or any other subclass of Material in an extension of FuGE). GenericProtocolApplication also has two associations to Data to represent any data sets that are inputs or outputs for a process.

A complete workflow of Material objects (for example sample tracking) can only be traced through the instances of ProtocolApplication that have acted upon them to ensure that all processes are explicitly described. However, it should be noted that for database implementations of FuGE, for example in a LIMS, additional structures may be required to collect together chains of related Material instances to improve query performance.

[pic]

Figure 17 GenericProtocolApplication has associations to Material to capture the inputs and outputs to a process. GenericProtocolApplication also has two associations to Data to represent input and output data for a process.

4 FuGE.Collection

The Collection package contains a class that defines the root of a FuGE document and collection classes that have composite associations to reference all objects used in FuGE (Figure 18). If a class is not directly associated from a collection class, it is wholly owned (by a composite association) by another class that is referenced by a collection class. As such, the Collection package defines the root structure of the XML format (see Section 5 for more details) and provides a mechanism for accessing all objects that relate to a single FuGE instance within other platform-specific models.

[pic]

Figure 18 The FuGE class defines the root of a document and has, directly or indirectly, composite associations to all classes in the object model.

Unique identifiers

Many classes in the model are capable of being referenced, either within the same FuGE instance or from other FuGE instances. Such classes extend from Identifiable, inheriting the identifier attribute. The identifier attribute SHOULD be completed with a value that uniquely identifies the object within the system. When data are exchanged between systems, or for example, when data are submitted to a public repository, the identifier MUST be globally unique, for example by assigning the URL of the institution, as a prefix to the identifier. Life science identifiers (LSID) MAY be used within a particular extension, if required. Platform specific models should implement the identifiers in the appropriate manner depending upon the platform. A unique identifier best practice document is provided on the FuGE website.

Mapping to XML Schema

The following rules govern how the XML Schema is generated for FuGE version 1. The same rules apply to models developed by extending FuGE. The rules have been implemented using the model driven architecture platform AndroMDA. AndroMDA processes UML models and outputs text documents, such as Java code or a database schema, according to particular templates. AndroMDA contains a default template for producing an XML Schema, which has been modified for FuGE to implement the rules specified below. The AndroMDA stereotypes XmlSchemaType, Entity, Enumeration and XmlAttribute are used within the object model to govern the processing of model elements. Two additional stereotypes have been newly defined within FuGE to generate additional processing rules: AbstractAssociation and MapAssocToElement. The following rules are used to convert structures in UML into XML Schema features. In this section, XML Schema structures and data types are denoted by italic font to distinguish them from UML features.

1 Mapping from classes to elements

Classes modeled with the AndroMDA XmlSchemaType stereotype are converted to a complexType [13] in the XML Schema. Classes are also represented as a separate element which references the complexType definition, following a design pattern that allows elements to be reused across XML Schemas. Note that the AndroMDA Entity stereotype is also required for generation of platform specific models.

[pic]

Sample of XML Schema (note that the documentation of elements and attributes has been removed from these examples for clarity):

...



Example XML instance:



2 Mapping UML attributes

Attributes on classes are converted to XML attributes on the parent element if they have the AndroMDA XmlAttribute stereotype. In FuGE, all attributes have the XmlAttribute stereotype except for storage on InternalData which is mapped to an element instead. Attributes are assigned the relevant XML Schema data type [13], using the mapping defined in Section 5.8, however only the following have been used in FuGE version 1: int, string, dateTime, boolean and anyURI.

[pic]

Sample of XML Schema:



Example XML instance:

3 Mapping composite associations

If a composite (containment) association is used, the associated element will appear below the associating element in the XML tree, example:

[pic]

Sample of XML Schema:

...

Example XML instance:

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download