Abstract



Extraction and Evaluation of Transcription Factor Gene-Disease Association

Thesis Proposal for Doctor of Philosophy

Bioinformatics Program

Warren Cheung

Thesis Supervisors

Francis Ouellette Wyeth Wasserman

Committee Members

Angela Brooks-Wilson

Raphael Gottardo

Chair

Paul Pavlidis

Examination Date

August 9/2007

Table of Contents

Table of Contents 2

A. Problem Statement 2

Summary of Goals 2

Motivation 3

Example Use Cases 3

Existing Methods 4

B. Proposed Method 6

Genes 7

Disease 7

Features 7

Linkages 7

Quantitative Evaluation 8

Validation 9

C. Goals 10

1) Main TF Gene-Disease Association Prediction Model 10

2) TF Gene-Disease Association Property Predictions 10

3) Gene Cluster-Disease Association Predictions 10

Common Goals 11

D. Project 11

Principles 11

E. Appendix I - Data Sources 11

Genes 12

Disease 13

Evidence 14

Prototype Implementation 15

F. References 16

Problem Statement

The purpose of this research will be to identify effective methods of quantitatively evaluate the relationship between transcription factor genes and diseases via literature evidence, identifying existing associations and predicting novel associations. To accomplish this, I shall explore ways to link various forms of evidence with genes and diseases, quantitative methods to evaluate the resulting associations, and validate the resulting analyses.

Summary of Goals

Main TF Gene-Disease Association Prediction Model

Evaluate quantitatively the association between each transcription factor gene and each disease.

TF Gene-Disease Association Property Predictions

Evaluate quantitatively what properties are relevant to a gene-disease pairing.

Gene Cluster-Disease Association Predictions

Identify clusters of similar genes associated to disease.

Background [Motivate harder]

Identification of functional causes and contributing mechanisms of disease is a principal aim of biomedical research. In many cases the term “disease” broadly applies to a heterogeneous set of observable properties, which may arise from multiple molecular processes. Disease is often characterized by symptoms such as indicator signs and a pattern of progression of time. Brain diseases are a particularly broad disease area, encompassing a wide range of complex, abnormal phenotypes, including combinations of lethality, neurodegeneration, paralysis and behavioural abnormality. Compared to diseases associated with other organs, the diseases of the brain tend to be poorly understood; many are difficult to characterise and have complex genetic components involving multiple genes.

Dating back to the early postulations of King and Wilson[REFME?], there has been an expectation that regulatory changes are likely to have a strong impact on phenotype in complex organisms and systems. While examples exist of disease linkages for almost all classes of encoded proteins, genes encoding transcription factors are amongst the most commonly linked. Transcription factors are regulators of gene expression, involved via processes such as the recruitment of transcription initiation factors or causing DNA conformational change, working alone or as part of protein complexes. Transcription factors in particular play a key role in the brain. Given the incredible diversity of the neuronal and glial cells and their complex arrangement, the careful balance of transcription factors is vital to the proper development of the brain, determination of the cell subtype and migration. This relationship continues through to the adult brain, where transcription factor activity is linked to neuronal survival, differentiation, proper cellular function and neuroplasticity.

In the proposed research, I will develop computational methods to identify linkages between genes encoding transcription factors and brain diseases. The research will emphasize the use of literature resources such as the PubMed archive of abstracts from scientific literature, controlled vocabularies for both medicine and genes, and a compendium of additional resources relevant to disease, transcription factors and/or the brain. By emphasising verifiable evidence types when performing data integration, the proposed method will not only deliver novel gene-disease associations, but deliver the associated evidence supporting these predictions. This allows the user to gain new insight, and by ultimately focusing on associated properties relevant to the gene-disease pairings, will help elucidate new understandings about gene-disease relationships.

Example Use Cases

Prior to examining existing approaches and laying out the proposed approach, it may be helpful to consider specific examples of how the research could be directly applied.

Case #1: A researcher performs a microarray experiment, comparing expression of genes within pineal tissue samples obtained from individuals with and without symptoms of a neurological disease. From the set of genes showing differential expression between these two conditions, the user is looking for the set of genes most likely to be involved in diseases of interest and supporting evidence for such relationships.

Case #2: A researcher wishes to obtain a ranked list of genes associated with a particular disease, with each entry linked to supporting evidence. The researcher may further desire to highlight potential pathways and regulatory relationships between these genes.

Existing Methods for Prediction of Gene-Disease Links

There are several methods described for the computational prediction of linkages between genes and disease [REFER TO SUMMARY TABLE HERE]. Most of the existing methods take as input a preliminary list of candidate genes (e.g. genes in a genomic region linked in a genetic study to a disease), and return as output either a reduced or a ranked list. The underlying approaches differ substantively between methods. Examples of characteristics used in the methods include numerical features derived from the raw sequence of genes and/or encoded proteins, existing annotations of proteins and genes, and abstracts or articles directly referring to the gene. The current methods focus on using properties from a representative set of genes to identify similar genes from the candidate set. Here I will present an overview of eight recent methods used to link genes to disease.

[INSERT? Summary Table of Methods used by the related work]

One method [FIND THIS PAPER I TOOK THIS TO WALES OLD] for identifying disease-related genes involved clustering the diseases in OMIM, rather than the genes themselves, using indices such as tissue, age of onset, primary etiology, episodic occurrence and mode of inheritance. A measure of similarity between any two diseases is calculated based on weighted contributions of each of these indices. Once the clusters are determined (using a strategy that involves manual thresholding by a human expert), the candidate genes are compared to the disease genes underlying the diseases in each cluster using the GO annotations[1]. For a candidate gene in a disease cluster, each GO term is considered. If the GO term does not match the candidate gene, the ratio is 0. Otherwise, the ratio of the occurrences of the GO term in the cluster and the occurrences of the GO term in all disease genes is computed. The score for a candidate gene in a disease cluster is then the average of all the GO term ratios for that gene. This score is then downscaled by the number of genes in the cluster. Validation was assessed using leave-one-out cross-validation.

Rather than ranking candidates, an alternative approach is to restrict a candidate gene set to those genes which meet some of a set of specific properties. GeneSeeker(Van Driel, Cuelenaere et al. 2005) can find genes within a chromosomal location that are localized in particular tissues, by looking at human and mouse expression data. Another method of associating disease genes to anatomical locations(Tiffin, Kelso et al. 2005) performed text mining of PubMed abstracts to associate eVOC anatomical ontology terms to gene names.

Machine learning approaches can be used when a representative set of disease genes are available to use as training data. In DGP(López-Bigas and Ouzounis 2004), a decision tree classification approach is used to find features common to disease genes based on a training set composed of sample disease and control proteins. Features were protein length, BLASTP ratios (conservation score) between a protein and its highest scoring homologue within taxonomic groups (representing phylogenetic conservation and extent) and the conservation score with the closest paralog. The study indicates that, on average, hereditary disease genes (genes taken from OMIM) in comparison to randomly selected genes are longer, more conserved, phylogenetically extended and without close paralogues.

PROSPECTR(Adie, Adams et al. 2005) uses a wider variety of features, including the length of the gene, the length of its coding sequence, the length of its cDNA, length of the protein, GC content and percentage protein identity with its nearest homologue in various species (mouse, worm, fly). The investigators used an alternating decision tree, taking genes from OMIM and comparing against genes not found in OMIM. They also generated two independent test sets – one using genes from the Human Gene Mutation Database with randomly selected control genes, and another set of 54 genes not in OMIM, again with a set of randomly selected control genes.

POCUS(Turner, Clutterbuck et al. 2003) takes another machine learning approach, using a selected training set of genes linked to the target disease. POCUS identifies common features between all the training genes – InterPro domains, GO annotations, similar expression profile – and assesses the chance that such common features would be shared by chance. This method depends on a carefully selected training set of genes, and focuses the likelihood of these genes all sharing common, disease-related properties, in contrast to methods that focus on overrepresentation of properties among the training genes.

G2D(Perez-Iratxeta, Bork et al. 2002) links genes from a specified genomic locus to diseases based on PubMed MeSH disease and chemical term annotation and RefSeq GO annotations. MeSH disease terms are initially linked to MeSH chemical terms via co-occurring annotation of PubMed articles. Similarly, RefSeq GO annotations were linked to the MeSH chemical terms via the PubMed references in the GO annotations. Scores were generated for these pair-wise associations as the ratio of the cardinality of the intersection against the union. The score for the combined disease-chemical-gene relation is defined as the product of the two pair-wise relations, and the score for a disease-gene relation is simply the maximum of all possible scores. The most recent update(Perez-Iratxeta, Bork et al. 2007) includes additional methods of inferring disease-gene associations, involving the user providing genes from other genomic regions related to the disease. The first method is more stringent – it looks for disease genes sharing functional similarity with the specified genes. The second method looks for functional association via protein-protein interactions (provided by the STRING database). [FIXME WEAK ENDING]

The Endeavor system(Aerts, Lambrechts et al. 2006) aims to create an extendible system for prioritizing disease genes using heterogeneous data sources. The input to the system is a training set of genes. They evaluated the performance of the system against monogenic diseases (automatically extracted from OMIM), polygenic diseases (six genes recently determined to be involved in polygenic disease) and also for functional role in regulatory pathways (by examining differential RT-PCR expression). They also performed functional validation in zebrafish, searching for DiGeorge syndrome (DGS), by using a training set of genes causing DGS and DGS-like symptoms. This resulted in the prioritisation of TBX1, a known DGS-related gene, and YPEL1, which yielded in DGS-like defects when expression was knocked down in vivo. [FIXME add description of Order statistics integration method, results]

More recently, CAESAR(Gaulton, Mohlke et al. 2007) takes as input a text about the disease and analyses for the presence of ontology terms. For each of the associated data sources (including GO annotations, InterPro and protein-protein interaction databases), genes are ranked based on the annotated ontology terms. The gene ranks are then integrated, using the functions sum, mean, maximum as well as a transformed score that considers both the rank of a gene for each data source and the number of genes returned by that data source. [FIXME CLARIFY].

A large integrative analysis with most of the methods was performed to predict genes potentially linked to diabetes and/or obesity(Tiffin, Adie et al. 2006). [RESULTS OF STUDY – Integration of methods can meta-method, etc.] Several of the above mentioned methods have emerged since the review study, including CAESAR(Gaulton, Mohlke et al. 2007), Endeavor(Aerts, Lambrechts et al. 2006) and an update to G2D(Perez-Iratxeta, Bork et al. 2007).

These methods demonstrate a diverse variety of methods to elucidate gene-disease associations. Some of these methods integrate several methods in a quantitative manner, but these methods rely on qualitatively determined scoring mechanisms. All methods focus only on presenting the selected gene list (potentially with scores and a cut-off), without providing verifiable evidence for review.

Proposed Method

We propose a method of extracting gene-disease associations that will emphasise verifiable supporting evidence for the predicted associations, and a quantitative evaluation of the strength of the association. We shall investigate both associations between genes and disease, as well as properties of the gene-disease association.

[pic]

Figure 1. An overview of the data modelled by the system, depicting genes, diseases and evidence (entities in rectangles), and their relationships (in diamonds), with examples of pertinent attributes (ellipses).

We shall consider three base entities – Genes, Diseases, Evidence – and the relationships between these entities. Our goal will be to predict Gene-Disease relationships based on the existence of relationships between other entity pairings. After initial study of mammalian gene-disease relationships, we will broaden the approach to incorporate entity relationships involving orthologous genes in model organisms or related diseases. These paths of supporting evidence will be quantitatively evaluated, making it possible to both extract strongly supported gene-disease linkages and to rank these linkages.

Although the thesis itself will investigate properties of transcription factor genes in diseases, the methods and analysis will be designed for general application. For the initial analysis of the main gene-disease associations, we shall investigate brain diseases specifically. Once we reach the stage of mining property associations and analysis of clusters of genes, we shall select a second broad disease area to examine, such as cancer, to allow a more diverse set of analyses and demonstrate the generality of the method.

Genes

We shall focus on the genes, and associated encoded products, represented within Entrez Gene. The Ensembl Gene set may be mapped to, combined with, or used as an alternative to Entrez Gene. We shall identify genes encoding transcription factors via GOA, supplemented by TF-Cat, our laboratory’s curated TF catalog.

Disease

We shall use the MeSH ontology as a primary source for disease terms, to facilitate interoperability with PubMed. Other vocabularies/ontologies, such as the UMLS Metathesaurus concepts, Disease Ontology, ICD and SNOMED CT may also be used in conjunction with the MeSH ontology, to suit the needs of other the data sources used.

Evidence

In general, features encompass all descriptive properties of genes and diseases, qualitative and quantitative. Qualitative features include ontology or vocabulary annotations, such as GO annotation for genes and MeSH terms for PubMed articles, or may be free text, such as GeneRIFs. This also includes the presence or absence of protein domains, transcription factor binding domains or other functional genomic elements. Quantitative features include numerical attributes, such as the length of coding sequence, as well as derived numerical attributes, such as BLAST similarity score to the nearest murine homologue. We shall consider PubMed articles as a primary source of supporting evidence, and all other forms of experimental evidence (microarray data, gene linkage studies) will be mapped to concrete, verifiable evidence, such as the relevant PubMed article, in order to be considered.

Linkages

To find evidence associating transcription factors with diseases, we shall look at integrate and evaluate the strength of the links between genes, evidence and disease. This divides the linkages into five broad categories: Gene-Gene, Gene-Evidence, Evidence-Evidence, Evidence-Disease and Disease-Disease.

Gene-Gene relationships include homology and gene interactions. When considering a human transcription factor gene, information can be gleaned from paralogs, highly similar genes potentially arising from an ancestral gene duplication event, and orthologs in a closely related species. Gene interaction includes protein-protein interactions as well as regulatory mechanisms, from interfering RNA to the transcriptional regulation effects at DNA binding sites of the transcription factors. These related genes are likely to share elements in common to the considered human TF. [ORTHOLOG REF] From the presumed evolutionary relationships, orthologs perform equivalent roles and paralogs are likely to share function, although recently there is evidence supporting significant divergence between the mouse and human genome-wide transcription factor-DNA binding behavior (Odom, Dowell et al. 2007). Interaction partners and downstream regulated genes are likely to be involved in some common process. The gene-gene relationships will be extracted from curated sources such as Orthologene, and also computationally derived via commonly-used gene similarity metrics such as BLAST E-values. Interaction databases such as BIND, Intact and STRING will be used to extract other protein-protein relationships. The PAZAR database can also provide TF-gene regulation relationships.

Gene-Evidence relationships include gene references in PubMed articles. GeneRIFs, RefSeq Related Articles and Gene Ontology literature reference articles relevant to the gene of interest. As well, the transcription factor database TF-Cat links transcription factor genes with relevant articles. Many of the links describe the reason for the linkage, whether via plain text (GeneRIFs and TF-Cat) or ontology terms (GO annotations). Other forms of evidence may also involve quantitative values – for example, from the sequence, we can compute the length of the gene and its nucleic acid content – in addition to additional qualitative values – the presence of protein domains, or the existence of a homologue in a another species.

Evidence can be interlinked. For example, PubMed related articles links articles in PubMed by the similarity of their abstract text, as well as citations, both articles that cite and are cited by an article in question.

Disease-Evidence links are taken from references to a particular disease from an article. We shall use MeSH headings, annotated by NLM curators on PubMed articles, to link evidence to disease.

Disease-Disease linkages can be gleaned from ontological relationships, and from hierarchical arrangements in organized vocabularies. The MeSH hierarchy will be used to determine relationship between disease entities.

Quantitative Evaluation

Scoring Relationships

To evaluate the results obtained, we shall aim to generate relevant and intuitive numerical scoring methods. Our goal is for the scoring methods to be sufficiently general to allow evaluation and comparison between forms of evidence.

To evaluate strength of a linkage between two entities (e.g. a TF gene and a disease) supported by evidence (e.g. from a subset of all PubMed articles), we consider a null hypothesis – that the linkage found occurred entirely by chance. We can therefore examine the probability of the evidence found occurring by chance. In the example, we consider n PubMed articles that are referenced by the gene, and the k articles which are annotated as linked to the disease. We then compare against the K articles that are annotated as linked to the disease, and the N entity-linked articles in PubMed. If we consider each article referenced by the gene as a random draw from the pool of all articles available (the subset of all PubMed articles), we can use a hypergeometric distribution to model the number of articles we would see by chance annotated as linked to the disease and quantitatively evaluate our results. Therefore, if we observe that x articles referenced by the gene are associated to the disease,

[pic]

These results equate to performing a one-tailed Fisher’s exact test. Should this prove too computationally expensive or inaccurate to compute, we can approximate this using the binomial distribution, if n is much smaller than N-K and K.

Multiple Testing Correction

Multiple testing correction will be employed in cases where we examine the potential association between a gene and each of the diseases – for example, when the investigator specifies a particular gene, and requests a list of all diseases associated with the gene. The danger in such case is potentially increased Type I (false positive) error. In such a case, we can employ the Bonferroni (familywise error) correction – effectively, we divide the significance level (e.g. α = 0.05) we are looking for by the number of tests we employed (e.g. the number of diseases we tested for) and count significant the p-values that fall below this conservative threshold.

When considering many tests, the penalty imposed by Bonferroni correction prove too extreme, resulting in a substantial increase of Type II (false negative) error. As an alternative, we could employ Benjamini-Hochberg (false discovery rate) correction to control the Type I error explicitly. In this case, rather than controlling for single erroneous rejection of the null hypothesis, we control the fraction of erroneous rejection.

This method is has shown to be applicable when the tests are independent and when the tests are positively correlated, and has been used for correction of GO term overrepresentation.

Joint Probability

In general, we can utilize the overrepresentation analysis to determine when two [FINISH ME] When considering two links, linking gene A to feature B, and feature B to disease C, with p-values p(B|A) and p(C|B), we often wish to estimate the probability of the secondary relation, p(C|A). Assuming that the relation A->B->C is transitive, and that p(B|A) and p(C|B) are independent, we can compute p(C|A) as the joint probability p(B|A AND C|B). Then probability of the combined link will be p(AB) + p(BC) - p(AB)p(BC).

Another heuristic, useable when we wish to examine multiple links, is the shortest path heuristic(Zhou, Kao et al. 2002). Each link becomes the weight of an edge in a graph, and the length of the shortest path between two points is the value given for that relation.

Validation

The purpose of validating the results is to demonstrate the effectiveness of the method. There are two goals of the validation – first, to show that the method is effective at reconstructing the existing gene-disease knowledge, and secondly, to demonstrate that the method would be effective at discovering novel gene-disease relationships.

Validation of the data will be performed in three ways — using OMIM gene and disease entries, comparison with more recent data and manual verification. These validation strategies will test the sensitivity of our method, by providing positive examples. As it is impossible to rule out a future link between a gene and disease, there is no negative data.

To evaluate the basic accuracy of the relationship links suggested by the system, we can use OMIM entries noting a link between a gene and disease will comprise one set of positive data Y, to be compared against the results generated by our system X. By taking the ratio [pic], we can evaluate the sensitivity of our method — the fraction of the positive examples that are correctly identified by the system. By evaluating the sensitivity using the most recent versions of the database, we can measure the ability of the system to reconstruct the known relationships in OMIM. Similarly, we can manually evaluate the results of the system using the associated evidence, such as determining whether the PubMed articles referenced support the gene-disease association hypothesis.

To look at the predictive ability of the system, we can also save the databases loaded by the system before a particular date and use these slightly obsolete databases for the analysis. We can then examine the more recent literature since the saved time-point for novel gene-disease linkage discoveries, providing a second set of positive data. Methods to generate this dataset of “novel” (as of the saved time-point) discoveries would be to look for new OMIM disease and gene entries, and manual examination of recently published articles. Manual examination of the articles can be assisted by using the system on the most recent version of the databases to generate gene-disease relationships and verifying the evidence manually.

OMIM is a fairly conservative source of gene and disease information, and therefore will not necessarily have all the most recent discoveries curated. One method to more accurately place the time [FINISH THE SENTENCE]

[Duplication of manual curation confusing. Replace by….experimental result validation data? Examples of results of manual eval] can also manually verify the evidence supplied by the system for a particular gene-disease linkage. By examining the PubMed articles referenced, we can evaluate whether it is relevant to the gene, the disease, both or none. This form of verification will evaluate the relevance of the data extracted by the system.

Goals

Main TF Gene-Disease Association Prediction Model

Associations between genes and diseases will be identified, implicating specific genes with specific diseases with a quantitative strength.

➢ Tool to derive associations between genes and diseases from the database

➢ Model to quantitatively evaluate associations extracted

o Validation of the associations derived and the model

TF Gene-Disease Association Property Predictions

Associations between genes and diseases will be expanded to elucidate additional properties, such as the functional role of the gene in the disease, the affected locations, as well as investigate the relationship between genes involved in the disease, such as via protein-protein interactions and transcriptional regulation.

o Tool to analyse gene-property-disease associations

o Model to quantitatively evaluate the properties derived

o Validation of the additional properties derived and the model

Gene Cluster-Disease Association Predictions

Meta-analysis, using the results from previous association analyses, will focus on finding clusters of genes related to disease. Using traditional methods, such as k-means, as well as more recently developed methods such as (Jochen’s rank-based prior)[FIXME ask Jochen] and OPTICS, we shall investigate whether the genes can be clustered in disease and disease-property meaningful ways.

o Cluster genes, looking for disease and disease-property clusters

o Validation by examining known disease-related genes and disease genes involved in pathways

Common Goals

Data on genes, diseases and evidence used to support the gene-disease associations will be extracted and stored, to support analysis and validation.

➢ Database of transcription factor genes, diseases and evidence data

➢ Tool to create and update the database from relevant data sources

Design Principles

Quantitative TF Gene-Disease Relationships

The tools will allow examination of known and predicted gene-disease relationship, and quantitatively evaluate these relationships. The evidence supporting predictions will be accessible, allowing users direct means to confirm the predictions. The system will be designed to accommodate more general use in other disease areas or types of genes.

Open Access

Freely available data sources will be used. The tools developed and results of the analyses will be made publicly available and published in open access journals.

Modular, Efficient Programmatic Framework

A comprehensive toolkit for analysis will be developed. Scalable algorithms will be used to handle the extremely large, expanding datasets involved. Efficient methods to extract data from the large dataset will be developed.

Appendix I - Data Sources

The system will be designed to provide a complete storage solution for genes, diseases and evidence from disparate databases, as well as existing and computed annotations and relationships. A consistent interface will allow straightforward access to all the data. This data will be stored in a database, with programs written to both load and update from the data sources.

Three main entities will be considered — genes, diseases and evidence. Genes refer to loci on the chromosomes of humans, generally protein-coding, including the relevant regulatory elements. Diseases refer to abnormal human phenotypes. Evidence refers to all the data that will be used to link genes to disease.

Due to the extreme sizes of the data sources involved (16 million entries in PubMed alone), we shall consolidate the data in a local database. This will ensure maximal efficiency for accessing the data when performing the analyses. As well, this will put all the data in a common, controlled format, which will simplify the downstream analyses and make the development of the subsequent tasks independent on the data acquisition task.

Genes

The ultimate goal of the thesis is to link human transcription factor genes with human diseases. Other genes, such as genes in model organisms and other species, as well as genes regulated by transcription factors, will need to be considered. As well, in existing methods, candidate genes may be specified directly by the user or selected via broad chromosomal regions. To accommodate the range of genes that may be used in our analyses, I shall use Entrez Gene as the primary source reference for genes.

Entrez Gene

This NCBI database tracks genes annotated in genomes, from known genes to protein-coding regions and predicted genes. A unique gene identifier is assigned for each gene in each species. Data in Entrez Gene comes from both curated and automatically generated sources, including information from and links to sequences in NCBI Reference Sequence (RefSeq). Gene Ontology (GO) annotations are provided by the Gene Ontology Annotation (GOA) Database. Data from Entrez-accessible sources at the NCBI can be accessed via NCBI Entrez EUtils, as well as downloaded via FTP as compressed text files.

Gene Ontology

The Gene Ontology (GO) Consortium is a collaborative effort to provide a consistent nomenclature for gene annotations (GO terms) and for indicating the strength of the evidence supporting such annotations. In addition to the three original members, the model organism databases FlyBase, Saccharomyces Genome Database external link (SGD) and the Mouse Genome Database (MGD), there are now over ten full members, including GOA, and several associate members. GO is composed of three main ontologies - biological processes, cellular components and molecular functions. Annotations are described by a three-letter controlled vocabulary of evidence codes (See Table [FIXE]). However, GO does not describe "abnormal" state of features, such as mutant or disease-specific traits. The Gene Ontology Annotation Database is responsible for annotations to proteins in the human, chicken and cow genomes in UniProtKB, and is supplemented by annotations from other groups. Priority is given to proteins without annotation, those with disease relevance and those relevant to high-throughput analyses. We use the GO term “transcription factor” to identify genes that are transcription factors.

[pic]

Figure 2. The number of genes in Entrez Gene are compared here to the number of genes annotated as having transcription factor (TF) activity (GO term), the number of genes that are from homo sapiens, and the number genes that are that are human genes marked as having transcription factor activity.

Other sources for Transcription Factors

I shall also examine the integration of other data sources to increase both the coverage of transcription factors as well as providing more direct links to literature. Curated TF databases, such as the locally developed TF-Cat database, can provide a specialised, annotated resource for transcription factors.

Disease

No standard ontology or vocabulary for diseases is currently in widespread use. However, several standards exist for categorizing data in various fields relate closely — Medical Subject Headings, used to annotate PubMed articles, the International Classification of Diseases, standard terminology used worldwide to track morbidity, and SnoMed CT, an emerging standard for health records. The Unified Medical Language System Metathesaurus and the Disease Ontology will provide methods of unifying these terminologies.

Medical Subject Headings (MeSH)

MeSH is a controlled vocabulary thesaurus of descriptors, arranged in a hierarchical structure. Sixteen main categories (e.g. Anatomy, Disease) at the top are divided into subcategories, and then the descriptors are placed into the tree, with more general terms near the top to the most specific, with a descriptor potentially occurring more than once in the tree. We shall initially use the category C, in particular, tree number C10.228.140, "Brain Diseases", and its subheadings, to as labels of for disease. However, as MeSH is a general subject classification system, disease labels will often be general rather than specific – for example, “Spinocerebellar ataxias” (SCA) exists as a distinct MeSH term, but the specific SCA types do not.

(Online) Mendelian Inheritance in Man (OMIM)

OMIM provides access to curated reports in human-readable text format on both genes implicated in diseases and diseases with a genetic component. Articles include inline PubMed references as supporting evidence. OMIM has been used as a source for genetic diseases in several previous methods, however, this would only provide a list of known (potentially) genetic diseases, leaving out diseases that do not yet have a known genetic component.

Other Terminologies Used for Disease-Related Tasks

The International Classification of Diseases (ICD), also known as the International Statistical Classification of Diseases and Related Health Problems, this classification system, published by the World Health Organisation, provides codes to classify diseases, and also signs of health problems such as symptoms, social circumstances and external causes of injury. It is currently in its 10th revision, ICD-10, and is used to track mortality statistics worldwide. ICD-10-CA is an enhanced version developed by the Canadian Institute for Health Information for morbidity classification, and was phased in from 2001-2006. ICD-9-CM, based on the 9th ICD release, is the current official standard used by U.S. hospitals. Incorporation of ICD variants would therefore allow interoperability with morbidity data gathered and publicly available.

Systematized Nomenclature of Medicine – Clinical Terms (SNOMED-CT) was originally developed by the College of American Pathologists. As of April 2007, it is owned by the International Health Terminology Standards Development Organisation (IHTSDO). Canada is a founding member country of IHTSDO, and is represented by the Canada Health Infoway, an organization aimed at providing interoperable electronic health record solutions for Canadians. A U.S.-wide license for SNOMED-CT has been available from the National Library of Medicine as of June 2003.

Unifying Terminologies

The Unified Medical Language System (UMLS) Metathesaurus contains database of medical terminology, provided by the National Library of Medicine. It provides a mapping to unique concept identifiers from vocabularies including MeSH, ICD and SNOMED CT.

Disease Ontology is a controlled vocabulary, currently under development at the Center for Genetic Medicine, Northwestern University, aims to facilitate mapping diseases and conditions, and uses the UMLS to map terminologies such as SNOMED and ICD into the Open Biomedical Ontology format. The previous stable release version of the ontology was based primarily on ICD-9-CM, and there is currently discussion about .

Evidence

As we are focusing research on verifiable experimental evidence, our sources of data will include scientific articles summarizing the results of experiments in addition to other databases of experimental results and the results of simple analyses. PubMed will provide a basic source of scientific articles.

PubMed

PubMed is a searchable citation database at the NCBI, indexing biomedical literature. Bibliographical citation information is taken primarily from the National Library of Medicine (NLM) MEDLINE database, although some journals indexed for their biomedical articles have all their articles indexed, and there are also legacy articles from OLDMEDLINE, as well as other initiatives that experimented with indexing other scientific literature.

GeneRIF

Gene Reference Into Function (GeneRIF) are annotations both submitted by the public and curated by the National Library of Medicine, describing references to gene function. Gene function in this case defined very broadly, referring to not only biological function, but also information about the gene's role in disease, as well as its discovery and mapping. In addition to general GeneRIFs, there are also two other major sources of GeneRIFs: information from HIV-1, the Human Protein Interaction Database, and information from the protein-protein interaction databases BIND, BioGRID, EcoCyc and HPRD. All GeneRIFs include a reference to at least one PubMed article as evidence. GeneRIFs associate PubMed evidence with genes.

[pic]

Figure 3. Compares the number of genes in Entrez Gene that have at least one GeneRIF annotation with the number of human genes annotated as having transcription factor activity in Entrez Gene and also having at least one GeneRIF annotation.

MeSH annotations

As new articles are added to PubMed/MEDLINE, these articles are also being indexed using MeSH terms by curators at the NLM. Each article is indexed by one or more MeSH terms, each of which may also have one of 83 topical qualifier subheadings (e.g. analysis, education or therapy) to potentially indicate a more specific topic.

Statistics

PubMed articles: 16,120,074

PubMed articles with MeSH headings: 15,806,221

PubMed articles with Brain Disease MeSH (or more specific) terms: 660538

MeSH terms: 47143

Unique MeSH terms: 24355

MeSH terms under Brain Diseases: 312

Other forms of Evidence

The PAZAR database identifies regulatory elements associated with genes, potentially revealing interconnected regulatory programs. The String database incorporates both experimental protein-protein interaction evidence as well predicted interactions. The KEGG database provides pathways.

Experimental evidence has also been incorporated in programs such as GeneSeeker and POCUS. Other annotations used include eVoc annotations (Tiffin), InterPro domains, secondary properties derived from DNA or protein sequences (DGP, PROSPECTR). As well, some programs use text mining, extracting information from textual sources such as PubMed abstracts and OMIM articles.

Prototype Implementation[DO SOMETHING MORE HERE?]

The data will be stored in a relational database. Each of the major concepts — genes, diseases and evidence — will be represented as abstract entities. A specific instance of an entity will be both a member of the abstract, and also store specific information separately. The abstract entities will only contain data relevant to the analyses for efficiency — specific information can be referenced afterwards.

[pic]

References

Adie, E., R. Adams, et al. (2005). "Speeding disease gene discovery by sequence based candidate prioritization." BMC Bioinformatics 6(1): 55.

Aerts, S., D. Lambrechts, et al. (2006). "Gene prioritization through genomic data fusion." Nat Biotech 24(5): 537-44.

Gaulton, K., K. Mohlke, et al. (2007). "A computational system to select candidate genes for complex human traits." Bioinformatics.

Gaulton, K., K. Mohlke, et al. (2007). "A computational system to select candidate genes for complex human traits."

López-Bigas, N. and C. Ouzounis (2004). "Genome-wide identification of genes likely to be involved in human genetic disease." Nucleic Acids Research 32(10): 3108.

Odom, D., R. Dowell, et al. (2007). "Tissue-specific transcriptional regulation has diverged significantly between human and mouse." Nat Genet 39(6): 730-2.

Perez-Iratxeta, C., P. Bork, et al. (2007). "Update of the G2D tool for prioritization of gene candidates to inherited diseases." Nucleic Acids Research.

Perez-Iratxeta, C., P. Bork, et al. (2002). "Association of genes to genetically inherited diseases using data mining." Nat Genet 31(3): 316-9.

Perez-Iratxeta, C., M. Wjst, et al. (2005). "G2D: a tool for mining genes associated with disease." BMC Genetics 6(1): 45.

Tiffin, N., E. Adie, et al. (2006). "Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes." Nucleic Acids Research 34(10): 3067.

Tiffin, N., J. Kelso, et al. (2005). "Integration of text- and data-mining using ontologies successfully selects disease gene candidates." Nucleic Acids Research 33(5): 1544-52.

Turner, F., D. Clutterbuck, et al. (2003). "POCUS: mining genomic sequence annotation to predict disease genes." Genome Biology 4(11): 75.

Van Driel, M. A., K. Cuelenaere, et al. (2005). "GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases." Nucleic Acids Research 33(web server): 758.

Zhou, X., M.-C. Kao, et al. (2002). "Transitive functional annotation by shortest-path analysis of gene expression data." Proceedings of the National Academy of Sciences of the United States of America 99(20): 12783-8.

-----------------------

[1] [pic]OPpxy??‘ ²³ÔÕÖçèü " # $ B C D ïßÏËĽ¹µ®½¦Ÿ˜?ˆ€xskfakf\WOˆh’a>h’a>\? h’a>\? h’a>5? hýZ \? hýZ 5?hýZ hýZ 5? hùCü\?hýZ hùCü\?hýZ hùCü5?hýZ hÀ ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download