PHS 398 (Rev. 9/04), Continuation Page



A. SPECIFIC AIMS:

We propose to establish a Yale Resource Center for Epigenomics that will perform global analyses of epigenetic markers as they change through the course of differentiation, and of the allele specific effects of these marks on levels of gene expression. Our center, which builds on our expertise in the ENCODE project and Yale Stem Cell Center, will provide a detailed analysis of clonal variation as well as analyze changes in human embryonic stem cells and in adult hematopoietic stem cells as they differentiate towards mature erythroid cells, neutrophils, or B cells. It will globally analyze DNA methylation, chromatin modification and noncoding RNA expression patterns during the development of mature lineages from stem cells. Data integration will be used to develop a comprehensive view of the epigenomic events that occur during differentiation. The production center, approaches used, and data generated will be a valuable resource for the scientific community. Specifically we propose to:

1. Determine the full sequence and distribution of methylated bases of an individual. We will use massively parallel sequencing, comparative genomic hybridization (CGH) on high density genomic tiling arrays, and paired-end mapping to fully characterize the diploid genomic genome sequence and DNA methylation distribution and heterogeneity of an individual and to correlate these with copy number variations in this individual. This will provide a framework for the subsequent studies on allele and cell type specific modifications of DNA and chromatin.

2. Study the clonal variation in allele-specific expression in comparison to DNA methylation and histone modification. We will isolate a population of normal B cells and a series of independently and freshly derived lymphoblastoid cell lines from the same individual used for Aim 1, and use these to characterize relative allelic expression, methylation, and selected histone modifications for study of the effects of these modifications on relative levels of allele expression. We expect that this will require developing an appropriate quantitative framework and correlative toolset.

3. Study epigenetic changes during hematopoietic differentiation: We will quantitatively analyze the levels of transcription, DNA methylation, use of DNA replication initiation foci, histone methylation and association with chromatin remodeling factors across the genome during differentiation of embryonic stem cells and adult hematopoietic precursors into erythroid, myeloid, megakaryocytic and B cell lineages.

4. Determine- the expression and epigenetic roles of Piwi-interacting RNAs (piRNAs) and other nuclear RNAs during the differentiation processes. We have recently discovered over 60,000 species of piRNAs as a novel class of non-coding small RNAs. Our preliminary data suggest that PIWI proteins and their associated piRNAs bind to piRNA-corresponding genomic sequences to exert epigenetic function. We will clone PIWI-associated piRNAs and genomic sequences from cell types of Aim 3 by using a CHiP-seq approach. We will then characterize these piRNAs and the changes of epigenetic markers at these sites during differentiation to infer PIWI-piRNA function at these sites. These results will be correlated with the observed chromatin changes from AIM3 aim 3 and provide further insights into epigenomic effects in differentiation.

5. Extend studies of selected epigenomic markers to uncultured cells of various lineages. We will extend the studies of epigenetic markers to normal hematopoietic cells, including T cells, and to neural cells, as a basis for subsequent investigation of the role of these modifications in immunologic and neural disorders.

6. Develop a database and related analysis toolset and make information and reagents generally available through the web. We will focus on presenting the data in a fashion that makes it possible to readily correlate the various quantities under study (e.g. methylation patterns and allele-specific expression) in a genome wide fashion and inter-related them with annotation. We will work closely with the EDACC to make sure all data is correctly deposited into public databases.

B. Background and Significance:

1. Introduction

Over the last decade it has become increasingly clear that epigenomic mechanisms have a major role in the regulation of gene expression. Epigenomic regulation can occur by a wide variety of mechanisms, including DNA methylation, chromatin modifications, and small RNA regulation of genomic and mRNA activity. Gene regulation by each of these mechanisms has enomourenormous impact on phenotypic outcome, and consequently, epigenomics mechanisms have been implicated in a variety of disease such as cancer, degenerative disseaesdiseases, and aging.

We propose to establish a Yale Reference center for Epigenomics that will perform large-scale analysis of epigenomes in a variety of areas and apply this to the analysis of stem cell and hematopoietic differentiation, which is an ideal system in which to study epigenomic control of gene expression. The center builds on our expertise in global analysis of gene expression, high throughput production ability, hematopoesis and stem cell biology.

2. DNA methylation and Allelic variation in gene expression: The principal post-translational modification of eutherian DNA is methylation of cytosines, principally in the dinucleotide CpG. CpG dinucleotides are generally markedly under-represented in the genome sequence and tend to be clustered in regions of a few hundred bases to a few kilobases in length that have a relatively high C:G content and are known as CpG islands. About half of these are located at promoters of genes4, and methylation of these regions’ has been associated with inhibition of binding of sequence specific factors and gene silencing6, 7.Recently it has become more evident that a substantial number of CpG islands may be heterogeneously methylated within a single tissue or cell type 10.

Allelic variation in gene expression may be a consequence of polymorphisms in DNA sequence that affect gene regulation11{Campbell, 2008 #828}. For some time it has been known that there may be mono-allelic gene expression in particular cell clones, and recently it has been shown that this phenomenon may occur in an apparently random way over a wide range of genes12, 13 perhaps related to DNA methylation13.. The structural basis for this allelic silencing and the question as to whether stochastic effects may occur that influence levels allelic expression short of complete silencing are presently unresolved.

3. Chromatin Modification: The proteins of the core nucleosome show an impressive diversity in detailed molecular structure, and this diversity is correlated with specific local aspects of DNA transcription, replication, and repair. Several “non-replicative” histones are encoded by separate genes and may be substituted for the more common “replicative” histones at specific chromosome positions, sites of transcription initiation, or other targeted regions of the DNA 20, 21. In addition the conventional replicative histones can undergo a variety of modifications including methylation at arginine residues, mono- di- or tri-methylation at specific lysine residues, acetylations, phosphorylation, ubiquitinylation, and others modifications, often less well characterized23, 24{Bhaumik, 2007 #825}.

Methylation and demehtylation of lysine residues of histone 3 are illustrative of the complexity and functional specialization of histone modifications26-28}. Trimethylation of lysine 4 of this histone is a relatively robust marker for sites of transcription initiation although it can spread across the body of some short active genes29(Weissman, S.M.. Unpublished observations), trimethylation of lysine 36 is a similarly robust marker of templates encoding the precursors for mature mRNAs, trimethylation of lysine 27 may either mark inactive chromatin regions or be a part of a biphasic mark of “poised” promoters27, 30. Trimethylated lysine 9 characteristically marks blocks of inactive chromatin but, surprisingly, also occurs at a low level across some active genes31(Mahajan, M. unpublished observations). Interestingly, dimethylation of the same lysine residues in histone 3 often occurs in a different pattern than trimethylation. For example, mono- and/or dimethylation of lysine 4 often occurs over distant enhancers as well as sites of initiation of transcription (Mahajan, M.,Snyder, M., Weissman, S.M. unpublished observations), and may serve as an indicator of the presence or activity of such enhancers. Dimethylation of lysine 36 can be present over portions of the template of long transcripts but can also occur over blocks of DNA either overlapping or independent of templates for known transcripts (Lian, Z.. et al manuscript in preparation). The functional correlate of this modification remains obscure. Thus there appears to be a general rule that the trimethylated lysine modifications on histones are used differently in vivo than the dimethylated modifications, consistent with the chemical differences between tertiary and quaternary amines.

A variety of DNA-dependant ATPases are widely expressed. Several of these are parts of nucleosome “remodeling” complexes that may promote nucleosome migration along DNA, or loosening of the DNA wrapped around a nucleosome so as to expose the DNA to proteins such as restriction enzymes32-34. These DNA dependant ATPases are often components of multiprotein complexes. While the core composition of certain complexes is known, in cells this core may itself be part of larger complexes that are difficult to study. However the various ATPases of the SMARCA family show preferential sites for association with DNA in vivo, and these sites may differ for different members of the family. The differential role of these different family members in vivo is not well understood, and is a subject for considerable further study.

[pic]

5. Hematopoietic Differentiation: Differentiation of cells along the immuno-hematopoietic lineages is an especially favorable process for detailed molecular characterization. The properties of hematopoietic stem cells have been at least partially defined, and a single cell may give rise in vivo to about 10 cell lineages, and multiple sub-lineages and maintain these cell populations through the lifetime of a recipient. The normal end-product cells can often be obtained in biochemically tractable amounts directly from individuals without culture or tissue disruption and many sub-lineages of mature cells have been extensively characterized at the molecular level. Hematopoietic stem cells and multipotent precursor cells have been the subject of much study and some of their defining characteristics are established( for example, see35). A number of lineage specific transcription factors have been identified and partial interactive networks have been presented for the role of these factors in various mature cell lineages36. Marked changes in nuclear morphology occur during development of mature lineages from hematopoietic stem cells, suggesting major alterations in chromatin structure. Importantly in vitro systems are available for relatively large scale differentiation of certain precursors into cells of several mature lineages. Cell surface markers have been defined that permit FACS purification of cells at defined intermediate stages of differentiation.

6. Conclusion

Epigenomics regulation occurs at a variety of levels, DNA methylation, chromaitonchromatin modification andmodification and small RNA control. Below we propose to establish an integrated epigenomics center that will focus on the analysis of stem cell and hematopoietic differentiation. The center will build on our extensive expertise in the global analysis of gene expression and epigenomics. As such we expect to generate a wealth of valuable information concerning epigenomic regulation and be an integral part of the epigenomics consortium

C. Preliminary Results

We have extensive expertise in the large scale analysis of DNA methyationmethylation, chromatin modifications and small RNA control of gene expression and stem cell and hematopoietic differentiation. Moreover, we have help develop technologies for the global analysis of gene expression and have considerable experience in production work for the high throughput generation of high quality data.

General

The Yale laboratories have been involved in a number of large-scale genomic mapping activities, many of them collaboratively. These include participation in the ENCODE program, a large CEGS program grant, and the founding and cooperation of the Yale Center for Genomics and Proteomics (Director-Snyder).

The Snyder laboratory has been involved in a number of large scale projects. These include the large scale transposon tagging of genes and localization of yeast proteins 37, 38 the effort to disrupt most genes in yeast 39 the generation of two expression collections for yeast and Arabidopsis 40-42 and development of the first protein chips and proteome 43, 44. The Snyder laboratory invented the ChIP-chip procedure with the P. Brown laboratory for yeast and has mapped several hundred factors with this procedure in yeast cells. Together with Drs. Weissman and Gerstein they extended this procedure to mammalian cells 45 and have mapped a large number of factors with this method. The Snyder, Weissman and Gerstein laboratories together built one of the first human chromosome tiling arrays 46 and used it to discover extensive transcription in the mammalian genome 46, map DNA replication timing 47 and perform the first chromosome mapping of a transcription factor on an entire human chromosome (see also 48.). These groups also built the first high resolution tiling microarray of the human genome and used it to map transcription throughout the entire human genome 5. These groups together have invented numerous tools for designing, scoring and analyzing results of DNA microarrays. They also have been gaining experience with high throughput sequencing methods and have sequenced a bacterial genome with 454 sequencing 49 generated and analyzed 4 >20 million paired ends reads with 454 for structural genome variation50 and recently began investigating ChIP sequencing with Illumina technologies (see below).

During the ENCODE pilot phase the Snyder, Weismann, and Gerstein groups worked together to carry out a number of studies involved in transcript mapping, and ChIP-chip . Along with the Farnham and Struhl laboratories they established many of the ChIP-chip procedures and standards, and performed many of the pilot studies pertinent to establishment of a production center. Importantly for this work, Dr. Snyder was involved in establishing a company, Protometrix (now part of Invitrogen), that specialized in the high throughput production of protein chips containing thousands of proteins. There are many parallels between the establishment of our Yale production facilities and that of Protometrix, including establishment of robust Standard Operating Procedures (SOPs) for a high quality product in a cost effective manner, robotics and the high throughput processing and handling of large numbers of samples.

In addition to the participation in the studies mentioned above, the Weissman lab has been involved for some time in aspects of technology development relevant to genomic studies including the introduction of approaches such as cDNA normalization and chromosome jumping, improved methods for whole genome amplification, and identification of novel transcripts and mutations by cDNA selection techniques. They have demonstrated the occurrence of cell lineage specific patterns of relative expression of broadly expressed genes that are invariant to short term activation of cells51, 52. They have also begun biochemical characterization of cell type specific multi-protein complexes whose individual components are broadly expressed in many cell types. They purified a sequence specific DNA binding complex containing NURD, the BRG1 containing complex, and an RNA binding protein53, 54, and a second complex that contains the MCM complex, ORC2, several DNA repair proteins and a different RNA binding protein (Unpublished results). These are of interest in view of suggestion that RNA may direct chromatin modifying complexes to specific sites on the genome. Finally, in collaboration with the Snyder and Gerstein groups, they have begun to correlate these complexes with factor binding patterns detected with genomic tiling arrays and chip-seq experiments.

ProfesorProfessor Haifan Lin has recently joined Yale as Director ofDirector of the Stem Cell Center and is already collaborating with Snyder and Weissman in a stem cell program grant to define genome-wide epigenetic changes during early stages in neural differentiation. HIs laboratory discovered the Argounate/Piwi gene family in 1998 (ref. 22) and was the first to report the discovery of piRNAs (ref. 16). Recently, the Lin lab has demonstrated the epigenetic function of PIWI and its partner piRNAs by using the Drosophila genome as a model (ref. 9 and 60).

DNA Methylation:

We have been comparing and using several approaches for mapping methylated cytosine residues in DNA. In one approach we first digest genomic DNA with an enzyme that will cut at AATT sites, the DNA is then size fractionated on a gel and fragments larger than one kilobase are recovered. These are then cut with the methylation sensitive enzyme HpaII and again size fractionated. Fragments below about 500 base pairs are recovered and labeled as the unmethylated markers. The larger fragments are recut with the methylation resistant enzyme MSP1 and again fragments of 500 bp or less are recovered. All small fragments are end-sequenced by Illumina and mapped onto the adjacent CCGG site in the genome (except where they fall more than ten bases away from any such annotated site . Each mapped end is scored as one “hit” at the adjacent CCGG site and the scores are compared for methylated and unmethylated reporterunmethylated reporter fragments. Sites occurring in both the methylated and unmethylated runs are presumably sites of heterogeneous methylation. One lane of a Illumina run is adequate for mapping the relative methylation of a large number of sites across the genome (Table I).

| |CS generated by HpaII |CS generated by MspI |CS generated by both HpaII and |

| |(unmethylated sites) |(methylated sites) |MspI |

| | | |(heterogeneously methylated |

| | | |sites) |

|Chromosome 22 |6135 |4908 |843 |

|Chromosome X |1990 |3264 |191 |

Table I: Example of methylation analysis by massively parallel sequencing. These are results obtained by a single lane of Illumina sequencing of fragments generated from male neutrophils by the procedure of the preceding paragraph.

As a second approach we have been using chromatin immunoprecipitation of methylC containing DNA.

Identification of Genomic Regions Containing DNA methylation: The methylcytosine immunoprecipitation (MeDIP) method was adapted from a previous study55.. In MeDIP assay, methylated DNA is immunocaptured by an antibody specific for methylated cytosines in genomic DNA fragments. The MeDIP enriched DNA and input DNA (i.e., without immunoprecipitation) are differentially labeled using fluorescent dyes (Cy3/Cy5) and competitively hybridized to high resolution genomic DNA tiling arrays (Figure 2).

After optimizing the immunoprecipitation conditions, we carried out a number of control experiments to test further the specificity and efficiency of MeDIP. We produced methylated DNA that contained various patterns of methylated cytosines, by treating unmethylated DNA fragments with DNA methylase enzymes. These enzymes included HhaI (GCGC), HpaII (CCGG) and SssI methylase (CG) (New England Biolabs).The unmethylated fragments were produced from genomic DNA of normal melanocytes using whole-genome amplification procedure (Qiagen). The level of enrichment by MeDIP for each sequence increased in a linear manner with the number of methylated cytosines (Figure 3A). In contrast, the number of unmethylated cytosines did not affect to the level of MeDIP enrichment (Figure 3B).

Next, we compared the relative enrichment of known methylated (eluted fraction) and unmethylated genomic sequences (unbound fraction.) (Figure 4A). MeDIP enriched methylated DNA relative to CpG free controls by up to 10-fold in the elution fractions (Figure 4B). In contrast, and as predicted, DNA from the unbound fractions (depleted for methylated DNA) had significant enrichment of unmethylated over methylated sequences (Figure 4C) We also analyzed the imprinted H19 imprinting control region (ICR) sequence, which was previously shown to be consistently methylated on one allele in all somatic cells (Figure 3B,C). Importantly, the mono-allelic methylation of H19 ICR showed less enrichment than bi-allelic methylation of HOXA5 in the elution fractions. These preliminary data that enrichment by MeDIP was specific for 5-methylcytosine and occurs in a sequence independent manner.

[pic][pic]

[pic]

[pic]

[pic]

To date, we have analyzed DNA this way by hybridizing immunoprecipitates to genomic tiling arrays, but we are currently beginning analysis from data obtained by high throughput sequencing. The two approaches to methylation analysis are complementary in that the first approach shows homogeneity or heterogeneity at each of a subset of CpG sites (those in the sequence CCGG) while the second method shows average level of methylation over each segment of DNA but does not distinguish between partial methylation at each site or full methylation at some sites with lack of methylation at other, proximate sites.

As a third approach we have been using conventional bisulfite treatment of DNA followed by PCR amplification with selected primers that would amplify regions independent of the degree of CpG methylation. This has worked well for conventional sequencing and we are currently both applying this on a whole genome level and working on array elution procedures to be able to sequence DNA from selected genomic regions before and after deamination of unmethylated C residues.

We have performed pilot studies comparing DNA methylation intensity in patients hemizygous for a 3 megabase region of chromosome 22 (VCF patients), patients with a duplication of the region, and normals. (Fig.5)

[pic]

| |HeLa |K562 |

| | | |

|Polymerase II no-P |Y |Y |

|Polymerase II s2-P |Y |Y |

|Polymerase 2-s5-P |Y |Y |

|Histone 3 K4 Me3 |Y |Y |

|Histone 3 K4 Me2 |Y |Y |

|Histone 3 K9 Me3 |Y |Y |

|Histone 3K9 Me2 |Y |Y |

|Histone 3 K9 Ac |Y |Y |

|Histone 3 K27 me3 |Y |- |

|Histone3 K27 Me2 |Y |- |

|Histone 3K36 Me3 |Y |Y |

|Histone 3 K36 Me2 |Y |- |

|LSD1 |Y |- |

|Lamin A |Y |- |

|Lamin B/C |Y |- |

|BRG1 |Y |Y |

|INI1 |Y |Y |

|BAF155 |Y |Y |

|BAF250 |Y |Y |

|BRM1 |Y |Y |

|SMARCA6 |Y |Y |

|hSNF2 |Y |Y |

|NRSF |Y |Y |

|CoREST |Y |Y |

|SP1 |Y |- |

|STAT1 |Y |- |

|NfkapaB p65 |Y |- |

|CTBP1 |Y |- |

|BAF170 |Y |Y |

|Sin3A |Y |- |

|NFE2 |- |Y |

|GATA1 |- |Y |

|MBD3 |- |Y |

|H4 Ac |Y |Y |

|MCM5 |- |Y |

|Rad50 |- |Y |

|ORC2 |- |Y |

|CTCF |Y |Y |

|MBD1 |Y |- |

|MeCP2 |Y |- |

The results suggest that there is no overall uniform compensation for loss of one allele in this region by decreased methylation of the other allele. However, closer inspection shows interesting regions of variable methylation at the ends of the deletion that might correlate with copy number of the chromosomal region. More samples and a more detailed analysis are needed. The studies we propose comparing methylation patterns in normal B cells and those in cultured cell lines will help to decide whether these VCF studies will have to be carried out directly with normal cell populations or can be performed in cultured cells.

Chromatin Modifications: As part of the ENCODE project our labs have mapped about 40 types of histone modifications, sequence specific binding proteins and chromatin remodeling factors onto 1% of the genome, using three biologic replicas on HeLa cells and, for many factors, K562 erythroleukemic cells or NB4 promyelocytic cells (Table 2).

Currently we have switched to the use of Chip-seq with Illumina massively parallel sequencing to map factors and modifications along the entire genome, So far, using this technique we have mapped RNA polymerase II binding sites for 4 cell types and mapped Stat`1 and INI1 sites for HeLa cells. We have antibodies available for all the factors we have previously mapped and are poised to initiate mapping on ES and hematopoietic stem cells in their resting state and at various stages of differentiation.

One of the interesting observations that has emerged is that there is gene and regional specificity in which chromatin remodeling factors are associated with chromatin. For example see Fig. 6

[pic]

Lysine 9 methylation of histone 3 is generally considered to be a mark of inactive genes although in the globin system we and others have found evidence of lysine 9 methylation at the promoters of active genes.. Interestingly, dimethylation and trimethylation of lysine 9 of histone three occurs in quite disparate patterns and may correlate with different aspects of gene silencing. For example, again in the globin system, we have observed that dimethylation of this residue may reflect stage specific gene silencing of globinof globin genes while trimethylation may reflect cell type specific silencing. For an example see Figure 7.

[pic]

Often two or more modifications map to the same position, within the resolution of the methods used. These could represent alternative modifications of members of a dimeric pair of histones in a single nucleosome, alternative states of the nucleosome, or, in some cases, simply different modifications on adjacent nucleosomes. In selected cases we will investigate this by immunodepletion and analysis of alternative modifications in the supernate and by sequential immunoprecipitations.

Another important determination of the different modification states of chromatin is the association of specific regions with various histone demethylases. As a pilot for these studies we have performed ChIP-chip with an antibody against the first known demethylase, LSD1. This demethylase turns out to be specifically associated with active promoters and one might speculate that it is partly responsible for the high Raito of tri- to di-methylated lysine 4 of histone 3 that is characteristic of promoters. We plan to extend these studies with antibodies against known demethylases as they become available commercially, or, if necessary, are raised by ourselves.

Hematopoietic Differentiation in vitro: In our laboratories we have been using two types of hematopoietic differentiation systems that permit in vitro analysis of various epigenetic changes during the course of differentiation. These systems are the in vitro differentiation of human ES cells (line H1 or H9) into erythroid cells, and the differentiation of adult CD34 cells into mature cells of the myeloid and erythroid lineages. The procedures and results of ES cell differentiation follow those that a member of our group, Dr. Cai-Hong Qiu, has already published56. These cells can be progressively differentiated into cells producing embryonic, then fetal globins, but not so far significant amounts of adult globins.

Human CD34 positive cells derived from peripheral blood are available through the Yale Hematopoietic core Facility. These cells can be expanded in vitro 100 fold (or more) by growth in SCF and IL3 without loss of pluripotency. After expansion these cells can be driven to differentiate into mature enucleated erythrocytes, with little cell death along the way57. The intermediate stages in erythroid differentiation can be distinguished by the gradual accumulation of transferrin receptor on the surface of cells during the earlier stages of differentiation, followed by the appearance of surface glycophorin and then the gradual decrease in transferring receptor, as well as gradual decrease in the size of cells and accumulation of hemoglobin. Earlier differentiating cells produce a mixture of fetal and adult globins while later stages of maturation produce only adult beta globin (and alpha globin.) Thus comparisons of ES cell and adult precursor cell development and regulation of globin production provides an interesting system for studying not only lineage differentiation but also the specifics of control of similar or the same genes at embryonic, fetal, and adult stages of development. Up to one hundred million cells at various stages of differentiation can be obtained from a single starting culture.

We have been optimizing the production of mature or nearly mature neutrophils from the same CD34 positive precursor cell preparations58By successive addition and removal of cytokines and addition of all-trans retinoic acid at the last stages of differentiation we can obtain a population of cells with a high percentage of condensed and folded or segmented nuclei and large granules consistent with secondary mature neutrophil granules. These cells are produced in a medium that contains only GCSF. Unlike the murine multipotent precursor cell lines, there is little cell death during the maturation process and most cells develop along the myeloid lineage, so that there does not appear to be a selection for progeny of a minor subset of precursors.

Well defined systems have been described for differentiating these same precursors into B cells in vitro. We have not yet implemented these systems in our laboratory but do not anticipate major difficulties. In addition we have considerable experience in isolating and analyzing normal B cells at later stages for development, directly from human tissue, and obtaining them in large enough amounts for biochemical analysis.

Epigenetic regulation by non-coding RNAs: Most importantly, we have recently shown that the Drosophila Piwi/piRNA complexes directly bind to the piRNA-matching sequences in the genome to exert positive or negative epigenetic regulation in a site-specific manner. (ref 59 is not correct, please cite ref. 9 and 60). Furthermore, we demonstrated that the negative epigenetic regulation at many of the Piwi/piRNA-target sites in the genome is achieved by Piwi directly recruiting heterochromatin protein 1 (HP1) to the sites. (ref 60 is not correct, please cite ref. 60). These studies revealed a novel dimension of epigenetic regulation that is completely unexplored.

The Piwi- piRNA mediated epigenetic regulation is likely involved in stem cell function because the Ago/Piwi protein family has evolutionarily conserved function in stem cell maintenance in both animal and plant kingdoms 22. The stem cell function appears to be especially conserved within the Piwi subfamily 61, 62. The binding of Piwi proteins to piRNAs as partners and the requirement of Piwi proteins for piRNA biogenesis and function suggest that piRNAs are also involved in stem cell division. In support of this speculation, the Lin lab has recently demonstrated the epigenetic function of a specific piRNA-Piwi complex in regulating stem cell division in Drosophila 9. Given the highly conserved stem cell function of Piwi proteins during evolution, it is anticipated that these molecules together play a similar role in hESCs.

There are four Piwi proteins in humans, PIWIL1 (Hiwi), PIWIL2 (Hili), PIWIL3 and PIWIL4 (Hiwi2) 63. All are highly expressed in the testis , 65 (ref 64 & 65 are not correctly cited, please cite ref. 61 & 63). Additionally, Hiwi is expressed in CD34+ hematopoietic stem/progenitor cells and putative gastric epithelial stem cells, but not in more differentiated cells of these lineages, suggesting that Hiwi might be required for the maintenance of these somatic stem cells 64, 66. Our preliminary data indicate that all four Piwi proteins and piRNAs are expressed in hESCs (see below). These findings provide a great opportunity for us to explore the role of the piRNA pathway in the epigenetic regulation of hESC self-renewal.

Determination of Allele Specific Binding: Using ChIP-Seq, transcription factor binding sites that are strongly enriched have many sequence reads mapped to the vicinity of the binding site. Given sufficient depth of sequencing many binding sites are sequenced at a depth of greater than 50x. If a heterozygous single nucleotide polymorphism exists close to the binding site then it will be evident in the assembled mapped sequence reads. For a binding site with the TF binding equally to both alleles we should see an equal number of sequence reads corresponding to each allele. However, if the TF binding is allele specific then we will observe significantly more reads of the one allele variant than the other. We have developed a framework

for observing this and have observed a number of preliminary indications (see Fig.8 below). Of course, care must of course be taken to account for local copy number variation as many immortalized cell-lines suffer from aneuploidy.

Bioinformatics

Intergenic Annotation: The Gerstein lab has been a major participant in a number of ENCODE efforts, focusing on intergenic annotation. We developed an approach to analyze the distribution of regulatory elements found in many different ChIP-chip experiments 67By focusing on the overall chromosomal distribution of regulatory elements in the encode regions, we showed that it is highly nonrandom. Our results indicate that these elements are clustered into regulatory rich ‘islands’ and poor “deserts”. We then performed a multivariate analysis on all the factors collectively, which grouped the transcription factors into sequence-specific and sequence-nonspecific clusters. Following on this, we developed an approach for integrating the results of many ChIP-chip experiments to discover new promoters in the human ENCODE regions 68We have also carried out comprehensive pseudogene annotation of the human genome and cross referenced this annotation with tiling arrays 69-74 We performed a related analysis of structured RNAs in intergenic regions, also inter-relating them with tiling array data 3, 75

Tiling Array Tools: The Gerstein Lab has developed a considerable number of tools and sophisticated machinery for processing tiling arrays. Most of these are described elsewhere in the proposal (e.g. scoring arrays, sect. C.4). One tool not described elsewhere is BoCaTFBS 76. This refines ChIP-chip hits by considering known motifs (Wang et al., 2006). Traditional computational algorithms used to identify binding sites, such as consensus sequences 77profile methods (PWM/PSSM) 78and HMMs 79,80generate high false-positive rates when applied genome-wide 81Our method uses a boosted cascade of classifiers -- specifically, alternating decision trees (ADTboost) (Freud & Schapire, 1999), where ADTboost is a special extension of AdaBoost. We use the known motifs (e.g. from the TRANSFAC database, 82as positives and the results of the ChIP-chip experiments as negatives. Our method is the first motif finder that explicitly takes into account the data from ChIP-chip experiments. Moreover, BoCaTFBS differs from most other motif programs in that (1) it takes into consideration positional dependencies within a given motif and (2) it uses the negative data (regions where the transcription factor does not bind) in order to refine the binding site.

Regulatory Networks: The Gerstein lab has done quite a number of analyses on the large-scale structure of regulatory networks (e.g. 83, 84,85,86 and has developed tools to enable their analysis 87,86 ( networks.tyna ).

Tools to integrate the list of functional elements – DART: We have developed DART (DART., 88) to facilitate the flexible storage, visualization, and comparison of the growing number of experimentally defined sets of transcription factor binding sites. DART has been designed to address a number of challenging issues that arise when attempting to analyze, compare, and store these types of data. The key aspects of DART include the following: Dealing with heterogeneous datasets, flexibility for storing different tiling array hit attributes, accommodating new genome builds, and integrated linking to other web resources for broader visualization and analysis. Furthermore, DART provides machinery for helping to pick targets for validation.

Development of Pipelines for Intergenic Annotation67: We have developed extensive pipelines and databases for intergenic annotation and relating these to novel non-coding RNAs. In particular, we have carried out comprehensive pseudogene annotation of the human genome and cross referenced this annotation with all the unnannotated transcription uncovered tiling arrays and other technology (70, 72, 8971 73 Much of this has been done in the framework of the ENCODE project. We have also characterized the structured RNAs in the ENCODE region and throughout the whole genome and inter-related this with transcriptional data 3; 67

Preliminary Results related to Integrating Different ChIP-seq Datasets

During the pilot phase of ENCODE project a large number of independent ChIP-chip and ChIP-sequencing experiments were conducted to characterize transcriptional regulatory elements in human cell lines. These datasets are often analyzed individually by imposing an arbitrary threshold to call regions with modified histones or bound by transcription factors. Such an approach makes the classification of the regions near the threshold difficult and unstable depending on whether the emphasis is placed on specificity or sensitivity. In order to define transcriptional regulatory elements in a more robust way, we developed integrative approaches to evaluate these datasets as a whole in a quantitative fashion rather than analyzing each dataset individually.

We developed a Naive Bayesian and a majority voting method to integrate ChIP-chip and ChIP-sequencing datasets generated in the pilot phase of ENCODE project. The Bayesian method first thresholds ChIP enrichment signals and assigns a weight to each dataset based on how well they predict on a training set of known regulatory regions. This training step is followed by a prediction step that involves the computation of a score aggregating all the Bayesian weights from different datasets for each position in the genome. The majority voting method evaluates the level of experimental evidence for each genomic position, dictated by the number of cross-laboratory, cross-platform, or cross-factor experiments supporting that position above some statistical threshold.

These methods were initially used to predict transcriptional regulatory regions in ENCODE regions. The predicted regions were experimentally tested using transient transfection and 5'-RACE assays 90. We trained the Bayesian method on 223 positive examples (with 4+ CAGE tags and at least one GIS-PET tag) and 225 negative examples of promoters. The majority voting method did not require training. The predicted regions were tested using two experimental assays. The Myers group tested 163 predicted regions for promoter activity in four cell lines using transient transfection assays and 25% of them showed transcriptional activity above background in at least one cell line. 5'-RACE experiments on 64 novel regions and 77% of the regions were associated with the 5'-ends of at least two RACE products. Our preliminary results indicate that there are at least 36% more functional transcriptional promoters in the human genome than currently annotated.

Preliminary Results Related to Correlating Tracks

To provide an initial exploratory investigation of the data we used a number of analytical and visualization tools to understand the complete dataset of regulatory elements. We focused on the overall chromosomal distribution of regulatory elements in the encode regions and showed that it is highly nonrandom. Our results indicate that these elements are clustered into regulatory rich ‘islands’ and poor ‘deserts.’ We then performed a multivariate analysis on all the factors collectively, using a "biplot". This grouped the transcription factors into sequence-specific and sequence-nonspecific clusters. Fig. 9.

This clustering suggested that classification of the elements was a sensible approach. Applying standard clustering methods to the data from the pilot phase of the ENCODE project, we concluded that long range elements fall into four different classes. Class I is identified by strong signals in H3/H4 acetylations, H3K4 methylations, cMyc, TAF250 and PolII binding. Class III is distinguished by HNF3b, HNF4a and USF1. Similarly, Class II and Class IV are identified by different sets of features. Most interestingly, a similar analysis can be performed for a range of distances from TSSs (or other genomic features) to see whether the regions at different distances from a genomic feature fall into similar classes. New datasets generated during the

scale-up phase will potentially reveal novel biological classes.

D. Research Design and Methods

Our overall plan is to establish an integrated center to analyze epigenomics information in humans. We will first gain a comprehensive analysis of DNA methylation in a human genome (Aim 1) followed by analysis of clonal variation of DNA methylation, chromatin modifications and gene expression (Aim 2). We then hope to obtained a detailed understanding of DNA methylation, chromatin modifications and noncoding RNA expression in stems cells and cells induced to differentiate in several hematopoetic cell lineages (Aims 3-5). Important parts of this center are pipelines for that analysis, storage and dissemination of data (Aim 6) and cores for cell culture and DNA sequencing.

1. Determine the full sequence and distribution of methylated bases of an individual.

We plan to determine the first high resolution map of the methylated bases of a human, . This will be accomplished by first determining the genomic diploid DNA sequence of an individual and then the methylated regions of the same individual.

1a. Determining the DNA sequence of an Individual

The diploid genome of the individual will be sequenced using a combination of experimental (e.g. Sanger, 454 and Illumina sequencing) and analytical (e.g. comparative and de novo sequencing assembly algorithms) technologies. The existing reference genome can be utilized to help guide the assembly. However, de novo sequencing assembly will still be needed in regions where the target genome has large insertions (>100 bp) compared to the reference genome.

i. Sample Preparation

Our clinical colleagues have selected a donor who is available for several blood donations, and whose parents are available for, at a minimum, cheek swabs for SNP haplotype mapping; For genomic sequencing we will use cells derived directly from peripheral blood, to avoid any possibility of sequence changes arising during cloning or expansion in vitro. We will isolate total peripheral blood B cells by negative selection on MACS columns using beads coated with anti glycophorin, anti- CD15, anti CD56, and anti CD3. The negatively selected cells will then be positively selected for CD20 and CD19.

ii. Re-sequencing Protocol

To determine the genomic DNA sequences we will use standard protocols for aerosolization of DNA derived from the B cell preparation. The resulting fragments will be sequenced by use of the 454 machine and protocols, potentially with some additional Illumina runs. Currently 454 provides over 400,000 sequences per run at an average read length of approximately 450 bases; we expect these parameters to further improve in the course of the work. Sequencing will be performed to a depth of at least eight-fold coverage. For planning purposes we estimate that we will need 70 sequencing runs of 450 b reads should yield XXX fold coverage assuming 400K reads per run. Together with ~50 runs of paired end sequencing (an mixture of 3 kb, 12 kb and 20 kb) should enable sequencing of most (>90%) of the diploid human genome excluding the highly repetitive regions (e.g. centromeric and rDNAs). This is based on our experience with paired end mapping (Korbel et al., 2007) and the sequence of the Watson genome (T. Harkins, pers. comm.). If necessary, additional paired end reads can be added to help combine contigs and directed Sanger sequencing can be used if necessary to close gaps.

We will use several additional approaches to detect copy number variation or copy number neutral sequence variations, and to align separately the paternal and maternally derived chromosomes of each pair. First we will use hybridization to ten 2.1 M feature high density genomic tiling arrays, able to display heterozygous deletions or insertions of one kilobase or more in size, compared to a control DNA pool. Second we will use commercial SNP chips to analyze 500k SNPs from the mother and the father of the donor. This should expedite aligning the separate parental chromosomes.

We expect a small amount of errors in the resulting assembly, due to not only the limitations of different techniques, but also the complex nature of the human genome sequence itself. Thus, it is important for us to define a reasonable performance measurement first so that the project can be designed in such a way that its outcome will be ``optimal” according to this metric. Here we propose a measure which is a linear combination of several sub-measures, including the SNP/MNP detection rate, the deletion/small insertion identification rate, and the large insertion recovery rate.

iii. Assembly Methods

Based on the performance measure defined, we will first simulate the sequence assembly process in order to obtain a ``best” set of parameters (e.g. the amount of long (Sanger), medium (454) and short (Illumina) reads, the amount of single and paired-end reads, and the amount of CGH data) so that we can achieve the desired performance with a relatively low cost in this project. The sequencing will then be carried out and the resulting sequence reads will be further processed in the following manner:

1) All the reads will be mapped back (allowing a few mismatches) to the reference genome in an efficient way and then be divided into three categories: those with unique matches, with multiple matches, and with no match. The last category usually corresponds to the new insertions to the reference genome. Most of the sequencing errors can be eliminated in this step by comparing similar reads to each other.

2) The SNP/MNPs will be identified immediately using those reads with unique matches, and the boundaries of the indels will also be detected by such reads. At this step, we can also have an estimate of the percentage of the genome that is represented by these uniquely matched reads.

3) More structural variants will be detected by further analysis on the paired-end data, and de novo sequence assembly will be performed in those large insertions by using mostly the overlapping no-match reads and some of the reads in other categories. At the same time, haplotype islands will be extracted based on paired-end information and prior knowledge of the haplotype pattern revealed by previous works (e.g. HapMap).

iv. Assembly Algorithm

1. Initial read partitioning Attempt to map each read to reference. Each read will:

a. map uniquely (possibly with some mismatches)

b. map to multiple locations

c. one read end maps very well, other end not (with paired-end, one of pair maps)

d. not map well

This step will use a simple sequence matcher, e.g. blat. We want something fast, simple, and easy to understand and control.

2. Find small substitutions/ indels For all reads mapping uniquely, record coverage, and determine SNPs and short MNPs.

a. Fix Homozygous SNP/MNP/indel: at least 2 reads agree, disagree with reference and insufficient support for reference.

b. Heterozygous SNP/MNP/indel: at least 2 reads agree and 2 disagree, MAF > 25%

3. Phasing of SNPs into consistent haplotype blocks. A number of methods exist for this, including graph theoretical and Bayesian88, 89 (lippert et al, levy et al. See below for refs). All methods depend on reads that directly associate 2 or more variants on a single haplotype, and transitively joining these chains into longer phased haplotypes. The methods vary chiefly in the way they resolve inconsistences. An issue in SNP phasing is the read length as a function of the distance between variants. We are currently investigating Markov Chain Monte Carlo models that combine the sequence read data with hapmap population haplotype data. That is, we plan use expected haplotype block frequencies (from Hapmap (2007) in the population to guide the haplotype phasing process.

[[MG-to-SW: ref for hapmap (2007) above

is The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs.

Nature 449, 851-861. 2007]]

[[

88. Lippert, R., Schwartz, R., Lancia, G. & Istrail, S. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics 3, 23-31 (2002).

89. Levy, S. et al. The diploid genome sequence of an individual human. PLoS biology 5, e254 (2007).

]]

4. Detecting the variant locations with comparative genome assembly algorithms. If one end of a set of consistent reads can be mapped uniquely back to the reference genome sequence, then the location of a variation (either insertion or deletion) can be precisely determined based on the position where it maps back to the reference. Some other long range variants can also be identified by compiling the paired-end reads: when the two ends from one read are unambiguously mapped back to the reference, and the distance between these two mappings on the reference are significantly different from the gap size of that read, this indicates an insertion or deletion event taking place in between the two mapping locations. In this case, however, the exact locations may not be clearly identified solely based on the paired-end data, although an estimation of the size of the indel can be obtained. This is essentially the approach we used in Paired-End Mapping (PEM) 50

5. De novo assembly of novel sequence insertions. De novo sequence assembly can be applied to those reads that does not map anywhere back to the reference genome. Long and short reads will first be combined using different assembling strategies (e.g. Arachne91 for assembling long/medium reads and SHARCGS92 for short reads), with the results further assembled. The resulting contigs will be further combined by using both the location information of the insertions identified in the previous step and the other types of reads when necessary. By doing so, the contents of those novel sequence insertions can be reconstructed. Figure 10 shows the simulation result of the recovery rate of such novel insertions when we combine long (Sanger), medium (454) and short (Illumina) sequencing technologies with a fixed total cost (11 long insertions in the Venter’s genome with size ~10kbp).

6. Comparative reassembly of the variants from the reference sequence. For those variants that came from the reference sequence itself, when we identify them with partial mapping of the reads, it will be the case that the other part of that read can also be mapped back to the reference. If that part of mapping is also unique and comes from a position of the reference that is unlikely to correspond to a deletion event, we can infer that this corresponds to an insertion of one part of the reference sequence being duplicated. Further supports from other reads (such as paired-end reads) will be needed to precisely locate and reconstruct such duplication eve. We expect this to be the hardest part of the assembly.

1.b. Determination of the metbhylated DNA sequence of the same individual

Once the genome sequence of the diloid genome is determined we will determine the methylated DNA sequence of the same individual. Very recently the methylation status of the Arabidopsis genome has been determined by use of Illumina sequencing of bisulfite treated DNA93. We prefer to use 454 sequencing initially because of the greater complexity of human than Arabidopsis DNA. This complexity makes it easier to uniquely match the longer 454 reads rather than the Illumina reads. Of course the technology advances rapidly and Illumina may give longer reads in the future, reducing the overall cost of sequencing methylated DNA.

i. Sample Preparation

The DNA will be fragmented by aerosolization and adapters ligated that contain 5 methyl cytosine instead of cytosine. The DNA fragments would be treated with bisulfite using a standard protocol for distinguishing methylated cytosine, and then prepared for 454 sequencing. Although a large fraction of the inserts might be broken during the bisulfite treatment, the 454 protocol specifically permits only recovery of fragments with adapters on both ends, and quantifies the recovered DNA prior to sequencing. This avoids contamination with DNA fragments that cannot be sequenced. DNA sequences form bisulfite treated DNA will be aligned with a constructed DNA sequence based on conversion of all C outside of CpG into T, and CpG into YpG (and complementary sequences for the opposite strand.)

ii. Refinements of Resequencing Approach to Handle Issues Associated with Methylation

One additional complexity that we see is to determine phasing of epigenetic annotations. Methods similar to those used for phasing SNPs into haplotypes can be used to phase the methylated bases. A number of methods exist for this, including graph theoretical and Bayesian88,94, as discussed above. All methods depend on reads that directly associate 2 or more variants on a single haplotype, and transitively joining these chains into longer phased haplotypes. A key parameter is the length of the reads as a function of the distance between variants. We are currently investigating Markov Chain Monte Carlo models that combine the sequence read data with hapmap population haplotype data94.

Like any other sequencing method on genomic level, our approach involves many sequencing parameters such as genomic coverage, sequencing depth, and sequencing error rate. It is also compounded by the biological complexity such as the heterogeneity of sample cells and the heterozygousity of the methylated loci. Given the scale of sequencing needed, it is prohibitive to optimize the sequencing protocol through actual experiments. However, many interactions of those parameters can be studied through computer simulation. For example, the genomic coverage and the sequencing depth are directly related. The genomic coverage reflects the completeness of the profile, while the sequencing depth and the sequencing error rate determine the accuracy of the base call, which in turn determine the accuracy of the profile. As methylation only occurs in CpG dinucleotides, the global and local density of CpG will also affect the amount of sequencing needed.

Computer simulation could unravel the complicated interplay of those parameters and help to select a good experiment strategy.

Anticipated results and potential problems: While hypothetically the bisulfite conversion could be incomplete, giving misleading results, others have found 98-99% conversion with standard protocols. While a large fraction of DNA molecules may undergo cleavage by bisulfite treatment, even 1% recovery of molecules with adapters on both ends would be satisfactory for 454 sequencing.

We have an interesting preliminary observation that several of the regions we have detected as heterogeneously methylated in a B cell line lie in intergenic DNA relatively far from any annotated gene. (Fig 11 ) It is at least possible that some parentally imprinted regions might occur in such regions and not manifest themselves as complete parental imprinting of expression of a single gene but could result in differential binding of various proteins to that segment of chromatin.. These could be detected in the sequence data we would acquire, as long as the regions were linked to a polymorphic SNP that distinguished maternal from paternal chromosomes.

Some Cs not in the dinucleotide CpG may also be methylated and this could give rise to sequences that match neither the original genome sequence nor the predicted sequence for bisulfite treated DNA, because the altter owudllater would be constructed on the assumption that CpG could either be retained or converted to UpG but that all other C would be converted to U . Therefore we will manually inspect atinspect at least a hundred sequences that show mismatches to both the normal and the bisulfite treated predicted gneomesgenomes.

Quality Control: The quality control of the genome sequence will be assessed by several means. First, we will clone 500 1500 1 kb fragments and determine their DNA sequence on both strands using Sanger sequencing We expect our error rate to be less than 1 per thousand. Second, we will examine the annotated RefSeq genes for frameshifts, the most common error of 454 sequencing. Any frameshifts identified will be examined by Sanger sequencing to determine if it is bona fide or due to a sequencing error. Third, as noted above, we will perform high resolution CGH using 10 Nimblegne Nimblegen 2.1 million feature arrays to identify CNVs. These should correspond to that assembled in our final sequence and used to correct the sequence assembly, if necessary. In addition we can use reads frequency at different regions of the human genome as a measure to deduce CNVs. Finally, although we will strive for high accuracy, even if the assembly is not 100% perfect the epigenomic information will still be interpretable in terms of the regions of the human genome that are methylated and thus the information will be extremely valuable.

To quality control of the sequence of the cytidine deaminated genome will be performed as follows: Total genomic DNA will be treated with bisulfite. Fifty regions containing CpG sequences in the conventional genome sequence will be picked amndand PCR amplified ot to produce fragments of over 500 base paris in length. Fragments will be cloned into plasmids and ten clones of each fragments will be sequenced omon both stands by Sanger sequencing. In addition cloning and sequencing will be performed on at least fifty regions where the bisulfite treated DNA gives a sequence that doesn’t match the predicted sequence. Based on the results of these re-sequencing efforts we may need to extend these resuoltsresults to a larger set of discrepant sequences to obtain the highest feasible accuracy.

[pic]

[pic]

Heterogeneous methylation of CpG sequences will result in heterogeneity in the sequences of the deaminated DNA in these regions and these will have to be correlated bioiinformatically. The depth of sequencing that can be obtained practically will only uncover a small fraction of the heterogeneity at these regions, when the sequence is derived from a mixture of normal cells, or from a lymphoblastoid cell line that is not a recently derived clonal isolate.

II. Studies of the clonal variation in allele-specific expression in comparison to DNA methylation and histone modification

Beyond parental imprinting a number of regions of the genome show heterogeneous patterns and levels of methylation in cell populations. Allele specific silencing may also occur at multiple loci in individual cloned cell lines, independent of parent of origin of the allele1312. These effects could be of great importance for understanding non-environmental effects on the heterogeneity of intensity of effect of disease (or phenotype) modifying variants in DNA sequence. We propose to investigate certain questions that arise from these results-Is there clonal allelic variation in levels of expression of genes short of complete silencing of one allele? What is the distribution of methylated bases in heterogeneously methylated regions and does allelic expression vary with total level of methylation of promoters and control regions or with selective methylation of specific sequences such as transcription factor binding sites in promoters. Is there clonal allele specific variation in chromatin structure and does this correlate with levels of allele specific transcription? Do any of these variations show consistent allele specific quantitative bias and do they correlate in any way with one another in clones?

To examine the effects of methylation heterogeneity on gene transcription, we will prepare a series of independently derived clonal lymphoblast cultures from the same donor used above. Each lymphoblastoid cell line is derived by adding EBV virus to a population of B cells and passaging the emerging immortalized cells. In general therefore, each line will represent a heterogeneous mixture of B cells. We will begin by using FACS sorting and visual inspection to prepare a true clonal cell line from each source. After 16 doublings, DNA from subclones will be isolated.

It is not economically feasible to do complete genomic sequencing on bisulfite treated DNA from each clone. Instead we will use microarray elution and selection95, 96 to examine the methylation status of sequences from regions that are known from the studies of Aim 1 to be heterogeneously methylated. DNA will be randomly fragmented, methylated adapters added, and the fragments selected by hybridization to an array constructed based on the information derived in Aim 1. As a rough estimate we may assume that there are about 30,000 CpG islands in the genome and that one third of them may show heterogeneous methylation in B cells. If we select ten percent of these, selecting regions that contain SNPs that are heterozygous in the donor, this selection would represent about one million base pairs of DNA. Ten-fold coverage would require ten million bases of sequence. One run of 454 gives close to 200 million bases of sequence, so that, ideally one could use nucleotide “bar codes” and pool up to 20 samples in a single 454 run. If we allow a safety factor of two fold, this would still reduce the cost to $500 per sample.

The microarray selected DNA fragments would be subjected to bisulfite treatment to convert unmethylated C to U. PCR primers will be prepared and tested to be sure they can amplify each locus of interest, beginning with bisulfite treated DNA. PCR amplification will be used to prepare DNA fragments corresponding to several loci that we have identified previously as showing extensively heterogeneous patterns of methylation in the genomic sequences. The PCR amplicons will be cloned and multiple clones from each source sequenced to determine if there is heterogeneity between or within cell clones in the methylation patterns.

Anticipated results and potential problems: We have sequenced a number of DNA fragments subcloned from PCR amplicons of bisulfite treated DNA from a HapMap cell line. This approach detected a large amount of heterogeneity at a single locus derived from a HapMap cell line. We do not know if this heterogeneity is due to the multi-clonal nature of this cell line, or to development of heterogeneity in vitro, although we suspect the latter. The only estimate of the complexity of HapMap cell line clonal structure that we are aware of indicates that as many as 40% of the cell lines may be clonal, and presumably multi-clonality is an uncommon event. However, the data we show here on the complexity of methylation patterns at a single locus in a HapmapHapMap cell line, and alternative calculations from the published data suggest substantially more heterogeneity in most HapmapHapMap cell lines and perhaps in vitro evolution of epigenetic marks in these lines. The rate of evolution of changes in methylation and the possibility of rapid early changes during adaptation to culture are unknowns, and it is possible that we will find even very early clones show more than two types of methylation pattern per locus.

Microarray elution might offer a problem if it produces too much DNA fragmentation, or if hybridization to CG rich sequences produces too much background or inefficient selection. In this case we would design PCR primers that would amplify bisulfite treated DNA regardless of the state of CpG methylation, and that would represent 100 chosen loci, and use these to prepare sequencing templates from the bisulfite treated DNA of the clonal cell lines. A large fraction of the bisulfite treated DNA from the array elution may break during the bisulfite treatment. However the 454 procedure involves amplification steps that will only recover DNA fragments with an adapter at each end, and the amount of DNA for analysis is quantified after amplification There is a risk that bias will be introduced as a consequence of differential fragmentation of DNA segments of different composition. As an alternative we would hybridize fragmented DNA to arrays before adding adapters. After elution we would treat the DNA with bisulfite, then use single stranded ligation to attach adapters for pCR PCR amplification...

Quantitative comparison of methylation patterns at the chosen loci for each clone with the distribution of patterns seen by PCR amplification from DNA obtained directly from an isolated B cell population should give an indication as to whether there is a bias in cellular cloning or rapid evolution of in vitro patterns. If rapid evolution is occurring we would scale down the number of passages and the number of cells used for analysis, using our improved phi29 amplification protocols. A significant unknown is how few cells and/or how little DNA we need for bisulfite treatment.

Quality Control: WE We will match the sequencdsequence from the isulfitebisulfite treated DNA of each clone againsagainst the methylome sequence obtained in the previous aim. IF there are any disagreements, we would PCR up the region directly form total bisulfite treated DNA clone the PCR fragments and perform Sanger sequencing of ten cloensclones from each fragment. We might find that there is a high level of heterogeneity in the methylation of the genomic DNA aatat a particular site, in which case it is likely that the original sequence represented on variant of the methylation pattern of gneomicgenomic DNA. If we do not find heterogeneity in the original genomic DNA, we will reamplify the region from the isolated clonal DNA, using cells set aside shortly after cloning.

Genomic scale clonal and allelic variation in levels of mRNA expression: To perform measurements of transcriptional activity, we will use total RNA extracted from B cells. Because of the high ratio of nucleus to cytoplasm this should sample nuclear RNA precursors as well as mature RNA. For quantitative analysis of expression of different loci, we will prepare libraries of 3’ end cDNA fragments, such that each fragment consists of an average of about 250 base pairs of sequence adjacent to eh polyadenylation site. For sequencing, we will prepare an oligo-dT adapter with a restriction site (Mme) so located that cutting with this enzyme will remove all but about four bases from the oligo-dT sequence attached to the body of the cDNA. The composition of the library of fragments will be determined by Illumina sequencing. One lane of a sequence run provides 3 to 6 million reads. As a cell may contain of the order of 2-300,000 mRNA molecules, a fragment derived from an RNA species present at one copy per cell would give rise to 10 to 30 sequences so that the 95% confidence limits for the abundance of such a fragment would be within about two fold of the measured abundance. For RNA species present at ten copies per cell 95% confidence limits would be plus or minus less than 20% of the observed number of reads. This indicates that the sequencing approach is both more comprehensive and considerably more accurate than any array hybridization approach for quantitation of mRNAs. To determine allelic variation in mRNA expression, we sill prepare total mRNA from a mixture of normal B cells and clonal B cell lines (using the latter in case there are mRNA that are only expressed upon culturing in vitro.)

We will perform quantitative measurements of cDNA expression by massively parallel sequencing on several different clonal lymphoblast cell lines that have shown differences in methylation patterns in genomic DNA, based on the test loci as described above. We will also test mono-allelic expression on a variety of clones, using heterozygous loci determined from the full genomic sequence and residing in mRNA that has been found by others to be monoallelicaly expressed in some B cell cones. .

We propose further to compare the relative expression of the two alleles for each RNA species on a whole genome scale by performing extensive 454 sequencing of total oligo-dT primed cDNA. The advantage of doing this is that we will detect all sequence variations in template DNA regions and not be limited to common SNPs that are represented I commercial arrays. An important consideration for this type of approach is the amount of sequencing that is needed and the cost of 454 sequencing. Currently we can anticipate generating up to half a gigabase of sequencing from a single 454 run. If we presume 300,000 RNA species are present at significant levels in the B cell clone, and that the average length of an RNA is 3 kilobases then it would require about one gigabase of sequence to represent each molecule once on the average. However, if the 300,000 molecules per cell represent only 15,000 RNA species then one run of 454 would give an average of 10 reads per RNA species-enough to detect substantial variation in expression of alternate alleles for most genes provided either solution normalization or array hybridization selection could substantially normalize the levels of transcripts for different genes.

It is feasible to determine substantial variation in allelic expression for a large fraction of mRNAs by random sequencing of the total cDNA pool However, a major decrease in sensitivity, coverage, and cost would be achieved by normalization of the cDNA pool, so that all species of RNA would be present at more similar levels than in the original preparation. Because we are interested here only in the relative level of RNA derived from each allele rather than the absolute amount of each mRNA species in the cell, provided normalization does not differently affect one or the other allelic form of RNA, the normalized library would still give us the desired information. To reduce the relative excess of the most abundant RNA species we will perform a simple form of RNA normalization. To normalize RNA we will synthesize cDNA using oligo-dT pimersprimers covalently attached to magnetic beads. Total RNA will be hybridized to beads (2 cycles) and final non-bound material will be primed for normalized cDNA synthesis in solution.

We propose initially to do deep 454 sequencing for complete analyses of allele specific differences in levels of expression on at least two B cell clones and also on a total population of B cells directly isolated from the same donor. From these various analyses we expect to distinguish several types of RNAs as indicated in Table 3. As shown in the table, the present analyses would not distinguish between genes that showed allelic difference in expression because of parental imprinting versus those that showed differences in expression because of cis variation in regulatory element sin the two alleles. A select number of clones of each type will be taken for clonal analysis of DNA methylation and chromatin modifications as described in a subsequent section.

Table 3—Expected results for analysis of allele specific expression patterns in various contexts.

|Type of gene` |Clonal allelic difference |Pooled Cell Allelic difference |Variable expression level in |

| | | |clones |

|Normal |0 |0 |0 |

|Imprinted |+ |+ |0 |

|Allele specific regulation |+ |+ |0 |

|Variable regulation |0 |0 |+ |

|Random epigenetic effects |+ |0 |+/- |

From the results of the initial 454 sequencing of cDNA and the original genomic sequence we can calculate which cDNA segments would have sequence variations that would permit evaluation of the relative level of expression of the two alleles. We would then design a Nimblegen microarray that would select both forms of the sequence around as many of these variants as possible. This microarray would be used for hybridization selection of cDNA fragments from each of the lymphoblast clones,. The enriched polymorphic clones would then be subject to high throughput sequencing to determine the ratio of the two alleles at each locus.

Potential problems and solutions: An unavoidable limitation of this type of study is that we can only distinguish the products of two alleles where we have found sequence difference in their templates. A caution of some potential interest is that RNA editing events will also become evident from the initial 454 sequencing, and, if detected, must be distinguished from template polymorphisms.

The solution normalization approach is simple in principle, but may give only limited normalization. An alternative approach is to prepare a custom Nimblegen array that has oligonucleotides complementary to regions containing polymorphic sites, as detected by the genomic or initial 454 sequencing. Once these arrays are prepared they can be used for hybridization selection of fragments of cDNA for subsequent sequencing. This would reduce the number of runs needed for coverage of the polymorphisms in each preparation of cDNA. By tagging samples from different cells with oligonucleotide markers, one could markedly reduce the cost of analysis per cell. For example, if three thousand polymorphism are to be examined, then one run of 454 could give 20 reads at each polymorphic site for each of nine or ten samples.

It is possible that we will run into unanticipated problems with array hybridization and elution, including issues of background and selective enrichment. If issues arise, we will explore alternative possibilities for surveying allele specific variation in the remaining B cell clones with methods that are less costly although less exhaustive. Illumina SNP microarrays can be used to detect large variations in allele expression and can be performed on total cDNA preparations. As yet another alternative, we could contract with Agilent to prepare a set of complementary oligonucleotides that are biotin tagged, and then capture these oligos and their targets on beads. For this purpose we might want to use internally tagged oligonucleotides to decrease release of broken oligonucleotides from beads along with the target sequences.

Correlation of methylation patterns with allelic variation in gene expression: To investigate the correlation between allelic variation in methylation of genomic segments and variation in mRNA expression we will take a two-pronged approach-genome wide inspection for heterogeneity in CpG island methylation and detailed bisulfite based sequencing of selected promoter regions. We will perform the overall correlation with the TrackSimplifier computational approach described in the bioinformatics section.

To survey the sites of heterogeneous methylation on a genome wide basis, we will use the procedure described in Preliminary Results in which one first digests DNA with a restriction enzyme with a recognition site containing only A and T, and separates out the larger fragments. These fragments are then digested successively with the methylation sensitive enzyme HpaII and the methylation insensitive MSP1. After each digestion, smaller fragments and undigested DNA are separately recovered. This provides two types of small fragments representing short DNA sequences bounded by CCGG sequences that are respectively unmethylated and methylated. Each set of short fragments is cataloged by Illumina sequencing. Heterogeneity of methylation is indicated when a fragment appears in both sets of DNA. This procedure obviously does not detect all heterogeneity, but would on average measure heterogeneity at 5-10 sites per CpG island. The extent of heterogeneity detected by this method would be tested for correlation with the allelic variation in mRNA expression as determined in 1. Also, we would select a limited number of promoters that showed heterogeneity in DNA methylation without detectable effects on relative allelic expression, and these would be subject to bisulfite sequencing.

We would select three more groups of promoters for bisulfite sequencing, from among promoters that have SNPs that are heterozygous in the original donor. One group would be derived from genes that showed monoallelic expression. A second would be from genes where both alleles were expressed but at significantly different levels, and a third group would be from genes whose overall level of expression varied between clones, but where allelic variation was not prominent. In each case genomic DNA is first treated with bisulfite, then the promoter region PCR amplified after finding primers suitable for the bisulfite modified DNA. Multiple clones from each PCR product would be sequenced to obtain redundant information about the methylation state of each allele and its intro-clone stability.

Heterogeneous methylation in intergenic regions: As mentioned in a previous section we have found several intergenic regions of heterogeneous methylation that are situated relatively far from any annotated gene. The significance of these sites is unknown but it is at least possible that they may reflect heterogeneity in the status of the chromatin embedding the site, and that this heterogeneity might be epigenetically transmitted at the time of cell division. The method we describe for differential detection of methylated and unmethylated CCGG sites by high throughput sequencing can often demonstrate such sites, as the same CCGG site could be scored as methylated and as unmethylated. A large number of such potential sites have shown up on our preliminary run using normal neutrophil DNA. A single lane of Illumina sequencing is enough to survey many such sites and we would propose to do such a run for each independent clone of B cells. Again we would confirm a subset of these sites by PCR amplification and cloning of bisulfite treated DNA.

Chromatin Modification associated with allelic variation in expression: One possibility is that allelic variation in gene expression may be mediated by changes in histone modifications around the promoter, even in the absence of changes in DNA methylation. To explore this we will pick a limited number of promoter regions that are polymorphic in the donor DNA and perform chromatin immunoprecipitation separately with antibodies against di methyl K9 of histone 3, trimethyl K9, dimethyl K27, and trimethyl K27. The immunoprecipitates and supernates will be saved, DNA extracted and treated with bisulfite. PCR amplicons spanning the polymorphic sites will be prepared, and multiple molecules sequenced by tagging and multiplexing samples in a high throughput sequencing run.

Hemizygosity and epigenetic effects: In many cases even complete inactivation of one allele of an autosomal gene and reduction by 50% in the level of transcription might have only relatively small physiologic effects. However if the gene is in a hemizygous state, due either to allelic deletion or epigenetic inactivation, the effect might be disastrous. Presumptively, the X chromosome that is hemizygous in males is adapted so that epigenetic depression or silencing does not occur, at least over any critical genes. Even here there is interesting data from Turner syndrome, a disorder in which the subject has only a single X and no Y chromosome. There is a suggestion here that a different pattern of neuro-psychiatric abnormalities may occur depending on whether the X is derived from the father or mother97. X chromosome parental imprinting is known in mice98, and these results raise the question as to whether any such imprinting occurs in man, because the homology of the murine imprinted region to human DNA is very weak.

Copy number variation of relatively large blocks of genomic DNA occurs probably at hundreds of sites, between individuals or individual chromosomes. The phenotypic effects of these surprisingly large differences between individuals iseffects of these surprisingly large differences between individuals are largely unknown, and the possibility remains that feedback mechanisms including epigenetic changes moderate the effects of these di these variations in normal DNA. The index subject in these studies will undoubtedly exhibit a number of such variations and it will be interesting to see if there is a paucity of methylation in hemizygous regions or an excess in reduplicated regions. These studies will be extended to large deletions we and others have noted in normal subjects, There are also several disease syndromes associated with hemizygous deletions of large DNA blocks. In particular, our laboratories have been interested in the VCF (vel-cardio-facial) syndrome99, caused most commonly by a 3 megabase deletion of proximal chromosome 22. This deletion is associated with a spectrum of disorders including psychiatric disorders resembling schizophrenia that occurs in a substantial fraction of the cases100. The severity of the VCF syndrome varies markedly from subject to subject for reasons that have not been determined. One possibility is that epigenetic effects may result in variable expression of the hemizygous genes. We have a fairly large collection of lymphoblastoid cell lines from subjects with this disorder, and would propose as part of the present study, to use array elution to investigate the methylation status of the genes within the hemizyogushemizygous region and compare it with the average methylation of genes from a normal individual, to see if there is any compensation for lack of a second allele or if methylation effects might contribute to the variability of the syndrome.

III. Epigenetic Changes During Hemamtopoietic Differentiation:

Systems for studying hematopoietic differentiation in vitro

1. Adult erythroid differentiation:

We have taken human CD34 positive hematopoietic precursors derived form normal blood and expanded them at least a thousand fold by growth in serum free medium with a cocktail of cytokines. This gives up to ten million cells per preparation, and it s\is possible that even further expansion could be done without loss of pluripotency. The expanded cells still can be differentiated into mature erythroid cells by addition of erythropoietin and later withdrawal of other cytokines. To decrease heterogeneity in the timing of steps of differentiation, we will freeze multiple samples of CD34 cells derived from a single individual. We will also refreeze aliquots from early stages of cell expansion so that we can go back to a single source for a variety of experiments. We have performed initial array analysis of RNA expression at two stages of differentiation of these cells towards erythrocytes. By three or four days after removal of cytokines other than erythropoietin the cells have already decreased transcription and presumably are underway towards nuclear complete condensation and expulsion. We propose to use preparative FACS sorting to isolate cells at three stages of differentiation toward mature erythrocytes and to perform methylated C immunoprecipitation analysis of methylation intensity on a genome wide basis on these cells, together with high throughput sequencing of cDNA to obtain more accurate measurements of g mRNA expression than those we have previously obtained from Affymetrix U133 arrays.

Myeloid Differentiation from adult hematopoietic precursor cells: We have further developed a protocol for sequential addition and subtraction of cytokines that results in conversion of most of the cells in culture to CD15 reactive cells with the morphology of maturing neutrophils. Interestingly, unlike maturation of promyelocytic cell lines such as NB4 and HL60, the cells derived from normal precursors show large granules, consistent with the secondary granules of mature neutrophils.(Fig. 12) We have not yet confirmed these chemically but on morphologic grounds we are achieving extensive differentiation of neutrophils, without large amounts of apoptosis of precursor cells. For epigenome analysis we will sort cells at various stages of differentiation using CD15, DNA content, and forward and side scatter parameters initially. Cells will be evaluated morphologically and, if the preparations look heterogeneous in morphology, further myeloid markers will be used to optimize preparation of homogeneous neutrophils or precursors.

[pic]

Anticipated Problems. It is somewhat harder to isolate intact proteins from neutrophils, both because of the high proteinase activity and also, perhaps, because the cells have decreased synthetic activity as they mature and retain a relatively large fraction of “old” protein. Also, the neutrophils in the differentiated culture are not all fully matured. This may be partially because the cells have a limited life span once maturation is completed. Cells can be partially fractionated into more and less mature forms by use of forward and side scatter measurements in FACS based on cell size and granularity. Normal mature neutrophils can be isolated rapidly from peripheral blood and the results of RNA analysis, methylation, and chromatin structure of cells matured in vitro will be compared with mature neutrophils directly isolated from donors. There are suggestions in the literature that mature neutrophils may substitute other proteins for conventional histones in their chromatin. Although this is uncertain, if true then a large part of the DNA may not be isolatable as nucleosome by controlled micrococcal nuclease digestion. By immunoprecipitation with antibodies against histone 3 that bind independently of known modifications, we can determine if the substitution of histones has occurred generally or left some areas unsubsidized, such, such as the limited number of genes up-regulated on neutrophil activation.

Lymphocyte Differentiation: We are in the process of implementing published procedures for B cell differentiation of precursor cells and as these procedures are perfected we will select cultured cells that are CD19 and CD20 positive for morphologic then molecular analysis.

Megakaryocytic differentiation: A colleague of ours, Diane Krause, has been able ot obtain up to 90% megakaryocytic cells in vitro from the same type of peripheral blood CD34 positive cells that we have used for myeloid and erythroid differentiation. The present protocols use fresh CD34 cells and this gives a relatively low absolute number of differentiated cells but we anticipate that the expansion protocols we use would allow us to generate ten million or more megakaryocytic cells per culture.

2. Embryonic and Fetal erythropoiesis: Dr. Cai Hong Qiu of the Yale Stem Cell Center is currently growing human H1ES cells and using the protocols she has previously published to differentiate these cells into embryonic erythroid cells that produce epsilon and zeta globins, then to further mature these into cells producing alpha and epsilon globin56. On further cultivation erythrocytes more closely resembling fetal lineages are produced and these cells express gamma and alpha globin chains (Fig. 13). From 5 to 50 million differentiated cells can be produced in a culture, and this would provide enough material for many of our analyses. Thus the progression of these cells complements the lineages seen from adult CD34 positive cells, and overall produces models for all stages of erythroid development. These primary cell culture based embryonic, fetal and adult erythropoiesis systems are excellent model systems to study epigenetics of both a differentiating cell system, and also developmental switches between embryonic and adult stages

3. Analysis of epigenetic changes during hematopoietic differentiation.

3. Epigenetic changes during hematopoietic differentiation.

Chromatin Modifications: We propose to use the differentiation schemes and methods mentioned in the preceding paragraphs for a series of epigenomic studies to reveal changes in chromatin and associated factors that accompany lineage choice and restriction, An advantage of the study of the hematopoietic system is that a single stem cell can differentiate into several differentiation pathways, and that these differentiation events can be recapitulated in vitro in cell suspensions. During erythropoiesis, myelopoiesis and development of other hematopoietic lineages, chromatin undergoes profound changes in morphology and activity (Fig…) offering an excellent system to study epigenetics events in differentiating cells.

1. We propose initially to measure the distribution of DNA methylation and RNA composition in cells at various stages of differentiation, then, at selected time points, to study binding of initiating (CTD dephosphorylated) and elongating (CTD S5 and S2 phosphorylated) RNA polymerase 2. These studies will generate a baseline for the correlation of DNA methylation, poised and active transcription, and gene silencing during differentiation. These data will be correlated with a series of epigenetic changes as follows.

2. High quality antibodies against a variety of histone modifications and proteins associated with such modifications are commercially available. We will initially use these antibodies for our studies and then as necessary we will prepare purified polyclonal antibodies against particularly interesting chromatin modifications. Prior to using antibodies for ChIP-ChIP/ChIP-Seq seq analysis, we will perform western blotting of all the antibodies used in our studies to check their specificity for a particular modification, using appropriate controls. Then we will perform IP of nuclear extracts and crosslinked chromatin with these antibodies. The protein-DNA adducts of immunoprecipitated chromatin will be reverse cross linked at high temperatures and subjected to western analysis along with the IP sample of the nuclear extracts and the supernates from these IP experiments. In general we will only accept antibodies if a large fraction of the proteins in Western blots correspond to the desired target. The identity of bands from Western blots will be further confirmed by mass spectroscopy (as in the current ENCODE procedures.) We will also titrate the exact amount of antibody required for the immunodepletion of the target protein with a given amount of chromatin.

3) Extensive studies by others and ourselves confirm that certain histone modifications such as histone 3 lysine 4 trimethylatin and lysine 36 trimethylation are closely correlated with transcription initiation or elongation28101 and give relatively little new information about epigenetic states of chromatin. Other modifications such as lysine 4 dimethylation or monomethylation102 may also mark enhancers outside of the body of the genes. Lysine 36 dimethylation occurs over portions of some genes but also occurs in blocks outside of known genes. Lysine 9 and 27 methylation are often considered as marks of inactive chromatin. However these modifications occur in disparate patterns across chromatin, and lysine 9 trimethylation may also occur in the body of active genes such as the globin genes in erythroid cells. Overall the distribution of many of these modifications does not simply correlate with known functional elements or chromatin features. We suggest that the study of these markers would reveal new features of chromatin and nuclear organization.

Based on the above considerations we propose to study the patterns of histone3 lysine 4, 9, 27, 36 and 79 mono and dimethylation, and lysine 9, 27, and 79 trimethylation at several time points during differentiation of erythroid, myeloid, and B lymphocytic cells in culture. In addition we propose to continue pilot studies of other modifications such as specific acetylations, variant histones, and arginine modifications, using K562 cells, and ENCODE chips. We will look for modifications that show a pattern of distribution sufficiently distinct from those we have previously studied. We will then extend studies of these additional modifications to the in vitro development systems mentioned above.

4) Several modification sites such Ser, thr, Lys and Arg are situated adjacent to each other at several places on histones and may correlate with one another. For example the dimethylation of Arg-2 on histone H3 is antagonistic to trimethylation of Lys 4103 and Ser 10 phosphorylation is antagonistic to Lys 9 methylation. In later phases of these studies we will use chip-seq to investigate the acetylation and phosphorylation of relevant residues and correlate these with the results of methylation studies. .

5) Chip-chip and chip;-seq studies detect multiple different modifications of histones covering the same segments of genomic DNA. An important issue is whether these modifications occur on the same nucleosomes, on different nucleosomes on the same DNA molecule, or on alternate configuratiosnconfigurations of nucelsomenucleosome modifications on different copies of the same genomic segment.. For example, di and tri methylation of Lys 4 on Histone H3 and acetylation and trimethylation of Lys 9 of Histone H3 occur at certain regions on the alpha globin locus in K562 cells (Fig……). We will investigate this by reciprocal immunodepletion with an antibody for one modification followed by Chip analysis of the supernate and precipitate for the other modification. We will perform these studies both on mono-nucleosome prepared by micrococcal nuclease digestion, and on sonicated chromatin fragments large enough to include three to four nucleosomes. Reciprocal experiments will be performed in which precipitation is carried out first with antibody A then antibody B, and vice versa.

6) Chromatin remodeling complexes may modulate the locations and spacing of nucleosomes, compaction of nucleosome-DNA interactions and histone modifications such as acetylation. The central ATPases and their associated factors of SWI/SNF related of chromatin remodeling complexes are classified as SMARCAS. As part of our ENCODE project we have tested the recruitment of six of the SMARCA(SWI/SNF related matrix associated chromatin remodeling complexes) ATPases (SMARCA1 to SMARCA6) and their associated co-factors and found them to show widespread and partly divergent distributions in both HeLa and K562 cells (Fig.-). We propose to use the antibodies against these SMARCA ATPases and their associated sub-units and other ATPases such as members of the Mi2b family (CHD family of ATPases )ATPases) to investigate genome wide distribution of these complexes during embryonic and adult hamtopoiesis (Table…).

7) The chemical modification of histones are brought about by specific acetylases, deacetylases, methyltransferases and demethylases. Antibodies against several such histone modifying enzymes are conmmerciallycommercially available. We have conducted a pilot studies with antibodies against LSD1 and JMJD 2d demehtylases using ENCODE Chip arrays. Interestingly LSD1 was strongly concentrated at the promoters of actively transcribing genes. This indicates that direct study of the distribution of these histone modifying enzymes can give basic information about the control of the epigenetic marks in chromatin.

Anticipated Problems and Solutions: We do not see any major in carrying out these experiments. We have extensive experience in this type of investigation during our ENCODE and CEGS (Center for Excellence in Genome Sciences) projects. We have rigorously tested several of the antibodies for the ChIP-Chip analysis in the ENCODE project and the same antibodies would be used in these studies. We would get relatively small amounts of DNA after sequencialsequential ChIP analysis as described in Experiment set- 5 above. Fortunately, Solexa sequencing requires considerably less DNA material than Chip-chip experiments. Further, we have developed a whole genome amplification method that uniformly amplifies DNA without any major sequence bias (Fig 15). Hence, we do not anticipate any difficulties in generating the data during these investigations.

Analysis of patterns of DNA replication: During S phase inactive genes are often associated with late replicating DNA and activation may be associated with a switch to early replication. The sites of initiation of DNA replication may change during early stages of development104, and in cells carrying various deletions. However, there is relatively little information about changes in the use of replication origins during differentiation of various mature lineages of cells. We will analyze the sites of initiation of DNA replication and the timing of DNA replication during S phase in undifferentiated Cd34 positive adult cells, human ES cells, and in cells at several stages of differentiation. We have obtained excellent results in mapping replication origins105106 as estimated by PCR analysis of replication origins in the globin cluster (Fig 13). We detected as the most prominent origin the site that had been previously described upstream of the beta globin gene, but also at least one and probably two additional upstream sites that are used less frequently. One of these sites corresponds to the DNA region that preferentially binds a multi-protein complex we have purified that includes Rad50, ORC, and MCM components.

[pic]

To map timing of DNA replication we will use preparative FACS to separate G1 early, mid, and late S phase, and G2 cells from exponentially growing populations. We might anticipate getting of the order of ten thousand cells or less in each population. However, our modified phi29 amplification method (Pan, XH, Snyder, M., Weissman, S.M. submitted for publication) produces sufficiently uniform amplification of genomic DNA to allow high resolution tiling array analysis to detect deletions or duplications of as little as one kilobase of DNA from 5 nanograms or less of starting total genomic DNA (Fig15). This is sufficiently sensitive that we could amplify and analyze the DNA from cells at several stages of replication and determine which genomic segments are present in diploid, triploid or tetraploid number per cell.. If necessary we could even do a second cycle of cell sorting to further enrich cells at particular stages of replication. This method would be more physiologic than methods involving use of drugs, or selective labeling of nascent DNA with BrUdr47.

[pic]

Fig. 15: The Pan-Weissman method produces amplified products that faithfully demonstrate the heterozygous duplication seen with the original un-amplified gDNA. The samples were hybridized on a high-density Chromosome 22 genomic tiling array for high resolution (HR)-CGH. The region under the half basket is the known segmental duplication (3 Mb, coordinate 17017434-20022987). The arrow points to a reversed signal due to cross hybridization of a repeat sequence, as demonstrated in 107 vs. amplified gDNA pool (as reference). The input for mapificaiton was 5 nanogams of gneomicgenomic DNA, corresponding ot less than oen thousand cells. Blue. Un-amplified 05050 vs. un-amplified gDNA pool. The input for amplification was 5 nanograms of DNA, corresponding to the DNA form less than one thousand cells.. (a): along the whole long arm of Chromosome 22; (b). The segmental duplication and its flanking sequences.

Anticipated Problems and Solutions: The number of cells that we can obtain from preparative FACS is somewhat limiting. By the time all separations have been accomplished, we would expect more than one million, but probably less than ten million purified cells to be available for each cell type and time point. This number of cells is enough for analysis of DNA methylation. For chromatin immunoprecipitation studies we may need to employ amplification of the immunoprecipitated products for microarray analysis. Fortunately less starting material is needed for high throughput sequencing and we may be able to avoid any preamplification for this application.

IV. Study of piRNAs during hESC differentiationhESC differentiation into the hematopoietic lineage.

Experimental Strategy

Our working hypothesis is that the piRNA pathway plays an epigenetic role in regulating hESC self-renewal and differentiation. This hypothesis has two components: 1) the piRNA pathway is required for hESC self-renewal and differentiation; and 2) such requirement is achieved via epigenetic regulation. Our experiments are designed to test this hypothesis.

Specifically, we will conduct the following three sets of experiments to analyze the Piwi/piRNA-mediated epigenetic regulation:

Experiment 1. Clone and characterize piRNAs in hESCs, adult hematopoietic stem cells, and their derivative erythroid cells, neutrophils, or B cells. (Year 1-2)

Experiment 2. Determine the function of the Piwi/piRNA-mediated mechanism in hESC pluripotency, self-renewal, and differentiation by examining the effect of reducing the expression of human Piwi proteins on the self-renewal of hESCs and their ability to differentiate into the hematopoietic lineage. (Year 1-3)

Experiment 3. Identify regions/genes under the control of Piwi/piRNA-mediated epigenetic regulation in the hESC genome and characterize the changes of such target regions during differentiation by the CHIP-seq approach. (Year 3-5)

Experiment 4. Determine the epigenetic function of the Piwi/piRNA complex at their target sites by characterizing the changes of epigenetic markers at these sites in Piwi-deficient hESCs (Year 4-5).

Experiments 1 and 2 will test the first component of the hypothesis; whereas Experiments 3 and 4 will test the second component of the hypothesis.

Detailed Experimental Procedures

Experiment 1. Clone and characterize piRNAs in hESCs, adult hematopoietic stem cells, and their derivative erythroid cells, neutrophils, or B cells. (Year 1-2)

Preliminary Results: The first step to test our hypothesis is to demonstrate the expression of human Piwi proteins and piRNAs in hESCs. We have examined the expression of human Piwi proteins in hESCs (H1) by immunoblot analysis using anti-Hiwi and anti-Hili antibodies. We cultured hSECs (H1) under feeder-free condition to avoid contamination of proteins from feeder cells. Both Hiwi and Hili are expressed in the hESCs (Fig. 16A). This result is confirmed by RT-PCR analysis, which indicates that the mRNAs of all four human Piwi proteins are expressed in the hESCs (Fig. 16B). We then further examined the subcellular localization of Hiwi and Hili by immunofluorescence microscopy. Hili and Hiwi are located in both the nucleus and in the cytoplasm, implicating their possible involvement in nuclear and cytoplasmic events.

We have recently also shown that piRNAs are present not only in germline cells, but also in somatic cells, albeit at lower abundance 9. To determine whether piRNAs are present in hESCs, total RNAs were isolated from feeder-free H1 hESCs and their 5’ ends were labeled by 32P. A population of small RNAs ~ 27nt in length was visualized following gel electrophoresis. Both the size and the typical smeary morphology of the band indicate that these small RNAs are likely piRNAs (Fig. 17 The expression of all four human Piwi proteins and the presence of putative piRNAs in hESCs provide a solid basis for our proposal.

Experimental design: We will conduct co-immunoprecipitation of piRNAs with each of the four human Piwi proteins from H1 hESCs according to the established protocol in our lab 9, 16. For Hiwi and Hili, we already have antibodies that work effectively for immunoprecipitation. We will raise antibodies against PiwiL3 and Hiwi2 using the antibody generation protocol that have been successfully used in the lab for Piwi proteins 108. After immunoprecipitation, RNA will be isolated by Trizol extraction. Their 5’ end will be radioactively labeled by T4 polynucleotide kinase (T4 PNK) and (-32P-ATP, and resolved on 15% denaturing polyacrylamide gels. The population of RNAs around 27-30 nt in length will be extracted from gels, PCR-amplified, and deep-sequenced using the 454 technology. Following sequencing, bioinformatics analysis will be done in-house to map the piRNA populations associated with each of the four Piwi proteins to the human genome and compare their differences.

Fig. 16 –need to change figure numbering within the image Fig 3 should be Fig 17)

Anticipated results, potential pitfalls, and alternative solutions: Our lab has discovered over 60,000 piRNAs in mice (ref 16 and Beyret, Yin, and Lin, unpublished data) and over 13,000 Piwi-associated piRNAs in Drosophila 9. We have extensive experience in all aspects of piRNA identification and characterization. Given the positive preliminary results, we do not anticipate any major pitfalls. The experiments in this aim should identify piRNAs that are associated with individual Piwi proteins in hESCs. From the genomic locations of these piRNAs, we will be able to get an initial view of the target sites in the genome that are regulated by the piRNA pathway. We should also be able to distinguish the individual function of four Piwi proteins by their association with different subsets of piRNAs, if any, which would then reflect different genomic loci as potential targets of their regulation. I will also compare piRNA target sites between hESCs and their derivative differentiated cells in embryoid bodies. This should allow me to identify piRNAs and their potential genomic target sites that are specific for hESCs, which may be important for hESC self-renewal. These data will provide a valuable platform for our future experiments to validate the function of these piRNAs by knocking down their expression using transgenic knock-out or transient short-hairpin RNA (shRNA) knock-down technique. These experiments, beyond the scope of this proposal, will definitively demonstrate the function of specific piRNAs in hESCs proliferation, self-renewal, and differentiation.

Experiment 2. Determine the function of the Piwi/piRNA-mediated mechanism in hESC pluripotency, self-renewal, and differentiation by examining the effect of reducing the expression of human Piwi proteins on the self-renewal of hESCs and their ability to differentiate into the hematopoietic lineage. (Year 1-3)

Rationale: Given that Piwi proteins are essential for stem cells self- renewal in many organisms, including mice (Hao and Lin, unpublished), it is likely that human Piwi proteins are also critical for hESC self-renewal. To test this hypothesis, we propose to use RNA interference (RNAi) approach to knock-down the expression of the four human Piwi proteins, individually and in necessarily combinations, in hESCs. Because Piwi proteins are required for piRNA biogenesis, this know-down is an effective way to confirm this requirement in hESCs and to determine the overall effect of knocking down the piRNAs pathway on hESC self-renewal and differentiation. Technically, since hESCs are difficult to transfect, we will use a lentiviral vector-based RNAi method, which is known as the most efficient way for introducing shRNAs into hESCs 109, 110.

Experimental design & anticipated results: We will culture H1 and H9 hESCs and infect them with lentiviruses carrying an shRNA against one of human Piwi proteins together with GFP as a reporter for selection. The transfected cells will be subjected to a series of assays, such as colony proliferation assays, DNA content/cell cycle assay, apoptosis assays (TUNEL), pluripotency assays (by examining the expression of Oct4 and the capability to form embryonic bodies ), and, importantly, hematopoietic differentiation assays, using combinations of CD markers for various stages of differentiation via FACS analysis (Year 1-2). In all these assays, hESCs transfected with lentiviral vector alone and untransfected hESCs will be used as positive controls. These analyses should provide us with a comprehensive picture of the function of human Piwi proteins in hESCs. Moreover, total RNA will be extracted from knock-down hESCs to examine the expression of total and specific piRNAs in these cells (Year 3). We anticipate that a sub-pool of piRNAs will decrease in response to knock-down of one of more of the Piwi proteins. These experiments should allow us to evaluate specific human Piwi proteins for their requirement for piRNA expression and hESC self-renewal.

Potential pitfalls and alternative solutions: We are experienced with all of the above described methods, and we are fortunate to have full support from Dr. Ivanova, the inventor of the lentiviral vector-based shRNA technique for ESCs, who have joined the Yale Stem Cell Center as our neighboring lab, so we do not anticipate any technical difficulties for this work. However, it is possible that individual Piwi proteins are functionally redundant in hESCs, thus their knock-down will not produce any detectable effect. If this is the case, we will knock down the four piwi genes in various combinations, all the way up to quadruple knock-down, until we see an effect. It is also possible that the knockdown of these four genes will not be effective enough so that the leaky expression can sustain hESC self-renewal at a completely normal rate. If we encounter this unlikely possibility, we will need to resort to genomic knock out of these genes, which is a challenging, but achievable technique, with the help of the Human Embryonic Stem Cell Core at the Yale Stem Cell Center.

Experiment 3. Identify regions/genes under the control of Piwi/piRNA-mediated epigenetic regulation in the hESC genome and characterize the changes of such target regions during differentiation by the CHIP-seq approach. (Year 3-5)

Rationale: As mentioned above, the Drosophila Piwi-piRNA complex directly binds to the piRNA- matching genomic sequences to regulate its epigenetic state 9. Since Hiwi and Hili are present in the nucleus of hESCs (Fig. 16C ), human Piwi protein/piRNA complexes may play a similar epigenetic role in hESCs. To understand the molecular mechanism of the piRNA pathway in hESCs, it will be important to explore the association of human Piwi/piRNAs with chromatin in these cells. and how such association is changes during hES cell differentiation.

Experimental design & anticipated results: We will conduct chromatin immuno-precipitation followed by Illumina sequencing 27. (CHIP-seq) to examine whether and where the Piwi/piRNA complexes are associated with chromatin in hESCs, adult hematopoietic stem cells, and their derivative erythroid cells, neutrophils, or B cells. CHIP-seq with antibodies against a core histone (e.g. H3) and a non-chromatin protein (such as GAPDH) will be used as positive and negative controls, respectively. Multiple binding sites on heterochromatin regions, telomeric regions and retrotransposon regions are anticipated for human Piwi/piRNA complexes. If necessary, The CHIP-seq data will be validated by PCR with specific primers for both heterochromatic and euchromatic regions. If the association is non-specific and at the background level, then we can conclude that the human Piwi/piRNA complexes do not have a significant role in epigenetic regulation.

Pitfalls and alternative solutions: Our team has extensive experience and resources for the techniques proposed in this aim. The two Illumina sequencers at the Yale Genomic Center can routinely read up to 5 millions of independent reads per sample. We have been using them for our CHIP-seq analysis in Drosophila, and have achieved much better results than CHIP-chip approach, with essentially no background. So we do not anticipate any significant problems in achieving this aim.

Experiment 4. Reveal the epigenetic function of the Piwi/piRNA complex at their target sites by characterizing the changes of epigenetic markers at these sites in Piwi-deficient hESCs (Year 4-5).

Experimental design & anticipated results: We will use the method proposed in Experiment 2 to generate hESC lines with various Piwi proteins knocked down. We will use the CHIP-seq method proposed in Experiment 3 to quantify euchromatic and heterochromatic markers associated with specific genomic sequences. For euchromatic markers, we will use H3K9ac, H3K4m2 and phosphorylated RNA polymerase II; whereas for heterochromatic markers, we will use HP1a and H3K9m3. When necessary, we will use other epigenetic markers to be used in other aims of this proposal. We have antibodies to all these markers that have been successfully used for CHIP analysis. We do not anticipate any technical difficulties. These analyses should allow us to reveal systematically the epigenetic function of the Piwi-piRNA mechanism in the hESC genome, which represent the first such analysis in any genome.

V. Epigenetic marks in normal mature cells of various hematopoietic lineages.

In previous studies we have prepared relatively large amounts of RNA from a variety of normal mature hematopoietic lineages including resting B cells, centroblasts, centrocytes, C4 and CD8 T cells, monocytes, dendritic cells, macrophages, and resting and activated neutrophils. These analysis indicated that there were not only a series of mRNAs that were relatively specific for each lineage, but also that housekeeping genes were expressed in different relative ratios that were lineage specific. As part of the present epigenomics study we propose to isolate sufficient amounts of each of these mature cell types directly from donors in order to analyze DNA methylation patterns, and to use our assays to look specifically for cell type specific changes in regions that are heterogeneously methylated and to correlate these with changes in the relative level of gene expression. We will further compare specific histone modifications that have been found to change during differentiation in the in vitro systems, as described above.

In all proposed experiments we will confirm results by Q-PCR for randomly selected data points.

VI. Bioinformatics:

Bioinformatics Pipeline for Interpretation of NextGen Sequencing Results

Overview

The bioinformatics pipeline for handling and processing the high-throughput sequencing data related to the modifications consists of a number of modules:

a - LIMS

b - Data handling module for dealing with the large quantities of data produced the sequencing machines

c - Scoring procedure for calling the actual modifications from the data

d - DART, our database of results and associated data flow into public databases

e - TrackSimplifier, web tools and computational approaches for interpreting and visualizing the data, i.e. correlating genome-wide modification patterns

We have linked all the tools and databases in our project into computational pipeline. We first capture data in a LIMS . LIMS. Then starting with the raw sequencing data is processed, which scoring and calling hits. The resulting segments (i.e. modifications) are put into DART and we use DART and additional associated downstream tools for further analysis. In particular, we are developing tools for genome-wide correlation (see below). Finally, this information flows to community databases (UCSC and GEO). We would anticipate much of this data also flowing into the epigenomics EDACC.

Overall, our pipeline contains a number of integrated web tools that allow researchers to post information related each step of the pipeline process. It provides for seamless data flow from one module to another via either XML exchange or direct database connectivity. The system provides for data archival, data mining, operations research, and dissemination. We are implementing key parts of the pipeline in redundant and easily backed up fashion using virtual machine technology (i.e. VMware server).

(a) LIMS

We have had much experience creating secure, reliable distributed database systems for genome annotation. In particular, our past experience has led us to identify a number of key issues that are often stumbling blocks in their construction and maintenance -- i.e. network security in relation to data integrity and distributed annotation, the need for a rapid prototyping environment, and the use of semantic web technologies and XML for database interoperation. We discuss these issues in a number of papers 111-116. Furthermore, the Gerstein Lab has had the responsibility of creating a tracking LIMS as part of NESG production centers of NIGMS protein structure initiative (PSI) (spine., 117, 118}). By reusing the code base from the SPiNE LIMS built for the Protein Structure Initiative, we are able to quickly tailor parts of the Wet LIMS to the needs of the ENCODE project.

Our Wet LIMS is a collection of platform independent LIMS applications used to capture production data from the selection, cell production, antibody validation, ChIP, qPCR, and hybridization segments of the pipeline. Each Wet LIMS module provides a method for capturing data in a manner best suited for that segment of the pipeline process. This entails a graphical user interface for data-entry that is custom tailored for each type of work environment throughout the pipeline process. Tracking functionality can accommodate detailed histories for individual samples, thereby presenting a more complete framework for transmitting information through all stages in high-throughput pipeline. A key feature is the ability of any registered member to add and modify entries via an intuitive web interface. This is regulated by journaling all changes to data (and the user responsible), and by restricting access to certain entries. More flexible and customized methods of data entry are also possible. The use of direct SQL and the Open Database Connectivity (ODBC) protocols enable a variety of remote interfaces to the server (such as Excel spreadsheets or Java programs). This enables bulk uploading of local datasets into the LIMS.

Recent advances in open source content management systems (CMS) have dramatically reduced the development time required to implement functional LIMS/database systems. We are now able to do very rapid development using Django. It is now possible to launch a simple yet functional system within hours of defining the data model. These rapid deployment environments allow the LIMS developers to more accurately tune the LIMS to researcher needs because refinements can be deployed within moments of a change request being received.

The specific database technology utilizes a standard three-tier architecture using a LAMP (Linux, Apache, MySQL, PHP/Perl) implantationimplementation. This approach offers a considerable speed advantage and allows sharing of libraries with offline programs used in the development of future releases and associated tools. Perl's suitability for systems programming also allows a wide variety of other modules to be used in the server with minimal setup and administrative overhead.

(b) Handling NextGen Sequencing Data

Overview

We recently have been devoting considerable effort to handling sequencing data, in particular the substantial data that flows from one run of a Illumina instrument. Each full run of this instrument generates about 700 GB of data. Currently this data is copied both to archival store and to the Life Sciences cluster for analysis. We have helped to develop and deploy techniques for doing so in a reasonable amount time (if space is available on the cluster, we can now have validated copies on both the archival store and on the cluster’s high performance file system within a few hours of the completion of the run, which itself takes about three days). We have also deployed Illumina’s analysis software to hardware well suited for its execution (in particular, computation nodes with 4 dual core processors). This software generates another 200 GB or so of data spread out over tens of thousands of files. Effectively interacting with such a huge quantity of “results” presents a considerable challenge to the researchers providing the samples for the run. To meet this challenge, we have developed a web interface that simplifies navigating the results, and another to post-process the results into a form suitable for use with the researchers’ tool of choice for visualizing the reads relative to a reference genome.

Specific Protocol

Each Illumina machine has a Windows PC attached that controls the chemistry and acquires scanned images of the lanes in the flowcell. The resulting raw data consisting of TIFF images and other files totals over 700 GB per flow cell. Each PC remote mounts an external 2TB hard drive from a Mac Mini. The Illumina control software at appropriate moments in the processing using a utility called robocopy that incrementally copies this raw data from a local hard drive, where the data is deposited by the control software, to the remote mount. Thus, by the end of a run, most of the data has been copied to the remote mount, making it relatively quick to transfer the remaining data, validate the copy and then purge the local hard drive in preparation for the next run (the PCs have only enough local storage for one run). Software on the Mac Mini, meanwhile, incrementally copies incoming data to the Yale Life Sciences Supercomputer analysis. When the data on an external drive has been copied and the copy validated, the external drive can be detached from the Mac Mini and put into storage, effectively archiving the raw data. An 8 processor node of the Yale Life Sciences Supercomputer is used to process the images, call bases and compare the resulting reads to the reference genome.

On the Supercomputer, the manager of the Illumina pipeline runs the softwarethe runner runs the Illumina pipeline software; first Firecrest processes the images copied over, then Bustard calls bases to determine read sequences, and finally Gerald maps them against the genome or genomes from which the samples were designed using an alignment algorithm called eland. There are three sets of output totaling approximately 200 GB of data. Data for mapped reads are used to create SGR files indicating the sequences and genomic locations of the samples in the flowcell. Most likely the experimentalist will request additional runs of the pipeline or some of its parts (with differing parameters), creating another ~200 Gb of data.

When the results of the run have been fully reviewed, the data is moved to our long term storage (see below).

Storage and Backup

By the time the experimentalist is satisfied with the final results, we generally have accumulated approximately 1.1 TB of data. One of the challenges is to understand how to store this data. Because of the widely differing costs of storage, our approach has been tiered storage. Currently the external hard drives that contain images copied by robocopy are archived when full. External hard drives are also used to back up the analysis data created on the Supercomputer; this process is done over the network from the Supercomputer back to the Illumina setup where these external hard drives are attached. External hard drives are used for data that are expected to be accessed rarely, if at all. For data that may be retrieved more frequently, we currently use SAN disks, which is a much more expensive but reliable option with better access characteristics. Such data include the base calling and the alignment data, as well as SGR files and downstream analysis results. Most will be zipped on the SAN disks and when they are needed again, they will be unzipped and transferred back onto the Supercomputer. Usage of archived data may include further analysis, analysis with updated software or different software, or data submission.

Future Directions

• Greater automation and streamlining: currently there is a great deal of manual work in the process done by the runnerpipeline manager. We will develop a system of scripts and programs that will result in fewer interventions without compromising the generation of correct data and their storage.

• Future storage solutions: as the number of flowcells put through the system increase in frequency, the need for more storage and better storage management system becomes more urgent. We are continuing to study the different types of files and their storage needs. But it is inevitable that more storage of different prices will be needed.

• Alternative Software: we are exploring other algorithms for base calling and alignment, as well as more downstream analysis programs. Their presence in the analysis pipeline will give us more confidence in the sequence and genomic position results.

) Dataset Scoring

Sequence reads (typically 27-35nts in length) off the ends of ChIP DNA fragments are aligned to the most recent build of the human genome (NCBIv36) either using the Illumina software package Eland or Maq (Durbin Lab, Sanger Center).Sequence reads (typically 27-35nts in length) off the ends of ChIP DNA fragments are aligned to the most recent build of the human genome (NCBIv36) using the Illumina software package Eland. The ChIP DNA that is sequenced is size selected to be in the range 100-300 bp in length. Eland allows sequence reads to be mapped to the genome with up to two possible mismatches. Using only sequence reads, which uniquely map to the entire genome, we construct a signal map of all the inferred ChIP-DNA fragments (extending the location of each mapped read by 200 bp to account for the average size of each fragment of ChIP DNA). The signal map is then the piled up mapped pieces of ChIP-DNA where the height at a given nucleotide position is the count of the number of mapped fragments that overlap that position.

Transcription factor binding sites and modifications will appear as “peaks” in these maps. In order to determine a threshold to use for the peak height, we perform a computational simulation determining the number of peaks we would obtain from a random mapping of the same number of DNA fragments. At a given threshold we can compute the false discovery rate as the number of false positive identified from the simulation divided by the number of positives identifies from the actual mapped ChIP DNA sequences. Requiring a FDR of less than 0.05 determines the peak height threshold required. This analysis is done separately for each chromosome and can be done at an even more local genomic level if necessary (to potentially account for local copy number variation). Using these thresholds we can determine a set of regions for each chromosome that are potentially factor binding sites. Significant regions that are separated by less than 200 bp are combined. In addition we have constructed a mapability map of the entire human genome where at each nucleotide position we have determined if a short sequence read from that location maps uniquely back to the correct location or to multiple locations in the genome. We use this map to correct for local variations in the amount of uniquely mapable sequence, which will locally modulate the signal maps.Transcription factor binding sites and modifications will appear as “peaks” in these maps. In order to determine a threshold to use for the peak height, we perform a computational simulation determining the number of peaks we would obtain from a random mapping of the same number of DNA fragments. At a given threshold we can compute the false discovery rate as the number of false positive identified from the simulation divided by the number of positives identifies from the actual mapped ChIP DNA sequences. Requiring a FDR of less than 0.05 determines the peak height threshold needed. This analysis is done separately for each chromosome and can be done at an even more local genomic level if necessary (to potentially account for local copy number variation). Using these thresholds we can determine a set of regions for each chromosome that are potentially transcription factor binding sites. Significant regions that are separated by less than 200 bp are combined. In addition we have constructed a mapability map of the entire human genome where at each nucleotide position we have determined if a short sequence reads from that location maps uniquely back to the correct location or to multiple locations. We use this map to correct for local variations in the amount of uniquely mapable sequence, which will modulate the signal maps.

For each set of transcription factors and chromatin modifications experiments done using ChIP-Seq seq we have sequenced the Input DNA for the same cell-line using the exact same experimental procedure. The Input DNA is sequenced at a greater depth than any of the individual ChIP-DNA samples. Analysis of the Input DNA shows that also we observe “peaks” in the Input DNA around the promoters of many genes. Thus only “peaks” from the ChIP-DNA maps that are enriched relative to the same location in the Input DNA are sites of real transcription factor binding or chromatin modification (i.e. that show enrichment from the IP). In order to do the comparison with the Input DNA the number of mapped sequence reads needs to be normalized between the ChIP and Input DNA. We do not simply use the same number of mapped reads as the ChIP DNA map typically has a significantly higher fraction of reads in binding sites and we would like to normalize the background signal of the ChIP DNA map against the background of the Input DNA.

A normalization factor is computed by performing linear regression of the number of mapped fragments in each 10 Kb region along a chromosome between the ChIP and Input DNA maps. This is done separately for each chromosome; analysis has shown that this normalization does not need to be done on a more local chromosomal level as the correlation is consistent across a chromosome.level.

In order to determine whether a “peak” region from the ChIP DNA map is significantly enriched relative to the same region in the Input DNA we count the number of mapped fragments in each region and compare against the number of mapped fragments to the same region from the Input DNA appropriately normalized. Using a binomial test we can determine a p-value for the significance of where the number of mapped tags in the ChIP DNA is enriched relative to the Input DNA. In addition for each region we can also compute the enrichment as the ratio of the number of mapped fragments appropriately normalized. Only ChIP-DNA regions that are statistically significant (p-value ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download