'The 3rd Generation of Sequencing Technologies'



George: I have used the ‘Track changes’ function to mark my edits and suggestions. In case you are unfamiliar with this function, it can be found under Tools in the Standard toolbar. My queries are in bold type. Terms that could do with a glossary definition are in small caps.

(we can place the URL of your homepage in the online version of the article, where it can be linked to your affiliation via your autobiographical sketch (I’ll ask you to provide this later))

Personal genomes and the next generation of DNA sequencing technology.

Advances in DNA sequencing technology: methods and goals Personal Genome Project Technology

Jay Shendure^, Rob Mitra*, George M. Church^+

^ Harvard Medical School, 77 Ave Louis Pasteur, Boston, MA 02115, USA.

* Dept. of Genetics, Washington University School of Medicine, 4566 Scott Avenue, St. Louis, MO 63110, USA

+ Corresponding author, email: church@arep.med.harvard.edu (contact: (we can place the URL of your homepage in the online version of the article, where it can be linked to your affiliation via your autobiographical sketch (I’ll ask you to provide this later))

)

* Dept. of Genetics, Washington University School of Medicine, 4566 Scott Avenue, . St. Louis, MO 63110, USA

Preface

Nearly three decades have passed since the concurrent invention of the Maxam-Gilbert and Sanger methods for DNA sequencing. Automation and miniaturization of Sanger sequencing led to the development of high-throughput sequencing machines, the workhorses that delivered the human genome under-budget and ahead of schedule. A new generation of technologies aspire to push DNA sequencing to a new level, such that genomes of individual humans can be sequenced at a cost compatible with routine health care. Here we review technologies-under-development, and discuss the potential impact of the “Personal Genome Project” on both the research community and society.

Introduction

In the Human Genome Project (HGP), early investments in the development of cost-effective sequencing methods undoubtedly contributed to its resounding success. Over the course of a decade, through the refinement, parallelization, and automation of established sequencing methods, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per finished base to 10 finished bases per dollar [Col03]. The relevance and utility of high-throughput sequencing and sequencing centers in the wake of the HGP has been a subject of recent debate. The number of species for which canonical genomes are sequenced and assembled is rapidly growing, but the list of species that are both interesting and relevant is ultimately finite. Nevertheless, a number of academic and commercial efforts are pushing for new ultra-low-cost sequencing (ULCS) technologies that aim to reduce the cost of DNA sequencing by several orders of magnitude. Why? First, whereas many biomedical and bioagricultural goals are not practical with current cost structures, they are quite justifiable at a costs such as 1 million bases per dollar. Second, achieving sequencing at the low costs achieved by the HGP requires involvement of a large sequencing center, and these are few in number and under heavy demand. Many of the new technologies aim to place high-throughput sequencing within reach of individual labs or core-facilities. Finally, such technologies have the potential to bring genome sequencing out of the laboratory and into the clinic, as they approach costs at which the complete sequencing of “personal genomes” might be affordable.

Here we review novel sequencing technologies that are under development, and discuss their relative feasibility and progress-to-date. The technologies can be generally classified into one of the following: (a) micro-electrophoretic methods, (b) sequencing-by-hybridization, (c) cyclic-array sequencing, and (d) single-molecule sequencing. The field is still relatively small; probably less than 10 academic labs and less than 10 companies are focused in this area. All technologies are at an early stage of development, such that it is difficult to gauge the time-frame before any given method will truly be practical and living up to expectations. Nevertheless, significant progress has been made with relatively limited resources such that substantial further investment is likely justified.

Boosting the capacity research-oriented DNA sequencing was the original motivation for pursuing new technologies. Since 2001, the primary justification for these efforts has gradually shifted to the notion that the technology could become so affordable that sequencing the full genomes of individual patients would be justified from a health-care perspective [Jon01][Pra02] [Pen02][Sal03]. Here we use the phrase “Personal Genome Project” (PGP) to describe this goal. What are the potential health-care benefits? At what cost-threshold does the PGP become justified? With respect to issues such as consent, confidentiality, discrimination, and patient psychology, what are the risks?

Preface: please provide a short and snappy standalone preface – no more than 100 words.

Introduction

The introduction could be expanded a bit. See comments within.

The success of the Human Genome Project (HGP) illustrates how even modest investments in developing cost-effective sequencing methods can have payoffs for the whole biomedical community. Over the course of a decade, through the refinement, parallelization, and automation of established sequencing technologies, the HGP motivated a 100-fold reduction of sequencing costs, from 10 dollars per base to 10 finished bases per dollar [Col03]. However, for a wide range of biomedical and bioagricultural goals, a strong need is evolving evolving to accelerate the pace of these advances. Please pause to explain what these biomedical and bioagricultural forces are. This has been accompanied by calls for to reduce the cost of sequencing a human genome at a cost ofto less than $5000 per human genome, a price threshold below which individual patients could conceivably afford "personal sequences" [Jon01][Pra02] [Pen02][Sal03]. Please explain succinctly why this goal might be desirable – ie do we need it or is there simply a demand for it? This will impact health care, both directly by providing diagnostic and prognostic markers for the clinical setting, and indirectly by accelerating the pace of basic and clinical biomedical research. To meet this challenge for cheaper and more accurate high-throughput sequencing,, scientists and engineers will have to further reduce DNA sequencing costs by a factor of 1,000. The PGP has to be introduced in more detail here – when was it first proposed, who are its proponents, what is the rationale behind the technological advances, how feasible is it and what is the timescale. Several technologies are being developed that could make the Personal Genome Project (PGP) a reality. This will impact health care, both directly by providing diagnostic and prognostic markers for the clinical setting, and indirectly by accelerating the pace of basic and clinical biomedical research. Here we review potential PGP sequencing technologies and discuss the impact they could have on the biomedical community – could you expand this sentence a little – eg to say that the impact is not just technical or biomedical but also ethical..

We are increasingly seeing the potential fruits of genomics in the clinical setting, but clearly there is a long way to go. Sequencing of large numbers of human genomes has the potential to itself be a means of expediting this process.

The idea that the cost of sequencing a full human genome (effectively, the sum possibility of all genetic tests) could approach the current costs-to-patients of many single genetic tests is certainly appealing (i.e. more for your money). Here we consider the For example, what are the potential benefits of sequencing the genome of a perfect healthy baby? From multiple perspectives (anonymity, psychological, insurance, etc.) what are the risks?

A Brief History of Sequencing (This title is a bit grand! How about: ‘Traditional sequencing’ - ?)

In 1977 two research groups — one led by F. Sanger and the other by W. Gilbert — that were familiar with peptide and RNA sequencing methods, made a technical leap forward for sequencing in general and DNA sequencing in particular, based ofby harnessing the amazing separation power of gel electrophoresis [Gil81,San88]. Although dideoxy and Maxam–Gilbert sequencing was costly relative to today (can you give numbers, for comparison?), many scientists adopted and improved these techniques. By 1985, enough progress (qualify this progress please) had been made that a small group of scientists set themselves the audacious goal of sequencing the entire human genome between 1990 and 2005. [Coo89][Col03]. This declaration met with considerable resistance from the wider community. At the time, many felt the cost of DNA sequencing was far too high and the sequencing community too fragmented to complete such a vast undertaking. However, the critics did not account for the rapid pace of technical and organizational innovation – what, specifically, was achieved? Ie what allowed the draft human sequence to be produced so quickly and cheaply?. In 2000, five years ahead of schedule, a useful draft sequence of the human genome arrived was published (ref) at a cost of $300 million dollars for the bulk of the sequence, which was significantly under budget. Possibly even more significant was the appearance of a culture of technology and "open" data and software [Col03]. This sentence is rather terse but makes an important point (and that is picked up on later in the piece) and so should be expanded.

Why continue sequencing?How many nucleotides are out there?

I have suggested a way of linking this section to the next – by all means change the link, provided we have one of some sort.

In addition to Homo sapiens, the genomes of over 160 organisms have been fully sequenced, plus as well as parts of the genomes offrom over 100,000 taxonomic species. There are currently 20 billion bases in international databases [Gen03], reflecting the complexity of life on earth. But is sequencing absolutely essential to understand life? Or are there other means by which we can learn about the biosphere and our own species? We argue that although sequencing the biosphere is unnecessary and impractical, whole-genome sequencing — when the technology is made faster and cheaper — will improve existing biological and biomedical investigations and help to develop new genomic and technological studies (see also Box 1).

Impact of sequencing on existing and new studies. By comparing the genomic sequences between organisms, we are learning a great deal about our own molecular program, as well as that of the other organisms in the biosphere [Ure03,Car03]. This approach will become more powerful as we sequence more genomes [Tho03]. As we increasingly perceive the significance of inter-individual and tissue specific variation in sequences and their interactions, it is both humbling and amusing to compare the current 20 billion bases in international databases [Gen03] to the complexity of sequences on earth. A global biomass of over 2e18 g contains an estimated total biopolymer sequence of 1e38 residues. While sequencing the biosphere is unnecessary and impractical, we clearly have a great deal to learn.

Why Continue Sequencing?

An ultra low cost sequencing technology (ULCS) would greatly impact the way biologists work. Mutagenic or population genetic screens in model and non-model organisms would be more powerful if one could sequence genomes for responsible mutations in crosses.

In addition to making well-established techniques more powerful, improved sequencing power could open up new research fields. For example, one could directly query the antibody diversity that is generated in response to disease. ULCS would benefit synthetic biology and genome engineering, ranging from selecting new enzymes to building??? new chromosomes, both of which are powerful tools for perturbing or designing complex biological systems. Whether the goal is the accurate synthesis of DNA/RNA?? or combinatorial scrambling (please paraphrase), these large syntheses will have many unknowns (meaning?), and ULCS would help determine or ensure engineered DNA design specifications. Subject to the forces of mutation and selection, and it will be important to monitor base changes as they occur (meaning?). Even further beyond normal genomes than the above synthetics, lies DNA computing [Bra02] and DNA as ultracompact memory (nm3/bit rather than conventional 1e11 nm3/bit storage media like DVDs) (ref?). Please say a bit more about what DNA computing is and the use of DNA in ultracompact memory - readers might not have heard of these applications.

The introduction mentions bioagricultural goals – could these be covered here?

The personal genome project and human health. Perhaps the most compelling reason to pursue the goal of achieving low cost, high-throughput sequencing technology PGP is the impact it could have on human health. Inexpensive (defined as ?) sequencing would give us access not only to our diploid inherited genetic make-up ance but to many aspects of our daily changing internal environment including how our body responds to pathogens , and allergens and, the genetic changes associated with cancer and immune cells. It is occasionally claimed that all that we can afford (and hence all that we need) is information on "common" single nucleotide polymorphisms, SNPs, or haplotype the arrangements of these (haplotypes) [Gib03] in order to understand so-called "multifactorial" or "complex" diseases [Hol00]. In a non-trivial sense all diseases are "complex". As we get better at genotyping and phenotyping we simply get better at finding the factors contributing to ever lower penetrance and expressivity. A focus on "common" alleles will probably be successful for alleles maintained in human populations by heterozygote advantage (such as the textbook relationship between sickle cell anaemia and/ malaria relations) but would miss most of the genetic diseases documented so far [Vit03]. In any case, even for diseases that are amenable to the haplotype mapping approach, ultra high throughput sequencing would allow geneticists to move more quickly go from a haplotype that is linked to the disease to the causative SNP. Additionally, many candidate loci could be investigated by sequencing them across large populations. For diseases caused by multiple but rare mutations, it is absolutely critical to directly sequence the causative SNP, and a whole genome sequencing technology would provide geneticists with the ability to do this. What advances in terms of speed, cost & accuracy would make it possible to achieve this goal?

Another medically important area that ultra high throughput sequenceULCS could make a large impact is cancer biology. Cancer is fundamentally a disease of the genome: the gradual accumulation of somatic mutations during normal cell division is thought to give rise to malignant cells. Epidermiology suggests that mutations in three to seven genes are necessary to cause cancer [Mil80]. It is now becoming clear that different sets of genes are mutated in different cancers, andthe order in which these genes acquire mutation can vary. In addition, genomic instability may play a role in cancer progression[Raj03]. Therefore, there are many different paths to cancer. Although a large number of distinct genomic mutations and aberrations have been found in cancers, patterns are starting to emerge. For example, there are thought to be six essential features that a cancerous cell must acquire, and by looking for disrupted pathways rather than just individual genes, a better understanding of tumorigenesis has been acquired[Han00]. The ability to sequence and compare complete genomes from a large number ofmany normal, neoplastic, and malignant cells will greatly increase our understanding of tumorigenesisThe comprehensive detection of all somatic mutations (base changes, genomic rearrangements, and epigenetic variation) in the genome should capture most, if not all, of the factors involved in cancer initiation and progression, and allow us to. We could exhaustively catalogue the molecular pathways and checkpoints that are inactivated as a tumor develops. The comprehensive detection of all somatic mutation (base changes, genomic rearrangements, and epigenetic variation) in the genome would capture most,if not all, of the causative factors of cancer.

An ultra low cost sequencing technology (ULCS) would greatly impact the way biologists work. Mutagenic or population genetic screens in model organisms (yeast, nematode, fly, zebrafish, mice) and non-model organisms would be more powerful if one could sequence genomes in crosses for responsible mutations. In addition to making well established techniques more powerful, new possibilities would be made possible. For example, one could directly query the antibody diversity generated in response to disease. ULCS would benefit synthetic biology and genome engineering ranging from selecting new enzymes to new chromosomes, powerful tools to perturb and design complex biological systems. Whether the goal is accurate synthesis or combinatorial scrambling, these large syntheses will have many unknowns, and ULCS would help determine or ensure engineered DNA design specifications. Subject to the forces of mutation and selection, and it will be important to monitor base changes as they occur. Even further beyond normal genomes than the above synthetics, lies DNA computing [Bra02] and DNA as ultracompact memory (nm3/bit rather than conventional 1e11 nm3/bit storage media like DVDs).

Is the PGP feasible?PGP Sequencing Technologies

One reason for the overwhelming success of sequencing is that the number of nucleotides that can be sequenced at a given price has increased exponentially for the past 30 years (Figure 1). This is biology’s version of Moore-Kurzweil law, that the computer instructions per second available per unit cost doubles every 18 months or faster [Kur03]. This exponential trend is by no means guaranteed and realizing a PGP in the next five years [who has set this timescale?] probably requires a higher commitment to technology than was available in the pragmatic andally production-oriented HGP (figure 1). How might this be achieved? Obviously we cannot review technologies that are secret, but a number of truly innovative approaches are have now been now made public, marking this as an important time to compare and to conceptually integrate these innovative strategies ????? to. We review four major approaches below. (see Fig 2 – this figure should contain simple diagrams that illustrate the four techniques ).

Sequencing technologies

This section should be fleshed out a little.

Micro-Electrophoretic Sequencing

Micro-electrophoretic Sequencing. Nearly all publicly available DNA sequences have been collected obtained using by technology that is based on the dideoxy sequencing method developed by Fred Sanger’s group? published 26 years ago [ref]. Typically 99.99% accuracy can be achieved with as few as three raw reads. The Mathies and Matsudaira groups are currently investigating just how inexpensive Sanger sequencing can be made. By using microfabrication techniques developed in the computer industry, they are building miniature DNA sequencers that can rapidly separate the dideoxynucleotide ladders. They are also working to create microfabricated devices that can perform PCR, allowing for integration of several sequencing steps to be integrated. Read lengths of up to 900 bases on a 384 -lane microfabricated sequencing device have been demonstrated [Emr02], and work is proceeding being undertaken to create a single device that will perform DNA amplification, purification and sequencing in an integrated fashion.

Hybridization Sequencing

Hybridization sequencing. Other groups are utilizing the property that of single stranded DNA will to preferentially hybridize to its a complementary sequence. This fact can be exploited in a variety ofseveral ways to sequence DNA.

One approach, demonstrated by Hyseq Hyseq, is to immobilize the DNA to be sequenced on a membrane or glass chip and then perform serial hybridizations with 6-mer oligonucleotides. By monitoring which 6mers bind, the unknown sequence can be decoded. This strategy can be used for both resequencing and de novo sequencing [Drm01].

A more widespread strategy for this ‘hybridization resequencing’ method is to attach the oligonucleotides to the glass surface instead of the sample DNA — an approach first commercialized in the Affymetrix HIV chip in 1995 [Lip95]. Perlegen greatly extended this approach (how?) to allow the resequencing of human genomes [Pat01]. The current maximum density of the chip is about one oligonucleotide "feature" per 5 microns (. Eeach feature contains approximately 100,000 copies of a 25 base pair oligonucleotide). For each base pair to be resequenced in the reference sequence there are four features on the chip. The middle base pair of these four features is either an “A”,”C”,”G”, or “T”. The sequence that flanks the variable middle base is identical for all four features and matches the reference sequence. By hybridizing labeled sample DNA to the chip and determining which of the four features shown the strongest signal for each base pair in the reference sequence, a DNA sample can be rapidly resequenced.

Cyclic array sequencing

Cyclic array sequencing. This class includes methods previously lumped classified under the name “sequencing-by-synthesis”. However, since nearly all of the methods reviewed here have critical "synthesis" steps, we choose to emphasize the cyclic nature of this class. Please add a sentence here to explain what the ‘cyclic’ that defines this class refers to. In 1984, three key components (what do you mean by ‘components’ – were these 3 different technologies or three features of one technology?) of this class emerged:, i.e. multiplexing in space and time, avoiding bacterial clones, and cycling [Chu84]. This lead to the first commercially sold genome sequence (which one? ref?), however its electrophoretic component held it back (in what way? Too slow? Too expensive? )_. In 1996, Ronaghi and Nyren introduced Pyrosequencing, in which a sequencing primer is extended by adding a single type of nucleotide triphosphate. Extension is detected by monitoring pyrophosphate release. Repeated cycles of nucleotide addition are performed and by recording which bases are added.

This approach to DNA sequencing is attractive because it is non-electrophoretic and miniaturizable from the original 9 mm spacing??. Mostafa Ronaghi and Ron Davis at the Stanford Genome Center, and 454, a company located in New Haven (Connecticut, USA), are developing methods to perform single molecule PCR in 50 micron wells and then sequence these amplification products [Lea03]. This has been used recently to sequence an Adenoviral genome [- please give an idea of cost/speed) Sar03]. By integrating the cloning amplification and sequencing steps into a miniaturized format, they expect a significant improvement in the cost (numbers?). One way to achieve this integration has been to utilizesuse polymerase colonies, or polonies [Mit99]. ] (See figure 2c). In this technology, millions of individual DNA molecules are amplified by the polymerase chain reaction (PCR) in an acrylamide gel attached to the surface of a glass microscope slide. Because the acrylamide restricts the diffusion of the DNA, each single molecule included in the reaction produces a colony of DNA (a polony). This can also be accomplished on beads in oil/water emulsions [Dre03]. With sizes as small as 1 micron, a conventional slide can carry up to 2 billion beads. To sequence these polonies we perform sequential single base extensions with reversible dye-labeled deoxynucleotides [Mit03B] with read-lengths of at least 28 cycles.

Please link this para to the previous one In the Lynx Massively Parallel Signature Sequencing (MPSS), each single molecule of DNA in a library is labeled with a unique oligonucleotide tag. Next, the library is amplified, and hybridized to oligonucleotides combinatorially synthesized on glass beads. After the hybridization reaction, every bead contains approximately 100,000 DNA molecules, each amplified from the same single library molecule. These are sequenced by cycles of fluorescent linker ligation and removal of 4 basepairs with a restriction enzyme. The accuracy is quite high and the 20 basepair read-lengths are adequate for many purposes [Bre00].

Single Molecule Sequencing

Single molecule sequencing. Several groups are attempting to sequence single molecules of DNA. This approach could lead to cost reductions by eliminating the cloning and amplification steps. In addition, small reaction volumes (how small?) are required. One such single molecule technology is nanopore sequencing, which is being developed by Agilent, Branton and Deamer groups. In this technology, as DNA passes through a 1.5 nm pore, different base pairs obstruct the pore to varying degrees resulting in changes in the electrical conductance of the pore. The accuracies of base calling range from 60% for single events to 99.9% for 15 events [Win03]. However the method is limited so far to the terminal bp of a specific type of hairpin.

The cyclic fluorescent base extension used on microscopic polonies as described above can be scaled down further to single molecules. Solexa is developing a technology in which millions of single DNA molecules can be sequenced in parallel using a sequencing-by-synthesis protocol. To do so, the sequencing primer is annealed to DNA templates and single base extensions are performed using nucleotides with reversible terminators and dye-labels. Two of the founders of Solexa have published a peer-reviewed paper demonstrating that they can achieve aThis approach allows single molecules to be detected with an impressive signal to noise ratio in their detection of single molecules [ref?]. The Quake group has demonstrated that sequence information can be obtained from single DNA molecules using serial single base extensions and the clever use of energy transfer -?? to improve their signal to noise ratio. [Bra03]. Watt Webb’s group at Cornell has recently proposed a strategy in which DNA sequence is acquired by the detecting, in real-time, the incorporation of dye-labeled nucleotide triphosphates by DNA polymerase. They achieve this by extending a sequencing primer with fluorescent nucleotides in a nanofabricated structure ,which they have termed a ‘zero mode waveguide’. By performing the reaction in a zero-mode waveguide, only a zeptoliter volume of the reaction is excited by the laser, so that in principle, one is only detecting fluorescent triphosphates that reside in the DNA polymerases active site [Lev03]. The Genovoxx team has shown the possibility of using standard optics and has given details on one class of reversible terminators [Fag04].

It is important to note that many of the advantages of "single molecules" can be achieved after a bit of nucleic acid amplification. How are these two sentences linked? The first mentions the advantages of amplifying the single mols but then the following one develops the use of single mols. In particular, single molecules allow determination of combinations of structures which are hard to disentangle in pools of molecules. For example, alternative RNA splicing contributes extensively to protein diversity and regulation but is poorly assayed by pooled RNAs on microarrays, while an amplified single molecules allow accurate measures of over 1000 alternative spliceforms in RNAs like CD44 [Zhu03]. Similarly haplotype (or diploid genotype) combinations of SNPs can be determined accurately from single cells or DNA molecules [Mit03A]. The ability to amplify large DNAs by multiple displacement amplification (MDA) or whole genome amplification (WGA) is improving rapidly [Dea02][Nel03]. How is the improvement being achieved? This will enhance our ability to get complete sequence from single cells even even when they are dead or impossible to grow in culture. when unculturable or dead [Sor04][Roo04].

Technologyies comparedisons.

This section should be expanded and made much clearer by listing the specific goals that are desirable (in terms of speed, accuracy and cost) and assessing their feasibility by referring to the previously listed technologies. See also accompanying email.

A key consideration is cost per base. Accuracy goals will depend on the application, ranging from 21 base RNA tags [Sah02] to nearly error free genomes ( ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download