My first university biology classes begin in September. In preparation, but increasingly out of curiosity, I've been studying some basic biology. Saying that to say, I know as much about DNA replication as the Khan Academy videos teach.
Presently, I'm reading about the two efforts to sequence the genome. From what I've read, Collins' project used "polymerase catalysis and nucleotide labeling"; whereas, Venter's project used "shotgun sequencing". I didn't know enough to understand the article I read on "shotgun sequencing" and couldn't find one that described the other technique.
I suspect that, in order to comprehend a concise description of the two techniques, I'd need to learn more than I could learn from an answer to this question. That said, it'd still be nice to have a very rough idea of what each process involves and how they differ from each other. A crude analogy to something concrete would be great.
The public effort, led by Collins, relied on a physical map of each chromosome. Very large pieces of genomic DNA were sub cloned into cloning vectors and used to create genomic libraries in cosmids, BACs and YACs. The sub clones are ordered using hybridization probes from either known genes, or RFLP genetic markers. In this way a physical map of each chromosome was built up, and then the project would march down the ordered clones, sequencing the sub clones. If the physical map is good then you have a very good idea of where the sequence comes from.
Shotgun sequencing, in contrast, involves randomly fragmenting the entire genome into small fragments (using sonication, for example), sub cloning those pieces and sequencing everything.
In the first method each base is sequenced a minimal number of times, whereas in the second approach each base ends up being sequenced many times (on average).
For a whole genome, shotgun sequencing requires an excellent computer algorithm to "assemble" the resulting data into overlapping contigs.
PCR (polymeraze chain reaction) and shotgun sequencing are two somewhat complimentary approaches to sequencing. In the former case one uses chemical reaction to create multiple copies of the genome of interest, which allows to reduce the error in sequencing this genome. The latter is done by sequencing the data directly, i.e., without any genome multiplication.
PCR is more suitable when we are studying a particular genome or a few genomes, e.g., when sequencing the genome of a particular person or a particular type of organism. However, when some very different genomes are present, the amplification might be very uneven, and in fact introduce additional errors. This is why in some situations one prefers shotgun sequencing, e.g., when studying microbiome, containing many different types of bacteria, archaea, viruses, etc.
Both terms refer to the so-called High-Throughput Sequencing (HTS). You could learn a bit more about sequencing technologies from this popular science article.
Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations
The ability to identify low-frequency genetic variants among heterogeneous populations of cells or DNA molecules is important in many fields of basic science, clinical medicine and other applications, yet current high-throughput DNA sequencing technologies have an error rate between 1 per 100 and 1 per 1,000 base pairs sequenced, which obscures their presence below this level.
As next-generation sequencing technologies evolved over the decade, throughput has improved markedly, but raw accuracy has remained generally unchanged. Researchers with a need for high accuracy developed data filtering methods and incremental biochemical improvements that modestly improve low-frequency variant detection, but background errors remain limiting in many fields.
The most profoundly impactful means for reducing errors, first developed approximately 7 years ago, has been the concept of single-molecule consensus sequencing. This entails redundant sequencing of multiple copies of a given specific DNA molecule and discounting of variants that are not present in all or most of the copies as likely errors.
Consensus sequencing can be achieved by labelling each molecule with a unique molecular barcode before generating copies, which allows subsequent comparison of these copies or schemes whereby copies are physically joined and sequenced together. Because of trade-offs in cost, time and accuracy, no single method is optimal for every application, and each method should be considered on a case-by-case basis.
Major applications for high-accuracy DNA sequencing include non-invasive cancer diagnostics, cancer screening, early detection of cancer relapse or impending drug resistance, infectious disease applications, prenatal diagnostics, forensics and mutagenesis assessment.
Future advances in ultra-high-accuracy sequencing are likely to be driven by an emerging generation of single-molecule sequencers, particularly those that allow independent sequence comparison of both strands of native DNA duplexes.
The genome contains all the information an organism needs to exist, reproduce, and evolve. The human genome, for example, contains 3.2 billion bases. If we use the analogy of a book, then it would contain 3.2 billion letters (the bases A, C, G, T) without spaces divided into 46 chapters (the number of chromosomes), which would make it around 70 million letters (bp) per chapter. To put this into context, this article contains approximately 58 000 letters. When an organism dies, the combination of time and environment will break or alter the string of letters, making it hard to read and understand. Until the introduction of next-generation sequencing (NGS), we were often able to read a few of the larger remaining sentences—and this is after a great deal of effort and expense. However, as technology has progressed, we are now much more easily able to read very short sentences, and combine these into complete chapters and sometimes the entire book. The progress made in the last decade is staggering the field has gone from recovering hundreds of bp to hundred millions of bp. This advance in technology not only has enabled us to answer more complex questions, but also has led to an overload of information, not all of it useful.
2 - Genomes and Variants
One of the defining achievements of the early 21st century is the sequencing and alignment of more than 90% of the human genome. Of course, there is not a single human genome: individuals differ from each other by about 0.1% and from other primates by about 1%. Variation comes in many different forms, including single base changes and copy number changes in large segments of DNA. Even more challenging than sequencing the whole genome is documenting and understanding the clinical significance of human sequence variation. We are still very early in our understanding of the human genome.
Beginning with a historical perspective, the structure of the human genome is described in detail followed by comparison to other interesting species. Then different types of genomic variation are covered, including single base changes (substitutions, deletions, insertions), copy number variations, translocations and fusions, short tandem repeats of different size and number, and larger repetitive segments, some of which can hop around the genome as transposons. The function of different genomic elements is considered along with many different classes of RNA transcribed from the DNA. How to name all the different genes, variants, and elements is a daunting task, and accepted nomenclature is presented. Many databases are available to mine accumulated genomic information. We end with a description of basic informatics tools that provide a pipeline from the raw data of massively parallel DNA sequencing to finished sequence with annotations on the variations that are observed.
Technology advances: deep sequencing + dense bones = paleogenomics
It has long been realized that performing archaeogenetics research correctly is extremely difficult [38, 45, 46, 48,49,50]. However, by the same token, during the last three decades the challenging nature of aDNA research has spurred significant technical innovation and rapid deployment of state-of-the-art genomics and ancillary technologies [46, 50, 88,89,90,91,92,93]. Undoubtedly, the most important scientific advance was the introduction of high-throughput sequencing (HTS) to archaeogenetics [94,95,96,97]. High-throughput sequencing technologies have been commercially available since 2005  and between 2007 and 2019 there has been an almost 100,000-fold reduction in the raw, per-megabase (Mb) cost of DNA sequencing . Currently, the dominant commercial HTS technology is based on massively parallel sequencing-by-synthesis of relatively short DNA segments [100, 101], which is ideally suited to fragmented aDNA molecules extracted from archaeological and museum specimens. In addition, the vast quantities of sequence data generated—literally hundreds of gigabases (Gb) from a single instrument run—can facilitate cost-effective analyses of archaeological specimens containing relatively modest amounts of endogenous aDNA (for technical reviews see [89,90,91,92,93, 102]).
The introduction of HTS and ancillary specialized methods for sample treatment, aDNA extraction, purification and library preparation have represented a genuinely transformative paradigm shift in archaeogenetics. It has ushered in the era of paleogenomics and the capacity to robustly genotype, analyze and integrate SNP data from thousands of genomic locations in purified aDNA from human and animal subfossils [103,104,105,106,107,108,109,110,111,112,113]. In a comparable fashion to human archaeogenetics , the first HTS paleogenomics studies of domestic animals or related species were focused on a single or a small number of “golden samples” [10, 69, 109, 114, 115].
One of the first HTS studies directly relevant to domestic animals was a technical tour de force which pushed the time frame for retrieval of aDNA and reconstruction of paleogenomes beyond 500 kya to the early stages of the Middle Pleistocene . In this study, Ludovic Orlando and colleagues were able to generate a 1.12× coverage genome from a horse bone excavated from permafrost at the Thistle Creek site in north-western Canada and dated to approximately 560–780 kya. Using this Middle Pleistocene horse genome in conjunction with another ancient genome from a 43 kya Late Pleistocene horse, and genome sequence data from Przewalski’s horse (Equus ferus przewalskii), the donkey (Equus asinus) and a range of modern horses, these authors showed that all extant equids shared a common ancestor at least four million years ago (mya), which is twice the previously accepted age for the Equus genus. They also showed that the demographic history of the horse has been profoundly impacted by climate history, particularly during warmer periods such as the interval after the LGM (Fig. 1), when population numbers retracted dramatically in the 15 millennia prior to domestication 5.5 kya. Finally, by focusing on genomic regions exhibiting unusual patterns of derived mutations in domestic horses, it was possible to tentatively identify genes that may have been subject to human-mediated selection during and after domestication .
The origins of the domestic dog (C. familiaris) and the dispersal of dogs across the globe during the Late Pleistocene and Holocene periods have been extremely contentious, particularly as population genetic, archaeogenetic and paleogenomic data sets have accumulated during the last two decades [8, 116, 117]. Again, like the Thistle Creek horse bone, a small number of key subfossil specimens have provided critical paleogenomic evidence concerning the evolutionary origins of domestic dogs and their genetic relationships with Late Pleistocene Eurasian wolf populations [10, 11, 115]. Pontus Skoglund and colleagues were able to generate a low coverage (
1×) nuclear genome from a 35 kya wolf (C. lupis) from the Taimyr Peninsula in northern Siberia . Analysis of this Taimyr specimen with WGS data from modern canids showed that this ancient wolf belonged to a population that was genetically close to the ancestor of modern gray wolves and dogs. The results supported a scenario whereby the ancestors of domestic dogs diverged from wolves by 27 kya, with domestication happening at some point subsequent to that event. In addition, this study provided compelling evidence that high-latitude dog breeds such as the Siberian Husky trace some of their ancestry back to the extinct wolf population represented by the Taimyr animal .
Another important paleogenome study, published one year after the Taimyr wolf paper, described a high coverage (
28×) nuclear genome from a late Neolithic (4.8 kya) domestic dog specimen from Newgrange, a monumental passage grave tomb in eastern Ireland . Analyses of the ancient Newgrange dog genome, additional mtDNA genomes from ancient European dogs and modern wolf and dog genome-wide SNP data suggested that dogs were domesticated independently in the Late Pleistocene from distinct East and West Eurasian wolf populations and that East Eurasian dogs, migrating alongside humans at some time between 6.4 and 14 kya, partially replaced indigenous European dogs . In 2017, following publication of the Newgrange dog genome, Laura Botigué and colleagues generated two
9× coverage domestic dog nuclear genomes from Early (Herxheim,
7 kya) and Late (Cherry Tree Cave,
4.7 kya) Neolithic sites in present-day Germany . Comparison of these two ancient dog genomes with almost 100 modern canid whole genomes and a large genome-wide SNP data set of modern dogs and wolves did not support the dual domestication hypothesis proposed by Frantz et al. one year earlier , or the suggested East Eurasian partial replacement of Late Paleolithic or Early Neolithic European dogs.
The origins and fate of the domestic dog populations of the Americas prior to contact with European and African peoples has been the subject of a recent paleogenomics study involving comparisons of ancient and modern dogs. Máire Ní Leathlobhair and colleagues sequenced 71 mitochondrial and seven nuclear genomes from ancient North American and Siberian dogs . Comparative population genomics analyses of these data demonstrated that the first American domestic dogs did not trace their ancestry to American wolves. Instead, however, these pre-contact American dogs (PCDs) represent a distinct lineage that migrated from northeast Asia across the Beringian Steppe with humans more than 10 kya . These analyses also demonstrated that PCD populations were almost completely replaced by European dogs due to large-scale colonization of North and South America within the last 500 years. In a similar fashion to the post-contact human demographic transition in the Americas [119, 120], the authors hypothesize that infectious disease likely played a major role in the replacement of PCDs by European dogs. Finally, they also show that the genome of the canine transmissible venereal tumor (CTVT) cancer lineage, which has evolved to become an obligate conspecific asexual parasite , is the closest genomic relative of the first American dogs.
As has been previously noted, understanding the origins and early domestic history of dogs has been complicated by population bottlenecks, expansions, local extinctions and replacements and geographically localized gene flow among wolves and dogs and genetically distinct dog populations . It will, therefore, require systematic large-scale retrieval and analysis of ancient wolf and dog genomes across space and time to accurately reconstruct the evolutionary history of the first animal domesticate . However, this and similar undertakings for other domestic species will be greatly facilitated by another recent technical breakthrough that is described below.
In 2014, a team of Irish geneticists and archaeologists showed that the petrous portion of the temporal bone—the densest bone in the mammalian skeleton—produced the highest yields of endogenous DNA in some cases, up to 183-fold higher than other skeletal elements . The impact of this discovery has been such that the ancient DNA community now dub the period prior to 2014 “BP” (“before petrous”) . During the last 5 years, DNA extraction from petrous bones, coupled with constantly improving HTS and ancillary technologies, has led to a dramatic scale-up of human archaeogenetics, the cutting edge of which is now the statistically rigorous field of high-resolution population paleogenomics [82, 125,126,127,128,129]. Another notable outcome has been a substantial increase in the proportion of the Earth’s surface area where archaeological excavation can uncover suitable material for successful aDNA extraction and paleogenomics analysis. Previously, for the most part, aDNA research has been confined to regions of the globe where climate and topography were conducive to taphonomic preservation of skeletal DNA (Fig. 3) [90, 130]. However, in recent years human paleogenomics studies have been successfully conducted using samples from arid, subtropical and even tropical zones [131,132,133,134,135,136,137,138,139,140,141,142].
Geography of archaeological DNA survival prior to the discovery of high endogenous DNA content in the mammalian petrous bone. a Expected DNA survival after 10,000 years for 25-bp fragments and 150-bp fragments close to the ground surface (modified with permission from ). b Illustration of a sheep (Ovis aries) petrous bone retrieved from a Middle Neolithic site at Le Peuilh, France (modified with permission from )
2010 marks the 10th anniversary of the completion of the first plant genome sequence (Arabidopsis thaliana). Triggered by advancements in sequencing technologies, many crop genome sequences have been produced, with eight published since 2008. To date, however, only the rice (Oryza sativa) genome sequence has been finished to a quality level similar to that of the Arabidopsis sequence. This trend to produce draft genomes could affect the ability of researchers to address biological questions of speciation and recent evolution or to link sequence variation accurately to phenotypes. Here, we review the current crop genome sequencing activities, discuss how variability in sequence quality impacts utility for different studies and provide a perspective for a paradigm shift in selecting crops for sequencing in the future.
In this report, we have described how NGS data are collected and analyzed. We showed that the mechanism of NGS is not a fundamental departure from its predecessor, but rather an improved and scaled version of Sanger sequencing that allows for a staggering increase in data quality and throughput. We argue that minimum depth is a better reflection of a test’s variant-call confidence than average depth, and demonstrate that SNPs, indels, and del/dups can be confidently identified using intuitive analysis techniques. Our primary hope is that we can make NGS-based genetic tests more accessible to patients by making the inner workings of the technology itself more accessible to practitioners of genetic medicine.
Hi-C, a method for quantifying long-range physical interactions in the genome, was introduced by Lieberman-Aiden et al. , and it was reviewed in Dekker et al. . A Hi-C assay produces a so-called genome contact matrix, which – at a given resolution determined by sequencing depth – measures the degree of interaction between two loci in the genome. In the last 5 years, significant efforts have been made to obtain Hi-C maps at ever increasing resolutions [3–8]. Currently, the highest resolution maps are 1 kb . Existing Hi-C experiments have largely been performed in cell lines or for samples where unlimited input material is available.
In Lieberman-Aiden et al. , it was established that at the megabase scale, the genome is divided into two compartments, called A/B compartments. Interactions between loci are largely constrained to occur between loci belonging to the same compartment. The A compartment was found to be associated with open chromatin and the B compartment with closed chromatin. Lieberman-Aiden et al.  also showed that these compartments are cell-type specific, but did not comprehensively describe differences between cell types across the genome. In most subsequent work using the Hi-C assay, the A/B compartments have received little attention the focus has largely been on describing smaller domain structures using higher resolution data. Recently, it was shown that 36 % of the genome changes compartment during mammalian development  and that these compartment changes are associated with gene expression they conclude “that the A and B compartments have a contributory but not deterministic role in determining cell-type-specific patterns of gene expression”.
The A/B compartments are estimated by an eigenvector analysis of the genome contact matrix after normalization by the observed–expected method . Specifically, boundary changes between the two compartments occur where the entries of the first eigenvector change sign. The observed–expected method normalizes bands of the genome contact matrix by dividing by their mean. This effectively standardizes interactions between two loci separated by a given distance by the average interaction between all loci separated by the same amount. It is critical that the genome contact matrix is normalized in this way, for the first eigenvector to yield the A/B compartments.
Open and closed chromatin can be defined in different ways using different assays such as DNase hypersensitivity or chromatin immunoprecipitation (ChIP) sequencing for various histone modifications. While Lieberman-Aiden et al.  established that the A compartment is associated with open chromatin profiles from various assays, including DNase hypersensitivity, it was not determined to what degree these different data types measure the same underlying phenomena, including whether the domain boundaries estimated using different assays coincide genome-wide.
In this manuscript, we show that we can reliably estimate A/B compartments as defined using Hi-C data by using Illumina 450 k DNA methylation microarray data  as well as DNase hypersensitivity sequencing [10, 11], single-cell whole-genome bisulfite sequencing (scWGBS)  and single-cell assay for transposase-accessible chromatin (scATAC) sequencing . Data from the first two assays are widely available for a large number of cell types. In particular, the 450 k array has been used to profile a large number of primary samples, including many human cancers more than 20,000 samples are readily available through the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) . We show that our methods can recover cell-type differences. This work makes it possible to study A/B compartments comprehensively across many cell types, including primary samples, and to investigate further the relationship between genome compartmentalization and transcriptional activity or other functional readouts.
As an application, we show how the somatic mutation rate in prostate adenocarcinoma (PRAD) is different between compartments and we show how the A/B compartments change between several human cancers currently TCGA does not include assays measuring chromatin accessibility. Furthermore, our work reveals unappreciated aspects of the structure of long-range correlations in DNA methylation and DNase hypersensitivity data. Specifically, we observe that both DNA methylation and the DNase signal are highly correlated between distant loci, provided that the two loci are both in the closed compartment.
Before investigating the differences between the two machines, we wanted to rule out the possibility that the increased proportion of duplicates detected by the HiSeq 4000 was simply a result of sequencing more di-tags – i.e. the more a sample was sequenced the higher the probability that any given read was a duplicate. To check for this, the HiSeq 4000 FASTQ file was randomly downsized to the same number of reads as found in the file generated by the HiSeq 2500. During HiCUP processing 25% of the di-tags were now discarded during the de-duplication step, still far more than the 2% discarded than when processing the HiSeq 2500 data.
To investigate the possible cause of the duplication we analysed the spatial distribution of duplicate di-tags on the flow cells. For both machines the duplicates were scattered in a uniform manner and did not show significant duplication “hotspots”. While duplicates were not localised to particular regions of a flow cell, it was still possible that, in general, duplicate di-tags may have co-localised with their exact copies. To test this hypothesis we identified di-tags present in two copies and recorded whether they mapped to one or two tiles (each Illumina flow cell comprises multiple tiles). Significantly, 1% of HiSeq 2500 duplicates comprised di-tags originating from the same tile. In contrast, 92% of the duplicate pairs were located on a single tile for the HiSeq 4000. This close proximity suggest that the duplicates observed on the HiSeq 4000 were largely machine-specific artefacts.
To further characterise this two-dimensional separation, we extracted duplicates which localised to only one tile and then recorded the relative position of a di-tag to its exact duplicate (this is possible because FASTQ files record the coordinates of each cluster). The figures below show these findings as density plots (for each di-tag, one read was specified as the origin, and the chart shows the relative position of the “other ends” to the origin).
For the HiSeq 2500 there is, in general, a uniform distribution across the plot, except for a high-density region in close proximity to the origin. This elevated density around the origin is much more pronounced when analysing the HiSeq 4000 data, in which almost all the other-ends localise to this region. We hypothesise that the other-ends positioned far alway from the origin are true biological duplicates or experimental PCR duplicates. In contrast, those other-ends close to the origin are more likely to be generated by the machine itself. Again, this is indicative of the HiSeq 4000 generating more duplication artefacts.
We then investigated whether such duplicates on the HiSeq 4000 were confined to adjacent nanowells, or multiple nanowells in the same local region of a flow cell. Although we were unable to obtain direct information relating the FASTQ coordinate system to individual nanowells, it was possible, by creating a density plot of the region immediately around the origin, to visualise the ordered array of the HiSeq 4000 flow cell. The plot clearly shows that duplicates are found in multiple wells around the origin, and this trend decreases as one moves from the origin. Also shown below is the same plot, but of the HiSeq 2500 data. As expected no nanowell pattern is visible.
The Need for Barcoding
Taxonomy of living things was created by Carl von Linné, who formalized it by using a binomial classification system to differentiate organisms. Binomial nomenclature was used to describe a genus and a species name to each organism to provide an identity. These days, classification of organisms is becoming increasingly important as a measurement of diversity in the face of habitat destruction and global climate change. There is no consensus on how many life forms exist on this planet, but the estimation of extinction rates is about 1 species per 100-1000 million species. Classification in Linné&rsquos day was mostly performed by morphological differences. This was carried on in fossils. However, morphology has many drawbacks, especially in sexually dimorphic species or species with multiple developmental morphologies.
Larva (top) of the Green Lacewing and the adult (bottom).
Molecular biology and DNA technologies have revolutionized the classification system of living things especially in providing the ability to match relatedness of these species. DNA barcoding , like the name implies, seeks to utilize DNA markers to differentially identify organisms. But what DNA markers should be used? What criteria do we use to develop barcodes? Discrimination, Universality and Robustness are the criteria used to define the usefulness of barcodes.
Since the goal of barcoding is to define specific organisms, discrimination is the primary objective. Discrimination refers to the difference of sequences that occur between species. However, science is easier when there is some universality in the locus used for discrimination. As it sounds, universality is an attempt to use the same locus in disparate genomes. While discrimination is about uniqueness of sequences, universality seeks to use a single set of PCR primers that will be able to amplify that same distinct region with variable sequence similarity. If some region of DNA has absolutely no sequence deviation between species, this has great universality but poor discrimination. But if a sequence has very low sequence similarity, this is great for discrimination but has absolutely no universality and can not be amplified with the same set of primers. Robustness refers to the reliability of PCR amplification of a region. Some regions of DNA just don&rsquot amplify well or it is too difficult to design appropriate and unique primers for that locus.
A case where there is universality for designing primers, but not an area where discrimination can occur. While discrimination of different organisms can occur in this situation, the lack of similarity in sequence would make it difficult to design primers. That is, the lack of universality in sequence would also make this PCR not robust. Enough variability in these sequences gives us the ability to discriminate between species. The high similarity provides us the universality required to design primers that may be robust enough to amplify by PCR.
Sometimes, species are so similar for one sequence that a second marker is required. Just as the standard UPC barcode has a series of vertical line of different spacing and width, a 2-dimensional barcode adds that second dimension of information into a square of dots like in a QR code (Quick Response code). We can also utilize a second or a third or a fourth set of loci that will aid in increased discrimination just as CoDIS utilizes multiple STR sites to define individual people. In animals, the most commonly used barcode is the mitochondrial gene, Cytochrome Oxidase I ( COI ). Since all animals have mitochondria and have this mitochondrial gene, it offers high universality. It is a robust locus that is easy to amplify and has high copy number with enough sequence deviation between species to discriminate between them.
Animal mitochondrial genomes vary from 16kb-22kb. However, plants, fungi and protists have wildly different and larger mitochondrial genomes. For plants, we use a chloroplast gene, ribulose-bisphosphate carboxylase large subunit ( rbcL ) or maturase K ( matK ) (Hollingsworth et al. 2011). Prokaryotes are often discriminated by their 16s rRNA gene while eukaryotes can be identified by 18s rRNA. COI (a maternally transmitted gene) will not create a clear picture of species identity in the case of hybrid animals (mules, ligers, coydogs, etc.). Sometimes, closely related species are also indistinguishable by a single barcode, so the inclusion of 18s with COI may be necessary to define the identity of the species. Since it is so difficult to meet the three criteria (robustness, universality and discrimination) for all species, having these multiple barcodes is important. Fungi prove to be difficult in identification by COI, so another marker called the internal transcribed spacer ( ITS ) is used to aid in their identification. We must also remember that not everything with chloroplasts are plants and therefore additional markers are used to identify protists.