Information

How were genes located before the development of bioinformatics?


I would like to know the detailed procedure of how scientists in earlier time were able to locate genes like how we were able to locate Huntington gene in 1983?


The earliest genome maps were constructed from Escherichia coli by abusing conjugation of special Hfr-mutants. I'm assuming you are somewhat familiar with bacterial conjugation - the process of transferring genetic material between bacterial cells.

Usually, only mobile plasmids are transferred. They carry the necessary tra-genes for the construction of the sex-pilus, replication and transfer system. They also need the oriT-site (origin of transfer), at which transfer is initiated. A famous mobile plasmid is the F-plasmid.

Some mutant cells integrated the F-plasmid into their main chromosome via recombination. Thus, they are able to transfer their whole genome to a recipient cell. These strains are called Hfr-strains (high frequency of recombination) and they were used intensely in early genome mapping.

They discovered, that it takes roughly 100 minutes for a Hfr strain to transfer its entire genome to another cell. By interrupting conjugation (e.g. by vigorous shaking) the transfer was interruped and the genome could be divided into sections by transfer time (10min, 20min, 30min and so on).

The function of the genes (or genome sections) was revealed by mating Hfr strains with defective mutants. If the mutant was unable to grow on saccharose for example, several conjugation experiments with different mating times were conducted. When a mutant was able to grow on saccharose again after mating, the transfer time was checked. If the strain was able to grow again after 30 minutes of conjugation, but not after 20, they knew the genes necessary for saccharose growth lay in the 20 - 30 min region.

It took a painstaking amount of experiments with a wide range of mutants to construct the first map of E. coli genes, but it was achieved before even the nucleotide sequences were known.


Just to complement Pythagyros.

Sequencing and bioinformatics were already around in 1983, actually Frederick Sanger and colleagues published their automated method of sequencing (Sanger sequencing) in 1977. They published the complete sequence of the bacteriophage φX174. Other methods of DNA sequencing had been around since the first one in 1970 established by Ray Wu.


Bioinformatics

Introduction

Bioinformatics , as a new emerging discipline, combines mathematics, information science, and biology and helps answer biological questions. The word ‘bioinformatics’ was first used in 1968 and its definition was first given in 1978. Bioinformatics has also been referred to as ‘computational biology’. However, strictly speaking, computational biology deals mainly with modeling of biological systems. The main components of bioinformatics are (1) the development of software tools and algorithms and (2) the analysis and interpretation of biological data by using a variety of software tools and particular algorithms.


References

Baranova AV, Orlov YL. The papers presented at 7th Young Scientists School “Systems Biology and Bioinformatics” (SBB’15): Introductory Note. BMC Genet. 201617:S20.

Orlov YL, Baranova AV, Hofestaedt R, Kolchanov NA. Computational genomics at BGRSSB-2016: introductory note. BMC Genomics. 201617(Suppl 14):996.

Orlov YL, Kolchanov NA, Hofestädt R, Wong L. Editorial - bioinformatics development at the BGRSSB conference series: 10th anniversary. J Bioinforma Comput Biol. 201715(2):1702001.

Orlov YL, Baranova AV, Markel AL. Computational models in genetics at BGRSSB-2016: introductory note. BMC Genet. 201617(Suppl 3):155.

Tatarinova TV, Chen M, Orlov YL. Bioinformatics research at BGRS-2018. BMC Bioinformatics. 201920(Suppl 1):33.

Orlov YL, Baranova AV. Editorial: bioinformatics of genome regulation and systems biology. Front Genet. 202011:625.

Orlov YL, Baranova AV, Hofestädt R, Kolchanov NA. Genomics at Belyaev conference – 2017. BMC Genomics. 201819(Suppl 3):79.

Orlov YL, Baranova AV, Tatarinova TV, Kolchanov NA. Genetics at Belyaev conference - 2017: introductory note. BMC Genet. 201718(Suppl 1):116.

Orlov YL., Kochetov A.V., Li G., Kolchanov N.A. Genomics research at bioinformatics of genome regulation and structure systems biology (BGRSSB) conferences in Novosibirsk. BMC Genomics 201920(Suppl 3):322.

Orlov YL, Galieva ER, Melerzanov AV. Computer genomics research at the bioinformatics conference series in Novosibirsk. BMC Genomics. 201920(Suppl 7):537.

Balanovska E, Lukianova E, Kagazezheva J, Maurer A, Leybova N, Agdzhoyan A, Gorin I, Petrushenko V, Pylev V, Kostryukova E, Balanovsky O. Optimizing the genetic prediction of the eye and hair color for north Eurasian populations. BMC Genomics. 202021(Suppl 7). https://doi.org/10.1186/s12864-020-06923-1.

Candille SI, Absher DM, Beleza S, Bauchet M, McEvoy B, Garrison NA, Li JZ, Myers RM, Barsh GS, Tang H, Shriver MD. Genome-wide association studies of quantitatively measured skin, hair, and eye pigmentation in four European populations. PLoS One. 20127(10):e48294.

Chaitanya L, Breslin K, Zuñiga S, Wirken L, Pośpiech E, Kukla-Bartoszek M, Sijen T. Peter de Knijff, Liu F., Branicki W., Kayser M., Walsh S. the HIrisPlex-S system for eye, hair and skin colour prediction from DNA: introduction and forensic developmental validation. Forensic Sci Int Genet. 201835:123–35.

Sandoval JR, Lacerda DR, Jota MMS, Robles-Ruiz P, Danos P, Paz-y-Miño C, Wells S, Santos FR, Fujita R. Tracing the genetic history of the ‘Canaris’ from Ecuador and Peru using uniparental DNA markers. BMC Genomics. 202021(Suppl 7). https://doi.org/10.1186/s12864-020-06834-1.

Barbieri C, Barquera R, Arias L, Sandoval JR, Acosta O, Zurita C, et al. The current genomic landscape of Western South America: Andes, Amazonia, and Pacific coast. Mol Biol Evol. 201936(12):2698–713.

Suntsova MV, Buzdin AA. Differences between human and chimpanzee genomes and their implications in gene expression, protein functions and biochemical properties of the two species. BMC Genomics. 202021(Suppl 7). https://doi.org/10.1186/s12864-020-06962-8.

Dong X, Wang X, Zhang F, Tian W. Genome-wide identification of regulatory sequences undergoing accelerated evolution in the human genome. Mol Biol Evol. 201633(10):2565–75.

Danilov K.A., Baranova AV, Nikogosov DA, Musienko SV. A comparison of BeadChip and WGS genotyping outputs using partial validation by Sanger sequencing BMC Genomics 202021(Suppl 7). https://doi.org/10.1186/s12864-020-06919-x.

Akulova VS, Sharov VV, Aksyonova AI, Putintseva YA, Oreshkova NV, Feranchuk SI, Kuzmin DA, Pavlov IN, Litovka YA, Krutovsky KV. De novo sequencing, assembly and functional annotation of Armillaria borealis genome BMC Genomics. 202021(Suppl 7). https://doi.org/10.1186/s12864-020-06964-6.

Kolesnikova AI, Putintseva YA, Simonov EP, Biriukov VV, Oreshkova NV, Pavlov IN, Sharov VV, Kuzmin DA, Anderson JB, Krutovsky KV. Mobile genetic elements explain size variation in the mitochondrial genomes of four closely-related Armillaria species. BMC Genomics. 201920(1):351.

Orlov YL, Baranova AV, Salina EA. Computational plant bioscience at BGRSSB-2016: introductory note. BMC Plant Biol. 201616(Suppl 3):243.


Contents

The term "metagenomics" was first used by Jo Handelsman, Jon Clardy, Robert M. Goodman, Sean F. Brady, and others, and first appeared in publication in 1998. [5] The term metagenome referenced the idea that a collection of genes sequenced from the environment could be analyzed in a way analogous to the study of a single genome. In 2005, Kevin Chen and Lior Pachter (researchers at the University of California, Berkeley) defined metagenomics as "the application of modern genomics technique without the need for isolation and lab cultivation of individual species". [6]

Conventional sequencing begins with a culture of identical cells as a source of DNA. However, early metagenomic studies revealed that there are probably large groups of microorganisms in many environments that cannot be cultured and thus cannot be sequenced. These early studies focused on 16S ribosomal RNA (rRNA) sequences which are relatively short, often conserved within a species, and generally different between species. Many 16S rRNA sequences have been found which do not belong to any known cultured species, indicating that there are numerous non-isolated organisms. These surveys of ribosomal RNA genes taken directly from the environment revealed that cultivation based methods find less than 1% of the bacterial and archaeal species in a sample. [2] Much of the interest in metagenomics comes from these discoveries that showed that the vast majority of microorganisms had previously gone unnoticed.

Early molecular work in the field was conducted by Norman R. Pace and colleagues, who used PCR to explore the diversity of ribosomal RNA sequences. [7] The insights gained from these breakthrough studies led Pace to propose the idea of cloning DNA directly from environmental samples as early as 1985. [8] This led to the first report of isolating and cloning bulk DNA from an environmental sample, published by Pace and colleagues in 1991 [9] while Pace was in the Department of Biology at Indiana University. Considerable efforts ensured that these were not PCR false positives and supported the existence of a complex community of unexplored species. Although this methodology was limited to exploring highly conserved, non-protein coding genes, it did support early microbial morphology-based observations that diversity was far more complex than was known by culturing methods. Soon after that, Healy reported the metagenomic isolation of functional genes from "zoolibraries" constructed from a complex culture of environmental organisms grown in the laboratory on dried grasses in 1995. [10] After leaving the Pace laboratory, Edward DeLong continued in the field and has published work that has largely laid the groundwork for environmental phylogenies based on signature 16S sequences, beginning with his group's construction of libraries from marine samples. [11]

In 2002, Mya Breitbart, Forest Rohwer, and colleagues used environmental shotgun sequencing (see below) to show that 200 liters of seawater contains over 5000 different viruses. [12] Subsequent studies showed that there are more than a thousand viral species in human stool and possibly a million different viruses per kilogram of marine sediment, including many bacteriophages. Essentially all of the viruses in these studies were new species. In 2004, Gene Tyson, Jill Banfield, and colleagues at the University of California, Berkeley and the Joint Genome Institute sequenced DNA extracted from an acid mine drainage system. [13] This effort resulted in the complete, or nearly complete, genomes for a handful of bacteria and archaea that had previously resisted attempts to culture them. [14]

Beginning in 2003, Craig Venter, leader of the privately funded parallel of the Human Genome Project, has led the Global Ocean Sampling Expedition (GOS), circumnavigating the globe and collecting metagenomic samples throughout the journey. All of these samples are sequenced using shotgun sequencing, in hopes that new genomes (and therefore new organisms) would be identified. The pilot project, conducted in the Sargasso Sea, found DNA from nearly 2000 different species, including 148 types of bacteria never before seen. [15] Venter has circumnavigated the globe and thoroughly explored the West Coast of the United States, and completed a two-year expedition to explore the Baltic, Mediterranean and Black Seas. Analysis of the metagenomic data collected during this journey revealed two groups of organisms, one composed of taxa adapted to environmental conditions of 'feast or famine', and a second composed of relatively fewer but more abundantly and widely distributed taxa primarily composed of plankton. [16]

In 2005 Stephan C. Schuster at Penn State University and colleagues published the first sequences of an environmental sample generated with high-throughput sequencing, in this case massively parallel pyrosequencing developed by 454 Life Sciences. [17] Another early paper in this area appeared in 2006 by Robert Edwards, Forest Rohwer, and colleagues at San Diego State University. [18]

Recovery of DNA sequences longer than a few thousand base pairs from environmental samples was very difficult until recent advances in molecular biological techniques allowed the construction of libraries in bacterial artificial chromosomes (BACs), which provided better vectors for molecular cloning. [20]

Shotgun metagenomics Edit

Advances in bioinformatics, refinements of DNA amplification, and the proliferation of computational power have greatly aided the analysis of DNA sequences recovered from environmental samples, allowing the adaptation of shotgun sequencing to metagenomic samples (known also as whole metagenome shotgun or WMGS sequencing). The approach, used to sequence many cultured microorganisms and the human genome, randomly shears DNA, sequences many short sequences, and reconstructs them into a consensus sequence. Shotgun sequencing reveals genes present in environmental samples. Historically, clone libraries were used to facilitate this sequencing. However, with advances in high throughput sequencing technologies, the cloning step is no longer necessary and greater yields of sequencing data can be obtained without this labour-intensive bottleneck step. Shotgun metagenomics provides information both about which organisms are present and what metabolic processes are possible in the community. [21] Because the collection of DNA from an environment is largely uncontrolled, the most abundant organisms in an environmental sample are most highly represented in the resulting sequence data. To achieve the high coverage needed to fully resolve the genomes of under-represented community members, large samples, often prohibitively so, are needed. On the other hand, the random nature of shotgun sequencing ensures that many of these organisms, which would otherwise go unnoticed using traditional culturing techniques, will be represented by at least some small sequence segments. [13]

High-throughput sequencing Edit

An advantage to high throughput sequencing is that this technique does not require cloning the DNA before sequencing, removing one of the main biases and bottlenecks in environmental sampling. The first metagenomic studies conducted using high-throughput sequencing used massively parallel 454 pyrosequencing. [17] Three other technologies commonly applied to environmental sampling are the Ion Torrent Personal Genome Machine, the Illumina MiSeq or HiSeq and the Applied Biosystems SOLiD system. [22] These techniques for sequencing DNA generate shorter fragments than Sanger sequencing Ion Torrent PGM System and 454 pyrosequencing typically produces

400 bp reads, Illumina MiSeq produces 400-700bp reads (depending on whether paired end options are used), and SOLiD produce 25–75 bp reads. [23] Historically, these read lengths were significantly shorter than the typical Sanger sequencing read length of

750 bp, however the Illumina technology is quickly coming close to this benchmark. However, this limitation is compensated for by the much larger number of sequence reads. In 2009, pyrosequenced metagenomes generate 200–500 megabases, and Illumina platforms generate around 20–50 gigabases, but these outputs have increased by orders of magnitude in recent years. [24]

An emerging approach combines shotgun sequencing and chromosome conformation capture (Hi-C), which measures the proximity of any two DNA sequences within the same cell, to guide microbial genome assembly. [25] Long read sequencing technologies, including PacBio RSII and PacBio Sequel by Pacific Biosciences, and Nanopore MinION, GridION, PromethION by Oxford Nanopore Technologies, is another choice to get long shotgun sequencing reads that should make ease in assembling process. [26]

The data generated by metagenomics experiments are both enormous and inherently noisy, containing fragmented data representing as many as 10,000 species. [1] The sequencing of the cow rumen metagenome generated 279 gigabases, or 279 billion base pairs of nucleotide sequence data, [28] while the human gut microbiome gene catalog identified 3.3 million genes assembled from 567.7 gigabases of sequence data. [29] Collecting, curating, and extracting useful biological information from datasets of this size represent significant computational challenges for researchers. [21] [30] [31] [32]

Sequence pre-filtering Edit

The first step of metagenomic data analysis requires the execution of certain pre-filtering steps, including the removal of redundant, low-quality sequences and sequences of probable eukaryotic origin (especially in metagenomes of human origin). [33] [34] The methods available for the removal of contaminating eukaryotic genomic DNA sequences include Eu-Detect and DeConseq. [35] [36]

Assembly Edit

DNA sequence data from genomic and metagenomic projects are essentially the same, but genomic sequence data offers higher coverage while metagenomic data is usually highly non-redundant. [31] Furthermore, the increased use of second-generation sequencing technologies with short read lengths means that much of future metagenomic data will be error-prone. Taken in combination, these factors make the assembly of metagenomic sequence reads into genomes difficult and unreliable. Misassemblies are caused by the presence of repetitive DNA sequences that make assembly especially difficult because of the difference in the relative abundance of species present in the sample. [37] Misassemblies can also involve the combination of sequences from more than one species into chimeric contigs. [37]

There are several assembly programs, most of which can use information from paired-end tags in order to improve the accuracy of assemblies. Some programs, such as Phrap or Celera Assembler, were designed to be used to assemble single genomes but nevertheless produce good results when assembling metagenomic data sets. [1] Other programs, such as Velvet assembler, have been optimized for the shorter reads produced by second-generation sequencing through the use of de Bruijn graphs. [38] [39] The use of reference genomes allows researchers to improve the assembly of the most abundant microbial species, but this approach is limited by the small subset of microbial phyla for which sequenced genomes are available. [37] After an assembly is created, an additional challenge is "metagenomic deconvolution", or determining which sequences come from which species in the sample. [40]

Gene prediction Edit

Metagenomic analysis pipelines use two approaches in the annotation of coding regions in the assembled contigs. [37] The first approach is to identify genes based upon homology with genes that are already publicly available in sequence databases, usually by BLAST searches. This type of approach is implemented in the program MEGAN4. [41] The second, ab initio, uses intrinsic features of the sequence to predict coding regions based upon gene training sets from related organisms. This is the approach taken by programs such as GeneMark [42] and GLIMMER. The main advantage of ab initio prediction is that it enables the detection of coding regions that lack homologs in the sequence databases however, it is most accurate when there are large regions of contiguous genomic DNA available for comparison. [1]

Species diversity Edit

Gene annotations provide the "what", while measurements of species diversity provide the "who". [43] In order to connect community composition and function in metagenomes, sequences must be binned. Binning is the process of associating a particular sequence with an organism. [37] In similarity-based binning, methods such as BLAST are used to rapidly search for phylogenetic markers or otherwise similar sequences in existing public databases. This approach is implemented in MEGAN. [44] Another tool, PhymmBL, uses interpolated Markov models to assign reads. [1] MetaPhlAn and AMPHORA are methods based on unique clade-specific markers for estimating organismal relative abundances with improved computational performances. [45] Other tools, like mOTUs [46] [47] and MetaPhyler, [48] use universal marker genes to profile prokaryotic species. With the mOTUs profiler is possible to profile species without a reference genome, improving the estimation of microbial community diversity. [47] Recent methods, such as SLIMM, use read coverage landscape of individual reference genomes to minimize false-positive hits and get reliable relative abundances. [49] In composition based binning, methods use intrinsic features of the sequence, such as oligonucleotide frequencies or codon usage bias. [1] Once sequences are binned, it is possible to carry out comparative analysis of diversity and richness.

Data integration Edit

The massive amount of exponentially growing sequence data is a daunting challenge that is complicated by the complexity of the metadata associated with metagenomic projects. Metadata includes detailed information about the three-dimensional (including depth, or height) geography and environmental features of the sample, physical data about the sample site, and the methodology of the sampling. [31] This information is necessary both to ensure replicability and to enable downstream analysis. Because of its importance, metadata and collaborative data review and curation require standardized data formats located in specialized databases, such as the Genomes OnLine Database (GOLD). [50]

Several tools have been developed to integrate metadata and sequence data, allowing downstream comparative analyses of different datasets using a number of ecological indices. In 2007, Folker Meyer and Robert Edwards and a team at Argonne National Laboratory and the University of Chicago released the Metagenomics Rapid Annotation using Subsystem Technology server (MG-RAST) a community resource for metagenome data set analysis. [51] As of June 2012 over 14.8 terabases (14x10 12 bases) of DNA have been analyzed, with more than 10,000 public data sets freely available for comparison within MG-RAST. Over 8,000 users now have submitted a total of 50,000 metagenomes to MG-RAST. The Integrated Microbial Genomes/Metagenomes (IMG/M) system also provides a collection of tools for functional analysis of microbial communities based on their metagenome sequence, based upon reference isolate genomes included from the Integrated Microbial Genomes (IMG) system and the Genomic Encyclopedia of Bacteria and Archaea (GEBA) project. [52]

One of the first standalone tools for analysing high-throughput metagenome shotgun data was MEGAN (MEta Genome ANalyzer). [41] [44] A first version of the program was used in 2005 to analyse the metagenomic context of DNA sequences obtained from a mammoth bone. [17] Based on a BLAST comparison against a reference database, this tool performs both taxonomic and functional binning, by placing the reads onto the nodes of the NCBI taxonomy using a simple lowest common ancestor (LCA) algorithm or onto the nodes of the SEED or KEGG classifications, respectively. [53]

With the advent of fast and inexpensive sequencing instruments, the growth of databases of DNA sequences is now exponential (e.g., the NCBI GenBank database [54] ). Faster and efficient tools are needed to keep pace with the high-throughput sequencing, because the BLAST-based approaches such as MG-RAST or MEGAN run slowly to annotate large samples (e.g., several hours to process a small/medium size dataset/sample [55] ). Thus, ultra-fast classifiers have recently emerged, thanks to more affordable powerful servers. These tools can perform the taxonomic annotation at extremely high speed, for example CLARK [56] (according to CLARK's authors, it can classify accurately "32 million metagenomic short reads per minute"). At such a speed, a very large dataset/sample of a billion short reads can be processed in about 30 minutes.

With the increasing availability of samples containing ancient DNA and due to the uncertainty associated with the nature of those samples (ancient DNA damage), [57] a fast tool capable of producing conservative similarity estimates has been made available. According to FALCON's authors, it can use relaxed thresholds and edit distances without affecting the memory and speed performance.

Comparative metagenomics Edit

Comparative analyses between metagenomes can provide additional insight into the function of complex microbial communities and their role in host health. [58] Pairwise or multiple comparisons between metagenomes can be made at the level of sequence composition (comparing GC-content or genome size), taxonomic diversity, or functional complement. Comparisons of population structure and phylogenetic diversity can be made on the basis of 16S and other phylogenetic marker genes, or—in the case of low-diversity communities—by genome reconstruction from the metagenomic dataset. [59] Functional comparisons between metagenomes may be made by comparing sequences against reference databases such as COG or KEGG, and tabulating the abundance by category and evaluating any differences for statistical significance. [53] This gene-centric approach emphasizes the functional complement of the community as a whole rather than taxonomic groups, and shows that the functional complements are analogous under similar environmental conditions. [59] Consequently, metadata on the environmental context of the metagenomic sample is especially important in comparative analyses, as it provides researchers with the ability to study the effect of habitat upon community structure and function. [1]

Additionally, several studies have also utilized oligonucleotide usage patterns to identify the differences across diverse microbial communities. Examples of such methodologies include the dinucleotide relative abundance approach by Willner et al. [60] and the HabiSign approach of Ghosh et al. [61] This latter study also indicated that differences in tetranucleotide usage patterns can be used to identify genes (or metagenomic reads) originating from specific habitats. Additionally some methods as TriageTools [62] or Compareads [63] detect similar reads between two read sets. The similarity measure they apply on reads is based on a number of identical words of length k shared by pairs of reads.

A key goal in comparative metagenomics is to identify microbial group(s) which are responsible for conferring specific characteristics to a given environment. However, due to issues in the sequencing technologies artifacts need to be accounted for like in metagenomeSeq. [30] Others have characterized inter-microbial interactions between the resident microbial groups. A GUI-based comparative metagenomic analysis application called Community-Analyzer has been developed by Kuntal et al. [64] which implements a correlation-based graph layout algorithm that not only facilitates a quick visualization of the differences in the analyzed microbial communities (in terms of their taxonomic composition), but also provides insights into the inherent inter-microbial interactions occurring therein. Notably, this layout algorithm also enables grouping of the metagenomes based on the probable inter-microbial interaction patterns rather than simply comparing abundance values of various taxonomic groups. In addition, the tool implements several interactive GUI-based functionalities that enable users to perform standard comparative analyses across microbiomes.

Community metabolism Edit

In many bacterial communities, natural or engineered (such as bioreactors), there is significant division of labor in metabolism (Syntrophy), during which the waste products of some organisms are metabolites for others. [65] In one such system, the methanogenic bioreactor, functional stability requires the presence of several syntrophic species (Syntrophobacterales and Synergistia) working together in order to turn raw resources into fully metabolized waste (methane). [66] Using comparative gene studies and expression experiments with microarrays or proteomics researchers can piece together a metabolic network that goes beyond species boundaries. Such studies require detailed knowledge about which versions of which proteins are coded by which species and even by which strains of which species. Therefore, community genomic information is another fundamental tool (with metabolomics and proteomics) in the quest to determine how metabolites are transferred and transformed by a community. [67]

Metatranscriptomics Edit

Metagenomics allows researchers to access the functional and metabolic diversity of microbial communities, but it cannot show which of these processes are active. [59] The extraction and analysis of metagenomic mRNA (the metatranscriptome) provides information on the regulation and expression profiles of complex communities. Because of the technical difficulties (the short half-life of mRNA, for example) in the collection of environmental RNA there have been relatively few in situ metatranscriptomic studies of microbial communities to date. [59] While originally limited to microarray technology, metatranscriptomics studies have made use of transcriptomics technologies to measure whole-genome expression and quantification of a microbial community, [59] first employed in analysis of ammonia oxidation in soils. [68]

Viruses Edit

Metagenomic sequencing is particularly useful in the study of viral communities. As viruses lack a shared universal phylogenetic marker (as 16S RNA for bacteria and archaea, and 18S RNA for eukarya), the only way to access the genetic diversity of the viral community from an environmental sample is through metagenomics. Viral metagenomes (also called viromes) should thus provide more and more information about viral diversity and evolution. [69] [70] [71] [72] [73] For example, a metagenomic pipeline called Giant Virus Finder showed the first evidence of existence of giant viruses in a saline desert [74] and in Antarctic dry valleys. [75]

Metagenomics has the potential to advance knowledge in a wide variety of fields. It can also be applied to solve practical challenges in medicine, engineering, agriculture, sustainability and ecology. [31] [76]

Agriculture Edit

The soils in which plants grow are inhabited by microbial communities, with one gram of soil containing around 10 9 -10 10 microbial cells which comprise about one gigabase of sequence information. [77] [78] The microbial communities which inhabit soils are some of the most complex known to science, and remain poorly understood despite their economic importance. [79] Microbial consortia perform a wide variety of ecosystem services necessary for plant growth, including fixing atmospheric nitrogen, nutrient cycling, disease suppression, and sequester iron and other metals. [80] Functional metagenomics strategies are being used to explore the interactions between plants and microbes through cultivation-independent study of these microbial communities. [81] [82] By allowing insights into the role of previously uncultivated or rare community members in nutrient cycling and the promotion of plant growth, metagenomic approaches can contribute to improved disease detection in crops and livestock and the adaptation of enhanced farming practices which improve crop health by harnessing the relationship between microbes and plants. [31]

Biofuel Edit

Biofuels are fuels derived from biomass conversion, as in the conversion of cellulose contained in corn stalks, switchgrass, and other biomass into cellulosic ethanol. [31] This process is dependent upon microbial consortia(association) that transform the cellulose into sugars, followed by the fermentation of the sugars into ethanol. Microbes also produce a variety of sources of bioenergy including methane and hydrogen. [31]

The efficient industrial-scale deconstruction of biomass requires novel enzymes with higher productivity and lower cost. [28] Metagenomic approaches to the analysis of complex microbial communities allow the targeted screening of enzymes with industrial applications in biofuel production, such as glycoside hydrolases. [83] Furthermore, knowledge of how these microbial communities function is required to control them, and metagenomics is a key tool in their understanding. Metagenomic approaches allow comparative analyses between convergent microbial systems like biogas fermenters [84] or insect herbivores such as the fungus garden of the leafcutter ants. [85]

Biotechnology Edit

Microbial communities produce a vast array of biologically active chemicals that are used in competition and communication. [80] Many of the drugs in use today were originally uncovered in microbes recent progress in mining the rich genetic resource of non-culturable microbes has led to the discovery of new genes, enzymes, and natural products. [59] [86] The application of metagenomics has allowed the development of commodity and fine chemicals, agrochemicals and pharmaceuticals where the benefit of enzyme-catalyzed chiral synthesis is increasingly recognized. [87]

Two types of analysis are used in the bioprospecting of metagenomic data: function-driven screening for an expressed trait, and sequence-driven screening for DNA sequences of interest. [88] Function-driven analysis seeks to identify clones expressing a desired trait or useful activity, followed by biochemical characterization and sequence analysis. This approach is limited by availability of a suitable screen and the requirement that the desired trait be expressed in the host cell. Moreover, the low rate of discovery (less than one per 1,000 clones screened) and its labor-intensive nature further limit this approach. [89] In contrast, sequence-driven analysis uses conserved DNA sequences to design PCR primers to screen clones for the sequence of interest. [88] In comparison to cloning-based approaches, using a sequence-only approach further reduces the amount of bench work required. The application of massively parallel sequencing also greatly increases the amount of sequence data generated, which require high-throughput bioinformatic analysis pipelines. [89] The sequence-driven approach to screening is limited by the breadth and accuracy of gene functions present in public sequence databases. In practice, experiments make use of a combination of both functional and sequence-based approaches based upon the function of interest, the complexity of the sample to be screened, and other factors. [89] [90] An example of success using metagenomics as a biotechnology for drug discovery is illustrated with the malacidin antibiotics. [91]

Ecology Edit

Metagenomics can provide valuable insights into the functional ecology of environmental communities. [92] Metagenomic analysis of the bacterial consortia found in the defecations of Australian sea lions suggests that nutrient-rich sea lion faeces may be an important nutrient source for coastal ecosystems. This is because the bacteria that are expelled simultaneously with the defecations are adept at breaking down the nutrients in the faeces into a bioavailable form that can be taken up into the food chain. [93]

DNA sequencing can also be used more broadly to identify species present in a body of water, [94] debris filtered from the air, sample of dirt, or even animal's faeces. [95] This can establish the range of invasive species and endangered species, and track seasonal populations.

Environmental remediation Edit

Metagenomics can improve strategies for monitoring the impact of pollutants on ecosystems and for cleaning up contaminated environments. Increased understanding of how microbial communities cope with pollutants improves assessments of the potential of contaminated sites to recover from pollution and increases the chances of bioaugmentation or biostimulation trials to succeed. [96]

Gut microbe characterization Edit

Microbial communities play a key role in preserving human health, but their composition and the mechanism by which they do so remains mysterious. [97] Metagenomic sequencing is being used to characterize the microbial communities from 15–18 body sites from at least 250 individuals. This is part of the Human Microbiome initiative with primary goals to determine if there is a core human microbiome, to understand the changes in the human microbiome that can be correlated with human health, and to develop new technological and bioinformatics tools to support these goals. [98]

Another medical study as part of the MetaHit (Metagenomics of the Human Intestinal Tract) project consisted of 124 individuals from Denmark and Spain consisting of healthy, overweight, and irritable bowel disease patients. The study attempted to categorize the depth and phylogenetic diversity of gastrointestinal bacteria. Using Illumina GA sequence data and SOAPdenovo, a de Bruijn graph-based tool specifically designed for assembly short reads, they were able to generate 6.58 million contigs greater than 500 bp for a total contig length of 10.3 Gb and a N50 length of 2.2 kb.

The study demonstrated that two bacterial divisions, Bacteroidetes and Firmicutes, constitute over 90% of the known phylogenetic categories that dominate distal gut bacteria. Using the relative gene frequencies found within the gut these researchers identified 1,244 metagenomic clusters that are critically important for the health of the intestinal tract. There are two types of functions in these range clusters: housekeeping and those specific to the intestine. The housekeeping gene clusters are required in all bacteria and are often major players in the main metabolic pathways including central carbon metabolism and amino acid synthesis. The gut-specific functions include adhesion to host proteins and the harvesting of sugars from globoseries glycolipids. Patients with irritable bowel syndrome were shown to exhibit 25% fewer genes and lower bacterial diversity than individuals not suffering from irritable bowel syndrome indicating that changes in patients' gut biome diversity may be associated with this condition.

While these studies highlight some potentially valuable medical applications, only 31–48.8% of the reads could be aligned to 194 public human gut bacterial genomes and 7.6–21.2% to bacterial genomes available in GenBank which indicates that there is still far more research necessary to capture novel bacterial genomes. [99]

In the Human Microbiome Project (HMP), gut microbial communities were assayed using high-throughput DNA sequencing. HMP showed that, unlike individual microbial species, many metabolic processes were present among all body habitats with varying frequencies. Microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals were studied as part of the human microbiome project. The metagenomic analysis revealed variations in niche specific abundance among 168 functional modules and 196 metabolic pathways within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. The HMP has brought to light the utility of metagenomics in diagnostics and evidence-based medicine. Thus metagenomics is a powerful tool to address many of the pressing issues in the field of Personalized medicine. [100]

Infectious disease diagnosis Edit

Differentiating between infectious and non-infectious illness, and identifying the underlying etiology of infection, can be quite challenging. For example, more than half of cases of encephalitis remain undiagnosed, despite extensive testing using state-of-the-art clinical laboratory methods. Metagenomic sequencing shows promise as a sensitive and rapid method to diagnose infection by comparing genetic material found in a patient's sample to databases of all known microscopic human pathogens and thousands of other bacterial, viral, fungal, and parasitic organisms and databases on antimicrobial resistances gene sequences with associated clinical phenotypes.

  1. ^ abcdefg Wooley JC, Godzik A, Friedberg I (February 2010). Bourne PE (ed.). "A primer on metagenomics". PLOS Computational Biology. 6 (2): e1000667. Bibcode:2010PLSCB. 6E0667W. doi:10.1371/journal.pcbi.1000667. PMC2829047 . PMID20195499.
  2. ^ ab
  3. Hugenholtz P, Goebel BM, Pace NR (September 1998). "Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity". Journal of Bacteriology. 180 (18): 4765–74. doi:10.1128/JB.180.18.4765-4774.1998. PMC107498 . PMID9733676.
  4. ^
  5. Marco, D, ed. (2011). Metagenomics: Current Innovations and Future Trends. Caister Academic Press. ISBN978-1-904455-87-5 .
  6. ^
  7. Eisen JA (March 2007). "Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes". PLOS Biology. 5 (3): e82. doi:10.1371/journal.pbio.0050082. PMC1821061 . PMID17355177.
  8. ^
  9. Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM (October 1998). "Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products". Chemistry & Biology. 5 (10): R245-9. doi: 10.1016/S1074-5521(98)90108-9 . PMID9818143. .
  10. ^
  11. Chen K, Pachter L (July 2005). "Bioinformatics for whole-genome shotgun sequencing of microbial communities". PLOS Computational Biology. 1 (2): 106–12. Bibcode:2005PLSCB. 1. 24C. doi:10.1371/journal.pcbi.0010024. PMC1185649 . PMID16110337.
  12. ^
  13. Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR (October 1985). "Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses". Proceedings of the National Academy of Sciences of the United States of America. 82 (20): 6955–9. Bibcode:1985PNAS. 82.6955L. doi:10.1073/pnas.82.20.6955. PMC391288 . PMID2413450.
  14. ^
  15. Pace NR, Stahl DA, Lane DJ, Olsen GJ (1986). "The Analysis of Natural Microbial Populations by Ribosomal RNA Sequences". In Marshall KC (ed.). Advances in Microbial Ecology. 9. Springer US. pp. 1–55. doi:10.1007/978-1-4757-0611-6_1. ISBN978-1-4757-0611-6 .
  16. ^
  17. Schmidt TM, DeLong EF, Pace NR (July 1991). "Analysis of a marine picoplankton community by 16S rRNA gene cloning and sequencing". Journal of Bacteriology. 173 (14): 4371–8. doi:10.1128/jb.173.14.4371-4378.1991. PMC208098 . PMID2066334.
  18. ^
  19. Healy FG, Ray RM, Aldrich HC, Wilkie AC, Ingram LO, Shanmugam KT (1995). "Direct isolation of functional genes encoding cellulases from the microbial consortia in a thermophilic, anaerobic digester maintained on lignocellulose". Applied Microbiology and Biotechnology. 43 (4): 667–74. doi:10.1007/BF00164771. PMID7546604. S2CID31384119.
  20. ^
  21. Stein JL, Marsh TL, Wu KY, Shizuya H, DeLong EF (February 1996). "Characterization of uncultivated prokaryotes: isolation and analysis of a 40-kilobase-pair genome fragment from a planktonic marine archaeon". Journal of Bacteriology. 178 (3): 591–9. doi:10.1128/jb.178.3.591-599.1996. PMC177699 . PMID8550487.
  22. ^
  23. Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, Mead D, et al. (October 2002). "Genomic analysis of uncultured marine viral communities". Proceedings of the National Academy of Sciences of the United States of America. 99 (22): 14250–5. Bibcode:2002PNAS. 9914250B. doi:10.1073/pnas.202488399. PMC137870 . PMID12384570.
  24. ^ ab
  25. Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, et al. (March 2004). "Community structure and metabolism through reconstruction of microbial genomes from the environment". Nature. 428 (6978): 37–43. Bibcode:2004Natur.428. 37T. doi:10.1038/nature02340. PMID14961025. S2CID4420754. (subscription required)
  26. ^
  27. Hugenholtz P (2002). "Exploring prokaryotic diversity in the genomic era". Genome Biology. 3 (2): REVIEWS0003. doi:10.1186/gb-2002-3-2-reviews0003. PMC139013 . PMID11864374.
  28. ^
  29. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, et al. (April 2004). "Environmental genome shotgun sequencing of the Sargasso Sea". Science. 304 (5667): 66–74. Bibcode:2004Sci. 304. 66V. CiteSeerX10.1.1.124.1840 . doi:10.1126/science.1093857. PMID15001713. S2CID1454587.
  30. ^
  31. Yooseph S, Nealson KH, Rusch DB, McCrow JP, Dupont CL, Kim M, et al. (November 2010). "Genomic and functional adaptation in surface ocean planktonic prokaryotes". Nature. 468 (7320): 60–6. Bibcode:2010Natur.468. 60Y. doi: 10.1038/nature09530 . PMID21048761. (subscription required)
  32. ^ abc
  33. Poinar HN, Schwarz C, Qi J, Shapiro B, Macphee RD, Buigues B, et al. (January 2006). "Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA". Science. 311 (5759): 392–4. Bibcode:2006Sci. 311..392P. doi:10.1126/science.1123360. PMID16368896. S2CID11238470.
  34. ^
  35. Edwards RA, Rodriguez-Brito B, Wegley L, Haynes M, Breitbart M, Peterson DM, et al. (March 2006). "Using pyrosequencing to shed light on deep mine microbial ecology". BMC Genomics. 7: 57. doi:10.1186/1471-2164-7-57. PMC1483832 . PMID16549033.
  36. ^
  37. Thomas T, Gilbert J, Meyer F (February 2012). "Metagenomics - a guide from sampling to data analysis". Microbial Informatics and Experimentation. 2 (1): 3. doi:10.1186/2042-5783-2-3. PMC3351745 . PMID22587947.
  38. ^
  39. Béjà O, Suzuki MT, Koonin EV, Aravind L, Hadd A, Nguyen LP, et al. (October 2000). "Construction and analysis of bacterial artificial chromosome libraries from a marine microbial assemblage". Environmental Microbiology. 2 (5): 516–29. doi:10.1046/j.1462-2920.2000.00133.x. PMID11233160. S2CID8267748.
  40. ^ ab
  41. Segata N, Boernigen D, Tickle TL, Morgan XC, Garrett WS, Huttenhower C (May 2013). "Computational meta'omics for microbial community studies". Molecular Systems Biology. 9 (666): 666. doi:10.1038/msb.2013.22. PMC4039370 . PMID23670539.
  42. ^
  43. Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, Alm EJ, Chisholm SW (July 2010). Gilbert JA (ed.). "Unlocking short read sequencing for metagenomics". PLOS ONE. 5 (7): e11840. Bibcode:2010PLoSO. 511840R. doi:10.1371/journal.pone.0011840. PMC2911387 . PMID20676378.
  44. ^
  45. Schuster SC (January 2008). "Next-generation sequencing transforms today's biology". Nature Methods. 5 (1): 16–8. doi:10.1038/nmeth1156. PMID18165802. S2CID1465786.
  46. ^
  47. "Metagenomics versus Moore's law". Nature Methods. 6 (9): 623. 2009. doi: 10.1038/nmeth0909-623 .
  48. ^
  49. Stewart RD, Auffret MD, Warr A, Wiser AH, Press MO, Langford KW, et al. (February 2018). "Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen". Nature Communications. 9 (1): 870. Bibcode:2018NatCo. 9..870S. doi:10.1038/s41467-018-03317-6. PMC5830445 . PMID29491419.
  50. ^
  51. Hiraoka S, Yang CC, Iwasaki W (September 2016). "Metagenomics and Bioinformatics in Microbial Ecology: Current Status and Beyond". Microbes and Environments. 31 (3): 204–12. doi:10.1264/jsme2.ME16024. PMC5017796 . PMID27383682.
  52. ^
  53. Pérez-Cobas AE, Gomez-Valero L, Buchrieser C (2020). "Metagenomic approaches in microbial ecology: an update on whole-genome and marker gene sequencing analyses". Microbial Genomics. 6 (8). doi: 10.1099/mgen.0.000409 . PMC7641418 . PMID32706331.
  54. ^ ab
  55. Hess M, Sczyrba A, Egan R, Kim TW, Chokhawala H, Schroth G, et al. (January 2011). "Metagenomic discovery of biomass-degrading genes and genomes from cow rumen". Science. 331 (6016): 463–7. Bibcode:2011Sci. 331..463H. doi:10.1126/science.1200387. PMID21273488. S2CID36572885.
  56. ^
  57. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. (March 2010). "A human gut microbial gene catalogue established by metagenomic sequencing". Nature. 464 (7285): 59–65. Bibcode:2010Natur.464. 59.. doi:10.1038/nature08821. PMC3779803 . PMID20203603. (subscription required)
  58. ^ ab
  59. Paulson JN, Stine OC, Bravo HC, Pop M (December 2013). "Differential abundance analysis for microbial marker-gene surveys". Nature Methods. 10 (12): 1200–2. doi:10.1038/nmeth.2658. PMC4010126 . PMID24076764.
  60. ^ abcdefg
  61. Committee on Metagenomics: Challenges and Functional Applications, National Research Council (2007). The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet. Washington, D.C.: The National Academies Press. doi:10.17226/11902. ISBN978-0-309-10676-4 . PMID21678629.
  62. ^
  63. Oulas A, Pavloudi C, Polymenakou P, Pavlopoulos GA, Papanikolaou N, Kotoulas G, et al. (2015). "Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies". Bioinformatics and Biology Insights. 9: 75–88. doi:10.4137/BBI.S12462. PMC4426941 . PMID25983555.
  64. ^
  65. Mende DR, Waller AS, Sunagawa S, Järvelin AI, Chan MM, Arumugam M, et al. (23 February 2012). "Assessment of metagenomic assembly using simulated next generation sequencing data". PLOS ONE. 7 (2): e31386. Bibcode:2012PLoSO. 731386M. doi:10.1371/journal.pone.0031386. PMC3285633 . PMID22384016.
  66. ^
  67. Balzer S, Malde K, Grohme MA, Jonassen I (April 2013). "Filtering duplicate reads from 454 pyrosequencing data". Bioinformatics. 29 (7): 830–6. doi:10.1093/bioinformatics/btt047. PMC3605598 . PMID23376350.
  68. ^
  69. Mohammed MH, Chadaram S, Komanduri D, Ghosh TS, Mande SS (September 2011). "Eu-Detect: an algorithm for detecting eukaryotic sequences in metagenomic data sets". Journal of Biosciences. 36 (4): 709–17. doi:10.1007/s12038-011-9105-2. PMID21857117. S2CID25857874.
  70. ^
  71. Schmieder R, Edwards R (March 2011). "Fast identification and removal of sequence contamination from genomic and metagenomic datasets". PLOS ONE. 6 (3): e17288. Bibcode:2011PLoSO. 617288S. doi:10.1371/journal.pone.0017288. PMC3052304 . PMID21408061.
  72. ^ abcde
  73. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P (December 2008). "A bioinformatician's guide to metagenomics". Microbiology and Molecular Biology Reviews. 72 (4): 557–78, Table of Contents. doi:10.1128/MMBR.00009-08. PMC2593568 . PMID19052320.
  74. ^
  75. Namiki T, Hachiya T, Tanaka H, Sakakibara Y (November 2012). "MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads". Nucleic Acids Research. 40 (20): e155. doi:10.1093/nar/gks678. PMC3488206 . PMID22821567.
  76. ^
  77. Zerbino DR, Birney E (May 2008). "Velvet: algorithms for de novo short read assembly using de Bruijn graphs". Genome Research. 18 (5): 821–9. doi:10.1101/gr.074492.107. PMC2336801 . PMID18349386.
  78. ^
  79. Burton JN, Liachko I, Dunham MJ, Shendure J (May 2014). "Species-level deconvolution of metagenome assemblies with Hi-C-based contact probability maps". G3. 4 (7): 1339–46. doi:10.1534/g3.114.011825. PMC4455782 . PMID24855317.
  80. ^ ab
  81. Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC (September 2011). "Integrative analysis of environmental sequences using MEGAN4". Genome Research. 21 (9): 1552–60. doi:10.1101/gr.120618.111. PMC3166839 . PMID21690186.
  82. ^
  83. Zhu W, Lomsadze A, Borodovsky M (July 2010). "Ab initio gene identification in metagenomic sequences". Nucleic Acids Research. 38 (12): e132. doi:10.1093/nar/gkq275. PMC2896542 . PMID20403810.
  84. ^
  85. Konopka A (November 2009). "What is microbial community ecology?". The ISME Journal. 3 (11): 1223–30. doi: 10.1038/ismej.2009.88 . PMID19657372.
  86. ^ ab
  87. Huson DH, Auch AF, Qi J, Schuster SC (March 2007). "MEGAN analysis of metagenomic data". Genome Research. 17 (3): 377–86. doi:10.1101/gr.5969107. PMC1800929 . PMID17255551.
  88. ^
  89. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C (June 2012). "Metagenomic microbial community profiling using unique clade-specific marker genes". Nature Methods. 9 (8): 811–4. doi:10.1038/nmeth.2066. PMC3443552 . PMID22688413.
  90. ^
  91. Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, et al. (December 2013). "Metagenomic species profiling using universal phylogenetic marker genes". Nature Methods. 10 (12): 1196–9. doi:10.1038/nmeth.2693. PMID24141494. S2CID7728395.
  92. ^ ab
  93. Milanese A, Mende DR, Paoli L, Salazar G, Ruscheweyh HJ, Cuenca M, et al. (March 2019). "Microbial abundance, activity and population genomic profiling with mOTUs2". Nature Communications. 10 (1): 1014. Bibcode:2019NatCo..10.1014M. doi:10.1038/s41467-019-08844-4. PMC6399450 . PMID30833550.
  94. ^
  95. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M (2011). "Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences". BMC Genomics. 12 Suppl 2: S4. doi:10.1186/1471-2164-12-S2-S4. PMC3194235 . PMID21989143.
  96. ^
  97. Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K (2017). "SLIMM: species level identification of microorganisms from metagenomes". PeerJ. 5: e3138. doi:10.7717/peerj.3138. PMC5372838 . PMID28367376.
  98. ^
  99. Pagani I, Liolios K, Jansson J, Chen IM, Smirnova T, Nosrat B, et al. (January 2012). "The Genomes OnLine Database (GOLD) v.4: status of genomic and metagenomic projects and their associated metadata". Nucleic Acids Research. 40 (Database issue): D571-9. doi:10.1093/nar/gkr1100. PMC3245063 . PMID22135293.
  100. ^
  101. Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, et al. (September 2008). "The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes". BMC Bioinformatics. 9: 386. doi:10.1186/1471-2105-9-386. PMC2563014 . PMID18803844.
  102. ^
  103. Markowitz VM, Chen IM, Chu K, Szeto E, Palaniappan K, Grechkin Y, et al. (January 2012). "IMG/M: the integrated metagenome data management and comparative analysis system". Nucleic Acids Research. 40 (Database issue): D123-9. doi:10.1093/nar/gkr975. PMC3245048 . PMID22086953.
  104. ^ ab
  105. Mitra S, Rupek P, Richter DC, Urich T, Gilbert JA, Meyer F, et al. (February 2011). "Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG". BMC Bioinformatics. 12 Suppl 1: S21. doi:10.1186/1471-2105-12-S1-S21. PMC3044276 . PMID21342551.
  106. ^
  107. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (January 2013). "GenBank". Nucleic Acids Research. 41 (Database issue): D36-42. doi:10.1093/nar/gks1195. PMC3531190 . PMID23193287.
  108. ^
  109. Bazinet AL, Cummings MP (May 2012). "A comparative evaluation of sequence classification programs". BMC Bioinformatics. 13: 92. doi:10.1186/1471-2105-13-92. PMC3428669 . PMID22574964.
  110. ^
  111. Ounit R, Wanamaker S, Close TJ, Lonardi S (March 2015). "CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers". BMC Genomics. 16: 236. doi:10.1186/s12864-015-1419-2. PMC4428112 . PMID25879410.
  112. ^
  113. Pratas D, Pinho AJ, Silva RM, Rodrigues JM, Hosseini M, Caetano T, Ferreira PJ (February 2018). "FALCON: a method to infer metagenomic composition of ancient DNA". bioRxiv10.1101/267179 .
  114. ^
  115. Kurokawa K, Itoh T, Kuwahara T, Oshima K, Toh H, Toyoda A, et al. (August 2007). "Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes". DNA Research. 14 (4): 169–81. doi:10.1093/dnares/dsm018. PMC2533590 . PMID17916580.
  116. ^ abcdef
  117. Simon C, Daniel R (February 2011). "Metagenomic analyses: past and future trends". Applied and Environmental Microbiology. 77 (4): 1153–61. doi:10.1128/AEM.02345-10. PMC3067235 . PMID21169428.
  118. ^
  119. Willner D, Thurber RV, Rohwer F (July 2009). "Metagenomic signatures of 86 microbial and viral metagenomes". Environmental Microbiology. 11 (7): 1752–66. doi:10.1111/j.1462-2920.2009.01901.x. PMID19302541.
  120. ^
  121. Ghosh TS, Mohammed MH, Rajasingh H, Chadaram S, Mande SS (2011). "HabiSign: a novel approach for comparison of metagenomes and rapid identification of habitat-specific sequences". BMC Bioinformatics. 12 Suppl 13 (Supplement 13): S9. doi:10.1186/1471-2105-12-s13-s9. PMC3278849 . PMID22373355.
  122. ^
  123. Fimereli D, Detours V, Konopka T (April 2013). "TriageTools: tools for partitioning and prioritizing analysis of high-throughput sequencing data". Nucleic Acids Research. 41 (7): e86. doi:10.1093/nar/gkt094. PMC3627586 . PMID23408855.
  124. ^
  125. Maillet N, Lemaitre C, Chikhi R, Lavenier D, Peterlongo P (2012). "Compareads: comparing huge metagenomic experiments". BMC Bioinformatics. 13 Suppl 19 (Suppl 19): S10. doi:10.1186/1471-2105-13-S19-S10. PMC3526429 . PMID23282463.
  126. ^
  127. Kuntal BK, Ghosh TS, Mande SS (October 2013). "Community-analyzer: a platform for visualizing and comparing microbial community structure across microbiomes". Genomics. 102 (4): 409–18. doi: 10.1016/j.ygeno.2013.08.004 . PMID23978768.
  128. ^
  129. Werner JJ, Knights D, Garcia ML, Scalfone NB, Smith S, Yarasheski K, et al. (March 2011). "Bacterial community structures are unique and resilient in full-scale bioenergy systems". Proceedings of the National Academy of Sciences of the United States of America. 108 (10): 4158–63. Bibcode:2011PNAS..108.4158W. doi:10.1073/pnas.1015676108. PMC3053989 . PMID21368115.
  130. ^
  131. McInerney MJ, Sieber JR, Gunsalus RP (December 2009). "Syntrophy in anaerobic global carbon cycles". Current Opinion in Biotechnology. 20 (6): 623–32. doi:10.1016/j.copbio.2009.10.001. PMC2790021 . PMID19897353.
  132. ^
  133. Klitgord N, Segrè D (August 2011). "Ecosystems biology of microbial metabolism". Current Opinion in Biotechnology. 22 (4): 541–6. doi:10.1016/j.copbio.2011.04.018. PMID21592777.
  134. ^
  135. Leininger S, Urich T, Schloter M, Schwark L, Qi J, Nicol GW, et al. (August 2006). "Archaea predominate among ammonia-oxidizing prokaryotes in soils". Nature. 442 (7104): 806–9. Bibcode:2006Natur.442..806L. doi:10.1038/nature04983. PMID16915287. S2CID4380804.
  136. ^
  137. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, Mikhailova N, et al. (August 2016). "Uncovering Earth's virome". Nature. 536 (7617): 425–30. Bibcode:2016Natur.536..425P. doi:10.1038/nature19094. PMID27533034. S2CID4466854.
  138. ^
  139. Paez-Espino D, Chen IA, Palaniappan K, Ratner A, Chu K, Szeto E, et al. (January 2017). "IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses". Nucleic Acids Research. 45 (D1): D457–D465. doi:10.1093/nar/gkw1030. PMC5210529 . PMID27799466.
  140. ^
  141. Paez-Espino D, Roux S, Chen IA, Palaniappan K, Ratner A, Chu K, et al. (January 2019). "IMG/VR v.2.0: an integrated data management and analysis system for cultivated and environmental viral genomes". Nucleic Acids Research. 47 (D1): D678–D686. doi:10.1093/nar/gky1127. PMC6323928 . PMID30407573.
  142. ^
  143. Paez-Espino D, Pavlopoulos GA, Ivanova NN, Kyrpides NC (August 2017). "Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data" (PDF) . Nature Protocols. 12 (8): 1673–1682. doi:10.1038/nprot.2017.063. PMID28749930. S2CID2127494.
  144. ^
  145. Kristensen DM, Mushegian AR, Dolja VV, Koonin EV (January 2010). "New dimensions of the virus world discovered through metagenomics". Trends in Microbiology. 18 (1): 11–9. doi:10.1016/j.tim.2009.11.003. PMC3293453 . PMID19942437.
  146. ^
  147. Kerepesi C, Grolmusz V (March 2016). "Giant viruses of the Kutch Desert". Archives of Virology. 161 (3): 721–4. arXiv: 1410.1278 . doi:10.1007/s00705-015-2720-8. PMID26666442. S2CID13145926.
  148. ^
  149. Kerepesi C, Grolmusz V (June 2017). "The "Giant Virus Finder" discovers an abundance of giant viruses in the Antarctic dry valleys". Archives of Virology. 162 (6): 1671–1676. arXiv: 1503.05575 . doi:10.1007/s00705-017-3286-4. PMID28247094. S2CID1925728.
  150. ^
  151. Copeland CS (September–October 2017). "The World Within Us" (PDF) . Healthcare Journal of New Orleans: 21–26.
  152. ^
  153. Jansson J (2011). "Towards "Tera-Terra": Terabase Sequencing of Terrestrial Metagenomes Print E-mail". Microbe. 6 (7). p. 309. Archived from the original on 31 March 2012.
  154. ^
  155. Vogel TM, Simonet P, Jansson JK, Hirsch PR, Tiedje JM, Van Elsas JD, Bailey MJ, Nalin R, Philippot L (2009). "TerraGenome: A consortium for the sequencing of a soil metagenome". Nature Reviews Microbiology. 7 (4): 252. doi: 10.1038/nrmicro2119 .
  156. ^
  157. "TerraGenome Homepage". TerraGenome international sequencing consortium . Retrieved 30 December 2011 .
  158. ^ ab
  159. Committee on Metagenomics: Challenges and Functional Applications, National Research Council (2007). Understanding Our Microbial Planet: The New Science of Metagenomics (PDF) . The National Academies Press.
  160. ^
  161. Charles T (2010). "The Potential for Investigation of Plant-microbe Interactions Using Metagenomics Methods". Metagenomics: Theory, Methods and Applications. Caister Academic Press. ISBN978-1-904455-54-7 .
  162. ^
  163. Bringel F, Couée I (22 May 2015). "Pivotal roles of phyllosphere microorganisms at the interface between plant functioning and atmospheric trace gas dynamics". Frontiers in Microbiology. 6: 486. doi:10.3389/fmicb.2015.00486. PMC4440916 . PMID26052316.
  164. ^
  165. Li LL, McCorkle SR, Monchy S, Taghavi S, van der Lelie D (May 2009). "Bioprospecting metagenomes: glycosyl hydrolases for converting biomass". Biotechnology for Biofuels. 2: 10. doi:10.1186/1754-6834-2-10. PMC2694162 . PMID19450243.
  166. ^
  167. Jaenicke S, Ander C, Bekel T, Bisdorf R, Dröge M, Gartemann KH, et al. (January 2011). Aziz RK (ed.). "Comparative and joint analysis of two metagenomic datasets from a biogas fermenter obtained by 454-pyrosequencing". PLOS ONE. 6 (1): e14519. Bibcode:2011PLoSO. 614519J. doi:10.1371/journal.pone.0014519. PMC3027613 . PMID21297863.
  168. ^
  169. Suen G, Scott JJ, Aylward FO, Adams SM, Tringe SG, Pinto-Tomás AA, et al. (September 2010). Sonnenburg J (ed.). "An insect herbivore microbiome with high plant biomass-degrading capacity". PLOS Genetics. 6 (9): e1001129. doi:10.1371/journal.pgen.1001129. PMC2944797 . PMID20885794.
  170. ^
  171. Simon C, Daniel R (November 2009). "Achievements and new knowledge unraveled by metagenomic approaches". Applied Microbiology and Biotechnology. 85 (2): 265–76. doi:10.1007/s00253-009-2233-z. PMC2773367 . PMID19760178.
  172. ^
  173. Wong D (2010). "Applications of Metagenomics for Industrial Bioproducts". Metagenomics: Theory, Methods and Applications. Caister Academic Press. ISBN978-1-904455-54-7 .
  174. ^ ab
  175. Schloss PD, Handelsman J (June 2003). "Biotechnological prospects from metagenomics" (PDF) . Current Opinion in Biotechnology. 14 (3): 303–10. doi:10.1016/S0958-1669(03)00067-3. PMID12849784. Archived from the original (PDF) on 4 March 2016 . Retrieved 20 January 2012 .
  176. ^ abc
  177. Kakirde KS, Parsley LC, Liles MR (November 2010). "Size Does Matter: Application-driven Approaches for Soil Metagenomics". Soil Biology & Biochemistry. 42 (11): 1911–1923. doi:10.1016/j.soilbio.2010.07.021. PMC2976544 . PMID21076656.
  178. ^
  179. Parachin NS, Gorwa-Grauslund MF (May 2011). "Isolation of xylose isomerases by sequence- and function-based screening from a soil metagenomic library". Biotechnology for Biofuels. 4 (1): 9. doi:10.1186/1754-6834-4-9. PMC3113934 . PMID21545702.
  180. ^
  181. Hover BM, Kim SH, Katz M, Charlop-Powers Z, Owen JG, Ternei MA, et al. (April 2018). "Culture-independent discovery of the malacidins as calcium-dependent antibiotics with activity against multidrug-resistant Gram-positive pathogens". Nature Microbiology. 3 (4): 415–422. doi:10.1038/s41564-018-0110-1. PMC5874163 . PMID29434326.
  182. ^
  183. Raes J, Letunic I, Yamada T, Jensen LJ, Bork P (March 2011). "Toward molecular trait-based ecology through integration of biogeochemical, geographical and metagenomic data". Molecular Systems Biology. 7: 473. doi:10.1038/msb.2011.6. PMC3094067 . PMID21407210.
  184. ^
  185. Lavery TJ, Roudnew B, Seymour J, Mitchell JG, Jeffries T (2012). Steinke D (ed.). "High nutrient transport and cycling potential revealed in the microbial metagenome of Australian sea lion (Neophoca cinerea) faeces". PLOS ONE. 7 (5): e36478. Bibcode:2012PLoSO. 736478L. doi:10.1371/journal.pone.0036478. PMC3350522 . PMID22606263.
  186. ^
  187. "What's Swimming in the River? Just Look For DNA". NPR.org. 24 July 2013 . Retrieved 10 October 2014 .
  188. ^
  189. Chua, Physilia Y. S. Crampton‐Platt, Alex Lammers, Youri Alsos, Inger G. Boessenkool, Sanne Bohmann, Kristine (25 May 2021). "Metagenomics: A viable tool for reconstructing herbivore diet". Molecular Ecology Resources: 1755–0998.13425. doi: 10.1111/1755-0998.13425 . PMID33971086.
  190. ^
  191. George I, Stenuit B, Agathos SN (2010). "Application of Metagenomics to Bioremediation". In Marco D (ed.). Metagenomics: Theory, Methods and Applications. Caister Academic Press. ISBN978-1-904455-54-7 .
  192. ^
  193. Zimmer C (13 July 2010). "How Microbes Defend and Define Us". New York Times . Retrieved 29 December 2011 .
  194. ^
  195. Nelson KE and White BA (2010). "Metagenomics and Its Applications to the Study of the Human Microbiome". Metagenomics: Theory, Methods and Applications. Caister Academic Press. ISBN978-1-904455-54-7 .
  196. ^
  197. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. (March 2010). "A human gut microbial gene catalogue established by metagenomic sequencing". Nature. 464 (7285): 59–65. Bibcode:2010Natur.464. 59.. doi:10.1038/nature08821. PMC3779803 . PMID20203603.
  198. ^
  199. Abubucker, Sahar Segata, Nicola Goll, Johannes Schubert, Alyxandria M. Izard, Jacques Cantarel, Brandi L. Rodriguez-Mueller, Beltran Zucker, Jeremy Thiagarajan, Mathangi Henrissat, Bernard White, Owen Kelley, Scott T. Methé, Barbara Schloss, Patrick D. Gevers, Dirk Mitreva, Makedonka Huttenhower, Curtis (2012). "PLOS Computational Biology: Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome". PLOS Computational Biology. 8 (6): e1002358. Bibcode:2012PLSCB. 8E2358A. doi:10.1371/journal.pcbi.1002358. PMC3374609 . PMID22719234.

180 ms 17.0% Scribunto_LuaSandboxCallback::find 120 ms 11.3% Scribunto_LuaSandboxCallback::callParserFunction 100 ms 9.4% Scribunto_LuaSandboxCallback::plain 60 ms 5.7% type 60 ms 5.7% Scribunto_LuaSandboxCallback::getAllExpandedArguments 40 ms 3.8% Scribunto_LuaSandboxCallback::match 40 ms 3.8% Scribunto_LuaSandboxCallback::anchorEncode 40 ms 3.8% Scribunto_LuaSandboxCallback::gsub 40 ms 3.8% 40 ms 3.8% [others] 340 ms 32.1% Number of Wikibase entities loaded: 0/400 -->


Will I lose my lab skills?

In 2013, I did an internship with a plant-evolution research group that was entirely computational — no pipettes, bottles or plants — and it was an exciting experience. The idea of doing genuine scientific discovery with my laptop and a cup of coffee sounded idyllic at first. But after a while, running a gel started to seem very inviting. After returning to my lab, I really enjoyed finding a balance between the bench top and the laptop.


1. Introduction

The Human Genome Project and high-throughput experimental methodologies such as microarray chromatin-immunoprecipitation DNA chips (ChIP-chip) have led to the development of biology as an increasingly information-rich science encompassing transcriptomes, proteomes, metabolomes, interactomes, and so forth [1,2]. Some have suggested that systems biology is nothing more than a new name for integrative physiology, which has been practiced for the past 50 years or more. Because of these novel technologies, biologists have been able to collect data at a rate that was unimaginable a decade ago. The context of biology has profoundly changed over the past 20 years. These changes provide a powerful new framework for systems biology that moves it far beyond classical integrative physiology. A systems biology approach implies that every system of any level of biological systems can be analyzed with respect to the system’s structure, in particular, in terms of its dynamics, method of control, and method of system design. Systems biology involves genomic, transcriptomic, proteomic, and metabolic investigations from a systematic perspective. As a result, systems biology has become the frontier of modern biological research large amounts of new omics data cannot be understood without a network or systems viewpoint and without highly sophisticated computational analyses [3,4,5,6,7,8,9,10,11].

The role of systems biology in modern biological research ( Figure 1 ) requires powerful computational tools to mine large-scale data sets of information on genetics, proteins, DNA–protein binding, metabolism, and so forth. These tools are used to construct dynamic system models for the interpretation of specific mechanisms of some cellular phenotypse (or behaviors) from a system (or network) perspective [12,13,14,15,16,17]. To construct a dynamic system model of biological networks from omics data, system identification technologies (i.e., reverse-engineering schemes) are needed to estimate the parameter values of dynamic models and the order of biological networks [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44]. Synthetic biology metabolic engineering has recently been developed for designing synthetic genetic networks for the production of specific cellular functions in host cells [45,46,47,48,49,50,51]. Based on system models and mechanisms in systems biology, synthetic genetic circuits and metabolic engineering pathways can be designed to investigate cellular behaviors [52,53,54,55,56,57,58,59,60,61,62,63]. These synthetic biological technologies can be employed to investigate the models and mechanisms of systems biology. Discrepancies between the real behavior of synthetic genetic networks and the desired behavior predicted by the models and mechanisms of systems biology can be fed back to modify the models through methodologies of systems synthetic biology and systems metabolic engineering. Based on the role of systems biology ( Figure 1 ), this review describes current developments in bioinformatics, systems synthetic biology, and systems metabolic engineering. It discusses how systems biology can serve as an integrated platform for bioinformatics, systems synthetic biology, and systems metabolic engineering in the future.

The role of systems biology as an integrated platform in modern biological research Systems biology integrates information on genetics, proteins, DNA-protein binding, and metabolism with system dynamics modeling and system identification technology to develop models and mechanisms for the interpretation of phenotypes or behaviors of cellular physiology. Since large-scale data sets need to be mined, powerful computational tools are necessary. Based on system models and mechanisms in systems biology, synthetic genetic circuits are designed to investigate specific desired cellular behaviors of cellular physiology. Discrepancies between real and desired cellular behaviors are used as feedback to adjust system models and mechanisms. Systems biology is thus positioned to play the role of integrated platform for bioinformatics, systems synthetic biology, and systems metabolic engineering.

Bioinformatics is crucial in genome-wide analyses for understanding cell physiology at different cellular levels (i.e., genome, transcriptome, proteome, and metabolome levels) [12,13]. The various disciplines of bioinformatics provide invaluable information on the global cellular status for systems biology, systems synthetic biology, and systems metabolic engineering, as well as a thorough analysis of the cell. Genomic information in bioinformatics represents the whole genetic makeup of the organism [12,13,14,15,16,17], and comparative genomic analysis may contribute to systems synthetic biology or systems metabolic biology for targeting and engineering genetic circuits to create desirable cellular phenotypes. Transcriptome profiling uses DNA microarrays to decipher the expression levels of thousands of genes under various biological conditions [14,15]. The results can be used to select candidate genes for modification based on systematic analysis of regulatory genes in response to genetic variations and environmental changes, or to identify novel factors for the enhancement of heterologous product secretion in metabolic pathways [64,65,66,67,68]. Proteome profiling is also useful in obtaining transcriptome-profiling data at the protein level. The metabolome comprises the entirety of information on metabolites present within and/or outside the cell under specified conditions [61,62,63]. It is expected to contribute significantly to the understanding of the cell and of synthetic circuit engineering in its metabolic pathways. In this paper, recent advances in the application of bioinformatics to systems synthetic biology and systems metabolic engineering through systems biology are reviewed using specific examples.

Although bioinformatic information (e.g., data on genetics, protein binding, and metabolism) is available, several stages of systems biology are required to help us understand via system dynamics modeling the underlying molecular mechanisms of genetic regulatory (GR) networks [18,19,20,21,22,23,24], protein–protein interaction (PPI) networks [25,26,27,28,29], and metabolic networks [61,62,63] under various biological conditions. At the first stage, a putative GR or PPI network is created by large-scale integration of knowledge such as information from publications and databases, and high-throughput data (from data mining or deep curation). Based on this network and dynamic modeling, the actual GR or PPI network of cellular physiology can be identified with system identification methods (reverse-engineering scheme [34]) by using specific microarray gene or protein expression data [18,19,20,21,22,23,24,25,26,27,28]. For example, GR networks (GRNs) have been constructed by dynamic modeling via microarray data for cell cycles [18,23,24], environmental stressors [28,44], photosynthesis [69], aging [34], and cancer [39]. PPI networks have been constructed for cancer [3,9,33,39,70], inflammation [41], biofilm formation [43], and infection by Candida albicans. Comparison of PPI networks between healthy and cancer cells can provide network markers for the investigation of the systematic mechanism of cancer [70]. The integration of cellular networks of GRs and PPIs provides deeper insight into actual biological networks and is more predictive than an approach without integration [71]. A systematic and efficient method to integrate different kinds of omics data for the construction of integrated cellular networks via microarray data have been provided based on coupling dynamic models and statistical assessments [44]. This method has been shown to be powerful and flexible, and can be used to construct integrated networks at different cellular levels to investigate cellular machinery under different biological conditions and for different species. Coupling dynamic models of the whole integrated cellular network is very useful for theoretical analyses and for further experiments in the field of network biology and synthetic biology.

In short, synthetic biology is the engineering of biological systems to fulfill a particular purpose. It is possible to build living machines from off-the-shelf genetic devices by employing many of the same strategies that electrical engineers use to manufacture computer chips [47,48,49]. The main goal of this nascent field is the design and construction of biological systems with desired behaviors [51]. Synthetic biology envisions the redesign of natural biological systems as well as the construction of functional “genetic circuits” by using a set of powerful biotechniques for the automated synthesis of DNA molecules and their assembly into genes and microbial genomes [47]. Synthetic biology is predicted to have important applications in biotechnology, metabolic engineering, and medicine. It may revolutionize how we conceptualize and approach the engineering of biological systems [49]. As illustrated in Figure 1 , synthetic genetic circuits can furthermore be employed to confirm network mechanisms derived using systems biology methods, and can be used as feedback for their improvement or revision. Synthetic biology is therefore expected to contribute significantly to a better understanding of the functioning of complex biological systems such as metabolic pathways.

However, the development of synthetic gene networks is still difficult. Most newly created gene networks are nonfunctional because of intrinsic parameter fluctuations, environmental disturbances, and functional variations in the intra- and extracellular context. For this reason, the design method based on dynamic models for robust synthetic gene networks has become an important topic in synthetic biology [52,53,54,55,56,57,58,59,60,68]. These system-dynamics-based design methods for synthetic biology lead to systems synthetic biology.

Heterologous genes have previously been combined into pathways in metabolic engineering, generating a myriad of non-native biochemical products, including isoprenoids, hydroxyacids, biofuels, polypeptides, and biopolymers [64,65,66,67]. Synthetic biologists developed synthetic tools to engineer genetic devices capable of performing complex biological functions such as sensing cell states, counting cellular events, and implementing computational logic. These tools have been applied to the modification and control of metabolic pathways in several organisms. They consist of one or more parts that have been combined to perform a complex function, and provide metabolic engineers with novel ways of exerting cellular control over heterologous production pathways. Some synthetic biological devices with potential relevance in metabolic engineering include orthogonal inducible promoters, light-sensitive promoters, state sensors, spatiotemporal controllers, and logic gates, as well as promoter and ribosome binding site (RBS) libraries [36,37]. Since metabolic engineering seeks to control cellular metabolism and manipulate through heterologous pathways to maximize production of a desired molecule, metabolic engineers are need elegant methods for gathering bio-information about cells, their environment, and modulating gene expression in responses [37,61,62,63]. Hence, devices of synthetic biology promise to be a useful addition to the metabolic engineering toolbox. Some synthetic devices have already been used to increase product titers. However, many remain largely untested in an industrial setting, and the complexity of biology makes their application a feat of engineering [36,37,72]. From the systems biology perspective, continuous work with these devices can help elucidate design rules or aid the development of system dynamics models that facilitate their integration into metabolic industrial processes [36,37,72], and thereby lead to the development of systems metabolic engineering.

Several mathematical techniques based on systems biology have been developed to analyze the systematic properties of complex biological networks. For example, system sensitivity of a biological network in response to various parameter variations has been analyzed to determine the systematic properties that affect the robustness and fragility of a biological network. System sensitivity analysis not only can reveal the robust stability of a biological network against various perturbations, but may also provide information about the controllability of a biological network [7,8,9]. The system response ability of a biological network is a measure of response to environmental signals or disturbances [28,34]. From the system theory perspective, robustness to intrinsic system variation and the ability to respond to external stimuli are two important and complementary system characteristics in the evaluation of system performance [33,34]. A more biological system that is robust toward a large amount of intrinsic parameter fluctuations is less responsive to environmental disturbances, and vice versa. A systems biology investigation of the aging-related gene network via microarray data found that network robustness increases and network response ability decreases during the aging process [34]. The sensitivity of a biological genetic system to environmental molecular noise is considered as an indication of the noise-filtering ability of the gene network [42]. Systems biology allows the measurement of its system characteristics, as well as the capabilities of the signal transduction pathway [8] Similarly, flux amplification of the metabolic pathway can be estimated, by using both system dynamics models.

As nonlinear biological networks operate under different conditions of cellular homeostasis and homeodynamics, systems biology studies on complex biological models in the landscape of phenotypes are highly informative. These studies help discover possible equilibrium points (phenotypes) and dynamic behaviors, such as bifurcation, oscillation, robust stability, and phase drift to other equilibrium points (phenotype transition). Bifurcation analysis and phase-plane analysis of nonlinear dynamic networks can be useful in predicting system behavior of biological networks under intrinsic parameter changes. Through systems biology approach and dynamic modeling [38,39,40,41,42], network robustness and noise filtering ability can be improved via feedback, redundancy, and modular schemes. This is why there are so many feedback loops, redundant genes, and modular structures at different scales of biological networks. A unifying mathematical framework based on nonlinear stochastic dynamic models [73,74,75] was recently proposed to describe different levels of stochastic biological networks under different parameter fluctuations, genetic variations, and environmental disturbances [29,30,59]. The phenotype robustness criteria of biological networks in systems, evolutionary, ecological, and synthetic biology were investigated from a systematic perspective on the basis of robust stabilization and filtering ability. Network robustness of biological networks can confer intrinsic robustness toward intrinsic parameter fluctuations, genetic robustness for buffering genetic variations, and environmental robustness for resisting environmental disturbances. It was found that if the sum of intrinsic robustness, genetic robustness, and environmental robustness is less than or equal to the network robustness, then the phenotype is robust in different levels of biological networks in systems, evolutionary, ecological, synthetic, and metabolic biology. These phenotype criteria at different levels of biological networks are useful for the design of synthetic and metabolic systems. A systems biology approach based on dynamic models can clearly provide not only a systematic insight into behaviors at different levels of biological networks, but also a design platform to improve system robustness, filtering ability, and transduction ability of synthetic and metabolic system networks, which are discussed in detail in the following sections.


Methods

Electroporation of mouse embryos

Embryos obtained from CBA/CA X C57Bl/10 crosses were dissected without removing placental membranes at embryonic day E10.5, and were transferred into Tyrodes solution [77]. At this stage Mash1 and Ngn2 are already expressed in the presumptive basal ganglia and cortex, respectively. The neuroepithelium is thus competent for proneural activity and GOF studies should allow the detection of genes regulated by Mash1 or Ngn2. Embryos were precultured for 2 h into a 'precision incubator' (BTC Engineering, Milton, Cambridge, UK) at 37°C with 65% oxygen in 75% v/v Rat serum + 25% v/v Tyrodes solution and 2 mg/ml of glucose. Both telencephalic vesicles were injected for microarray experiments and with only one vesicle for embryos processed for in situ hybridization using a FemtoJet Microinjector (Eppendorf) with 2 μl of a solution containing 3 μg/μl Mash1- or Ngn2-pCAGGS expression vector [78] and 2 μg/μl GFP control vector. Electroporation was performed in Tyrodes solution in a CUY520P20 chamber (Nepagene, Japan) using a BTX Electro Square Porator (Eppendorf), with the following settings: 70 V, five pulses, 50 ms at 1 s intervals. Electroporated embryos were cultivated in Rat serum supplemented with glucose as above, for the indicated time. After 18 h, heart-beating embryos were dissected under a UV binocular microscope and the electroporated (GFP-expressing) tissue was dissected under the microscope and homogenized immediately in 300 μl Trizol (Invitrogen). The choice of the time of collection of the tissue was based on maximal expression of the Dll1 promoter-lacZ reporter located in the 0.8 kb distal promoter region described previously [79] (see Figure 2(a)) and on a more detailed analysis by quantitative RT-PCR of the time course of induction of Dll1 by Mash1 in P19 cells [12]. This work showed that induction of this direct target of Mash1 becomes detectable 4–7 h after Mash1 expression but reaches a plateau only after about 15 h. Therefore, this time of collection maximizes the detection of putative direct targets however, it does not rule out the possibility of detecting indirect targets as well. Total RNA was extracted following manufacturer recommendations and resuspended in 12 μl of diethylpyrocarbonate (DEPC)-treated water (Ambion). Between 3 and 10 electroporated cortices or basal ganglia were pooled to produce a minimum of 1 μg total RNA. The preparation of probes and hybridization to MG430 2.0 chips were performed following Affymetrix guidelines.

RNA in situhybridization and immunocytochemistry

Electroporated embryos were washed for 30 min at 4°C in phosphate buffered saline (PBS) 1×, fixed in PFA 4% for 3 h, washed again in PBS 1× and incubated in 15% sucrose phosphate buffer 0.12 M (PB), pH 7.2, overnight at 4°C, incubated in gelatin 7.5%/sucrose 15% PB at 42°C and frozen in isopentane at -40°C. Wild-type and mutant embryos (Ngn2-/- and Mash1-/-) at stage E12.5 or E13.5 were fixed overnight in 4% paraformaldehyde (PAF) at 4°C, incubated overnight as before. Embryonic sections were performed at 10 μM using a Microm cryostat. In situ hybridizations were carried out as described previously [12], with NBT/BCIP or fluorescent substrate in the case of Gp38/Podoplanin [80]. Mouse Elavl4 (HUC/D) polyclonal antibody was used as described previously [12]. All of the in situ analyses performed are summarized in Additional file 4. Briefly, eight genes putatively regulated by Ngn2 and eight genes regulated by Mash1 were selected based on their levels of downregulation and upregulation in the microarray data, as well as their potential involvement in neuronal development (Mnfg, Lnfg, Lhx8, HuC/D, Nscl1) or expression in the developing nervous system (Gp38/podoplanin, Rhomboid, Nrarp). Two out of eight Ngn2 candidate genes were not detected in situ in the cortex of WT embryos and were not tested further. Candidates showing a consistent regulation in Ngn2 or Mash1 LOF mutant embryos (five out of six expressed genes for Ngn2 and six out of eight genes for Mash1) were further analyzed by in situ hybridization on electroporated GOF embryos.

P19 cell transfection and quantitative RT-PCR

P19 embryonal carcinoma cells are pluripotent cells that specifically differentiate into neurons when induced by retinoic acid or when transfected with Mash1, Ngn1, or NeuroD expressing vectors [81, 82]. We seeded 250,000 P19 cells into 21 cm culture dishes in DMEM (Gibco) supplemented with 5% goat serum and incubated overnight at 37°C. Cells were transfected in duplicate with 2 μg Ngn2-, Mash1-, or empty pCAGGS vectors and 0.1 μg GFP control vector mixed with Lipofectamine 2000 (Invitrogen) following the manufacturer's recommendations. Total RNA extraction was performed in 2 ml Trizol (Invitrogen). RNA pellets were resuspended in 50 μl DEPC treated water (Ambion) and the RNA concentration was determined by spectrophotometry. A total of 2 μg RNA was treated with 10 units DNase I (Invitrogen) and reverse transcribed with Superscript III (Invitrogen). Quantitative PCR was performed in duplicates with SYBR Green (Roche) on a Light cycler apparatus (Roche). A cDNA from hydroxymethylbilane synthase was used as a reference for normalization. Primer sequences are available from the authors upon request.

Global gene expression analyses

Twenty-eight separate gene expression datasets were used for the identification and quantification of GRNs for forebrain development using either the U74A and U74B or MOE430 2.0 Affymetrix microarray platforms. Analyses of tissues from dorsal and ventral telencephalon from wild-type (n = 14), Ngn1-/-, Ngn2-/-, Mash1-/-, Ngn1-/- Ngn2-/-, and Ngn2-/- Mash1-/- transgenic mice and GOF tissues from mice in which Ngn2 or Mash1 were electroporated on E10.5 and killed 18 h later (see above) are included in this dataset. Analyses of Ngn1, Ngn2, and Mash1 single and double knockouts were performed with RNA extracted from tissue dissected from E13.5 mice and hybridized to U74A and U74B Affymetrix chips. Basal ganglia from Mash1 KO embryos were dissected and processed for RNA trizol extraction as described previously [4]. Microarray data from cortical tissue have been described previously [4]. Two replicates of each control and a single knockout were analyzed, whereas one replicate for each double knockout and control was sampled. Microarray analyses of dorsal and ventral telencephalic tissues from control and GOF mice (E10.5 mice cultured for 18 h) were performed using the Affymetrix MOE430 2.0 chip. Replicates were performed for a total of two dorsal telencephalon controls, three ventral telencephalon controls, and three each of the Ngn2 and Mash1 GOF mice. Normalization was performed using GC-RMA software for background adjustment using sequence information [83] downloaded from [84]. MOE and U74 probesets were assigned Ensembl IDs based on Ensembl Version 37 and duplicate Ensembl IDs were collapsed within a set by taking the median value. Gene expression ratios used for the subsequent network analyses described below were derived from individual mutant arrays versus a time-matched wild-type control gene expression array. To generate a list of putative target genes we used a 1.3-fold cut-off. We note that the use of a fold change cut-off has been shown to be more reliable than p-value or false discovery rate (FDR) cut-offs, in a multicenter large-scale quality control analysis across laboratories and platforms [85]. Furthermore, our target gene lists are generated from two independent fold change cut-offs, from both transgenic and GOF experiments, thereby increasing confidence in the resultant target gene lists. GOF experiments were performed at an earlier stage in neurogenesis (E10.5 and cultured for 18 h) than LOF experiments (E13.5) so that expression of Mash1 and Ngn2 in the telencephalon is still low and the effect of overexpressing these genes is maximal, while the LOF analysis had been performed at a stage when Mash1 and Ngn2 expression is high, to maximize the effect of loss of these genes. Based on previous analysis of Mash1 and Ngn2 function in telencephalic development [3, 4], we do not expect these genes to have substantially different functions and target genes at these two stages.

Quantification of networks

The strength of the relationships in GRNs were quantified to calculate the posterior probability distribution for the strength of the linkages based on the fold changes seen in the gene expression datasets [8]. A log-linear function was used to describe relationships between genes:

where α Iis the level of gene expression independent of the network, I jiis an indicator function (0, 1, -1) if a linkage exists from gene j to gene i, β jiis the degree to which change in gene j will affect change in gene i, G jis a variable associated with the relative expression level of gene j compared with normal level j, e iis the random error in predicted value for gene i and n is the number of genes in the network.

The posterior distributions for the linkages in each network were derived using Markov chain Monte Carlo (MCMC) sampling methods as described elsewhere [8, 34]. For the current analysis, KO and GOF effects on genes are modeled as dedicated parents where the prior for α iis set to zero all other α are assumed to have normal priors. The priors for the β are assumed normal with mean zero and variance σ = 1. Finally, e iis assumed to be normally distributed with mean zero and variance σ 2, where σ 2 is assumed to have a uniform prior with support defined by the observed data. The MCMC maximum sampling step sizes are 0.05 for the σ, 0.08 for the β, and 0.05 for the α, and 500,000 iterations were performed with decimation of every 10th value. The last 50,000 iterations were used to establish the mean value of β jiand the significance of this value. Statistical significance of the parameter β jiis defined by less than 5% of iterations with β ji≤ 0. To address the specificity of our method, we have permuted the gene labels from the microarray experiments (n = 12,357) generating 100 random datasets of gene expression. We then applied these datasets to quantitate the lit-based network to determine the number of times we see significance of these connections from each randomly generated dataset. It should be noted that gene expression correlations across experimental conditions are preserved in this analysis. Software for performing these analyses is available from JMG.

Identification of potential co-factors for Ngn2 and Mash1

Consensus binding sites for Ngn2 and Mash1 were defined as CANTWG and GCAGSTGK, or CAGSTG, respectively, based in part on [12, 28, 29] and unpublished data (DSC and FG) as described in Additional file 7. Due to the scale of the bioinformatics method performed for predicting co-factors, we limited our analysis to the sequence surrounding 11 predicted Ngn2 target genes, 14 predicted Mash1 genes, and 6 common target genes based on criterion similar to that used for in situ confirmation in that we focused on the most differentially expressed as well as the best candidate genes from the literature (Additional file 8). For each gene we looked at a minimum of 500 kb of sequences in front of (approximately 300 kb) and behind (approximately 200 kb) including UTRs and introns of the gene of interest and surrounding genes that fell within the 500 kb range. We utilized the ECR browser [86] to align human sequence with Mus musculus, Gallus gallus, Xenopus tropicalis, Fugu rubripes, and Danio rerio [87]. Sometimes no conserved regions were found within our search limits, in which case we removed alignments with the lower vertebrates (F. rubripes and D. rerio) and only analyzed alignments to G. gallus and/or X. tropicalis to find conserved non-coding regions. To further refine the alignment, the web-based Mulan program was utilized, which performs a full local multi-sequence alignment that can account for evolutionary reshuffling and inversions using the threaded blockset aligner program [87, 88]. From this analysis, evolutionary conserved regions (ECRs) with a minimum length of 100 bp and minimal percentage identity of 70% were defined. Finally, we applied Multitf, which searches across the identified ECRs for conserved TFBSs [88], to search for putative Mash1 (GCAGSTGK or CAGSTG) and Ngn2 (CANWTG) binding sites. In total, 160 conserved Ngn2 and 75 conserved Mash1 binding sites were identified. These sites were distributed over the 500 kb analyzed, although the highest number of sites was found in the 20 kb of sequence surrounding the TSS (Additional file 12). Specifically, 19 Mash1 sites were found surrounding the 11 Ngn2 target genes (average number of sites per gene is 1.7) versus 56 Mash1 sites surrounding the 20 Mash1 and common target genes (average number of sites per gene is 2.8). However, this difference is not statistically significant (p = 0.2). Furthermore, Ngn2 binding sites are found just as often in front of Ngn2 targets versus Mash1 and/or common target genes (82 Ngn2 sites were found surrounding the 14 Mash1 targets and 22 Ngn2 sites were found surrounding the 6 common targets versus 58 sites found surrounding the 11 Ngn2 target genes). The similarity between Ngn2 and Mash1 and potential function in central nervous system development suggests Mash1 targets could be regulated by Ngn2 as well, in the telencephalon or other tissues. In addition, we note that Mash1 target genes are not equivalent to a random set of genes when analyzing for enrichment of Ngn2 sites. With regards to enrichment of Ebox sites in our putative target genes, the CONFAC analysis described below allowed us to show that several Ebox matrices were significantly enriched in the sequence surrounding our predicted Ngn2 and Mash1 target genes when compared with 250 randomly selected genes (Additional file 9).

To identify potential co-factors, we searched for all vertebrate TRANSFAC annotated TFBSs within 30 bp upstream and downstream of the putative Ngn2 or Mash1 binding sites. The 30 bp length was based in part on prior research showing active modules containing Pou and bHLH binding sites within 15 bp of each other [12]. We removed those TRANSFAC annotated TFBSs that overlapped considerably with the putative Ngn2 and Mash1 sites including the following TRANSFAC matrices: E12, E2A, Heb, Hen1, Hand1, E47, Ebox, myogenin, NeuroD, Myod, Areb6, Tal1, Lbp1, Ap4, E47, and lmo2com. We also collapsed all similar TRANSFAC matrices that referenced the same family of transcription factors (for example, Pou domain containing factors and SRY domain containing factors) or were for the same transcription factor, but identified in different vertebrate species. TRANSFAC matrices were mapped to current mouse gene identifiers by following the original reference for the matrix found in TRANSFAC through the literature. To identify the most likely co-factors for Ngn2 and Mash1, we performed the Fisher's exact two-sided test with p < 0.05 to test for significantly enriched TFBSs in sequence surrounding Ngn2 sites versus Mash1 sites or vice versa. Gene expression from wild-type dorsal and ventral telencephalon tissue was analyzed to predict differential dorsal or ventral expression patterns of the predicted co-factors using a 1.5-fold change cut-off. These were subsequently compared with in situ analyses found in online databases (Additional file 10).

Identification of putative co-regulators of Ngn2 and Mash1 targets

Promoter region sequence (10,000 bp upstream of TSS) from mouse and human orthologs of Ngn2 predicted targets and Mash1 and Mash1/Ngn2 predicted common targets (only those with RefSeq IDs associated with them) was automatically uploaded from the UCSC database via the CONFAC website [89]. CONFAC then identifies conserved TFBSs from the TRANSFAC database version 7.0 in the human and mouse sequence alignments [30]. As part of the CONFAC software, the Mann-Whitney statistical test was then applied to test for enrichment of TFBSs in the given gene lists. We compared each list with a list of 250 randomly picked genes available from the CONFAC website, as well as comparing our Ngn2 list with the Mash1/common targets list and vice versa. We then annotated the resulting lists of enriched TRANSFAC TFBSs as described above. Transcription factors that did not show minimal expression (> 4.5 median intensity) in wild-type microarray datasets were not analyzed further.

Algorithm-based network structure

The TAO-Gen algorithm identifies the optimal gene regulatory network given a specific gene expression dataset [34]. Briefly, our method utilizes a log-linear model (Equation 1) and MCMC to identify the network that best accounts for the variability seen in the microarray datasets. In order to explore larger networks, the number of possible networks in the search space is restricted. This is accomplished through use of an annealing algorithm that combines aspects of the Metropolis algorithm used for MCMC sampling and a simulated annealing algorithm used for optimizations. The maximum number of parents for any given gene is restricted to three however, no complexity penalty was used. Based on standard techniques in Bayesian networks [90], we include all network structures within the top 95% of scores based on the maximum likelihood and build a common network that includes those interactions that occur in more than 50% of these network structures. A detailed description, as well as a complete evaluation of this method through statistical simulation studies has been described previously [34]. We have also performed detailed comparisons to another Bayesian network algorithm [35], which was subsequently coded for Matlab [36]. Results are described in detail in Additional file 1.

An informative prior structure was built utilizing several different data sources. The informative prior structure is represented as a matrix with 0 meaning forbidden connection, 1 meaning required connection, and 0.5 meaning no prior information is available. We considered 25 literature-based connections as required connections, based on previous literature data that is consistent with the current microarray data, which are highlighted in the resulting network. Genes whose functions are known and do not include direct transcription factor activity or DNA/RNA binding are forbidden from being parents, with the exception of the signaling molecules Wnt7b and Dll1, which are known initiators of transcription via Wnt/β-catenin and Notch pathways, respectively.

Parents are required to be expressed in the same tissue as children (greater than 4.5 median intensity (log2 base) in wild-type datasets) therefore, solely dorsally expressed genes are not permitted to parent ventrally expressed genes and vice versa. Each TFBS information source is given an informative prior value of 0.1, such that if a TFBS is found in a sequence in front of a given gene, the prior score is raised from 0.5 to 0.6. TFBS data were derived from both comparative genomics analyses described above. Results obtained with and without the prior structure are described in Additional file 1.


Education and Training Resources

The educational institutions listed below have submitted information on their bioinformatics related online courses. To post an online course offered by your institution please use this form.

**To sort this list by Course Focus, Course Title, or University/Institution, please click on the column header.

Focus Course Title University/Institution KeyInfo Goals
Computational Biology SysMIC University College London SysMIC is an online course in coding, modelling &amp data analysis for bioscience researchers.<br /> <br /> The module includes access to a comprehensive range of resources including course textbook, assignments, webinars and self-test quizzes, with dedicated module tutors who provide individual support through online forums.<br /> <br /> Participants will become confident in using the Python, R and MATLAB platforms, developing interdisciplinary skills that will make them more effective researchers with the confidence to develop and apply computational techniques in modelling and data analysis to their own work.<br /> <br /> Our materials were designed by bioscientists, for bioscientists and are focused on biological applications, and practical approaches that let participants develop coding and analysis skills while working in familiar contexts.<br /> <br /> The module has a fully flexible format. We recommend 5 hours of study a week over a 6 month period, which has been shown to fit well with the varying workload of research commitments, offering participants the opportunity to study at a time and pace that suits them.<br /> <br /> We are a well established and recognized course. Each year we welcome over 300 participants from across the UK research community who come from a wide range of academic (and industrial) backgrounds, and the module is approved by the Royal Society of Biology's CPD points scheme.<br /> <br /> Module 1 registrations are held twice yearly in April and November. Discounted rates are available for PhD students and group bookings. <br /> <br /> To apply for a place or for further information please contact us at: [email protected] Sysmic.ac.uk provides a comprehensive online course in the interdisciplinary skills which are increasingly important to cutting edge biological research.
Other: There are two concentration areas, one in biomolecular engineeri B.S. Biomolecular Engineering and Bioinformatics University of California, Santa Cruz The program has two concentrations: one on biomolecular engineering,<br /> which emphasizes wet-lab work with end-user bioinformatics and<br /> programming and one on bioinformatics, which emphasizes computational<br /> work. Both concentrations require both biochemistry and programming.<br /> <br /> For the bioinformatics concentration, we emphasize building new tools,<br /> more than using existing tools. We use stochastic modeling and machine<br /> learning extensively. The final year of the bioinformatics<br /> concentration is almost the same as the first year of our<br /> graduate program.<br /> <br /> We have strengths in comparative genomics, RNA genes, archaeal<br /> genomics, nanopore sequencing, ancient DNA, stem cells, protein<br /> engineering, gene-finding, and several other topics. To give students a very broad background to prepare them for graduate<br /> education in bioinformatics or biomolecular engineering. Students are<br /> also well prepared for jobs in the biotech industry.
Bioinformatics Python Skills for Handling Biological Data Uganda Virus Research Institute Introduction to concepts of Python programming&ltbr /&gt<br /> Biological data analysis code challenges Equipping scientists, students and interns in Bioinformatics and Computational Biology disciplines with Python programming skills for handling biological data
Bioinformatics Statistical Analysis in Bioinformatics University System of Maryland This course is taught online via the edX platform. In this course, part of the Bioinformatics MicroMasters program, you will learn about the R language and environment and how to use it to perform statistical analyses on biological big datasets. Basic R Programming. Applying packages in the R environment to determine changes in gene expression. Applying packages in the R environment to locate genes in a full genomic sequence.
Bioinformatics Proteins: Alignment, Analysis and Structure University System of Maryland This course is delivered online via the edX platform. In this course, part of the Bioinformatics MicroMasters program, you will learn about protein structure and its impact on function, practice aligning protein sequences to discover differences, and generate model structures of proteins using web and software-based approaches. Analyze biological big data How to align protein sequences to discover differences and determine structure Generate model structures of unknown proteins
Bioinformatics DNA Sequences: Alignments and Analysis University System of Maryland This course is delivered on-line via the edX platform. You will learn about the theory and algorithms behind DNA alignments, practice doing alignments manually, and then perform more complicated alignments using web and software based approaches. Synthesize and analyze biological big data. Theory behind alignment algorithms and how they operate Examine the roles mutations play on cellular processes
Math/Statistics Data Analysis for the Life Sciences Series HarvardX An introduction to basic statistical concepts and R programming skills necessary for analyzing data in the life sciences. We will learn the basics of statistical inference in order to understand and compute p-values and confidence intervals. We will provide examples by programming in R in a way that will help make the connection between concepts and implementation. Problems sets requiring R programming will be used to test understanding and ability to implement basic data analyses. We will use visualization techniques to explore new data sets and determine the most appropriate approach. We will describe robust statistical techniques as alternative when data do not fit assumptions required by the standard approaches. We will also introduce the basics of using R scripts to conduct reproducible research. This class was supported in part by NIH grant R25GM114818. Topics: Distributions Exploratory Data Analysis Inference Non-parametric statistics These courses make up 2 XSeries and are self-paced: PH525.1x: Statistics and R for the Life Sciences PH525.2x: Introduction to Linear Models and Matrix Algebra PH525.3x: Statistical Inference and Modeling for High-throughput Experiments PH525.4x: High-Dimensional Data Analysis PH525.5x: Introduction to Bioconductor: annotation and analysis of genomes and genomic assays PH525.6x: High-performance computing for reproducible genomics PH525.7x: Case studies in functional genomics
Bioinformatics ArrayGen Technologies Online/offline Courses ArrayGen Technologies ArrayGen offers the following genomics and bioinformatics training courses with a focus on improving participants' practical applications, by using the appropriate theoretical knowledge: Bioinformatics ( Understanding Genomics ) Microarray Data analysis Next Generation Sequencing (NGS) De novo genome and transcriptome assembly Chip-Seq Data Analysis RNA-Seq Data Analysis miRNA Data Analysis Metagenomic Data Analysis MethylSeq(DNA Methylation) Data Analysis Genome Variant Data Analysis Objectives To create an awareness with respect to the basic tools and techniques used in bioinformatics at industrial level To provide complete hands-on-training in the basic tools and techniques To inspire and motivate all life sciences to apply these techniques in their research programmes
Bioinformatics Biology Meets Programming: Bioinformatics for Beginners University of California, San Diego Are you interested in learning how to program (in Python) within a scientific setting? This course will cover algorithms for solving various biological problems along with a handful of programming challenges helping you implement these algorithms in Python. It offers a gentler-paced alternative to the first course in our Bioinformatics Specialization (https://www.coursera.org/specializations/bioinformatics). Each of the four weeks in the course will consist of two required components. First, an interactive textbook provides Python programming challenges that arise from real biological problems. If you haven't programmed in Python before, not to worry! We provide "Just-in-Time" exercises from the Codecademy Python track (https://www.codecademy.com/learn/python). And each page in our interactive textbook has its own discussion forum, where you can interact with other learners. Second, each week will culminate in a summary quiz. Lecture videos are also provided that accompany the material.
Bioinformatics Bioinformatics Specialization on Coursera University of California San Diego How do we sequence and compare genomes? How do we identify the genetic basis for disease? When you complete this Specialization, you will learn how to answer many questions such as these in modern biology. In the process, you wlll learn about the algorithms and software tools that thousands of biologists apply at work every day in one of the fastest growing fields in science. The Bioinformatics Specialization's printed companion, Bioinformatics Algorithms: An Active Learning Approach, is available from the textbook website (http://bioinformaticsalgorithms.com), which contains additional educational materials, including lecture videos and slides.
Bioinformatics Algorithms for DNA Sequencing Johns Hopkins University This course is delivered online on the Coursera MOOC platform. It consists of: (a) about 1 hour of lectures per week by Prof. Langmead, (b) about 1 hour of practical lectures per week by Prof. Langmead and Jacob Pritt, (c) one multiple choice homework assignment per module, (d) one programming-based homework assignment per module and (e) some optional lectures covering a broader selection of research ideas. DNA sequencing is now a ubiquitous tool in life science. You can observe this trend just by reading the news. This course examines the computational problems that come with this onslaught of DNA sequencing data. How do we take a huge collection of DNA "puzzle pieces" and assemble them into a genome? How do we make it quick and easy to find a DNA "needle" in an enormous genomic "haystack"? We will spend the bulk of the course understanding the algorithms and data structures that underlie software tools for analyzing sequencing data. The course is also an opporunity to practice programming skills and gain exposure to basic algorithms and data structures.
Math/Statistics Networks and Systems East Tennessee State University The course is intended for those with degrees in math, biology, computer science, statistics and other related scientific fields who are interested in modeling biological complexity. It is part of a 5 course certificate that is offered completely online. Topics include complex networks, centrality and global measures, random models and applications in Systems Biology. The first half of the course will cover the mathematical formulation of networks while the second half is dedicated to applications including the study of molecular biology networks.
Bioinformatics Advanced sequence analysis The University of Manchester This advanced bioinformatics course is suitable for those with a first degree in either a biological science or in computer science. It covers the most recent methods for biological sequence analysis. It could be taken as an individual short course, for professional development, and could be combined with one of more units from our theme in Computational Systems Biology :- Bioinformatics for Systems Biology Mathematics for metabolic modelling Computational Systems Biology Bioinformatics for transcriptomics. A student successfully completing four units can graduate with a Postgraduate Certificate. Those people who wish to complete the Masters degree will be required to successfully complete six modules and a research project. The course provides an introduction to the data and methods for projects requiring Next Generation Sequence data analysis. It will cover : genes, genomes and genome sequencing technologies for high throughput sequencing understanding the data mapping to a genome RNA-seq : quantification and differential expression ChIP-seq. For practical work students will have accounts on our Galaxy server.
Math/Statistics Mathematics for metabolic modelling University of Manchester The course is delivered using a virtual learning environment called Moodle. This allows you to navigate and search through course notes, protocols, practicals and references to useful texts and URLs. The course notes provide background information, as web pages. Teaching and learning are then focussed around tutor-supported individual and group exercises. This course aims to ensure that the successful student has an understanding of the core mathematical concepts and techniques used in mathematical modelling of biological systems is able to express in mathematical terms simple representations of a biological system, manipulate and develop simplifying approximations of those representations in order to gain insight into the behaviour of the mathematical model and hence the real biological system has a basic understanding of how parameters within a mathematical model are inferred from or fitted to experimental data, and the basic issues and pitfalls of model fitting. It is designed to prepare participants for our core modelling course in ’Computational simulation and analysis of biochemical networks’.
Computational Biology Bioinformatics for Systems Biology University of Manchester The course is delivered using a virtual learning environment called Moodle. This allows you to navigate and search through course notes, protocols, practicals and references to useful texts and URLs. The course notes provide background information, as web pages. Teaching and learning are then focussed around tutor-supported individual and group exercises. In this course, participants discuss a tutorial problem for each section of the course, and then submit solutions for feedback from the course tutor This course is designed as an introduction to modelling for Systems Biology. It covers the range of different types of data now available for model building. The sections are : * the use of models in biology * public pathway and interaction databases * reconstruction of biological networks from experimental data * network statistics * the analysis and interpretation of experimental data in the context of biological networks * advanced topics (optional). The course could be taken as an individual short course, for professional development, or the credits could count towards one of our Masters programmes.
Bioinformatics Bioinformatics for transcriptomics The University of Manchester The course is delivered using a virtual learning environment called Moodle. This allows you to navigate and search through course notes, protocols, practicals and references to useful texts and URLs. The course notes provide background information, as web pages. Teaching and learning are then focussed around tutor-supported individual and group exercises. In this course, participants discuss a tutorial problem for each section of the course, and then submit solutions for feedback from the course tutor. The new methods for transcriptomics are bringing new challenges in bioinformatics. This course covers microarray data analysis in depth, and also introduces the areas where new work is needed for next generation sequence (RNA-seq) analysis. The sections are : * Microarrays and experimental design * Data capture and preliminary checks * Microarray data analysis * Other methods for transcriptome data capture * Gene Class Tests The course could be taken as an individual short course, for professional development, or the credits could count towards one of our Masters programmes.

International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176


‘Linked’ tools and resources

A general challenge when using comparative approaches to study BGCs is the varying quality of annotation in public sequence databases. Some BGCs that have been extensively studied experimentally are well annotated, whereas others—mostly identified in high-throughput sequencing efforts—were only annotated using standard genome annotation pipelines that do not provide specific annotations of secondary metabolite BGCs. Therefore, a community effort has been established to define a ‘MIBiG’ standard [ 17] and provide a standardized repository for BGCs that have been experimentally connected to their biosynthetic products. The MIBiG repository currently (as of April 2017) contains 1396 entries of BGCs that are validated to code for a specific biosynthetic pathway. Within this set, 396 of the entries contain comprehensive manually curated annotations of the specific features of the gene clusters, which were provided by the specialists that studied these respective BGCs. This collection now serves as a reference data set for a wide variety of applications and the validation of novel computational tools.

In addition to analyses integrated into antiSMASH, the annotation generated by antiSMASH can also be useful as a starting point for further downstream analyses. Therefore, antiSMASH 4 provides an application programming interface that allows third-party software to access antiSMASH annotation for further processing. Examples of such tools are the ‘Antibiotic Resistant Target Seeker ARTS’ [ 8], which predicts potential targets of antibiotics and uses the annotation provided by antiSMASH to mine for BGCs and CRISpy-web [ 11], a Web tool that allows user-friendly design of single guide RNAs (sgRNAs) for CRISPR applications on nonmodel organisms.

antiSMASH is a comprehensive genome mining platform, but only provides information on individually submitted genomes and does not offer any integrated search functionality. Therefore, in 2016, the antiSMASH platform was extended with a database containing precomputed antiSMASH annotation on >3900 finished high-quality bacterial genome sequences [ 7]. Using the Web interface, it is possible to browse secondary metabolite clusters by BGC type or taxonomy of the producer organism. Additionally, custom queries can be constructed using an interactive query builder. This makes it possible to answer research questions such as ‘which clusters of type NRPS contain A domains that select for the nonproteinogenic amino acid 3, 5-dihydroxy-phenylglycine?’ or ‘what BGCs of type RiPP exist in the genus Streptomyces that are not lanthipeptides?’. The results are displayed in the same antiSMASH Web format. They can also be exported in various file formats that allow further processing in other bioinformatics tools.


How were genes located before the development of bioinformatics? - Biology

Primary information of p53 gene

p53 was identified in 1979 by Arnold Levine,David Lane and William Old,working at Princeton University, Dundee University (UK) and Sloan-Kettering Memorial Hospital, respectively. It had been hypothesized to exist before as the target of the SV40 virus, a strain that induced development of tumors.Although it was initially presumed to be an oncogene, its character as a tumor suppressor gene was revealed in 1989.In 1993, p53 protein has been voted molecule of the year by the Science magazine

The p53 protein is a phosphoprotein made of 393 amino acids. It consists of four units (or domains):

A domain that activates transcription factors.

A domain that recognizes specific DNA sequences (core domain).

A domain that is responsible for the tetramerization of the protein.

A domain that recognized damaged DNA, such as misaligned base pairs or single-stranded DNA.


Wild-type p53 is a labile protein, comprising folded and unstructured regions which function in a synergistic

manner (Bell et al. 2002).p53 protein has been voted molecule of the year.

It plays an important role in cell cycle control and apoptosis. Defective p53 could allow abnormal cells to proliferate, resulting in cancer. As many as 50% of all human tumors contain p53 mutants.

In normal cells, the p53 protein level is low. DNA damage and other stress signals may trigger the increase of p53 proteins, which have three major functions: growth arrest, DNA repair and apoptosis (cell death). The growth arrest stops the progression of cell cycle, preventing replication of damaged DNA. During the growth arrest, p53 may activate the transcription of proteins involved in DNA repair. Apoptosis is the "last resort" to avoid proliferation of cells containing abnormal DNA.

The cellular concentration of p53 must be tightly regulated. While it can suppress tumors, high level of p53 may accelerate the aging process by excessive apoptosis. The major regulator of p53 is Mdm2, which can trigger the degradation of p53 by the ubiquitin system.

p53 is a transcriptional activator, regulating the expression of Mdm2 (for its own regulation) and the genes involved in growth arrest, DNA repair and apoptosis. Some important examples are listed below.

    1. Growth arrest: p21, Gadd45, and 14-3-3 s.
    2. DNA repair: p53R2.
    3. Apoptosis: Bax, Apaf-1, PUMA and NoxA.

    As mentioned above, p53 is mainly regulated by Mdm2. The regulation mechanism is illustrated in the following figure.

    Figure 1.0. Regulation of p53.

    (a) Expression of Mdm2 is activated by p53.

    (b) Binding of p53 by Mdm2 can trigger the degradation of p53 via the ubiquitin system.

    (c) Phosphorylation of p53 at Ser15, Thr18 or Ser20 will disrupt its binding with Mdm2. In normal cells, these three residues are not phosphorylated, and p53 is maintained at low level by Mdm2.

    The roles of p53 in growth arrest and apoptosis are illustrated in Figure 4-H-6. p53 is also directly involved in DNA repair. One of its transcriptional target gene, p53R2, encodes ribonucleotide reductase, which is important for both DNA replication and repair. p53 also interacts directly with AP endonuclease and DNA polymerase which are involved in base excision repair.

    Figure 2.0 . The roles of p53 in growth arrest and apoptosis.

    (a) The cell cycle progression into the S phase requires the enzyme Cdk2, which can be inhibited by p21. The progression into the M phase requires Cdc2 which can be inhibited by p21, GADD45 or 14-3-3 s. p53 regulates the expression of these inhibitory proteins to induce growth arrest.

    (b) Apoptosis can be induced by the binding of Caspase 9 to cytochrome c and Apaf1. p53 may activate the expression of Apaf1 and Bax. The latter can then stimulate the release of cytochrome c from mitochondria (see Mitochondria, Apoptosis and Aging).

    If the p53 gene is damaged, tumor suppression is severely reduced. People who inherit only one functional copy of p53 will most likely develop tumors in early adulthood, a disease known as Li-Fraumeni syndrome. p53 can also be damaged in cells by mutagens (chemicals, radiation or viruses), increasing the likelihood that the cell will begin uncontrolled division. More than 50 percent of human tumors contain a mutation or deletion of the p53 gene.

    In health p53 is continually produced and degraded in the cell. The degradation of p53 is, as mentioned, associated with MDM-2 binding. In a negative feedback loop MDM-2 is itself induced by p53. However mutant p53s often don't induce MDM-2, and are thus able to accumulate at very high concentrations. Worse, mutant p53 protein itself can inhibit normal p53 (Blagosklonny, 2002).

    7. POTENTIAL THERAPEUTIC USE

    In-vitro introduction of p53 in to p53-deficient cells has been shown to cause rapid death of cancer cells or prevention of further division. It is more these acute effects which hopes rest upon therapeutically (McCormick F, 2001). The rationale for developing therapeutics targeting p53 is that "the most effective way of destroying a network is to attack its most connected nodes". P53 is extremely well connected (in network terminology it is a hub) and knocking it out cripples the normal functioning of the cell. This can be seen as 50% of cancers have missense point mutations in the p53 gene, these mutations impair its anti-cancer gene inducing effects. Restoring its function would be a major step in curing many cancers (Vogelstein et al 2000).

    Various strategies have been proposed to restore p53 function in cancer cells (Blagosklonny,2002).A number of groups have found molecules which appear to restore proper tumour suppressor activity of p53 in vitro. These work by altering the conformation of mutant conformation of p53 back to an active form. So far, no molecules have shown to induce biological responses, but some may be lead compounds for more biologically active agents. A promising target for anti-cancer drugs is the molecular chaperone Hsp90, which interacts with p53 in vivo.

    Adenoviruses rely on their host cells to replicate, they do this by secreting proteins which compel the host to replicate the viral DNA. Adenoviruses have been implicated in cancer-causing diseases, but in a twist it is now modified viruses which are being used in cancer therapy. ONYX-015 (dl1520, CI-1042) is a modified adenovirus which selectively replicates in p53-deficient cancer cells but not normal cells (Bischoff, 1996). It is modified from a virus that expresses the early region protein, E1B, which binds to and inactivates p53. P53 suppression is necessary for the virus to replicate. In the modified version of the virus E1B has been deleted. It was hoped that the viruses would select tumour cells, replicate and spread to other surrounding malignant tissue thus increasing distribution and efficacy. The cells which the adenovirus replicates in are lysed and so the tumour dies.

    Preclinical trials using the ONYX-015 virus on mice were promising however clinical trials have been less so. No objective responses have been seen except when the virus was used in combination with chemotherapy (McCormick, 2001). This may be due to the discovery that E1B has been found to have other functions vital to the virus. Additionally its specificity has been undermined by findings showing that the virus is able replicate in some cells with wild-type p53. The failure of the virus to produce clinical benefits may in large part be due to extensive fibrotic tissue hindering virus distribution around the tumour (McCormick, 2001).

      Bates S, Phillips AC, Clark PA, Stott F, Peters G, Ludwig RL, Vousden KH. (1998) p14ARF links the tumour suppressors RB and p53. Nature 395:124-125

    Bell S, Klein C, Muller L, Hansen S, Buchner J. (2002). p53 contains large unstructured regions in its native state. J Mol Biol, 322:917-927

    Bischoff JR, Kirn DH, Williams A, Heise C, Horn S, Muna M, Ng L, Nye JA, Sampson-Johannes A, Fattaey A, McCormick F. (1996). An adenovirus mutant that replicates selectively in p53-deficient human tumor cells. Science, 274:373-376

    Blagosklonny, MV. (2002). P53: An ubiquitous target of anticancer drugs. International Journal of Cancer, 98:161-166

    McCormick F. (2001). Cancer gene therapy: fringe or cutting edge? Nat Rev Cancer, 1:130-141

    Strachan T, Read AP. (1999). Human Molecular Genetics 2. Ch. 18, Cancer Genetics

    Vogelstein B, Lane D, Levine AJ. (2000). Surfing the p53 network. Nature, 408:307-310