Information

Resource for learning analysis of sequence data via QIIME 2


I am trying to study/learn metagenomics through 16s rRNA microbial data and I want to teach myself/review core biology concepts while also learning the QIIME 2 analysis pipeline, as the website suggests learning it instead of QIIME for beginners.

I found a great source : http://www.science.smith.edu/cmbs/wp-content/uploads/sites/36/2015/09/Tutorial-from-sample-to-analyzed-data-using-Qiime-for-analysis.pdf but was hoping for something with an analysis I could actually work through on my own, as this one involves the use of a cluster, where as I only have access to a laptop of moderate power and also one that involves QIIME 2.

Also something free and maybe as hand-holding as possible would be ideal too.

As for the platform I am learning on, it is a virtual box instance of Ubuntu.


The QIIME 2 tutorials are the place to get started - see here.


Resource for learning analysis of sequence data via QIIME 2 - Biology

qiime2 (the QIIME 2 framework)

Source code repository for the QIIME 2 framework.

QIIME 2™ is a powerful, extensible, and decentralized microbiome bioinformatics platform that is free, open source, and community developed. With a focus on data and analysis transparency, QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results.

Visit https://qiime2.org to learn more about the QIIME 2 project.

Detailed instructions are available in the documentation.

Head to the user docs for help getting started, core concepts, tutorials, and other resources.

Just have a question? Please ask it in our forum.

Please visit the contributing page for more information on contributions, documentation links, and more.

If you use QIIME 2 for any published research, please include the following citation:


Background

High-throughput sequencing technologies have transformed our ability to explore complex microbial communities, offering insight into microbial impacts on human health [1] and global ecosystems [2]. This is achieved most commonly by sequencing short, conserved marker genes amplified with ‘universal’ PCR primers, such as 16S rRNA genes for bacteria and archaea, or internal transcribed spacer (ITS) regions for fungi. Targeted marker-gene primers can also be used to profile specific taxa or functional groups, such as nifH genes [3]. These sequences often are compared against an annotated reference sequence database to determine the likely taxonomic origin of each sequence with as much specificity as possible. Accurate and specific taxonomic information is a crucial component of many experimental designs.

Challenges in this process include the short length of typical sequencing reads with current technology, sequencing and PCR errors [4], selection of appropriate marker genes that contain sufficient heterogeneity to differentiate target species but that are homogeneous enough in some regions to design broad-spectrum primers, quality of reference sequence annotations [5], and selection of a method that accurately predicts the taxonomic affiliation of millions of sequences at low computational cost. Numerous methods have been developed for taxonomy classification of DNA sequences, but few have been directly compared in the specific case of short marker-gene sequences.

We introduce q2-feature-classifier, a QIIME 2 (https://qiime2.org) plugin for taxonomy classification of marker-gene sequences. QIIME 2 is the successor to the QIIME [6] microbiome analysis package. The q2-feature-classifier plugin supports use of any of the numerous machine-learning classifiers available in scikit-learn [7, 8] for marker gene taxonomy classification, and currently provides two alignment-based taxonomy consensus classifiers based on BLAST+ [9] and VSEARCH [10]. We evaluate the latter two methods and the scikit-learn multinomial naive Bayes classifier (labeled “naive Bayes” in the “Results” section) for the first time. We show that the QIIME 2 classifiers provided in q2-feature-classifier match or outperform the classification accuracy of the widely used QIIME 1 methods for sequence classification, and that performance of the naive Bayes classifier can be significantly increased by providing it with information regarding expected taxonomic composition. Some of the taxonomy classification methods in QIIME 1 (RDP classifier [11] and BLAST [9]) are thin wrappers around the original software other methods based on uclust [12] SortMeRNA [13](QIIME 1), VSEARCH, and BLAST+ (QIIME 2) are also wrapped implementations of other software followed by consensus taxonomic assignment by QIIME software. Thus, while our analyses focus on methods currently implemented in these versions of QIIME, we expect that the results will generalize to similar applications of those tools outside of QIIME.

We also developed tax-credit (https://github.com/caporaso-lab/tax-credit-code/ and https://github.com/caporaso-lab/tax-credit-data/), an extensible computational framework for evaluating taxonomy classification accuracy. This framework streamlines the process of methods benchmarking by compiling multiple different test data sets, including mock communities [14] and simulated sequence reads. It additionally stores pre-computed results from previously evaluated methods, including the results presented here, and provides a framework for parameter sweeps and method optimization. Tax-credit could be used as an evaluation framework by other research groups in the future or its raw data could be easily extracted for integration in another evaluation framework.


Introduction

Sequencing of the 16S rRNA gene is the most popular method for characterizing the complexity of microbial communities. The latest advances in high-throughput sequencing techniques and next-generation (NGS) sequencing platforms allow users to assign taxonomy considering the hypervariable regions contained in the 16S rRNA gene. Studies suggest that 16S rRNA gene sequencing provides genus identification in most cases but less so with regard to species (Janda and Abbott, 2007). It has been pointed out that a key limitation of using 16S rRNA gene sequencing is that related bacterial species may be indistinguishable due to the homology between sequences (Plummer et al., 2015). Sequencing of hypervariable sub-regions of this gene produces single short-reads of 100–500 nucleotides, when the gene contains 1542 base pairs, missing variations out of sequencing area (Zhang et al., 2011). Although some sub-regions provide a good approximation of 16S diversity, most do not distinguish polymorphisms between closely related taxa, which could be avoided by the complete sequencing of the gene (Johnson et al., 2019). Also, the presence of multiple intragenomic copies of this gene (1 up to 5 or more), which could carry subtle nucleotide substitutions, may overestimate microbial abundance (Johnson et al., 2019 Větrovský and Baldrian, 2013).

Another source of bias when selecting sequencing platforms and the subsequent data processing is the software of choice and the analytical protocol (pipeline). It has been previously reported that the employment of different platforms can yield differences in sample characterization depending on their accuracy, sensitivity and error rate (Claesson et al., 2017). Up to date, no standard evaluations studies exist to select the best platform in terms of accuracy or sensitivity when evaluating complex communities. Escobar-Zepeda et al., have recently publish a benchmark tool that looks into error rates and sensitivity (Escobar-Zepeda et al., 2018). Loman and co-workers showed that Illumina MiSeq yielded the lowest error rate when compared with both Ion Torrent Personal Genome Machine (PGM) and the Roche 454 platform (Loman et al., 2012). Nowadays, the Roche 454 has been discontinued whereas PGM and Illumina MiSeq bench-top versions are the platforms of election for researchers starting microbiota characterization projects, and the prevalence of Illumina over PGM has significantly increased during the last years (Escobar-Zepeda et al., 2015). There have been some improvements on the PGM platform with the availability of new sequencing kits and chips that decrease the lower indel error rates, but the Illumina MiSeq platform seems to be better avoiding indel errors (Jünemann et al., 2013). The Illumina MiSeq technology uses an approach of sequencing by synthesis, utilizing fluorescently labeled reversible-terminator nucleotides and, depending on the model, it takes several days to obtain the data (Pylro et al., 2014). It is probably the most popular sequencing platform, in view of its competitive cost, lower than in the case of the PGM machine (Escobar-Zepeda et al., 2015). For Illumina MiSeq, the hypervariable regions V3-V4 are the most used for microbiota characterization analysis. The amplicon is generated by paired-ends reads covering around 460 nucleotides and allowing a certain degree of overlap in the readings at its 3′end, the area of greatest loss of quality (Barb et al., 2016 Claesson et al., 2010 Klindworth et al., 2013 Quail et al., 2012).

Considering the sequencing protocols, the PGM platform seems to be the best for characterization of hypervariable regions (V1-V9), due to the sequencing of all the regions at the same time by the most recent versions of their chips and larger single-read sequences (Barb et al., 2016). Due to this coverage, Quail et al. found that the performance of the PGM machine was better, but still at the expense of a higher false positive rate (Quail et al., 2012).

Analytical pipelines applied to data processing can also introduce bias factors, as not all available pipelines include the same steps (Allali et al., 2017). For a non-expert bioinformatician, to start a microbial community analysis project requires some initial input into the available algorithms, software and pipelines. There is limited literature comparing the advantages and disadvantages of available software packages. The choice depends on the level of taxonomic depth and complexity to be achieved in a study. Nilakanta et al., concluded that QIIME and Mothur where the most comprehensive pipelines and both include features and support documentation (Nilakanta et al., 2014). Dɺrgenio et al. did not find statistically significant differences in the taxonomic assignment and alpha diversity measures when using MG-RAST and QIIME, but stated that QIIME produced more accurate assignments (Dɺrgenio et al., 2014). QIIME1 protocols were initially established for 454 pyrosequencing raw data and later adapted to Illumina technology (Caporaso et al., 2012). Later, Bokulich et al., 2013 presented new guidelines trying to improve default parameters for quality filtering of Illumina reads (Bokulich et al., 2013). Now QIIME is under the transition to QIIME2, where important changes have been made in terms of overrepresentation of OTUs, characterization (now “features”) and other quality related factors (Bolyen et al., 2018).

VSEARCH is an open source alternative to the USEARCH tool and can be implemented in the QIIME package. It includes chimera detection, paired-end reads merging and dereplication (Rognes et al., 2016) and was designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not free.

SPINGO is a novel method to annotate sequences at genus or species level (Allard et al., 2015). Considering the results reported by Allard et al., and more recently by Escobar-Zepeda the use of SPINGO based on the RDP database seems a good method to report high accuracy at the species level (Bokulich et al., 2013 Escobar-Zepeda et al., 2018). Escobar-Zepeda considered also the use of different databases concluding small, but more curated databases, like RDP improved and their results in terms of sensitivity, specificity and accuracy (Escobar-Zepeda et al., 2018).

After all those studies, there is not yet a well-established protocol to decide which platform adapts better to specific research topics and there are not defined protocols considering all the sources of constraints detailed above, especially for novices in the field. The main objective of this study was to analyze a known bacterial community (mock) in a pilot study using two common benchtop sequencing platforms that are currently on the market and using four different analysis pipelines: QIIME1, QIIME1MOD, VSEARCH and QIIME2. The use of a standardized mock allows to measure directly the sequencing and analytical methods without considering the bias introduced by sample preparation.

Also, to increase taxonomic resolution at species level of the studied mock community, the species classifier SPINGO was used.


Group membership

Professor Gavin Huttley is a group leader in the Research School of Biology at the Australian National University. He obtained a B.Sc (Hons I) from Macquarie University in Sydney, a PhD in Molecular Population Genetics from the University of California, Riverside and undertook postdoctoral research in the Laboratory of Genomic Diversity, National Cancer Institute (USA). He is a recipient of the Howard Florey Young Investigator award. Professor Huttleys research is focussed on genome decryption — identifying regions of the genome which encode functions that influence susceptibility to disease — through analysis of genetic variation.

Projects

  • Principal investigator, Bioinformatic / Computational methods
  • Principal investigator, Metagenomics and the analysis of microbial genomes
  • Principal investigator, Molecular evolution and Phylogenetics
  • Principal investigator, Statistical methods of sequence divergence
  • Su, X, Jing, G, Sun, Z et al 2020, 'Multiple-Disease Detection and Classification Across Cohorts via Microbiome Search', mSystems, vol. 5, no. 2, pp. e001500-20.
  • Zhu, Y, Ong, C & Huttley, G 2020, 'Machine Learning Techniques for Classifying the Mutagenic Origins of Point Mutations', Genetics (online), vol. 215, pp. 25-40.
  • Simon, H & Huttley, G 2020, 'Quantifying influences on intragenomic mutation rate', G3: Genes, Genomes, Genetics, vol. 10, no. 8, pp. 2641-2652.
  • Bolyen, E, Rideout, J, Dillon, M et al 2019, 'Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2', Nature Biotechnology, vol. 37, no. 8, pp. 848-857.
  • McDonald, D, Kaehler, B, Gonzalez, A et al 2019, 'redbiom: a Rapid Sample Discovery and Feature Characterization System', mSystems, vol. 4, no. 4, pp. e00215-e00219.
  • Kaehler, B, Bokulich, N, McDonald, D et al 2019, 'Species abundance information improves sequence taxonomy classification accuracy', Nature Communications, vol. 10, no. 4643, pp. 1-10.
  • Bokulich, N, Kaehler, B, Rideout, J et al 2018, 'Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin', Microbiome, vol. 6, no. 90, pp. 17pp.
  • Sutherland, T, Sriskantha, A, Rapson, T et al 2018, 'Did aculeate silk evolve as an antifouling material?', PLOS ONE (Public Library of Science), vol. 13, no. 9, pp. 1-13pp.
  • Ying, H, Cooke, I, Sprungala, S et al 2018, 'Comparative genomics reveals the distinct evolutionary trajectories of the robust and complex coral lineages', Genome Biology, vol. 19, no. 175, pp. 1-24pp.
  • Wong, M, Arcos-Burgos PhD, M, Liu, S et al 2017, 'The PHF21B gene is associated with major depression and modulates the stress response', Molecular Psychiatry, vol. 22, no. 7, pp. 1015-1025.
  • Kaehler, B, Yap, V & Huttley, G 2017, 'Standard codon substitution models overestimate purifying selection for nonstationary data', Genome Biology and Evolution, vol. 9, no. 1, pp. 134-149.
  • Huttley, G, Neeman, T, Yap, V et al 2017, 'Statistical Methods for Identifying Sequence Motifs Affecting Point Mutations', Genetics (online), vol. 205, no. 2, pp. 843-856 pp.
  • Padovan, A, Patel, H, Chuah, A et al 2015, 'Transcriptome Sequencing of Two Phenotypic Mosaic Eucalyptus Trees Reveals Large Scale Transcriptome Re-Modelling', PLOS ONE (Public Library of Science), vol. 10, no. 5, pp. e0123226-e0123226.
  • Kaehler, B, Yap, V, Zhang, R et al 2015, 'Genetic distance for a general non-stationary markov substitution process', Systematic Biology, vol. 64, no. 2, pp. 281-293.
  • Johar, A, Mastronardi, C, Rojas-Villarraga, A et al 2015, 'Novel and rare functional genomic variants in multiple autoimmune syndrome and Sjögren's syndrome', Journal of Translational Medicine, vol. 13, no. 173, pp. 173-173.
  • Maitip, J, Trueman, H, Kaehler, B et al 2015, 'Folding behavior of four silks of giant honey bee reflects the evolutionary conservation of aculeate silk proteins', Insect Biochemistry and Molecular Biology, vol. 59, pp. 72-79.
  • Perez-Bercoff, A, Papanicolaou, A, Ramsperger, M et al 2015, 'DRAFT GENOME OF AUSTRALIAN ENVIRONMENTAL STRAIN WM 09 24 OF THE OPPORTUNISTIC HUMAN PATHOGEN SCEDOSPORIUM AURANTIACUM', Genome Announcements, vol. 3, no. 1, pp. e01526-14.
  • Kitahara, M, Lin, M, Foret, S et al 2014, 'The "Naked Coral'' Hypothesis Revisited - Evidence for and Against Scleractinian Monophyly', PLOS ONE (Public Library of Science), vol. 9, no. 4, pp. -.
  • Engelthaler, D, Hicks, N, Gillece, J et al 2014, 'Cryptococcus gattii in North American Pacific Northwest: Whole-Population Genome Analysis Provides Insights into Species Evolution and Dispersal', mBio, vol. 5, no. 4.
  • Da Paz Filho, G, Boguszewski, M, Mastronardi, C et al 2014, 'Whole Exome Sequencing of Extreme Morbid Obesity Patients: Translational Implications for Obesity and Related Disorders', GENES, vol. 5, no. 3, pp. 709-725.
  • Ye, Y, Woolfit, M, Huttley, G et al 2013, 'Infection with a Virulent Strain of Wolbachia Disrupts Genome Wide-Patterns of Cytosine Methylation in the Mosquito Aedes aegypti', PLOS ONE (Public Library of Science), vol. 8, no. 6.
  • Verbyla, K, Yap, V, Pahwa, A et al 2013, 'The Embedding Problem for Markov Models of Nucleotide Substitution', PLOS ONE (Public Library of Science), vol. 8, no. 7.
  • Casewell, N, Huttley, G & Wuster, W 2012, 'Dynamic evolution of venom proteins in squamate reptiles', Nature Communications, vol. 3, pp. 1066-1066.
  • Huttley, G & Yap, V 2012, 'Robust estimation of natural selection using parametric codon models', in Gina M. Cannarozzi and Adrian Schneider (ed.), Codon Evolution Mechanisms and Models, Oxford University Press, UK, pp. 111-125.
  • Bellani, M, Epps, J & Huttley, G 2012, 'A Comparison of Periodicity Profile Methods for Sequence Analysis', IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS 2012), IEEE Engineering in Medicine and Biology Society, Washington DC USA, p. 4.
  • Nekrasov, M, Amrichova, J, Parker, B et al., 'Histone H2A.Z inheritance during the cell cycle and its impact on promoter organization and dynamics', Nature Structural and Molecular Biology. 2012, vol. 19, no. 11, pp. 1076-1083.
  • Soboleva TA, Nekrasov M, Pahwa A, Williams R, Huttley GA, Tremethick DJ. 'A unique H2A histone variant occupies the transcriptional start site of active genes', Nature Structural and Molecular Biology. 2012 Jan 19(1): 25-30.
  • Epps, J, Ying, H & Huttley, G 2011, 'Statistical methods for detecting periodic fragments in DNA sequence data', Biology Direct, vol. 6, no. 21, p. 16.
  • Ying, H & Huttley, G 2011, 'Exploiting CpG Hypermutability to Identify Phenotypically Significant Variation Within Human Protein-Coding Genes', Genome Biology and Evolution, vol. 3, no. 1, pp. 938-949.
  • Adcock, G, Easteal, S, Huttley, G et al 2001, 'Human origins and ancient human DNA [3]', Science, vol. 292, no. 5522, pp. 1655-1656.
  • Smith, A, Box, N, Marks, L et al 2001, 'The human melanocortin-1 receptor locus: Analysis of transcription unit, locus polymorphism and haplotype evolution', Gene, vol. 281, no. 1-2, pp. 81-94.
  • Huttley, G, Easteal, S, Southey, M et al 2000, 'Adaptive evolution of the tumour suppressor BRCA1 in humans and chimpanzees', Nature Genetics, vol. 25, no. 4, pp. 410-413.
  • Sammut, R & Huttley, G 2011, 'Regional Context in the Alignment of Biological Sequence Pairs', Journal of Molecular Evolution, vol. 72, no. 2, pp. 147-159.
  • Yap, V, Lindsay, H, Easteal, S et al 2010, 'Estimates of the effect of natural selection on protein-coding content', Molecular Biology and Evolution, vol. 27, no. 3, pp. 726-734.
  • Ying, H, Epps, J, Williams, R et al 2010, 'Evidence that localized variation in primate sequence divergence arises from an influence of nucleosome placement on DNA repair', Molecular Biology and Evolution, vol. 27, no. 3, pp. 637-649.
  • Caporaso, J, Kuczynski, J, Stombaugh, J et al 2010, 'QIIME allows analysis of high-throughput community sequencing data', Nature Methods: techniques for life scientists and chemists, vol. 7, no. 5, pp. 335-336.
  • Simpson, N, Gatenby, P, Wilson, A, Malik, S, Fulcher, DA, Tangye, SG, Manku, H,Vyse, TJ, Roncador, G, Huttley, G, Goodnow, CC, Vinuesa, CG, Cook, MC, 2010, 'Expansion of Circulating T Cells Resembling Follicular Helper T Cells Is a Fixed Phenotype That Identifies a Subset of Severe Systemic Lupus Erythematosus', Arthritis & Rheumatism, vol. 62, no. 1, pp. 234-244.
  • Huttley, G 2009, 'Do genomic datasets resolve the correct relationship among the placental, marsupial and monotreme lineages?', Australian Journal of Zoology, vol. 57, pp. 167-174.
  • Caporaso, J, Smit, S, Easton, B et al 2008, 'Detecting coevolution without phylogenetic trees? Tree-ignorant metrics of coevolution perform as well as tree-aware metrics', BMC Evolutionary Biology, vol. 8:327, pp. 1-25.
  • Lindsay, H, Yap, V, Ying, H et al 2008, 'Pitfalls of the most commonly used models of context dependent substitution', Biology Direct, vol. 3, no. 52, pp. 1-17.
  • Oscamou, M, McDonald, D, Yap, V et al 2008, 'Comparison of methods for estimating the Nucleotide Substitution Matrix', BMC Bioinformatics, vol. 9, pp. 1-18.
  • Warren, W, Hillier, L, Graves, J et al 2008, 'Genome analysis of the platypus reveals unique signatures of evolution', Nature, vol. 453, no. 7192, pp. 175-83.
  • Schranz, H, Yap, V, Easteal, S et al 2008, 'Pathological rate matrices: from primates to pathogens', BMC Bioinformatics, vol. 9, no. 1, pp. 1-10.
  • Faux, N, Huttley, G, Mahmood, K et al 2007, 'RCPdb: An evolutionary classification and codon usage database for repeat-containing proteins', Genome Research, vol. 17, no. 7, pp. 1118-27.
  • MacArthur, D, Seto, J, Raftery, J et al 2007, 'Loss of ACTN3 gene function alters mouse muscle metabolism and shows evidence of positive selection in humans', Nature Genetics, vol. 39, no. 10, pp. 1261-5.
  • Mikkelsen, T, Wakefield, M, Aken, B et al 2007, 'Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences', Nature, vol. 447, no. May 10, pp. 167-178.
  • Huttley, G, Wakefield, M & Easteal, S 2007, 'Rates of genome evolution and branching order from whole genome analysis', Molecular Biology and Evolution, vol. 24, no. 8, pp. 1722-30.
  • Knight, R, Maxwell, P, Birmingham, A et al 2007, 'PyCogent: a toolkit for making sense from sequence', Genome Biology, vol. 8, no. 8, pp. R171.
  • Easton, B, Isaev, A, Huttley, G et al 2007, 'A probabilistic method to identify compensatory substitutions for pathogenic mutations', Asia-Pacific Bioinformatics Conference (APBC 2007), ed. David Sankoff, Lusheng Wang, Francis Chin, Imperial College Press, London, UK, pp. 195-204.
  • Wakefield, M, Maxwell, P & Huttley, G 2005, 'Vestige: Maximum likelihood phylogenetic footprinting', BMC Bioinformatics, vol. 6, no. 130, pp. 1-9.
  • Robinson, A, Huttley, G, Booth, H et al 2004, 'Modelling and bioinformatics studies of the human Kappa-class glutathione transferase predict a novel third glutathione transferase family with similarity to prokaryotic 2-hydroxychromene-2-carboxylate isomerases', Biochemical Journal, vol. 379, pp. 541-552.
  • Butterfield, A, Vedagiri, V, Lang, E et al 2004, 'Pyevolve: a toolkit for statistical modelling of molecular evolution', BMC Bioinformatics, vol. 5, no. 1, pp. 1-12.
  • Huttley, G 2004, 'Modeling the Impact of DNA Methylation on the Evolution of BRCA1 in Mammals', Molecular Biology and Evolution, vol. 21, no. 9, pp. 1760-1768.
  • Wilson, S & Huttley, G 2002, 'Non-replicability of disease gene results: A modelling perspective', International Conference on Statistics, Combinatorics and Related Areas / International Conference of the Forum for the Interdisciplinary Mathematics 2001, ed. Gulati C, Lin Y-X, Mishra S, Rayner J, World Scientific Publishing Company, Singapore, pp. 374-382.
  • Prichard, Z, Jorm, A, Prior, M et al 2002, 'Association of polymorphisms of the estrogen receptor gene with anxiety-related traits in children and adolescents: A longitudinal study', American Journal of Medical Genetics, Part A, vol. 114, no. 2, pp. 169-176.
  • Peacock, W, Dennis, E, Easteal, S et al 2001, 'Lake Mungo 3: A response to recent critiques', Archaeology in Oceania, vol. 36, pp. 163-174.
  • Adcock, G, Dennis, E, Easteal, S et al 2001, 'Mitochondrial DNA sequences in ancient Australians: Implications for modern human origins', PNAS - Proceedings of the National Academy of Sciences of the United States of America, vol. 98, no. 2, pp. 537-542.
  • Cooper, A, Rambaut, A, Macaulay, V et al 2001, 'Human Origins and Ancient Human DNA', Science, vol. 292, no. 2001, pp. 1656-1656.
  • Huttley, G, Jakobsen, I, Wilson, S et al 2000, 'How important is DNA replication for mutagenesis?', Molecular Biology and Evolution, vol. 17, pp. 929 - 937.
  • Huttley, G & Wilson, S 2000, 'Testing for concordant equilibria between population samples', Genetics (Print) ceased Dec 2009 - don't use, vol. 156, pp. 2127 - 2135.
  • Huttley, G, Jakobsen, I & Easteal, S 2000, 'How important is DNA replication for mutagenesis?', Molecular Biology and Evolution, vol. 17, pp. 929-937.
  • Huttley, G, Smith, M, Carrington, M et al 1999, 'A scan for linkage disequilibrium across the human genome', Genetics (Print) ceased Dec 2009 - don't use, vol. 152, no. 4, pp. 1711-1722.

+61 2 6125 5111
The Australian National University, Canberra
CRICOS Provider : 00120C
ABN : 52 234 063 906


INTRODUCTION

This tutorial illustrates the use of QIIME 2 (Bolyen et al., 2019 ) for processing, analyzing, and visualizing microbiome data. Here we use, as an example, a high-throughput 16S rRNA gene sequencing study, starting with raw sequences and producing publication-ready analysis and figures (see Basic Protocol). QIIME 2 can also process other types of microbiome data, including amplicons of other markers such as 18S rRNA, internal transcribed spacers (ITS), and cytochrome oxidase I (COI), shotgun metagenomics, and untargeted metabolomics. We will also show how to combine results from an individual study with data from other studies using the Qiita public database framework (Gonzalez et al., 2018 ), which can be used to confirm relationships between microbiome and phenotype variables in a new cohort, or to generate hypotheses for future testing.

Typical QIIME 2 analyses can vary in many ways, depending on your experimental and data analysis goals and on how you collected the data. In this tutorial, we use the QIIME 2 command-line interface, and focus on processing and analyzing a subset of samples from the Early Childhood Antibiotics and the Microbiome (ECAM) study (Bokulich, Chung, et al., 2016 ). We will start with raw sequence files and use a single analysis pipeline for clarity, but note where alternative methods are possible and why you might want to use them.

Before Starting

We recommend readers to follow the enhanced live version of this protocol (https://curr-protoc-bioinformatics.qiime2.org) which will be updated frequently to always reflect the newest release of QIIME 2. We also recommend that you read about the core concepts of QIIME 2 (https://docs.qiime2.org/2019.10/concepts/) before starting this tutorial, to familiarize yourself with the platform's main features and concepts, including enhanced visualization methods through QIIME 2 View, decentralized provenance tracking (which ensures reproducible bioinformatics), multiple interfaces (including the Python 3 API and QIIME 2 Studio graphical interface), the plugin architecture (which enables anyone to expand QIIME 2's functionality), and semantic types (which enable QIIME 2 to help users avoid misusing their data). In general, we suggest referring to the QIIME 2 website (https://qiime2.org), which will always be the most up-to-date source for information and tutorials on QIIME 2, including newer versions of this tutorial. Questions, suggestions, and general discussion should always be directed to the QIIME 2 Forum (https://forum.qiime2.org). A brief “Glossary of terms” for common QIIME 2 terminology is provided as an Appendix.

: USING QIIME 2 WITH MICROBIOME DATA

Necessary Resources

Hardware

  • QIIME 2 can be installed on almost any computer system (native installation is available on Mac OS and Linux, or on Windows via a virtual machine). The amount of free disk space and memory that you will need varies dramatically depending on the number of samples and sequences you will analyze and the algorithms you will use to do so. At present, QIIME 2 requires a minimum of 6 to 7 GB for installation, and we recommend a minimum of 4 GB of memory as a starting point for small datasets and 8 GB of memory for most other real-world datasets. Other types of analyses, such as those using shotgun metagenomics plugins, may require significantly more memory and disk space.

Software

  • An up-to-date web browser, such as the latest version of Firefox or Chrome, is needed for visualizations using QIIME 2 View

Installing QIIME 2

The latest version of QIIME 2, as well as detailed instructions on how to install on various operating systems, can be found at https://docs.qiime2.org. QIIME 2 utilizes a variety of external independent packages, and while we strive to maintain backward compatibility, occasionally changes or updates to these external packages may create compatibility issues with older versions of QIIME 2. To avoid these problems, we recommend always using the most recent version of QIIME 2 available online. The online tutorial will always provide installation instructions for the most up-to-date, tested, and stable version of QIIME 2.

Troubleshooting

If you encounter any issues with installation, or with any other stage of this tutorial, please get in touch with the QIIME 2 Forum at https://forum.qiime2.org.

The QIIME 2 Forum is the hub of the QIIME 2 user and developer communities. Technical support for users and developers is provided there, free of charge. We try to reply to technical support questions on the forum within 1 to 2 business days (though sometimes we need more time). Getting involved on the QIIME 2 Forum, for example by reading existing posts, answering questions, or sharing resources that you have created such as educational content, is a great way to get involved with QIIME 2. We strive to create an inclusive and welcoming community where we can collaborate to improve microbiome science. We hope you will join us!

(Re)Activating QIIME 2

Using this tutorial

The following protocol was completed using QIIME 2 2019.10 and demonstrates usage with the command line interface (CLI). For users comfortable with Python 3 programming, an application programmer interface (API) version of this protocol is also available at https://github.com/qiime2/paper2/blob/master/notebooks/qiime2-protocol-API.ipynb. No additional software is needed for using the API. Jupyter notebooks for both of these protocols are also available at https://github.com/qiime2/paper2/tree/master/notebooks. Finally, an enhanced interactive live version of the CLI protocol is also available at https://curr-protoc-bioinformatics.qiime2.org with all intermediate files precomputed. While we strongly encourage users to install QIIME 2 on their own computers and follow along with this tutorial. The enhanced live version provides an alternative for when time and computational resources are limited. Following along with the live version of this protocol enables users to skip any step and instead download the pre-processed output required for a subsequent step. Additionally, the live version also provides simple “copy to clipboard” buttons for each code block which, unlike copying from a PDF file, retains the original formatting of the code, making it easy to paste into other environments. The enhanced live protocol will also be updated regularly with every new release of QIIME 2, unlike the published version, which will remain static with the 2019.10 version.

Acquire the data from the ECAM study

  • mkdir qiime2-ecam-tutorial
  • cd qiime2-ecam-tutorial
  • wget -O 81253.zip https://qiita.ucsd.edu/public_artifact_download/?artifact_id=81253
  • unzip 81253.zip
  • mv mapping_files/81253_mapping_file.txt metadata.tsv

Ignore the warning errors during the unzipping step this is expected behavior. You can now delete the original zip file 81253.zip to save space.

Explore sample metadata files

In the previous step, in addition to downloading sequence data, we downloaded a set of researcher-generated sample metadata. In the context of a microbiome study, sample metadata are any data that describe characteristics of the samples that are being studied, the site they were collected from, and/or how they were collected and processed. In this example, the ECAM study metadata include characteristics like age at the time of collection, birth mode and diet of the child, the type of DNA sequencing, and other information. This is all information that is generally compiled at the time of sample collection, and thus is something the researcher should be working on prior to a QIIME 2 analysis. Suggested standards for the type of study metadata to collect, and how to represent the values, are discussed in detail in MIMARKS and MIxS (Yilmaz et al., 2011 ). In this article, we also include a Support Protocol on metadata preparation to help users generate quality metadata. In QIIME 2, metadata are most commonly stored as a TSV (i.e., tab-separated values) file. These files typically have a .tsv or .txt file extension. TSV files are text files used to store data tables, and the format can be read, edited, and written by many types of software, including spreadsheets and databases. Thus, it is usually straightforward to manipulate QIIME 2 sample metadata using the software of your choosing. You can use a spreadsheet program of your choice such as Google Sheets to edit and export your metadata files, but you must be extremely cautious about automatic, and often silent, reformatting of values using these applications. For example, the use of programs like Excel can lead to unwanted reformatting of values, insertion of invisible spaces, or sorting of a table in ways that scramble the connection between sample identifiers and the data. These problems are very common and can lead to incorrect results, including missing statistically significant patterns. See the “Metadata preparation” section in the Support Protocol below for details regarding best practices for creating and maintaining metadata files.

Detailed formatting requirements for QIIME 2 metadata files can be found at https://docs.qiime2.org/2019.10/tutorials/metadata/. Metadata files stored in Google Sheets can be validated using Keemei (Rideout et al., 2016 ), an open-source Google Sheets plugin available at https://keemei.qiime2.org. Once Keemei is installed, select Add-ons > Keemei > Validate QIIME 2 metadata file in Google Sheets to determine whether the metadata file meets the required formatting of QIIME 2.

Open the metadata.tsv file with your software of choice and explore the content. Take note of the column names, as we will be referring to these throughout the protocol. Cual-ID may be useful for creating sample identifiers, and the Cual-ID paper (Chase, Bolyen, Rideout, & Caporaso, 2015 ) provides some recommendations on best practices for creating sample identifiers for data management.

Importing DNA sequence data into QIIME 2 and creating a visual summary

The next step is to import our DNA sequence data (in this case, from the 16S rRNA gene) into QIIME 2. All data used and generated by QIIME 2, with the exception of metadata, exist as QIIME 2 artifacts and use the .qza file extension. Artifacts are zip files containing data (in the usual formats, such as FASTQ) and QIIME 2−specific metadata describing the various characteristics of the data such as their semantic type, data file format, relevant citations for analysis steps that were performed up to this point, and the QIIME 2 steps that were taken to generate the data (i.e., the data provenance). See the Appendix (Glossary) at the end of this article for additional information.

QIIME 2 allows you to import and export data at many different steps, so that you can export them to other software or try out alternative methods for particular steps. When importing data into QIIME 2, you need to provide detail on what the data are, including the file format and the semantic type. Currently, the most common type of raw data from high-throughput amplicon sequencing are in FASTQ format. These files may contain single-end or paired-end DNA sequence reads, and will be in either multiplexed or demultiplexed format. Multiplexed files typically come as two (or three in the case of paired-end runs) files consisting of your sequences (forward and/or reverse, often but not always referred to as R1 and R2 reads, respectively) and a separate barcode file (often but not always referred to as the I1 reads). In demultiplexed format, you will have one (or two in the case of paired-end data) sequence files per sample, as the sequences have already been assigned to their designated sample IDs based on the barcode files. For the demultiplexed format, the sample name will typically be a part of the file name. In this protocol, our sequences are in single-end demultiplexed FASTQ format produced by Illumina's Casava software. As our data are split across multiple files, to import we will need to provide QIIME 2 with the location of our files and assign them sample IDs this is done using the manifest file. A manifest file is a user-created tab-separated values file with two columns: the first column, sample-id , holds the name you assign to each of your samples, and the second column, absolute-filepath , provides the absolute file path leading to your raw sequence files. For example:

sample-id absolute-filepath
10249.M001.03R $PWD/demux-se-reads/10249.M001.03R.fastq.gz
10249.M001.03SS $PWD/demux-se-reads/10249.M001.03SS.fastq.gz
10249.M001.03V $PWD/demux-se-reads/10249.M001.03V.fastq.gz

Alternatively, your sample metadata file can also be made to double as a manifest file by simply adding the absolute-filepath column to it in this protocol we demonstrate the creation and use of a separate manifest file. You can create a manifest file in a variety of ways using your favorite text editor application. Here we use a simple bash script to create ours.

  • for f in `ls per_sample_FASTQ/81253/*.gz' do n=`basename $f'
  • echo -e "12802.$ $PWD/$f" done >> manifest.tsv
  • qiime tools import
    • --input-path manifest.tsv
    • --type `SampleData[SequencesWithQuality]'
    • --input-format SingleEndFastqManifestPhred33V2

    Your data may not be demultiplexed prior to importing to QIIME 2. Instructions on how to import multiplexed FASTQ files, as well as a variety of other data types, can be found online at https://docs.qiime2.org/2019.10/tutorials/importing/. With multiplexed data, you will also need to demultiplex your sequences prior to the next step. Demultiplexing in QIIME 2 can be performed using either the q2-demux (https://docs.qiime2.org/2019.10/plugins/available/demux/) plugin which is optimized for data produced using the EMP protocol (Caporaso et al., 2012 ), or the q2-cutadapt (https://docs.qiime2.org/2019.10/plugins/available/cutadapt/) plugin (which additionally supports demultiplexing of dual-index barcodes using cutadapt Martin, 2011 ).

    The demultiplexed artifact allows us to create an interactive summary of our sequences. This summary provides information useful for assessing the quality of the DNA sequencing run, including the number of sequences that were obtained per sample and the distribution of sequence quality scores at each position.

    5. Explore the Visualization results:

    In the first Overview tab, we see a summary of our sequence counts followed by a per-sample breakdown. If you click on the Interactive Quality plot tab (Fig. 1), you can interact with the sequence quality plot, which shows a boxplot of the quality score distribution for each position in your input sequences. Because it can take a while to compute these distributions from all of your sequence data (often tens of millions of sequences), a subset of your reads are selected randomly (sampled without replacement), and the quality scores of only those sequences are used to generate the box plots. By default, 10,000 sequences are subsampled, but you can control that number with --p-n on the demux summarize command. Keep in mind that because of this random subsampling, every time you run demux summarize on the same sequence data, you will obtain slightly different plots.

    Click and drag on the plot to zoom in. When you hover the mouse over a boxplot for a given base position, the boxplot's data are shown in a table below the interactive plot as a parametric seven-number summary This is a standard summary statistics of a dataset composed of 2nd, 9th, 25th, 50th, 75th, 91st, and 98th percentiles, and can be used as a simple check for assumptions of normality. These values describe the distribution of quality scores at that position in your subsampled sequences. You can click and drag on the plot to zoom in, or double click to zoom back out to full size. These interactive plots can be used to determine if there is a drop in quality at some point in your sequences, which can be useful in choosing truncation and trimming parameters in the next section.

    Sequence quality control and feature table construction

    Traditionally, quality control of sequences was performed by trimming and filtering sequences based on their quality scores (Bokulich et al., 2013 ), followed by clustering them into operational taxonomic units (OTUs) based on a fixed dissimilarity threshold, typically 97% (Rideout et al., 2014 ). Today, there are better methods for quality control that correct amplicon sequence errors and produce high-resolution amplicon sequence variants that, unlike OTUs, resolve differences of as little as one nucleotide. These “denoisers” have many advantages over traditional clustering-based methods, as discussed in Callahan, McMurdie, & Holmes ( 2017 ). QIIME 2 currently offers denoising via the DADA2 ( q2-dada2 ) and Deblur ( q2-deblur ) plugins. The inferred ESVs produced by DADA2 are referred to as amplicon sequence variants (ASVs), while those created by Deblur are called sub-OTUs (sOTUs). In this protocol, we will refer to products of these denoisers, regardless of their method of origin, as features. The major differences in the algorithms and motivation for these and other denoising methods are reviewed in Nearing, Douglas, Comeau, & Langille ( 2018 ) and Caruso, Song, Asquith, & Karstens ( 2019 ). According to these independent evaluations, denoising methods were consistently more successful than clustering methods in identifying true community composition, while only small differences were reported among the denoising methods. We therefore view method selection here as a personal choice that research teams should make. Some practical differences may drive selection of these methods. For instance, DADA2 includes joining of paired-end reads in its processing workflow, and is therefore simpler to use when paired-end read joining is desired, while Deblur users must join reads independently prior to denoising using other plugins such as q2-vsearch ’s join-pairs method (Rognes, Flouri, Nichols, Quince, & Mahé, 2016 ).

    In this tutorial, we will denoise our sequences with q2-deblur , which uses a pre-calculated static sequence error profile to associate erroneous sequence reads with the true biological sequence from which they are derived. Unlike DADA2, which creates sequence error profiles on a per-analysis basis, this allows Deblur to be simultaneously applied across different datasets, reflecting its design motivation to perform meta-analyses. Additionally, using a pre-defined error profile generally results in shorter runtimes.

    Deblur is applied in two steps:

    • qiime quality-filter q-score
      • --i-demux se-demux.qza
      • --o-filtered-sequences demux-filtered.qza
      • --o-filter-stats demux-filter-stats.qza
      • qiime deblur denoise-16S
        • --i-demultiplexed-seqs demux-filtered.qza
        • --p-trim-length 150
        • --p-sample-stats
        • --p-jobs-to-start 4
        • --o-stats deblur-stats.qza
        • --o-representative-sequences rep-seqs-deblur.qza

        The denoising step is often one of the longest steps in microbiome analysis pipelines. Luckily, both DADA2 and Deblur are parallelizable, meaning you can significantly reduce computation time if your machine has access to multiple cores. To increase the number of cores you wish to designate for this task, use the --p-jobs-to-start parameter to change the default value of 1 to a value suitable for your machine. As a reminder, if you are following the online version of this protocol, you can skip this step and download the output artifacts, and use those in the following steps.

        Deblur generates three outputs: an artifact with the semantic type FeatureTable[Frequency] , which is a table of the counts of each observed feature in each sample, an artifact with the semantic type FeatureData[Sequence] , which contains the sequence that defines each feature in the table that will be used later for assigning taxonomy to features and generating a phylogenetic tree, and summary statistics of the Deblur run in a DeblurStats artifact . Each of these artifacts can be visualized to provide important information.

        • qiime deblur visualize-stats
          • --i-deblur-stats deblur-stats.qza
          • --o-visualization deblur-stats.qzv

          The statistics summary (Fig. 2) provides us with information about what happened to each of the samples during the deblur process. The reads-raw column gives information on the number of reads presented to the deblur algorithm. Because deblur works by deleting erroneous reads that it detects, the final number of reads is smaller than the starting number. The three columns that follow ( fraction-artifact-with-minsize , fraction-artifact , and fraction-missed-reference ) summarize the data from other columns in a convenient way. They identify potential problems with the data at an early stage. Fraction-artifact-with-minsize is the fraction of sequences detected as artifactual, including those that fall below the minimum length threshold (specified by the --p-trim-length parameter). Fraction-artifact is the fraction of raw sequences that were identified as artifactual. Fraction-missed-reference is the fraction of post-deblur sequences that were not recruited by the positive reference database. The subsequent columns provide information about the number of sequences remaining after dereplication ( unique-reads-derep, reads-derep ), following deblurring ( unique-reads-deblur, reads-deblur ), number of hits that were recruited to the negative reference database following the deblurring process ( unique-reads-hit-artifact , reads-hit-artifact ), chimeric sequences detected ( unique-reads-chimeric and reads-chimeric ), sequences that match/miss the positive reference database ( unique-reads-hit-reference , reads-hit-reference , unique-reads-missed-reference , and reads-missed-reference ). The number in the reads-hit-reference column is the final number of per-sample sequences present in the table-deblur.qza QIIME 2 artifact.

          The shorthand “artifact” in the per-sample Deblur statistics denotes artifactual sequences (i.e., those erroneously generated as byproducts of the PCR and DNA sequencing process), not a QIIME 2 artifact (i.e., a valid data product of QIIME 2).

          • qiime feature-table tabulate-seqs
            • --i-data rep-seqs-deblur.qza
            • --o-visualization rep-seqs-deblur.qzv

            This Visualization (Fig. 3) will provide statistics and a seven-number summary of sequence lengths, and, more importantly, show a sequence table that maps feature IDs to sequences, with links that allow you to easily BLAST each sequence against the NCBI nt database. To BLAST a sequence against the NCBI nt database, click the sequence and then click the View report button on the resulting page. This will be useful later in the tutorial, when you want to learn more about specific features that are important in the data set. Note that automated taxonomic classification is performed at a later step, as described below the NCBI-BLAST links provided in this Visualization are useful for assessing the taxonomic affiliation and alignment of individual features to the reference database. Results of the “top hits” from a simple BLAST search such as this are known to be poor predictors of the true taxonomic affiliations of these features, especially in cases where the closest reference sequence in the database is not very similar to the sequence that you are using as a query.

            By default, QIIME 2 uses MD5 hashing of a feature's full sequence to assign a feature ID. These are the 32-bit strings of numbers and characters you see in the Feature ID column above. Hashing in q2-deblur can be disabled by adding the --p-no-hashed-feature-ids parameter.

            • qiime feature-table summarize
              • --i-table table-deblur.qza
              • --m-sample-metadata-file metadata.tsv
              • --o-visualization table-deblur.qzv

              The first Overview tab gives information about how many sequences come from each sample, histograms of those distributions, and related summary statistics. The Interactive Sample Detail tab (Fig. 4) shows a bar plot of the number of samples associated with the metadata category of interest, and the feature count in each sample is shown in the table below. Note that you can choose the metadata categories and change sampling depth by dragging the bar or typing in the value. The Feature Detail Detail tab shows the frequency and number of observed samples associated with each feature.

              If traditional OTU clustering methods are desired, QIIME 2 users can perform these using the q2-vsearch plugin (Rognes et al., 2016 ) at https://docs.qiime2.org/2019.10/plugins/available/vsearch/. However, we recommend that denoising methods be used prior to clustering in order to utilize the superior quality-control procedures within these tools.

              Generating a phylogenetic tree

              Although microbiome data can be analyzed without a phylogenetic tree, many commonly used diversity analysis methods such as Faith's phylogenetic diversity (Faith, 1992 ) and UniFrac (Lozupone & Knight, 2005 ) require one. To use these methods, we must construct a phylogenetic tree that allows us to consider evolutionary relatedness between the DNA sequences.

              QIIME 2 offers several methods for reconstructing phylogenetic trees based on features found in your data. These include several variants of traditional alignment-based methods of building a de novo tree, as well as a fragment-insertion method that aligns your features against a reference tree. It should be noted that de novo trees reconstructed from short sequences result in low-quality trees because the sequences do not contain enough information to give the correct evolutionary relationships over large evolutionary distances, and thus should be avoided when possible (Janssen et al., 2018 ). For this tutorial, we will use the fragment-insertion tree-building method as described by Janssen et al. ( 2018 ) using the sepp action of the q2-fragment-insertion plugin, which has been shown to outperform traditional alignment-based methods with short 16S amplicon data. This method aligns our unknown short fragments to full-length sequences in a known reference database and then places them onto a fixed tree. Note that this plugin has only been tested and benchmarked on 16S data against the Greengenes reference database (McDonald et al., 2012 ), so if you are using different data types you should consider the alternative methods mentioned below.

              • qiime fragment-insertion sepp
                • --i-representative-sequences rep-seqs-deblur.qza
                • --i-reference-database sepp-refs-gg-13-8.qza
                • --p-threads 4
                • --o-tree insertion-tree.qza
                • --o-placements insertion-placements.qza

                The newly formed insertion-tree.qza is stored as a rooted phylogenetic tree (of semantic type Phylogeny[Rooted] ) and can be used in downstream analysis for phylogenetic diversity computations.

                Building a tree using SEPP can be computationally demanding and often has longer run times than most steps in a typical microbiome analysis pipeline. The --p-threads parameter which, similar to the --p-jobs-to-start parameter from q2-deblur , allows this action to be performed in parallel across multiple cores, significantly reduces run time. See the developers’ recommendations with regard to run-time optimization at https://github.com/qiime2/q2-fragment-insertion#expected-runtimes. As a reminder, if you are following the online version of this protocol, you can skip this step, download the output artifacts, and use those in the following steps.

                Once the insertion tree is created, you must filter your feature table so that it only contains fragments that are in the insertion tree. This step is needed because SEPP might reject the insertion of some fragments, such as erroneous sequences or those that are too distantly related to the reference alignment and phylogeny. Features in your feature table without a corresponding phylogeny will cause diversity computation to fail, because branch lengths cannot be determined for sequences not in the tree.

                • qiime fragment-insertion filter-features
                  • --i-table table-deblur.qza
                  • --i-tree insertion-tree.qza
                  • --o-filtered-table filtered-table-deblur.qza
                  • --o-removed-table removed-table.qza

                  This command generates two feature tables: The filtered-table-deblur.qza contains only features that are also present in the tree, while the removed-table.qza contains features not present in the tree. Both of these tables can be visualized as shown in step 5 of the previous section, titled “Sequence quality control and feature table construction.”

                  If a traditional de novo phylogenetic tree is desired/required, QIIME 2 offers several methods [FastTree (Price, Dehal, & Arkin, 2010 ), IQ-TREE (Nguyen, Schmidt, von Haeseler, & Minh, 2015 ), and RAxML (Stamatakis, 2014 )] to reconstruct these using the q2-phylogeny plugin (https://docs.qiime2.org/2019.10/plugins/available/phylogeny/). A tree produced by any of these alignment-based methods can be used with your original feature table without the need for the filtering that SEPP requires. However, if some of your sequences are not 16S rRNA genes, the tree will be incorrect in ways that may severely affect your results.

                  4. Visualize the phylogenetic tree:

                  The phylogenetic tree artifact (semantic type: Phylogeny[Rooted] ) produced in this step can be readily visualized using q2-empress (https://github.com/biocore/empress) or iTOL's (Letunic & Bork, 2019 ) interactive web-based tool by simply uploading the artifact at https://itol.embl.de/upload.cgi. The underlying tree, in Newick format, can also be easily exported for use in your application of choice (see the “Exporting QIME 2 data” section in the Support Protocol, below).

                  Taxonomic classification

                  While sequences derived from denoising methods provide us with the highest possible resolution of our features given our sequencing data, it is usually desirable to know the taxonomic affiliation of the microbes from which sequences were obtained. QIIME 2 provides several methods to predict the most likely taxonomic affiliation of our features through the q2-feature-classifier plugin (Bokulich, Kaehler, et al., 2018 ). These include both alignment-based consensus methods and Naive Bayes (and other machine-learning) methods. In this tutorial, we will use a Naive Bayes classifier, which must be trained on taxonomically defined reference sequences covering the target region of interest. Some pre-trained classifiers are available through the QIIME 2 Data Resources page (https://docs.qiime2.org/2019.10/data-resources/), and some have been made available by users on the QIIME 2 Community Contributions channel (https://forum.qiime2.org/c/community-contributions). If a pre-trained classifier suited for your region of interest or reference database is not available through these resources, you can train your own by following the online tutorial (https://docs.qiime2.org/2019.10/tutorials/feature-classifier/). In the present protocol, we will train a classifier specific to our data that (optionally) also incorporates environment-specific taxonomic abundance information to improve species inference. This bespoke method has been shown to improve classification accuracy (Kaehler et al., 2019 ) when compared to traditional Naive Bayes classifiers, which assume that all species in the reference database are equally likely to be observed in your sample (i.e., that sea-floor microbes are just as likely to be found in a stool sample as microbes usually associated with stool).

                  To train a classifier using this bespoke method, we need three files: (1) a set of reference reads, (2) a reference taxonomy, and (3) taxonomic weights. Taxonomic weights can be customized for specific sample types and reference data using the q2-clawback plugin (Kaehler et al., 2019 ) (see alternative pipeline recommendation below), or we can obtain pre-assembled taxonomic weights from the readytowear collection (https://github.com/BenKaehler/readytowear). This collection also contains the reference reads and taxonomies required. The taxonomic weights used in this tutorial have been assembled with 16S rRNA gene sequence data using the Greengenes reference database trimmed to the V4 domain (bound by the 515F/806R primer pair as used in the ECAM study). Here, we will use the pre-calculated taxonomic weights specific to human stool data. For other sample types, make sure to pick the appropriate weights best fit for your data, and the appropriate sequence reference database a searchable inventory of available weights is available at https://github.com/BenKaehler/readytowear/blob/master/inventory.tsv.

                  • wget
                  • https://github.com/BenKaehler/readytowear/raw/master/data/gg_13_8/515f-806r/human-stool.qza
                  • wget
                  • https://github.com/BenKaehler/readytowear/raw/master/data/gg_13_8/515f-806r/ref-seqs-v4.qza
                  • wget
                  • https://github.com/BenKaehler/readytowear/raw/master/data/gg_13_8/515f-806r/ref-tax.qza
                  • qiime feature-classifier fit-classifier-naive-bayes
                    • --i-reference-reads ref-seqs-v4.qza
                    • --i-reference-taxonomy ref-tax.qza
                    • --i-class-weight human-stool.qza
                    • --o-classifier gg138_v4_human-stool_classifier.qza
                    • qiime feature-classifier classify-sklearn
                      • --i-reads rep-seqs-deblur.qza
                      • --i-classifier gg138_v4_human-stool_classifier.qza
                      • --o-classification bespoke-taxonomy.qza

                      This new bespoke-taxonomy.qza data artifact is a FeatureData[Taxonomy] type which can be used as input in any plugins that accept taxonomic assignments.

                      • qiime metadata tabulate
                        • --m-input-file bespoke-taxonomy.qza
                        • --m-input-file rep-seqs-deblur.qza
                        • --o-visualization bespoke-taxonomy.qzv

                        The Visualization (Fig. 5) shows the classified taxonomic name for each feature ID, with additional information on confidence level and sequences. You can reorder the table by clicking the sorting button next to each column name. Recall that the rep-seqs.qzv Visualization that we created above allows you to easily BLAST the sequence associated with each feature against the NCBI nt database. Using that Visualization and the bespoke-taxonomy.qzv Visualization created here, you can compare the taxonomic assignments of features of interest with those from BLAST's top hit. Because these methods are only estimates, it is not uncommon to find disagreements between the predicted taxonomies. The results here will generally be more accurate than those received from the simple BLAST search linked from the rep-seqs.qzv Visualization.

                        To assemble your own taxonomic weights for regions not available in the readytowear inventory, follow the detailed instructions at https://forum.qiime2.org/t/using-q2-clawback-to-assemble-taxonomic-weights.

                        Filtering data

                        So far, in addition to our sample metadata, we have obtained a quality-controlled FeatureTable[Frequency] , a Phylogeny[Rooted] , and a FeatureData[Taxonomy] artifact. We are now ready to explore our microbial communities and perform various statistical tests. In the following sections, we will explore the microbial communities of our samples from children only, and thus will separate these samples from those of the mothers.

                        • qiime feature-table filter-samples
                          • --i-table filtered-table-deblur.qza
                          • --m-metadata-file metadata.tsv
                          • --p-where "[mom_or_child]= `C'"
                          • --o-filtered-table child-table.qza
                          • qiime feature-table summarize
                            • --i-table child-table.qza
                            • --m-sample-metadata-file metadata.tsv
                            • --o-visualization child-table.qzv

                            Load this new Visualization artifact and keep it open, as we will be referring to this in the following section.

                            Alpha rarefaction plots

                            • qiime diversity alpha-rarefaction
                              • --i-table child-table.qza
                              • --i-phylogeny insertion-tree.qza
                              • --p-max-depth 10000
                              • --m-metadata-file metadata.tsv
                              • --o-visualization child-alpha-rarefaction.qzv

                              Load the child-alpha-rarefaction.qzv Visualization.

                              The resulting Visualization (Fig. 6) has two plots. The top plot is an alpha rarefaction plot, and is primarily used to determine if the within-sample diversity has been fully captured. If the lines in the plot appear to “level out” (i.e., approach a slope of zero) at some sampling depth along the x axis, this suggests that collecting additional sequences is unlikely to result in any significant changes to our samples’ estimated diversity. If the lines in a plot do not level out, the full diversity of the samples may not have been captured by our sampling efforts, or this could indicate that a lot of sequencing errors remain in the data (which are being mistaken for novel diversity).

                              The bottom plot in this visualization is important when grouping samples by our metadata categories. It illustrates the number of samples that remain in each group when the feature table is rarefied to each sampling depth. If a given sampling depth “d” is larger than the total frequency of a sample “s” (i.e., the number of sequences that were obtained for sample “s”), it is not possible to compute the diversity metric for sample “s” at sampling depth “d.” If many of the samples in a group have lower total frequencies than “d,” the average diversity presented for that group at “d” in the top plot will be unreliable because it will have been computed on relatively few samples. When grouping samples by metadata, it is therefore essential to look at the bottom plot to ensure that the data presented in the top plot are reliable. Try using the drop-down menus at the top of the plots to switch between the different calculated diversity metrics and metadata categories.

                              As mentioned earlier, a normalization method to account for unequal sampling depth across samples in microbiome data is essential to avoid the introduction of bias. One common approach to dealing with this problem is to sample a random subset of sequences without replacement for each sample at a fixed depth (also referred to as rarefying) and discard all remaining samples with total read counts below that threshold. This approach, which is not ideal because it discards a large amount of information (McMurdie & Holmes, 2014 ), has nonetheless been shown to be useful for many different microbial community analyses that are otherwise dominated by sample-to-sample variation in the number of sequences per sample obtained (Weiss et al., 2017 ). Selecting the depth to which to rarefy samples is a subjective decision motivated by the desire to maximize the rarefying threshold while minimizing loss of samples due to insufficient coverage.

                              Let us consider our current dataset as an example. In the rarefaction plots above, we can see that there is a natural leveling of our diversity metrics starting at 1000 sequences/sample, with limited additional increases observed beyond 3000 sequences/sample. This should be our target minimum sampling depth. Now let us revisit the child-table.qzv Visualization from the Filtering data step. Select the Interactive Sample Detail tab from the top left corner, and use the Metadata Category drop-down menu to select month . Hover over each bar in the plot to see the number of samples included at each month. Now, try moving the Sampling Depth bar on the right starting from the left (zero) to the right. You will see that as the sampling depth increases we begin to rapidly lose samples, as shown by the grayed areas in the bar plot. In this dataset, the time point 0 month is better represented than the subsequent months. We would therefore ideally minimize discarding samples from the other underrepresented months to maintain sufficient statistical power in downstream analyses. Start moving the Sampling Depth bar from zero again this time stop at the first instance where we begin to see a loss of sample at a month that is not 0. Now scroll down to the bottom of the page. The samples highlighted in red are the would-be discarded samples at that chosen sampling depth. Here we see that at a depth of exactly 3400 we are able to retain all the samples from months 6, 12, and 24, while still maintaining a minimum depth that will capture the overall signature of the alpha diversity metrics as seen by our rarefaction plots.

                              Alternative Pipeline

                              Newer methods are actively being developed that circumvent the need for rarefying by taking advantage of the compositional nature of microbiome data we will show examples of these methods in subsequent sections. However, for some commonly used analysis tasks, no such solution yet exists.

                              Basic data exploration and diversity analyses

                              • qiime feature-table filter-samples
                                • --i-table child-table.qza
                                • --m-metadata-file metadata.tsv
                                • --p-where "[month_replicate]= `no'"
                                • --o-filtered-table child-table-norep.qza
                                • qiime feature-table summarize
                                  • --i-table child-table-norep.qza
                                  • --m-sample-metadata-file metadata.tsv
                                  • --o-visualization child-table-norep.qzv

                                  We are now ready to explore our microbial communities. One simple method to visualize the taxonomic composition of samples is to visualize them individually as stacked barplots. We can do this easily by providing our feature table, taxonomy assignments, and sample metadata file to the taxa plugin's barplot action.

                                  • qiime taxa barplot
                                    • --i-table child-table-norep.qza
                                    • --i-taxonomy bespoke-taxonomy.qza
                                    • --m-metadata-file metadata.tsv
                                    • --o-visualization child-bar-plots.qzv

                                    This barplot (Fig. 7) shows the relative frequency of features in each sample, where you can choose the taxonomic level to display and sort the samples by a sample metadata category or taxonomic abundance in an ascending or descending order. You can also highlight a specific feature in the barplot by clicking it in the legend. The snapshot above shows a barplot at the phylum level (level 2) where samples were sorted by day. Three phyla were highlighted to show that Proteobacteria (gray) dominate at birth, but by 6 months of age the relative abundance of Bacteroidetes (green) and Firmicutes (purple) make up the majority of the community.

                                    While barplots can be informative with regard to the composition of our microbial communities, it is difficult to use them to disentangle meaningful signals from noise.

                                    Many microbial ecology studies use alpha diversity (within-sample richness and/or evenness) and beta diversity (between-sample dissimilarity) to reveal patterns in the microbial diversity in a set of samples. QIIME 2's diversity analyses are available through the q2-diversity plugin, which computes a range of alpha and beta diversity metrics, applies related statistical tests, and generates interactive visualizations. The diversity metrics used in any given study should be based on the overall goals of the experiment. For a list of available diversity metrics in QIIME 2 and a brief description of the motivation behind them, we recommend reviewing the following tutorial: https://forum.qiime2.org/t/alpha-and-beta-diversity-explanations-and-commands.

                                    In this tutorial, we will utilize the pipeline action core-metrics-phylogenetic , which simultaneously rarefies a FeatureTable[Frequency] to a user-specified depth, computes several commonly used alpha and beta diversity metrics, and generates principal coordinates analysis (PCoA) plots using the EMPeror visualization tool (Vázquez-Baeza, Pirrung, Gonzalez, & Knight, 2013 ) for each of the beta diversity metrics. For this tutorial, we will use a sampling depth of 3400 as determined from the previous step.

                                    • qiime diversity core-metrics-phylogenetic
                                      • --i-table child-table-norep.qza
                                      • --i-phylogeny insertion-tree.qza
                                      • --p-sampling-depth 3400
                                      • --m-metadata-file metadata.tsv
                                      • --p-n-jobs 1
                                      • --output-dir child-norep-core-metrics-results

                                      By default, the following metrics are computed by this pipeline and stored within the child-core-metrics-results directory.

                                      Alpha diversity metrics

                                      1. Shannon's diversity index (a quantitative measure of community richness Shannon & Weaver, 1949 )
                                      2. Observed features (a quantitative measure of community richness, called “observed OTUs” here for historical reasons)
                                      3. Evenness (or Pielou's Evenness a measure of community evenness Pielou, 1966 )
                                      4. Faith's Phylogenetic Diversity (a qualitative measure of community richness that incorporates phylogenetic relationships between the features Faith, 1992 ) this metric is sometimes referred to as PD_whole_tree, but we discourage the use of that name in favor of Faith's Phylogenetic Diversity or Faith's PD

                                      Beta diversity metrics

                                      1. Jaccard distance (a qualitative measure of community dissimilarity (Jaccard, 1908 )
                                      2. Bray-Curtis distance (a quantitative measure of community dissimilarity Sørensen, 1948 )
                                      3. unweighted UniFrac distance (a qualitative measure of community dissimilarity that incorporates phylogenetic relationships between the features Lozupone & Knight, 2005 ) implementation based on Striped UniFrac (McDonald et al., 2018 ) method
                                      4. weighted UniFrac distance (a quantitative measure of community dissimilarity that incorporates phylogenetic relationships between the features Lozupone, Hamady, Kelley, & Knight, 2007 ) implementation based on Striped UniFrac (McDonald et al., 2018 ) method.

                                      After computing the core diversity metrics, we can begin to explore the microbial composition of the samples in the context of their metadata.

                                      Performing statistical tests on diversity and generating interactive visualizations

                                      Alpha diversity

                                      • qiime diversity alpha-group-significance
                                        • --i-alpha-diversity child-norep-core-metrics-results/shannon_vector.qza
                                        • --m-metadata-file metadata.tsv
                                        • --o-visualization child-norep-core-metrics-results/shannon-group-significance.qzv

                                        Load the newly created shannon-group-significance.qzv Visualization.

                                        From the boxplots and Kurskal-Wallis test results (Fig. 8), it appears that there are no differences between the child samples in terms of Shannon H diversity when mode of delivery is considered (p value = 0.63). However, exposure to antibiotics appears to be associated with higher diversity (p value = 0.026). What are the biological implications?

                                        One important confounding factor here is that we are simultaneously analyzing our samples across all time points and in doing so potentially losing meaningful signals at a particular time point. Importantly, having more than one time point per subject also violates the assumption of the Kurskal-Wallis test that all samples are independent. More appropriate methods that take into account repeated measurements from the same samples are demonstrated in the longitudinal data analysis section below. It is important to note that QIIME 2 is not able to detect that: you must always be knowledgeable about the assumptions of the statistical tests that you are applying, and whether they are applicable to your data. These types of questions are common on the QIIME 2 Forum, so if you are unsure, start by searching for your question on the forum and posting your own question if you do not find a pre-existing answer.

                                        • qiime feature-table filter-samples
                                          • --i-table child-table-norep.qza
                                          • --m-metadata-file metadata.tsv
                                          • --p-where "[month]= `24'"
                                          • --o-filtered-table table-norep-C24.qza
                                          • qiime diversity core-metrics-phylogenetic
                                            • --i-table table-norep-C24.qza
                                            • --i-phylogeny insertion-tree.qza
                                            • --p-sampling-depth 3400
                                            • --m-metadata-file metadata.tsv
                                            • --p-n-jobs 1
                                            • --output-dir norep-C24-core-metrics-results
                                            • qiime diversity alpha-group-significance
                                              • --i-alpha-diversity norep-C24-core-metrics-results/shannon_vector.qza
                                              • --m-metadata-file metadata.tsv
                                              • --o-visualization norep-C24-core-metrics-results/shannon-group-significance.qzv

                                              Load this new Visualization.

                                              We can see now that at month 24, vaginal birth appears to be associated with a higher Shannon value than cesarean birth (p value = 0.02, Fig. 9), while antibiotic exposure is no longer associated with differences in Shannon diversity (p value = 0.87).

                                              Beta diversity

                                              Next, we will compare the structure of the microbiome communities using beta diversity. We start by making a visual inspection of the principal coordinates (PCoA) plots that were generated in the previous step. Load the unweighted_ unifrac_emperor.qzv Visualization from the norep-C24-core-metrics-results folder.

                                              Each dot in the PCoA plot (Fig. 10) represents a sample, and users can color them according to their metadata category of interest and rotate the 3D figure to see whether there is a clear separation in beta diversity driven by these covariates. Moreover, users can customize their figures using existing drop-down menus by hiding certain samples in Visibility , changing the brightness of dots in Opacity , controlling their size in Scale , choosing different shapes for samples in Shape , modifying the color of axes and background in Axes , and creating a moving picture under the Animations tab.

                                              Alternative Pipeline

                                              For longitudinal studies, we have made great use of visualizing temporal variability using animated traces in Emperor. By doing this, you can follow the longitudinal dynamics sample by sample and subject by subject. In order to do so, you need two metadata categories—one to order the samples ( Gradient category ) and one to group the samples ( Trajectory category ). For this dataset, we can use animations_gradient as the category that orders the samples, and animations_subject as the category that groups our samples.

                                              The values in animations_gradient represent the age in months. In this category, samples with no longitudinal data are set to 0 note that all values have to be numeric in order for the animation to be displayed. The animations_subject category includes unique identifiers for each subject. Put together, these two categories will result in animated traces on a per-individual basis.

                                              In Emperor's user interface, go to the Animations tab, select animations_gradient under the Gradient menu, and select animations_subject under the Trajectory menu. Then click play , and you will see animated traces moving on the plot. You can adjust the speed and the radius of the trajectories. To start over, click on the back button. Using the ECAM dataset, we have generated an animation visualizing the temporal trajectories of one vaginally born and one cesarean baby in the 3D PCoA plot. This animation is available at https://tinyurl.com/y7sbzpnd. For more information about animated ordinations, visit Emperor's online tutorial at https://biocore.github.io/emperor/build/html/tutorials/animations.html.

                                              • qiime diversity beta-group-significance
                                                • --i-distance-matrix norep-C24-core-metrics-results/unweighted_unifrac_distance_matrix.qza
                                                • --m-metadata-file metadata.tsv
                                                • --m-metadata-column delivery
                                                • --p-pairwise
                                                • --o-visualization norep-C24-core-metrics-results/uw_unifrac-delivery-significance.qzv

                                                The overview statistics (Fig. 11) provide us with the parameters used in the PERMANOVA test and the resulting values of test statistic and p value. The boxplots (Fig. 9) show the pairwise distance between cesarean and vaginal birth. Lastly, the table (in Fig. 9) summarizes the results from PERMANOVA and gives an additional q value (adjusted p value for multiple testing). The PERMANOVA test confirms our initial assessment that vaginal-born microbial communities are not statistically different from cesarean-born communities in beta diversity (as represented by unweighted UniFrac distances) at month 24 (p value = 0.6). These results, however, should be interpreted cautiously given the limited sample size in this dataset. We would conclude that further experiments would be needed to confirm our findings.

                                                Alternative Pipeline

                                                The beta diversity analysis above was carried on a rarefied subset of our data. An alternative method that does not require rarefying is offered through the external q2-deicode plugin (https://library.qiime2.org/plugins/deicode). DEICODE is a form of Aitchison Distance that is robust with respect to compositional data with high levels of sparsity (Martino et al., 2019 ). This plugin can be used to generate a beta diversity ordination artifact that can easily be utilized with the existing architecture in QIIME 2 such as visualization with q2-emperor and hypothesis testing with the beta-group-significance as above.

                                                Longitudinal data analysis

                                                When microbial data are collected at different timepoints, it is useful to examine dynamic changes in the microbial communities (longitudinal analysis). This section is devoted to longitudinal microbiome analysis using the q2-longitudinal plugin (Bokulich, Dillon, Zhang, et al., 2018 ). This plugin can perform a number of analyses such as: visualization using volatility plots, testing temporal trends in alpha and beta diversities, using linear mixed-effects models to test for changes in diversity metrics or individual features with regard to metadata categories of interest, and more. A comprehensive list of available methods and instructions on how to perform them are available in the online tutorial: https://docs.qiime2.org/2019.10/tutorials/longitudinal/. Here we will demonstrate some of these methods.

                                                Linear mixed effects (LME) models

                                                • qiime diversity core-metrics-phylogenetic
                                                  • --i-table child-table.qza
                                                  • --i-phylogeny insertion-tree.qza
                                                  • --p-sampling-depth 3400
                                                  • --m-metadata-file metadata.tsv
                                                  • --p-n-jobs 1
                                                  • --output-dir child-core-metrics-results
                                                  • qiime longitudinal linear-mixed-effects
                                                    • --m-metadata-file metadata.tsv
                                                    • --m-metadata-file child-core-metrics-results/shannon_vector.qza
                                                    • --p-metric shannon
                                                    • --p-random-effects day_of_life
                                                    • --p-group-columns delivery,diet
                                                    • --p-state-column day_of_life
                                                    • --p-individual-id-column host_subject_id
                                                    • --o-visualization lme-shannon.qzv

                                                    In this Visualization (Fig. 12), the model results provide all the outputs from the LME model, where we see a significant birth mode effect in Shannon diversity over time (p value = 0.02), while the diet has no bearing in Shannon diversity across time (p value = 0.55). The regression scatterplots (top) overlap the predicted group mean trajectories on the observed data (dots), and the projected residuals plot (bottom) can help users to check the validity of an LME model. For more details, see https://docs.qiime2.org/2019.10/tutorials/longitudinal/.

                                                    Volatility visualization

                                                    The volatility visualizer generates interactive line plots that allow us to assess how volatile a dependent variable is over a continuous, independent variable (e.g., time) in one or more groups. Multiple metadata files (including alpha and beta diversity) and feature tables can be used as input, and in the interactive visualization we can select different dependent variables to plot on the y axis. Here we examine how variance in Shannon diversity changes across time in our cohort, both in groups of samples (interactively selected) and in individual subjects.

                                                    • qiime longitudinal volatility
                                                      • --m-metadata-file metadata.tsv
                                                      • --m-metadata-file child-core-metrics-results/shannon_vector.qza
                                                      • --p-default-metric shannon
                                                      • --p-default-group-column delivery
                                                      • --p-state-column month
                                                      • --p-individual-id-column host_subject_id
                                                      • --o-visualization shannon-volatility.qzv

                                                      The volatility plot (Fig. 13) shows the mean curve of each group in the selected group column on top of individual trajectories over time. This plot can be useful in identifying outliers qualitatively, by turning on show global control limits to show ± 2× and 3× standard deviation lines from global mean. Observations above those global control limits are suspected to be outliers. In this analysis, we see high variance at time zero, while they become more similar by month 8, and by month 24, vaginally born children appear to be higher than cesarean-born (as expected).

                                                      Differential abundance testing

                                                      So far, we have analyzed our data using a variety of approaches employing various diversity metrics and between-sample distances that are useful in comparing our communities in a broad approach. Now, we want to identify individual taxa whose relative abundances are significantly different across groups. Differential abundance testing in microbiome analysis is an active area of research (see the “compositional data analysis” section in the Support Protocol for more details). Two QIIME 2 plugins that can be used for this are q2-songbird (Morton et al., 2019 ) and q2-composition . In this section, we will use the ANCOM test in the q2-composition plugin to identify differential abundant features between vaginal- and cesarean-born children. Moreover, we will use q2-songbird to perform a similar task, but with the additional adjustment for potential confounders.

                                                      ANCOM

                                                      As with any bioinformatics method, you should be aware of the assumptions and limitations of ANCOM before using it. For example, ANCOM assumes that few (less than ∼25%) features differ between groups. If you expect that more features differ between your groups, you should not use ANCOM because it will be more error-prone (an increase in both Type I and II errors is possible). We recommend reading the ANCOM paper (Mandal et al., 2015 ) before using this method. For the simplicity of the analysis, we will focus on identifying differential abundant features in children born with different birth modes at month 6 only. We have selected 6 months because this time point contains the highest number of samples (after baseline time 0), which greatly increases the power of our analysis.

                                                      • qiime feature-table filter-samples
                                                        • --i-table child-table-norep.qza
                                                        • --m-metadata-file metadata.tsv
                                                        • --p-where "[month]=`6'"
                                                        • --o-filtered-table table-norep-C6.qza

                                                        When performing differential abundance testing, it is generally a good idea to filter out features that have very low abundances across your dataset, as well those that are present in only a few samples. These features tend to add noise to the results, so we will remove them. Here we use the filter-features action to filter out features appearing in less than ∼10% of our samples (min 5 of 43 samples) and those that have a total frequency less than 20 counts across all samples.

                                                        • qiime feature-table filter-features
                                                          • --i-table table-norep-C6.qza
                                                          • --p-min-samples 5
                                                          • --p-min-frequency 20
                                                          • --o-filtered-table filtered-table-C6.qza

                                                          Because ANCOM operates on relative abundance data, it requires as input a feature table of type FeatureTable[Composition] it also cannot tolerate frequencies of zero. To resolve both of these requirements, we will use the add-pseudocount action to simultaneously apply relative abundance transformation and add a pseudocount of 1 to all of our counts.

                                                          • qiime composition add-pseudocount
                                                            • --i-table filtered-table-C6.qza
                                                            • --o-composition-table comp-table-C6.qza
                                                            • qiime composition ancom
                                                              • --i-table comp-table-C6.qza
                                                              • --m-metadata-file metadata.tsv
                                                              • --m-metadata-column delivery
                                                              • --o-visualization ancom-C6-delivery.qzv

                                                              The Visualization of ANCOM results (Fig. 14) first shows a volcano plot, where the x axis summarizes the effect size difference of the given features between interested metadata categories (delivery modes in our case) and the y axis is the strength of the ANCOM test statistic W. As ANCOM is essentially running pairwise tests, the W value is a count of the number of sub-hypotheses that have passed for a given feature. Hence, the differentially abundant features will be those ASVs with high values on both the x and y axis, in other words, points that are close to the top right or left corners (in the figures for this tutorial, the one identified feature has been highlighted with a red circle). The identified features are summarized underneath the ANCOM statistical results section. Lastly, the percentile abundance table shows the number of sequences assigned to each identified feature in how many of the samples. Regarding the identified feature in our analysis, of the samples in the cesarean group, in the sample with the lowest count of sequences assigned to detected feature, one sequence was observed that was ultimately assigned to this feature. Then, in 75% of the samples in the cesarean group, 1 or fewer sequences were observed that were ultimately assigned to this feature (recall that adding the pseudocount ensures that every sample will appear to have at least 1 count of every feature). However, in 75% of the samples in the Vaginal group, 884.75 or fewer sequences were observed that were ultimately assigned to this feature. This percentile abundance table suggests that the detected feature is higher in vaginally born than cesarean-born babies.

                                                              The ANCOM test has identified 1 feature that differs significantly by birth mode. To identify which taxa this feature corresponds to, we can load our bespoke-taxonomy.qzv artifact made in step 4 of the “Taxononomic classification” section and look up the feature ID in the search-bar at the top.

                                                              This identified feature and its corresponding taxonomic assignment are as follows:

                                                              • d75b7080930e7a77ef3de8c6154895b9 ->
                                                              • k__Bacteria p__Actinobacteria c__Actinobacteria
                                                              • o__Bifidobacteriales f__Bifidobacteriaceae
                                                              • g__Bifidobacterium s__

                                                              Perhaps not surprisingly, these results echo findings from the original ECAM paper (Bokulich, Chung, et al., 2016 ) encompassing the full dataset.

                                                              Songbird

                                                              Songbird (Morton et al., 2019 ) can be used to identify differentially abundant features while accounting for confounding variables in the data. This is a multinomial regression designed for compositional microbiome data (in technical terms, it is an L2 regularized multinomial regression that avoids overfitting by using the sum of squares of all feature weights as penalty term to the loss function, as in Ridge regression). Here, we control for confounding variables such as antibiotic exposure and infants’ diet and sex when identifying features that are significantly different between babies born vaginally versus through C-section.

                                                              • qiime songbird multinomial
                                                                • --i-table table-norep-C6.qza
                                                                • --m-metadata-file metadata.tsv
                                                                • --p-formula "delivery+abx_exposure+diet+sex"
                                                                • --p-epochs 10000
                                                                • --p-differential-prior 0.5
                                                                • --o-differentials songbird-results/differentials6monthControlled.qza
                                                                • --o-regression-stats songbird-results/regression-stats6monthControlled.qza
                                                                • --o-regression-biplot songbird-results/regression-biplot6monthControlled.qza

                                                                Note that users can adjust their model parameters and validate their fitted models by using the existing model diagnostic tools in songbird, such as plotting graphs of prediction accuracy and visualizing convergence summary.

                                                                • qiime tools export
                                                                  • --input-path songbird-results/differentials6monthControlled.qza
                                                                  • --output-path songbird-results/exported-differentials6monthControlled

                                                                  Based on the estimated coefficients for delivery[T.Vaginal] in the output of regression stats, we consider the features with the positive coefficients to be differential relative to negative coefficients in vaginal-born infants versus cesareans, and vice versa. There is no clear cutoff in songbird on the value of coefficients to assist in choosing of number of features, but since there are few features with coefficients higher than 2.5 or lower than -2.5, we use this threshold as our cut-off for regression coefficients and thus identify five vaginally born-associated and four C-section born-associated features, as shown below:

                                                                  • d75b7080930e7a77ef3de8c6154895b9 ->
                                                                  • k__Bacteria p__Actinobacteria c__Actinobacteria
                                                                  • o__Bifidobacteriales f__Bifidobacteriaceae g__Bifidobacterium s__
                                                                  • 2a99ec1157a90661db7ff643b82f1914 ->
                                                                  • k__Bacteria p__Bacteroidetes c__Bacteroidia o__Bacteroidales
                                                                  • f__Bacteroidaceae g__Bacteroides s__fragilis
                                                                  • c162a4f3943238810eba8a25f0563cca ->
                                                                  • k__Bacteria p__Bacteroidetes c__Bacteroidia o__Bacteroidales
                                                                  • f__Bacteroidaceae g__Bacteroides s__ovatus
                                                                  • c4f9ef34bd2919511069f409c25de6f1 ->
                                                                  • k__Bacteria p__Bacteroidetes c__Bacteroidia o__Bacteroidales
                                                                  • f__Bacteroidaceae g__Bacteroides s__
                                                                  • 1ad289cd8f44e109fd95de0382c5b252 ->
                                                                  • k__Bacteria p__Firmicutes c__Clostridia o__Clostridiales
                                                                  • f__Lachnospiraceae g__Clostridium s__hathewayi
                                                                  • C18afe570abfe82d2f746ecc6e291bab ->
                                                                  • k__Bacteria p__Proteobacteria c__Gammaproteobacteria
                                                                  • o__Enterobacteriales f__Enterobacteriaceae g__Klebsiella s__
                                                                  • bca0b81a0b8d59e90c25a323c2f62f31 ->
                                                                  • k__Bacteria p__Firmicutes c__Clostridia
                                                                  • o__Clostridiales f__Clostridiaceae g__Clostridium s__perfringens

                                                                  Meta-analysis through the Qiita database using redbiom

                                                                  After identifying differentially abundant features using ANCOM or songbird, users can search through available samples in Qiita (Gonzalez et al., 2018 ) using redbiom (McDonald et al., 2019 ) to see the characteristics of samples. This type of analysis can be used to examine what environments a particular feature was previously observed in. In addition, the FeatureTable[Frequency] data for the samples that contain a feature of interest can be extracted for further analysis. A detailed tutorial can be found on the QIIME 2 Forum (https://forum.qiime2.org/t/querying-for-public-microbiome-data-in-qiita-using-redbiom/4653). Here, we will search an individual differentially abundant feature to see whether that feature appears enriched in different infants by birth mode. Note that the exact numbers and results shown below may change over time as more samples get indexed by redbiom.

                                                                  • ContextName SamplesWithData FeaturesWithData Description
                                                                  • Pick_closed-reference_OTUs-Greengenes-Illumina-16S-V4-125nt-65468f 16622 40899 Pick closed-reference OTUs (reference-seq: |databases|gg|13_8|rep_set|97_otus.fasta) | Trimming (length: 125)
                                                                  • Deblur-Illumina-16S-V4-150nt-780653 127413 7299964 Deblur (Reference phylogeny for SEPP: Greengenes_13.8, BIOM: reference-hit.biom) | Trimming (length: 150)
                                                                  • Pick_closed-reference_OTUs-Greengenes-LS454-16S-V4-41ebc6 7326 27248 Pick closed-reference OTUs (reference-seq: |databases|gg|13_8|rep_set|97_otus.fasta) | Split libraries
                                                                  • Pick_closed-reference_OTUs-Greengenes-LS454-16S-V4-100nt-a243a1 7434 29507 Pick closed-reference OTUs (reference-seq: |databases|gg|13_8|rep_set|97_otus.fasta) | Trimming (length: 100)
                                                                  • Deblur-Illumina-16S-V4-125nt-3aae8b 15064 378537 Deblur (Reference phylogeny for SEPP: Greengenes_13.8, BIOM: reference-hit.biom) | Trimming (length: 125)

                                                                  For the analysis here, we are going to use the Deblur-Illumina-16S-V4-150nt-780653 context this context is composed of samples which sequenced the 16S V4 region, are all 150 nucleotides in length, and were processed with Deblur. The context contains 127,413 samples spanning over 7.2 million unique features, representing hundreds of publicly available studies in Qiita.

                                                                  • redbiom search features --context Deblur-Illumina-16S-V4-150nt-
                                                                  • 780653
                                                                  • TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTACGGGCGGGCTGGAGTGCGGTAGGGGAGACTGGAATTCCCGG > observed_samples.txt

                                                                  If we examine the observed_samples.txt file, we will see that over 17,000 samples contain this particular feature. These samples are part of 137 different studies in Qiita. We can now begin to explore what is known about the samples. A major challenge for meta-analysis, though, is having common metadata categories across studies.

                                                                  What we can see from this output is that (as expected) the feature is primarily observed in samples associated with the animal distal gut:

                                                                  Animal distal gut 7124
                                                                  Animal surface 331
                                                                  Surface (non-saline) 204
                                                                  Sterile water blank 102
                                                                  Animal secretion 91
                                                                  animal distal gut 68
                                                                  Animal corpus 58
                                                                  Water (non-saline) 15
                                                                  Plant corpus 13
                                                                  Animal proximal gut 12
                                                                  Aerosol (non-saline) 9
                                                                  Single strain 6
                                                                  Water (saline) 6
                                                                  Soil (non-saline) 6
                                                                  not provided 2
                                                                  Sediment (saline) 2
                                                                  Surface (saline) 1
                                                                  Total samples 8050
                                                                  • redbiom select samples-from-metadata
                                                                    • --context Deblur-Illumina-16S-V4-150nt-780653
                                                                    • --from observed_samples.txt "where (host_age < 3 or age < 3) and qiita_study_id != 10249" > infant_samples.txt

                                                                    We can see that birth_mode is represented by thousands of samples:

                                                                    From this summary, it appears our feature of interest is present in many more samples associated with a vaginal birth than cesarean section:

                                                                    This suggests the variable is not extremely unbalanced between C-section and vaginal births, and that actually more of the samples are associated with C-sections.

                                                                    We see that nine different Qiita studies are represented by the infant samples:

                                                                    10581 54
                                                                    10918 30
                                                                    11076 19
                                                                    1064 15
                                                                    11358 10
                                                                    11947 10
                                                                    2010 4
                                                                    10512 3
                                                                    11284 1
                                                                    Total samples 146

                                                                    Further exploration of these samples can be performed, such as extracting the samples and integrating them directly in a meta-analysis (see redbiom fetch to obtain feature tables and sample metadata).

                                                                    : FURTHER MICROBIOME ANALYSES

                                                                    The following sections are offered as stand-alone additional support for further microbiome analyses and do not rely on the ECAM dataset used in previous sections.

                                                                    Exporting QIIME 2 data

                                                                    • qiime tools export
                                                                      • --input-path insertion-tree.qza
                                                                      • --output-path extracted-insertion-tree

                                                                      Analysis of shotgun metagenomic data

                                                                      Whole-metagenome shotgun (WMS) sequencing explores the totality of genomes in the microbial community. Compared to amplicon-based analyses, it provides higher taxonomic resolution (typically beyond the genus level), direct observation of functional genes, and further information on the genome organization. Although assembly into draft genomes usually demands high sequencing depth, which is expensive, investigation of the microbial community can be as affordable as amplicon sequencing, hence enabling survey of larger quantity of samples. It has been demonstrated that “shallow” shotgun sequencing (0.5 million sequences per sample) delivers nearly equal insights into the community's taxonomic composition compared to sequencing with 100 times as much depth (Hillmann et al., 2018 )—although functional profiles are not nearly as accurate in shallow shotgun sequencing. Therefore, experimental design and budget arrangements should be made based on the goals of the study.

                                                                      Two plugins that are dedicated to shotgun metagenomics are currently available for QIIME 2: q2-shogun (Hillmann et al., 2018 ) and q2-metaphlan2 (Truong et al., 2015 ). They need to be installed separately. In the example below, we demonstrate the use of q2-shogun , a wrapper for the SHOGUN pipeline (Hillmann et al., 2018 ).

                                                                      • conda install -c bioconda bowtie2
                                                                      • conda install cytoolz
                                                                      • pip install https://github.com/knights-lab/SHOGUN/archive/master.zip
                                                                      • pip install https://github.com/qiime2/q2-shogun/archive/master.zip
                                                                      • qiime dev refresh-cache
                                                                      • for i in query refseqs taxonomy bt2-database do wget https://github.com/qiime2/q2-shogun/raw/master/q2_shogun/tests/data/$i.qza done
                                                                      • qiime shogun nobunaga
                                                                        • --i-query query.qza
                                                                        • --i-reference-reads refseqs.qza
                                                                        • --i-reference-taxonomy taxonomy.qza
                                                                        • --i-database bt2-database.qza
                                                                        • --o-taxa-table taxatable.qza

                                                                        In this example, SHOGUN is called to align query sequences query.qza against a reference sequence database refseqs.qza using the popular short-sequence aligner Bowtie2 (Langmead & Salzberg, 2012 ). The query sequences may be demultiplexed or multiplexed data. In the latter case, SHOGUN will automatically stratify alignment results by sample ID. The taxonomy artifact taxonomy.qza defines the mapping of reference sequences to taxonomic lineages. In addition to taxonomy, this artifact could be any hierarchical (semicolon-delimited) or simple mappings, for example, functional annotations. A Bowtie2 index containing the reference sequence database is necessary for this operation.

                                                                        The output file, taxatable.qza , is a feature table in which columns are sample IDs and rows are taxonomic lineages. Starting from this table, we may perform various subsequent analyses in a similar manner to amplicon sequencing data, as detailed above, such as taxonomy plots, alpha and beta diversity analyses, and differential abundance testing.

                                                                        If the user wants to prepare a custom reference sequence database from a multi-FASTA file (e.g., refseqs.fa ), it can be done as follows:

                                                                        NOTE: The below sections are presented for demonstration purposes only and are not to be executed unless the file refseqs.fa is first imported by the user.

                                                                        • qiime tools import
                                                                          • --input-path refseqs.fa
                                                                          • --type FeatureDate[Sequence]
                                                                          • --output-path refseqs.qza

                                                                          5. Build a Bowtie2 index based on the sequences:

                                                                          QIIME 2 is flexible in the types of metagenomic analyses it supports. In addition to calling SHOGUN or MetaPhlAn2 from the QIIME 2 interface, one may perform taxonomic or functional profiling of shotgun metagenomic data separately using any external tool, then import the resulting profile into QIIME 2. BIOM formatted files are supported as input. Questions about other supported formats should be directed to the QIIME 2 Forum, as this will expand over time.

                                                                          Source tracking

                                                                          Source tracking of microbial communities attempts to estimate the relative contributions of a set of host, environmental, and contamination sources to a novel community. QIIME 2 currently offers two methods for microbial source tracking through the external plugins q2-FEAST (https://github.com/cozygene/FEAST) (Shenhav et al., 2019 ) and q2-SourceTracker2 (https://github.com/biota/sourcetracker2) (Knights et al., 2011 ). FEAST (Fast Expectation-mAximization microbial Source Tracking) and SourceTracker2 vary in their statistical approach and assumptions for the estimation of source contributions. Therefore, we view method selection here as a personal choice that research teams should make if they do not have a prior hypothesis that one tool addresses directly.

                                                                          Compositional data analysis

                                                                          Feature tables contain magnitudes determined by random sequencing depths that vary dramatically between samples irrespective of the initial microbial load, making the data compositional in nature (Fernandes et al., 2014 ). Compositional data contain relative information where the abundance of one feature can only be interpreted relative to another.

                                                                          Numerous normalization methods have been proposed to restore absolute abundances such as rarefaction (Weiss et al., 2017 ), as well as median (Love, Huber, & Anders, 2014 ), quantile (Paulson, Stine, Bravo, & Pop, 2013 ), and constant sum normalization. However, due to erroneous assumptions, these methods cannot control false-positive rates (Hawinkel, Mattiello, Bijnens, & Thas, 2019 Morton et al., 2017 ) and contribute to irreproducibility (Fernandes et al., 2014 Gloor, Macklaim, Pawlowsky-Glahn, & Egozcue, 2017 Gloor, Wu, Pawlowsky-Glahn, & Egozcue, 2016 ).

                                                                          Transformation-independent and -dependent methods developed in the field of compositional data analysis (CoDA) offer an assumption-free solution (Quinn et al., 2019 ). Transformation-dependent methods such as the centered- (clr Aitchison, 1982 ), isometric- (ilr Egozcue, Pawlowsky-Glahn, Mateu-Figueras, & Barceló-Vidal, 2003 ), and additive- (alr Aitchison, 1982 ) log ratio transform the data with regard to a reference. Transformation-independent methods operate on a single feature or ratios of features (Greenacre, 2019 ).

                                                                          CoDA methods rely on logarithms to enforce symmetry in the weighting of relative increases or decreases between features (Aitchison, 1982 ). The logarithm of zero is undefined, and therefore the non-trivial task of zero handling is often the first step in CoDA analysis (Silverman, Roche, Mukherjee, & David, 2018 ). There are many proposed methods (Martín-Fernández, Barceló-Vidal, & Pawlowsky-Glahn, 2003 ), but QIIME 2 provides two steps to ameliorate the zero problem. First, features that have only a few entries across many samples can be filtered out (https://docs.qiime2.org/2019.10/plugins/available/feature-table/filter-features/). Second, a small pseudocount value (often of one) can be added uniformly to the data prior to applying a transform (https://docs.qiime2.org/2019.10/plugins/available/composition/add-pseudocount/).

                                                                          After zero handling, multiple CoDA transforms are available in QIIME 2, including clr and ilr on both a hierarchical and phylogenetic basis via gneiss (https://docs.qiime2.org/2019.10/plugins/available/gneiss/) (Morton et al., 2017 ). Downstream analysis of transformed data is often focused on finding differential features between sample groups. In QIIME 2, both Songbird (https://github.com/biocore/songbird) (Morton et al., 2019 ) and ALDEx2 (https://github.com/ggloor/q2-aldex2) (Fernandes et al., 2014 ) provide supervised differential abundance ranking. QIIME 2 also provides compositional unsupervised dimensionality reduction methods in two forms of Aitchison distance that use different zero-handling methods (https://docs.qiime2.org/2019.10/plugins/available/diversity/beta/ https://library.qiime2.org/plugins/deicode) (Martino et al., 2019 Pawlowsky-Glahn, Egozcue, & Tolosana-Delgado, 2015 ). Using both supervised and unsupervised CoDA methods, the differential features can be obtained with regard to sample groupings (e.g., armpit versus foot).

                                                                          After identifying differential features, QIIME 2 also provides methods for transform-independent analysis using Qurro (https://library.qiime2.org/plugins/qurro https://doi.org/10.1101/2019.12.17.880047). By taking the log-ratio between two or the sum of multiple differential features, the sample groupings can be directly visualized.

                                                                          Supervised classification and regression methods for predicting sample metadata

                                                                          Supervised learning (SL) methods predict sample data (e.g., metadata values) as a function of other sample data (e.g., microbiota composition) by training an SL model on training data. Various SL methods can predict either categorical data (a classification problem) or continuous values (a regression problem). SL methods have become increasingly common in microbiome studies to predict sample characteristics (e.g., disease state or location data) or to identify features that are associated with particular characteristics or sample classes (Bokulich, Collins, et al., 2016 Knights et al., 2011 Pasolli, Truong, Malik, Waldron, & Segata, 2016 ). The ability of many SL methods to perform feature selection—the identification (and ranking) of features associated with particular sample classes or values—is a particularly useful feature of these methods for application in microbiome experiments. The QIIME 2 plugin q2-sample-classifier (Bokulich, Dillon, Bolyen, et al., 2018 https://library.qiime2.org/plugins/q2-sample-classifier/) contains methods for performing supervised classification/regression and feature selection using microbiome data and metadata.

                                                                          Metadata preparation

                                                                          Metadata are a critical component of a successful study and, unlike other elements such as sequencing quality or completeness of the reference database, metadata are largely under the control of the investigator. Unfortunately, they are often treated as an afterthought, leading to uninterpretable results due to missing information. To ensure a successful data analysis, begin metadata generation at the time of sample collection. Be sure to record all sample attributes that are relevant to your hypotheses, as these attributes are the basis of QIIME 2's visualizations and statistical tests.

                                                                          Spreadsheets are the most commonly used vehicle for metadata storage and management due to their ubiquity and convenience, but they have well-known drawbacks. For example, by default Microsoft Excel performs irreversible modification of certain kinds of inputs into dates or floating-point numbers (Zeeberg et al., 2004 ), and auto-completes values based on earlier entries (https://support.office.com/en-ie/article/turn-automatic-completion-of-cell-entries-on-or-off-0f4aa749-b927-4ea7-adaa-86f8d4f9fe20). As these modifications are performed silently, without warning to the user, they frequently lead to mangled metadata. Although other spreadsheet programs (such as Google Sheets and LibreOffice) have slightly different defaults, all have “convenience” features that can cause data corruption, so it is critical to learn the default features of your preferred spreadsheet program, follow spreadsheet best-practices (Broman & Woo, 2018 ), and actively monitor the validity of your records. Alternatively, generate your metadata file in a dedicated software tool such as ISAcreator (Rocca-Serra et al., 2010 ), which provides a structured interface designed to prevent common errors.

                                                                          Consistency is the key to high-quality metadata. Much effort has already been put into identifying and standardizing the crucial pieces of metadata for various sorts of studies, so investigate these guidelines before beginning your metadata collection. The Genomic Standards Consortium (GSC) has created the “Minimum Information about any (x) Sequence” (MIxS) and “Minimum Information about a MARKer gene Sequence” (MIMARKS) specifications (Yilmaz et al., 2011 ) as well as 15 “environmental packages” that extend and refine these standards for samples from environments ranging from air to human skin to waste water. To ease compliance with these standards, the GSC provides checklists outlining the expected inputs, syntax, preferred units, and more for the fields in each standard and package (https://press3.mcs.anl.gov/gensc/mixs/). Many of these fields take values specified by subsets of controlled vocabularies such as the Experimental Factor Ontology (Malone et al., 2010 ) and the Environment Ontology (Buttigieg et al., 2016 ). Consider employing a tool such as the stand-alone ISAconfigurator (Rocca-Serra et al., 2010 ) or the Excel-based QIIMP (The Quick and Intuitive Interactive Metadata Portal https://qiita.ucsd.edu/iframe/?iframe=qiimp) to identify all the fields necessary for your study type and to enforce the validity of their content.

                                                                          While creating and maintaining consistent and compliant metadata is not trivial, it is well worth the effort. Not only are standards-compliant metadata required for submission to a growing number of public databases and journals [e.g., the European Nucleotide Archive (https://www.ebi.ac.uk/training/online/course/ebi-metagenomics-portal-submitting-metagenomics-da/what-are-metadata-and-why-are-they-so-impo), Qiita (https://qiita.ucsd.edu/static/doc/html/tutorials/prepare-information-files.html#required-fields-for-centralized-qiita), and Microbiome Journal (https://microbiomejournal.biomedcentral.com/submission-guidelines/preparing-your-manuscript/microbiome-announcement)], but they are also critical to enable future meta-analyses—both between your own data and others’—and between your own data today and your new data tomorrow! It is much easier to record required information up front than it is to retroactively track this information down when you are working toward a tight paper submission deadline.


                                                                          The kind of data used in this workshop

                                                                          This workshop will not start with the raw reads, since the first steps in a metabarcoding workflow are typically done using command line tools such as QIIME or mothur (dada2 is an exception) in the cloud. Data that can be analysed using techniques presented here is typically the result of the following steps (Comeau, Douglas, and Langille 2017) :

                                                                          1. Sample environments/soil/tissue/water and extract DNA.
                                                                          2. Perform PCR using standard primers.
                                                                          3. Sequence using a high-throughput sequencing platform such as the Illumina MiSeq.
                                                                          4. Call OTUs/ESVs and assign a taxonomic classification by comparing them to a reference database, such as Greengenes.
                                                                          5. Construct an abundance matrix of read counts for each OTU in each sample.

                                                                          Here we focus on the statistical analysis and visualizations following OTU calling that include:

                                                                          • Reading files into R
                                                                          • Manipulating tabular and taxonomic data
                                                                          • Heat trees (Foster, Sharpton, and Grünwald 2017) , stacked bar charts and related visualizations
                                                                          • Alpha and beta diversity
                                                                          • Ordination methods

                                                                          DNA Sequence Quality - Phred - provides base calling, chromatogram display and high quality sequence region evaluation and presentation for up to five sequences simultaneously.

                                                                          Sequence assembly - you don't need your own contig assembly program when you can use:

                                                                          EGassember - aligns and merges sequence fragments resulting from shotgun sequencing or gene transcripts (EST) fragments in order to reconstruct the original segment or gene ( Reference: A. Masoudi-Nejad et al. 2006. Nucl. Acids Res. 34: W459-462).

                                                                          CGE Assembler 1.2 - assembles Illumina, 454, SOLid and Ion Torrent data ( Reference: Larsen MV, et al. J. Clin. Micobiol. 2012. 50(4): 1355-1361).
                                                                          CGE SPAdes 3.9 - assembles Illumina and Ion Torrent data ( Reference: S. Nurk et al. Research in Computational Molecular Biology: pp 158-170).

                                                                          CAP3 (PBIL, France ), ( Reference: Huang,X. & Madan A. 1999. Genome Res. 9: 868-877), and here.
                                                                          CAP EST Assembler (Istituto FIRC di Oncologia Molecolare, Italy) - Maximum sequence length for each sequence is 30 kb - Maximum number of sequences 10 kb

                                                                          MicroScope web site (hosted at Genoscope), provides an environment for expert annotation and comparative genomics. Genome project: Annotation and comparative analyses of finished or draft genome sequences. For pre-annotated sequences, they only integrate annotations from NCBI RefSeq complete genome section. Metagenome project: Annotation and comparative analyses of assembled metagenomic sequences. Currently, they are able to integrate datasets below 20 Mb of contigs per bin.

                                                                          NanoPipe - was developed in consideration of the specifics of the MinION sequencing technologies, providing accordingly adjusted alignment parameters. The range of the target species/sequences for the alignment is not limited, and the descriptive usage page of NanoPipe helps a user to succeed with NanoPipe analysis. The results contain alignment statistics, consensus sequence, polymorphisms data, and visualization of the alignment. ( Reference: Shabardina V et al. (2019) Gigascience 8(2). pii: giy169).


                                                                          COV2HTML: a visualization and analysis tool of bacterial next generation sequencing (NGS) data for postgenomics life scientists - allows performing both coverage visualization and analysis of NGS alignments performed on prokaryotic organisms (bacteria and phages). It combines two processes: a tool that converts the huge NGS mapping or coverage files into light specific coverage files containing information on genetic elements and a visualization interface allowing a real-time analysis of data with optional integration of statistical results. ( Reference: Monot M. et al. 2014. OMICS 18(3): 184-95).

                                                                          DCA Divide-and-Conquer Multiple Sequence Alignment ( Universitat Bielefeld, Germany) - is a program for producing fast, high quality simultaneous multiple sequence alignments of amino acid, RNA, or DNA sequences. ( Reference: Brinkmann, G. et al. Mathematical Programming 79: 71-97, 1997).

                                                                          PhageTerm - is a fast and user-friendly software package which can be used to determine bacteriophage termini and packaging mode from randomly fragmented NGS data. It is part of the Galaxy package, and can be found in the "NGS: Mapping" directory. Ideal is you want an automated answer. ( Reference: Garneau JR, et al. 2017. Sci Rep. 7(1):8292).

                                                                          QUAST - a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications. ( Reference: A. Gurevich et al. 2013. Bioinformatics, 29(8): 1072&ndash1075). N.B. This server is as of April 2020, but there are hopes that it will be back online (see here for software downloads).

                                                                          Sequencing errors: - if your DNA sequence doesn't match the expected protein sequence you can check for errors at GeneWise (EMBL-EBI) which compares a protein sequence to a genomic DNA sequence, allowing for introns and frameshifting errors. Other programs include:

                                                                          FrameD ( Reference: T. Schliex et al. 2003. Nucl. Acids Res. 31: 3738-3741)
                                                                          AMIGene - annotation of microbial genes ( Reference: Bocs S et al. (2003) Nucleic Acids Res. 13(31): 3723-3726).
                                                                          path :: protein back-translation and alignment - addresses the problem of finding distant protein homologies where the divergence is the result of frameshift mutations and substitutions. Given two input protein sequences, the method implicitly aligns all the possible pairs of DNA sequences that encode them, by manipulating memory-efficient graph representations of the complete set of putative DNA sequences for each protein. ( Reference: Gîrdea M et al. 2010. Algorithms for Molecular Biology 5:)

                                                                          In-silico.com (Dr. Joseba Bikandi & co-workers, Faculty of Pharmacy, in the University of the Basque Country) - allows in silico experiments including theoretical PCR amplification, AFLP-PCR , restriction analysis and pulsed field gel electrophoresis [PFGE] with bacterial & archael genomes found in the public database.

                                                                          NCBI Prokaryotic Genomes Automatic Annotation Pipeline. This will completely annotate your bacterial genome and provide you with a Sequin submission file. N.B. an NCBI Phage Automatic Annotation Pipeline is in developement.

                                                                          RAST (Rapid Annotation using Subsystem Technology) is a fully-automated service for annotating bacterial and archaeal genomes. It provides high quality genome annotations for these genomes across the whole phylogenetic tree. Requires registration. ( Reference: Aziz, RK et al. 2008. BMC Genomics 9:75.).

                                                                          BASys Bacterial Annotation Tool - this incredible tool supports automated, in-depth annotation of bacterial genomic sequences. It accepts raw DNA sequence data and an optional list of gene identification information (Glimmer) and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. ( Reference: G.H. Van Domselaar et al. 2005. Nucl. Acids Res. 33(Web Server issue): W455-W459).

                                                                          MicroScope - (CEA, Institut de Génomique - Genoscope, France) is a microbial genome annotation & analysis platform which provides access to a wide range of tools including COG analysis, comparative genomics . ( Reference: Vallenet D et al. (2017) Nucleic Acids Res. 45(D1): D517-D528). Requires registration.

                                                                          MAKER Web Annotation Service (MWAS) is an easily configurable web-accesible genome annotation pipeline. It's purpose is to allow research groups with small to intermediate amounts of eukaryotic and prokaryotic genome sequence (i.e. BAC clones, small whole genomes, preliminary sequencing data, etc.) to independently annotate and analyse their data and produce output that can be loaded into a genome database. ( Reference: Holt, C. & Yandell, M. 2011. BMC Bioinformatics 12:491).

                                                                          MITOS - a pipeline is designed to provide consistent and high quality de novo annotation of Metazoan mitochondrial genomes sequences. We show that the results of MITOS match RefSeq and MitoZoa in terms of annotation coverage and quality. At the same time we avoid biases, inconsistencies of nomenclature, and typos originating from manual curation strategies. ( Reference: M. Bernt et al. 2013. Molecular Phylogenetics & Evolution 69:313-319).

                                                                          GenSAS - Genome Sequence Annotation Server - provides a one-stop website with a single graphical interface for running multiple structural and functional annotation tools, enabling visualization and manual curation of genome sequences. Users can upload sequences into their account and run gene prediction programs, protein homology searches, map ESTs, identify repeats, ORFs and SSRs with custom parameter settings. Each analysis is displayed on separate tracks of the graphical interface with custom editabe tracks to select final annotation of features and create gff3 files for upload to genome browsers such as GBrowse. Additional programs can be easily added using this Drupal based software.

                                                                          Viral Genome ORF Reader (VIGOR) - supports high throughput feature prediction and annotation. VIGOR employs an extrinsic strategy and boasts sensitivity and specificity greater than 98% for the RNA viral genomes we tested. Genome-specific features identified by VIGOR include frameshifts, ribosomal slippage, RNA editing, stop codon read-through, overlapping genes, embedded genes, and mature peptide cleavage sites. Genotyping capability for influenza and rotavirus is built into the program.
                                                                          ( Reference: S. Wang et al. 2011. BMC Bioinformatics 2010, 11:451)

                                                                          FLAN (FLu ANnotation) is an NCBI web server for genome annotation of influenza virus is a tool for user-provided influenza A virus or influenza B virus sequences. It can validate and predict protein sequences encoded by an input flu sequence. ( Reference: Y. Bao et al. 2007. Nucleic Acids Res. Web Server issue) 35: W280-W284.)

                                                                          CpGAVAS ( Chloroplast Genome Annotation, Visualization, Analysis and GenBank Submission Tool) - allows accurate chloroplast genome annotation, the generation of circular maps, the provision of useful analysis results of the annotated genome, the creation of files that can be submitted to GenBank directly. ( Reference: C. Liu et al. 2012. BMC Genomics 13: 715)

                                                                          Genome Annotation Transfer Utility (GATU) annotates a genome based on a very closely related reference genome. The proteins/mature peptides of the reference genome are BLASTed against the genome to be annotated in order to find the genes/mature peptides in the genome to be annotated ( Reference: T. Tcherepano v et al. 2006. BMC Genomics 7:150.)

                                                                          BioGPS (The Scripps Research Institute, USA) - is a one-stop gene annotation portal that emphasizes user-customizability and community-extensibility It is a customizable gene annotation portal and a complete resource for learning about gene and protein function.

                                                                          BAGEL (Groningen Biomolecular Sciences and Biotechnology Institute, Haren, the Netherlands) - will determine from an existing or non submitted GenBank file the presence of bacteriocins based on a database containing information of known bacteriocins and adjacent genes involved in bacteriocin activity. An alternative site for bacteriocins is BACTIBASE which is a data repository of bacteriocin natural antimicrobial peptides. See . LABioicin if you are interested in the topic of Lactic Acid Bacteria (LAB) and its bacteriocins.

                                                                          MICheck (MIcrobial genome Checker) - enables rapid verification of sets of annotated genes and frameshifts in previously published bacterial genomes, or genomes for which the user has a *.gbk file. This tool can be seen as a preliminary step before the functional re-annotation step to check quickly for missing or wrongly annotated genes. It worked nicely with phage genomes from 43-135kb. ( Reference: S. Cruveiller et al. 2005. Nucl. Acids Res. 33: W471- W479).

                                                                          WebGeSTer - Genome Scanner for Terminators - my favourite terminator search program is finally web enabled. Please note that if you want to analyze data from a *.gbk file you need to use their conversion program "GenBank2GeSTer" first. A complete description of each terminator including a diagram is produced by this program. This site linked to an extensive database of transcriptional terminators in bacterial genome (WebGeSTer DB) ( Reference: Mitra A. et al. 2011. Nucl. Acids Res. 39(Database issue):D129-35).

                                                                          RibEx: Riboswitch Explorer - scans <40kb DNA for potential genes (which are linked to BLASTP) and several hundred regulatory elements, including riboswitches. If you click on the "search for attenuators" it finds terminators and antiterminators. It presents the capculated genes and perits BLAST analysis at NCBI ( Reference: C. Abreu-Goodger & E. Merino. 2005. Nucl. Acids Res. 33: W690-W692).

                                                                          tRNAs: tRNAscan-SE - is incredibly sensitive & also provides secondary structure diagrams of the tRNA molecules (Reference: Schattner, P. et al. 2005. Nucleic Acids Res. 33: W686-689). Alternatively use ARAGORN ( Reference: Laslett, D. & Canback. 2004. Nucleic Acids Research 32:11-16).
                                                                          Test sequences.

                                                                          LTR_Finder - is an efficient program for finding full-length LTR retrotranspsons in genome sequences. The size of input file is now limited to 50MB ( Reference: Z. Xu & H. Wang. 2007. Nucl. Acids Res.35(Web Server issue): W265-W268).
                                                                          RTAnalyzer - finds retrotransposons and detects L1 retrotransposition signatures ( Reference: J-F. Lucier et al. 2007. Nucl. Acids Res. 35(Web Server issue):W269-W274

                                                                          MG-RAST (Metagenome Rapid Annotation using Subsystem Technology) is a fully-automated service for annotating metagenome samples. It provides annotation of sequence fragments, their phylogenetic classification and an initial metabolic reconstruction. The service also provides means for comparing phylogenetic classifications and metabolic reconstructions of metagenomes ( Reference: F. Meyer et al. 2008. BMC Bioinformatics 9: 386).

                                                                          The following four programs can be used to prediction phage proteins:

                                                                          PVPred ( Reference: Ding H et al (2014) Mol Biosyst 10(8): 2229-2235).
                                                                          PHPred ( Reference: Ding H (2016) Computers Biol Med 71: 156&ndash161).
                                                                          PVP-SVM ( Reference: Manavalan B et al. (2018) Front Microbiol 9: 476).
                                                                          PVPred-SCM ( Reference: Charoenkwan P et al. (2020) Cells 9(2) pii: E353.

                                                                          Chromosome replication origin:

                                                                          Ori-Finder and Ori-Finder 2 - are useful platforms for the identification and analysis of replication origins (oriCs) in the bacterial and archaeal genomes, respectively. ( Reference: Luo H et al. (2019) Brief Bioinform 20(4): 1114-1124). Please note that these tools have been used to create DoriC - a database of replication origins in prokaryotic genomes including chromosomes and plasmids. (Reference: Luo H & Gao F (2019) Nucleic Acids Res. 47(D1): D74-D77).

                                                                          One of the problems with GenBank is that scientists do not update their submission data nor correct errors. In part this is due to laziness but is also due to the fact that GenBank is, in most cases, unwilling to accept a new version of the Sequin file. Tbl2asn is a command-line program that automates the creation of sequence records for submission to GenBank but, from my perspective, it is not easy to use. The only online program is GenBank 2 Sequin which generates not only a Sequin file (*.sqn), but also a five-column "Annotation Table" (*.tbl). This together with the fasta-formatted DNA sequence can be submitted to GenBank by Email ( [email protected] ). In its absence I recommend the perl script gbf2tbl.pl available for downloading here.


                                                                          PlasmidFinder 1.3 - identifies plasmids in total or partial sequenced isolates of bacteria. The method uses BLAST for identification of replicons of plasmids belonging to the major incompatibility (Inc) groups of Enterobacteriaceae. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. See also pMLST ( Reference: Carattoli A et al. 2014. Antimicrob. Agents Chemother. 58: 3895-903)

                                                                          PHACTS can be used to quickly classify the lifestyle of a phage (temperate or lytic). All that is needed is the proteome of the phage to be classified and PHACTS will predict the lifestyle of that phage and return a confidence value for that prediction. ( Reference: K. McNair et al. 2012. Bioinformatics 28: 614-618).

                                                                          SpeciesFinder 1.0 (Danish Technical University) - predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the 16S rRNA gene.

                                                                          CSI Phylogeny 1.1 (Call SNPs & Infer Phylogeny) - calls SNPs, filters the SNPs, does site validation and infers a phylogeny based on the concatenated alignment of the high quality* SNPs. ( Reference: Kaas, R.S. et al. PLoS ONE 2014 9: e104984.)

                                                                          KmerFinder 2.0 &ndash predicts the species of bacteria from pre-assembled, complete or partial genomes, and short sequence reads. The prediction is based on the number of co-occurring k-mers (substrings of k nucleotides in DNA sequence data, in this case 16-mers) between the genomes of reference bacteria in a database and the genome provided by the user. ( Reference: Hasman H et al. 2013. J Clin Microbiol. 52:139-146)

                                                                          VIOLIN: Vaccine Investigation and Online Information Network - allows easy curation, comparison and analysis of vaccine-related research data across various human pathogens VIOLIN is expected to become a centralized source of vaccine information and to provide investigators in basic and clinical sciences with curated data and bioinformatics tools for vaccine research and development. VBLAST: Customized BLAST Search for Vaccine Research allows various search strategies against against 77 genomes of 34 pathogens. ( Reference: He, Y. et al. 2014. Nucleic Acids Res. 42 (Database issue):D1124-32).

                                                                          MLST 1.8 (MultiLocus Sequence Typing) - currently only works with assembled genomes and contigs ( Reference: Larsen MV et al. 2012. J. Clin. Micobiol. 50: 1355-1361).

                                                                          ECFfinder - extracytoplasmic function (ECF) sigma factors - the largest group of alternative sigma factors - represent the third fundamental mechanism of bacterial signal transduction, with about six such regulators on average per bacterial genome. Together with their cognate anti-sigma factors, they represent a highly modular design that primarily facilitates transmembrane signal transduction. ( Reference: Staron A et al. (2009) Mol Microbiol 74(3): 557-581).

                                                                          BacWGSTdb - is designed for monitoring the emergence and outbreak of important bacterial pathogens. In detail, it serves two particular purposes: Typing & Tracking. The former refers to an integrated genotyping at both the traditional multi-locus sequence typing (MLST) and whole-genome sequencing typing (WGST) level. The latter refers to source tracking (i.e., finding highly similar isolates) according to the typing result and isolates information stored in BacWGSTdb. ( Reference: Z. Ruan 7 Y. Feng, Nucleic Acids Research. 2016 44(D1): D682-D687).

                                                                          SISTR: Salmonella In Silico Typing Resource - (Public Health Agency of Canada, Laboratory for Foodborne Zoonoses) is a bioinformatics resource for rapidly interpreting in silico data for multiple Salmonella subtyping methods from draft bacterial genome assemblies. In addition to performing serovar prediction by genoserotyping, this resource integrates sequence-based typing analyses for: Multi-Locus Sequence Typing (MLST), ribosomal MLST (rMLST), and core genome MLST (cgMLST). Google Chrome is recommended Firefox is also supported but the SVG visualizations within this app may not be as responsive. Internet Explorer is unsupported.

                                                                          FSFinder2 (Frameshift Signal Finder) - Programmed ribosomal frameshifting is involved in the expression of certain genes from a wide range of organisms such as virus, bacteria and eukaryotes including human. In programmed frameshifting, the ribosome switches to an alternative frame at a specific site in response to a special signal in a messanger RNA. Programmed frameshift plays role in viral particle morphogenesis, autogenous control, and alternative enzymatic activities. The common frameshift is a -1 frameshift, in which the ribosome shifts a single nucleotide in the upstream direction. The major elements of -1 frameshifting consist of a slippery site, where the ribosome changes reading frames, and a stimulatory RNA structure such as pseudoknot or stem-loop located a few nucleotides downstream. +1 frameshifts are much less common than -1 frameshifting but are observed in diverse organisms.

                                                                          InBase, The Intein Database and Registry - Protein splicing is defined as the excision of an intervening protein sequence (the INTEIN) from a protein precursor and the concomitant ligation of the flanking protein fragments (the EXTEINS) to form a mature extein host protein and the free intein (Perler 1994). Protein splicing results in a native peptide bond between the ligated exteins. This is a database site which permits BLAST analysis. ( Reference: Perler, F.B. 2002. Nucleic Acids Res. 30: 383-384).

                                                                          P2RP (Predicted Prokaryotic Regulatory Proteins) - users can input amino acid or genomic DNA sequences, and predicted proteins therein are scanned for the possession of DNA-binding domains and/or two-component system domains. RPs identified in this manner are categorised into families, unambiguously annotated. ( Reference: Barakat M, et al. 2013. BMC Genomics 14:269).

                                                                          P2CS (Prokaryotic 2-Component Systems) is a comprehensive resource for the analysis of Prokaryotic Two-Component Systems (TCSs). TCSs are comprised of a receptor histidine kinase (HK) and a partner response regulator (RR) and control important prokaryotic behaviors. It can be searched using BLASTP. ( Reference: P. Ortet et al. 2015. Nucl. Acids Res. 43 (D1): D536-D541).

                                                                          COG analysis - Clusters of Orthologous Groups - COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages and likely corresponds to an ancient conserved domain (CloVR) . Sites which offer this analysis include:

                                                                          WebMGA ( Reference: S. Wu et al. 2011. BMC Genomics 12:444), RAST ( Reference: Aziz RK et al. 2008. BMC Genomics 9:75), and BASys (Bacterial Annotation System Reference: Van Domselaar GH et al. 2005. Nucleic Acids Res. 33(Web Server issue):W455-459.) and JGI IMG (Integrated Microbial Genomes Reference: Markowitz VM et al. 2014. Nucl. Acids Res. 42: D560-D567. )

                                                                          Other sites:

                                                                          EggNOG - A database of orthologous groups and functional annotation that derives Nonsupervised Orthologous Groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. ( Reference: Powell S et al. 2014. Nucleic Acids Res. 42 (D1): D231-D239

                                                                          OrthoMCL - is another algorithm for grouping proteins into ortholog groups based on their sequence similarity. The process usually takes between 6 and 72 hours.( Reference: Fischer S et al. 2011. Curr Protoc Bioinformatics Chapter 6:Unit 6.12.1-19).

                                                                          KAAS (KEGG Automatic Annotation Server) provides functional annotation of genes by BLAST or GHOST comparisons against the manually curated KEGG GENES database. The result contains KO (KEGG Orthology) assignments and automatically generated KEGG pathways. ( Reference: Moriya Y et al. 2007. Nucleic Acids Res. 35(Web Server issue):W182-185).

                                                                          ResFinder (Acquired antimicrobial resistance gene finder) - uses BLAST for identification of acquired antimicrobial resistance genes in whole-genome data. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. Tested with 1411 different resistance genes with 100% identity. ( Reference: Zankari E et al. 2012. J Antimicrob Chemother. 67:2640-2644)

                                                                          ARG-ANNOT (Antibiotic Resistance Gene-ANNOTation) is a new tool that was created to detect existing and putative new antibiotic resistance (AR) genes in bacterial genomes. ARG-ANNOT uses a local blast program in Bio-Edit software that allows the user to analyze sequences without web interface ( Reference: Gupta, S.K. et al. 2014. Antimicrob Agents Chemother. 58: 212&ndash220).

                                                                          CARD (The Comprehensive Antibiotic Resistance Database) - a rigorously curated collection of known resistance determinants and associated antibiotics, organized by the Antibiotic Resistance Ontology (ARO) and AMR gene detection models ( Reference: Jia, B. et al. 2017. Nucleic Acids Research, 45: D566-573).

                                                                          MEGARes - is a hand-curated antimicrobial resistance database and annotation structure that provides a foundation for the development of high throughput acyclical classifiers and hierarchical statistical analysis of big data ( Reference: Lakin, S.N.. et al. 2017. Nucleic Acids Research, 45: D574-D580 ) .

                                                                          BacMet (Antibacterial Biocide & Metal Resistance Genes Database) - a database of biocide and metal resistance genes with highly reliable content. In BacMet version 1.1, the experimentally confirmed database contains 704 resistance genes, whereas the predicted database contains 40,556 resistance genes ( Reference: Pal, C. et al. 2014. Nucleic Acids Research, 42: D737-743 ) .

                                                                          Specialized annotation - CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats):

                                                                          CRISPRfinder - enables the easy detection of CRISPRs in locally-produced data and consultation of CRISPRs present in the database. It also gives information on the presence of CRISPR-associated (cas) genes when they have been annotated as such. . ( Reference: I. Grissa et al. 2007. Nucl. Acids Res. 35 (Web Server issue): W52-W57).

                                                                          CRISPRmap -provides a quick and detailed insight into repeat conservation and diversity of both bacterial and archaeal systems. It comprises the largest dataset of CRISPRs to date and enables comprehensive independent clustering analyses to determine conserved sequence families, potential structure motifs for endoribonucleases, and evolutionary relationships. ( Reference: S.J. Lange et al. 2013. Nucleic Acids Research, 41: 8034-8044).

                                                                          CRISPI : a CRISPR Interactive database - includes a complete repertory of associated CRISPR-associated genes (CAS). A user-friendly web interface with many graphical tools and functions allows users to extract results, find CRISPR in personal sequences or calculate sequence similarity with spacers.( Reference: Rousseau C et al. 2009. Bioinformatics. 25: 3317&ndash3318).

                                                                          CRISPRTarget - that predicts the most likely targets of CRISPR RNAs. This can be used to discover targets in newly sequenced genomic or metagenomic data. ( Reference: Biswas A et al. 2013. RNA Biol. 10:817-827).

                                                                          CRISPy-web - is an easy to use web tool based on CRISPy to design sgRNAs for any user-provided microbial genome. CRISPy-web allows researchers to interactively select a region of their genome of interest to scan for possible sgRNAs. After checks for potential off-target matches, the resulting sgRNA sequences are displayed graphically and can be exported to text files. ( Reference: K. Blin et al. 2016. Synthetic and Systems Biotechnology 1(2): 118-121).

                                                                          Specialized annotation - virulence determinants: This is of particular interest to those working on bacteriophages for therapy

                                                                          VirulenceFinder (Danish Technical University) &ndash identification of virulence genes. The method uses BLAST for identification of known virulence genes in Escherichia coli. The method is being extended to also include virulence genes for Enterococcus and Staphylococcus aureus. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms.

                                                                          ClanTox: a classifier of short animal toxins - predicts whether each sequence is toxin-like and provides a ranked list of positively predicted candidates according to statistical confidence. For each protein, additional information is presented including the presence of a signal peptide, the number of cysteine residues and the associated functional annotations. ( Reference: G. Naamati et al. 2009. Nucleic Acids Res. 37(Web Server issue): W363&ndashW368).

                                                                          t3db the Toxin and Toxin Target Database - combines detailed toxin data with comprehensive toxin target information. The database currently houses 3,053 toxins which are linked to 1,670 corresponding toxin target records. Each toxin record (ToxCard) contains over 50 data fields and holds information such as chemical properties and descriptors, toxicity values, molecular and cellular interactions, and medical information. ( Reference: Lim E et al. 2010. Nucleic Acids Res. 38(Database issue): D781-786).

                                                                          TAfinder 2.0 - is a web-based tool to identify Type II toxin-antitoxin loci in bacterial genome ( Reference: Xie Y et al. (2018) Nucleic Acids Res. 46(D1): D749-D753 ).

                                                                          DBETH Database of Bacterial ExoToxins for Humans is a database of sequences, structures, interaction networks and analytical results for 229 exotoxins, from 26 different human pathogenic bacterial genus. All toxins are classified into 24 different Toxin classes. The aim of DBETH is to provide a comprehensive database for human pathogenic bacterial exotoxins. ( Reference: Chakraborty A et al. 2012. Nucleic Acids Res. 40(Database issue): D615-620).

                                                                          VFDB - is an integrated and comprehensive database of virulence factors for bacterial pathogens (also including Chlamydia and Mycoplasma). ( Reference : L.H. Chen et al. 2012. Nucleic Acids Res. 40(Database issue): D641-D645).

                                                                          PAIDB (Pathogenicity Island Database) - Pathogenicity islands (PAIs) and resistance islands (REIs) are key to the evolution of pathogens and appear to play complimentary roles in the process of bacterial infection. While PAIs promote disease development, REIs give a fitness advantage to the host against multiple antimicrobial agents. An anncillary program, PAI Finder, identifies PAI-like regions or REI-like regions in a multi-sequence query. ( Reference : S.H Yoon et al. 2015. Nucl. Acids Res. 43 (D1): D624-D630).

                                                                          IslandViewer - includes a new interactive genome visualization tool, IslandPlot, and expanded virulence factor, antimicrobial resistance gene, and pathogen-associated gene annotations, as well as homologs of these genes in closely related genomes. Notably, incomplete genomes are accepted as input in IslandViewer 3, though they strongly urge users to use complete genomes whenever possible. ( Reference : B.K. Dhillon et al. 2015. Nucl. Acids Res. 43 (W1): W104-W108).

                                                                          Gypsy Database - an open editable database about the evolutionary relationship of viruses, mobile genetic elements (MGEs Ty3/Gypsy, Retroviridae, Ty1/Copia and Bel/Pao LTR retroelements and the Caulimoviridae pararetroviruses of plants) and other genomic repeats. Equipped for BLAST and HMM searches. ( Reference : Llorens, C et al. 2011. Nucl. Acids Res. 39(suppl 1): D70-D74).

                                                                          PanDaTox (Pan Genomic Database for Genomic Elements Toxic to Bacteria) - is a database of genes and intergenic regions that are unclonable in E. coli, to aid n the discovery of new antibiotics and biotechnologically beneficial functional genes. It is also designed to improve the efficiency of metabolic engineering. BLAST Search feature included. ( Reference : Mitai G & Sorek R. 2012. Bioengineered, 3: 218-221.)

                                                                          PathogenFinder (predicts pathogenic potential) &ndash Based on complete genomes from 513 bacteria annotated as human non-pathogens and 372 bacteria annotated as human pathogens, a database of protein families, which are either mainly associated with non-pathogens or with pathogens have been created. This database is then used for predicting the pathogenic potential of bacteria. As input, the method can use both pre-assembled, complete or partial genomes, and short sequence reads from four different sequencing platforms. ( Reference: Cosentino S et al. 2013. PLoS ONE 8: e77302)

                                                                          VirulentPred - is a SVM based method to predict bacterial virulent proteins sequences, which can be used to screen virulent proteins in proteomes. Together with experimentally verified virulent proteins, several putative, non annotated and hypothetical protein sequences have been predicted to be high scoring virulent proteins by the prediction method. ( Reference: Garg A & Gupta G. 2008. BMC Bioinformatics 9: 62).

                                                                          The Type III secretion system (T3SS) is an essential mechanism for host-pathogen interaction in the infection process. The proteins secreted through the T3SSmachinery of many Gram-negative bacteria are known as T3SS effectors (T3SEs). These can either be localized subcellularly in the host, or be part of the needle tip of the T3SS that interacts directly with the host membrane to bring other effectors into the target cell. T3SEdb represents such an effort to assemble a comprehensive database of all experimentally determined and putative T3SEs into a web-accessible site. BLAST search is available. ( Reference: Tay DM et al. 2010. BMC Bioinformatics. 11 Suppl 7:S4).

                                                                          Effective (University of Vienna, Austria & Technical University of Munich, Germany) - Bacterial protein secretion is the key virulence mechanism of symbiotic and pathogenic bacteria.Thereby effector proteins are transported from the bacterial cytosol into the extracellular medium or directly into the eukaryotic host cell. The Effective portal provides precalculated predictions on bacterial effectors in all publicly available pathogenic and symbiontic genomes as well as the possibility for the user to predict effectors in own protein sequence data.

                                                                          SIEVE Server is a public web tool for prediction of type III secreted effectors. The SIEVE Server scores potential secreted effectors from genomes of bacterial pathogens with type III secretion systems using a model learned from known secreted proteins. The SIEVE Server requires only protein sequences of proteins to be screened and returns a conservative probability that each input protein is a type III secreted effector. ( Reference: McDermott JE et al. 2011. Infect Immun. 79:23-32).

                                                                          T3SE - Type III secretion system effector prediction ( Reference: Löwer M, & Schneider G. 2009. PLoS One. 4:e5917. Erratum in: PLoS One. 20094(7).

                                                                          Phage_Finder - was created to identify prophage regions in completed bacterial genomes. Using a test dataset of 42 bacterial genomes whose prophages have been manually identified, Phage_Finder found 91% of the regions, resulting in 7% false positive and 9% false negative prophages. A search of 302 complete bacterial genomes predicted 403 putative prophage regions, accounting for 2.7% of the total bacterial DNA. Analysis of the 285 putative attachment sites revealed tRNAs are targets for integration slightly more frequently (33%) than intergenic (31%) or intragenic (28%) regions, while tmRNAs were targeted in 8% of the regions. ( Reference: D.E. Fouts. 2006. Nucleic Acids Res. 34 : 5839&ndash5851).

                                                                          Prophinder - is the tool used for detecting prophages in bacterial genomes. Select a GenBank formatted file.

                                                                          PHAST (PHAge Search Tool) - is designed to rapidly and accurately identify, annotate and graphically display prophage sequences within bacterial genomes or plasmids. It accepts either raw DNA sequence data or partially annotated GenBank formatted data and rapidly performs a number of database comparisons as well as phage &ldquocornerstone&rdquo feature identification steps to locate, annotate and display prophage sequences and prophage features. Relative to other prophage identification tools, PHAST is up to 40 times faster and up to 15% more sensitive. It is also able to process and annotate both raw DNA sequence data and Genbank files, provide richly annotated tables on prophage features and prophage &ldquoquality&rdquo and distinguish between intact and incomplete prophage. PHAST also generates downloadable, high quality, interactive graphics that display all identified prophage components in both circular and linear genomic views.Furthermore, tests indicate that PHAST is as accurate or slightly more accurate than all available phage finding tools, with sensitivity of 85.4% and positive predictive value of 94.2%. ( Reference: Zhou, Y. et al. 2011. Nucl. Acids Res. 39(suppl 2): W347-W352).

                                                                          PHASTER PHAge Search Tool Enhanced Release - is a significant upgrade to PHAST for the rapid identification and annotation of prophage sequences within bacterial genomes and plasmids. Numerous software improvements and significant hardware enhancements have now made PHASTER faster, more efficient, more visually appealing and much more user friendly. In particular, PHASTER is now 4.3X faster than PHAST. ( Reference: D. Arndt et al. Nucleic Acids Res. 2016 44(W1):W16-21).

                                                                          Prophage Hunter - provides a one-stop web service to extract prophage genomes from bacterial genomes, evaluate the activity of the prophages, identify phylogenetically related phages, and annotate the function of phage proteins. ( Reference: Song W et al. (2019) Nucleic Acids Res 47(W1): W74&ndashW80).

                                                                          IslandViewer - integrates two sequence composition GI prediction methods SIGI-HMM and IslandPath-DIMOB, and a single comparative GI prediction method IslandPick ( Reference: Langille et al. 2008. BMC Bioinformatics 9: 329).

                                                                          PAIDB (PAthogenicity Island DataBase) has made an effort to collect known PAIs and to detect the potential PAI regions in the prokaryotic complete genomes. Pathogenicity islands (PAIs) are distinct genetic elements of pathogens encoding various virulence factors. ( Reference: Yoon SH et al. 2007. Nucleic Acids Res. 35 (Database Issue): D395-D400).

                                                                          MTGIpick can identify genomic islands from a single genome, without annotated information of genomes or prior knowledge from other datasets. In simulations with alien fragments from artificial and real genomes, MTGIpick reported robust results across different experiments ( Reference: Dai Q et al. (2018) Brief Bioinform 19(3): 361-373).


                                                                          SyntTax - is a web server linking synteny to prokaryotic taxonomy. SyntTax incorporates a full hierarchical taxonomic tree allowing intuitive access to all completely sequenced prokaryotes (Archaea and Bacteria). Single or multiple organisms can be chosen on the basis of their lineage by selecting the corresponding rank nodes in the tree. This is my favourite among the synteny programs ( Reference: Oberto J. 2013. BMC Bioinformatics. 14:4). The results below were generated using the heat-shock sigma factor (RpoH) from Salmonella Typhimurium against the Pseudomonadales.

                                                                          Cinteny Server for Synteny Identification and Analysis of Genome Rearrangement (A. U. Sinha & J. Meller, University of Cincinnati, USA) - this server can be used for finding regions syntenic across multiple genomes and measuring the extent of genome rearrangement using reversal distance as a measure. You may create a project and upload your own data or work with pre-loaded prokaryote or eukaryote data.

                                                                          SimpleSynteny - provides a pipeline for evaluating the synteny of a preselected set of gene targets across multiple organismal genomes. An emphasis has been placed on ease-of-use, and users are only required to submit FASTA files for their genomes and genes of interest. SimpleSynteny then guides the user through an iterative process of exploring and customizing genomes individually before combining them into a final high-resolution figure. ( Reference: Veltri D et al. 2016. Nucleic Acids Res. 44(Web Server issue): W41&ndashW45).

                                                                          Synteny Portal - eukaryotic genome users can easily (i) construct synteny blocks among multiple species by using prebuilt alignments in the UCSC genome browser database, (ii) visualize and download syntenic relationships as high-quality images, (iii) browse synteny blocks with genetic information and (iv) download the details of synteny blocks to be used as input for downstream synteny-based analyses, all in an intuitive and easy-to-use web-based interface. ( Reference: Lee J et al. 2016. Nucleic Acids Res 44(W1): W35&ndashW40).

                                                                          AutoGRAPH is an integrated web server for multi-species comparative genomic analysis. It is designed for constructing and visualizing synteny maps between two or three species, determination and display of macrosynteny and microsynteny relationships among species, and for highlighting evolutionary breakpoints.
                                                                          The web server constructs synteny maps by pairwise comparison of marker/anchor orders between a reference chromosome and one or two tested genome(s). It permits users to visualize and characterize several features: Conserved segments (CS), Conserved Segments Ordered (CSO) and breakpoints. ( Reference: Derrien T et al. 2007. Bioinformatics 23:498-499).

                                                                          Sibelia (University of California San Diego, USA) - is a tool for finding synteny blocks in multiple closely related microbial genomes using iterative de Bruijn graphs. Unlike most other tools, Sibelia can find synteny blocks that are repeated within genomes as well as blocks shared by multiple genomes. It represents synteny blocks in a hierarchy structure with multiple layers, each of which representing a different granularity level.

                                                                          Kablammo helps you create interactive visualizations of BLAST results from your web browser. Find your most interesting alignments, list detailed parametersfor each, and export a publication-ready vector image. Incredibly easy to use - here are the results for a BLASTN comparison to Escherichia phages T1 (query) and ADB-2. ( Reference: Wintersinger JA et al. Bioinformatics 31:1305-1306).


                                                                          M1CR0B1AL1Z3R - is a 'one-stop shop' for conducting microbial genomics data analyses via a simple graphical user interface. Some of the features implemented in M1CR0B1AL1Z3R are: (i) extracting putative open reading frames and comparative genomics analysis of gene content (ii) extracting orthologous sets and analyzing their size distribution (iii) analyzing gene presence-absence patterns (iv) reconstructing a phylogenetic tree based on the extracted orthologous set (v) inferring GC-content variation among lineages. M1CR0B1AL1Z3R facilitates the mining and analysis of dozens of bacterial genomes using advanced techniques. ( Reference: Avram O et al. (2019) Nucleic Acids Res. 47(W1): W88-W92).

                                                                          GeneOrder 4.0 (D. Seto, Bioinformatics & Computational Biology, George Mason Univ., U.S.A.) is designed to can be used to compare the gene order between two bacterial genomes ( Reference: Mahadevan P. & Seto D. 2010. BMC Research Notes 3:41).
                                                                          CoreGenes (D. Seto & P. Mahadevan, Bioinformatics & Computational Biology, George Mason Univ., U.S.A) - tallies the total number of genes in common between the two genomes being compared displays the percent value of genes in common with a specific genome determines the unique genes contained in a pair of proteomes. CoreGenes 3.5 is the batch CoreGenes server. I have extensively used this set of resources in the classification of bacterial viruses.

                                                                          If you have a a gbk file for a phage which has not yet been deposited in GenBank you can use these instructions to convert your data into CoreGenes format for use here.

                                                                          WebACT - this is the web version of ACT (Artemis Comparison Tool) a DNA sequence comparison viewer based on Artemis ( Reference: 21: 3422 - 3423 Visit the database page of EMBL-EBI and select EMBL and "Standard Query Form" to determine the EMBL accession number for the sequence you are interested in.

                                                                          Panseq (Chad Laing, Public Health Agency of Canada) - a group of tools for the analysis of the 'pan genome' of a group of genomic sequences. The pan-genome of a bacterial species consists of a core genome and an accessory gene pool, the latter of which allows subpopulations of the organism to adapt to specific environments. These include Novel Region Finder, which will find sequences that are unique to a strain or group of strains with respect to another strain or group of strains. Pan-genome Analysis identifies the pan-genome among your sequences and, finds SNPs in the core genome and determine the distribution of accessory genomic regions.Loci Selector identifies loci that offer the best discrimination among your dataset. ( Reference: Laing, C. et al. 2010. BMC Bioinformatics . 11: 461).

                                                                          PARIGA - enables users to perform all-against-all BLAST searches on two sets of sequences selected by the user. Moreover, since it stores the two BLAST output in a python-serialized-objects database, results can be filtered according to several parameters in real-time fashion, without re-running the process and avoiding additional programming efforts. ( Reference: Orsini M. et al. 2013. PLoS One 8(5):e62224).

                                                                          EDGAR (Efficient Database framework for comparative Genome Analyses using BLAST score Ratios) - EDGAR is designed to automatically perform genome comparisons in a high throughput approach and can be used for core genome, pan genome and singleton analysis, and Venn diagram construction. ( Reference: Blom J. et al. 2009. BMC Bioinformatics 10: 154).

                                                                          OrthoVenn - is a web server for genome wide comparison and annotation of orthologous clusters across multiple species. It provides coverage of vertebrates, metazoa, protists, fungi, plants and bacteria for the comparison of orthologous clusters and also supports uploading of customized protein sequences from user-defined species. An interactive Venn diagram, summary counts, and functional summaries of the disjunction and intersection of clusters shared between species are displayed as part of the OrthoVenn result. OrthoVenn also includes in-depth views of the clusters using various sequence analysis tools. Furthermore, it identifies orthologous clusters of single copy genes and allows for a customized search of clusters of specific genes through key words or BLAST. ( Reference: Y. Yang et al. 2015. Nucl. Acids Res. 43 (W1): W78-W84). Also found here.

                                                                          BEACON is a software tool that compares annotations of a particular genome from different Annotation Methods (AMs). It uses GenBank format as input and derives Extended Annotation (EA) along side listing original annotations from individual AMs. ( Reference: Kalkatawi M, BMC Genomics. 201516(1): 1-8).

                                                                          ANI (Average Nucleotide Identity) calculator - estimates the average nucleotide identity using both best hits (one-way ANI) and reciprocal best hits (two-way ANI) between two genomic datasets. Typically, the ANI values between genomes of the same species are above 95% (e.g., Escherichia coli). Values below 75% are not to be trusted, and AAI should be used instead. This tool supports both complete and draft genomes (multi-fasta). ( Reference: Goris J et al. 2007. Int J Syst Evol Microbiol. 57(Pt 1): 81-91).

                                                                          Average Nucleotide Identity (ANI) calculator - their ANI Calculator uses the OrthoANIu algorithm, an improved iteration of the original OrthoANI algorithm, which uses USEARCH instead of BLAST ( Reference: Yoon, S. H. et al. (2017). Antonie van Leeuwenhoek. 110:1281&ndash1286).

                                                                          VIRIDIC (Virus Intergenomic Distance Calculator C. Moraru, Institute for Chemistry and Biology of the Marine Environment, Germany) - the first level of bacteriophage classification by ICTV involves computing the overall DNA sequence identity between two viruses. This new tool computes pairwise intergenomic distances/similarities amongst phage genomes. To run it, upload a single fasta file with all phage genomes of interest, create a project and press run. Save the project ID that will be displayed when the project is created. You will need it to access the data if the calculations take a long time.

                                                                          GGDC (Genome-To-Genome Distance Calculator) - provides methods for inferring whole-genome distances which are well able to mimic DNA-DNA hybridization (DDH). Values calculated with GGDC yield a somewhat better correlation with wet-lab DDH values than alternative approaches such as "ANI". These distance functions can also cope with heavily reduced genomes and repetitive sequence regions. Some of them are also very robust against missing fractions of genomic information (due to incomplete genome sequencing). Thus, this web service can be used for genome-based species delineation. ( Reference: Meier-Kolthoff JP et al. 2013. BMC Bioinformatics 14: 60).

                                                                          POGO-DB - Based on computationally intensive whole-genome BLASTs, POGO-DB provides several metrics on pairwise genome: (a) Average Amino Acid Identity of all bi-directional best blast hits that covered at least 70% of the sequence and had 30% sequence identity (b) Genomic Fluidity that estimates the similarity in gene content between two genomes (c) Number of orthologs shared between two genomes (as defined by two criteria) (d) Pairwise identity of the most similar 16S rRNA genes (e) Pairwise identity of 73 additional globally-conserved marker genes (which were determined by us to exist in at least 90% of all the genomes). ( Reference: Lan Y et al. 2014. Nucl. Acids Res. 42 (D1): D625-D632).

                                                                          VICTOR (Virus Classification and Tree Building Online Resource Leibniz-Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH). This web service compares bacterial and archaeal viruses ("phages") using their genome or proteome sequences. The results include phylogenomic trees inferred using the Genome-BLAST Distance Phylogeny method (GBDP), with branch support, as well as suggestions for the classification at the species, genus and family level. (The service can be applied to other kinds of viruses, too, but has not yet been tested in this respect.) Upload your FASTA files, GenBank files and/or GenBank accession IDs. ( Reference: JP Meier-Kolthoff & M Göker. 2017. Bioinformatics 33(21): 3396&ndash3404).

                                                                          VIRFAM is dedicated to the recognition of head-neck-tail modules and of recombinase genes in phage genomes. You can use this server to search for remote homologs of specific protein families within protein sequences of bacteriophages. Input: protein sequences you&rsquore your phage output includesd a phylogenetic tree with the placement of your virus. ( Reference: Lopes A et al. Nucleic Acids Res. (2010) 38(12): 3952-62).

                                                                          Seeker - is a deep-learning tool for reference-free identification of phage sequences. Seeker allows rapid detection of phages in sequence datasets and clean differentiation of phage sequences from bacterial ones, even for phages with little sequence similarity to established phage families. We comprehensively validate Seeker ability to identify unknown phages and employ Seeker to detect unknown phages, some of which are highly divergent from known phage families. ( Reference: Auslander N et al. (2020) doi.org/10.1101/2020.04.04.025783)

                                                                          VipTree - generates a "proteomic tree" of viral genome sequences based on genome-wide sequence similarities computed by tBLASTx. The original proteomic tree concept (i.e., "the Phage Proteomic Tree&rdquo) was developed by Rohwer and Edwards, 2002. A proteomic tree is a dendrogram that reveals global genomic similarity relationships between tens, hundreds, and thousands of viruses. It has been shown that viral groups identified in a proteomic tree well correspond to established viral taxonomies. ( Reference: Nishimura Y et al. (2017) Bioinformatics 33: 2379&ndash2380).

                                                                          MiGA (Microbial Genomes Atlas) - a webserver that allows the classification of an unknown query genomic sequence, complete or partial, against all taxonomically classified taxa with available genome sequences, as well as comparisons to other related genomes including uncultivated ones, based on the genome-aggregate Average Nucleotide and Amino Acid Identity (ANI/AAI) concepts. ( Reference: Rodriguez-R et al (2018) Nucleic Acids Research 46(W1): W282-W288).

                                                                          CGView Server - is a comparative genomics tool for circular genomes that allows sequence feature information to be visualized in the context of sequence analysis results. A genome sequence is supplied to the program in FASTA, GenBank, EMBL or raw format. Up to three comparison sequences (or sequence sets) in FASTA format can also be submitted. The CGView Server uses BLAST to compare the genome sequence to the comparison sequences, and then converts the results and any available feature information (from the GenBank, EMBL or optional GFF file) or analysis information (from an optional GFF file) into a high-quality graphical map showing the entire genome sequence, or a zoomed view of a region of interest. Several options are available for specifying how the BLAST comparisons are conducted, and for controlling how results are displayed.( Reference: Grant JR & Stothard P. 2008. Nucleic Acids Res. 36(Web Server issue): W181-184)

                                                                          Jena Prokaryotic Genome Viewer (JPGV) - from a GenBank flatfile (*.gbk) generates linear or circular plots including if desired GC content, GC skew, purine excess and keto excess can be displayed. Also allows BLAST analysis against related genomes. Requires free registration.

                                                                          GenomeVx - makes editable, publication-quality, maps of mitochondrial and chloroplast genomes and of large plasmids. These maps show the location of genes and chromosomal features as well as a position scale. The program takes as input either raw feature positions or GenBank records. In the latter case, features are automatically extracted and colored, an example of which is given. Output is in the Adobe Portable Document Format (PDF) and can be edited by programs such as Adobe Illustrator.( Reference: G. Conant & K. Woolfe. 2008. Bioinformatics 24:861-862).

                                                                          myGenomeBrowser - is a web-based environment that provides biologists with a way to build, query and share their genome browsers. This tool, that builds on JBrowse, is designed to give users more autonomy while simplifying and minimizing intervention from system administrators. They have extended genome browser basic features to allow users to query, analyze and share their data. ( Reference: S. Carrere & J. Gouzy. Bioinformatics (2017) 33 (8): 1255-1257).

                                                                          DNAPlotter - is an interactive Java application for generating circular and linear representations of genomes. Making use of the Artemis libraries to provide a user-friendly method of loading in sequence files (EMBL, GenBank, GFF) as well as data from relational databases, it filters features of interest to display on separate user-definable tracks. It can be used to produce publication quality images for papers or web pages.( Reference: Carver, T. et al. 2008. Bioinformatics 25:119-120)

                                                                          GeneWiz (Center for Biological Sequence Analysis, Danish Technical University) produces linear or circular genome altases such as the one below. They have ready name ones for most bacteria, but by uploading custom data in GenBank format (.gbk) one can make one's own diagram showing the genetic and physical properties of your genome.

                                                                          OrganellarGenomeDRAW - is a suite of software tools that enable users to create high-quality visualrepresentations of both circular and linear annotated genome sequences provided as GenBank files oraccession numbers. Although all types of DNA sequences are accepted as input, the software has beenspecifically optimized to properly depict features of organellar genomes. A recent extension facilitates theplotting of quantitative gene expression data, such as transcript or protein abundance data, directly ontothe genome map ( Reference: Lohse M, et al. 2013. Nucleic Acids Res. 41(Web Server issue):W575-81) .

                                                                          PlasmaDNA - Starting with a primary DNA sequence, PlasmaDNA looks for restriction sites, open reading frames, primer annealing sequences, and various common domains. The databases are easily expandable by the user to fit his most common cloning needs. PlasmaDNA can manage and graphically represent multiple sequences at the same time, and keeps in memory the overhangs at the end of the sequences if any. This means that it is possible to virtually digest fragments, to add the digestion products to the project, and to ligate together fragments with compatible ends to generate the new sequences. Excellent package for plasmids. (Reference: Angers-Loustau A et al. 2007. BMC Mol Biol. 2007 8:77).

                                                                          GSDraw (Gene Structure Draw Server) is a web server for gene family to draw gene structure schematic diagrams. Users can submit genomic, CDS and transcript sequences. GSDraw uses this information to obtain the gene structure, protien motif and phylogenetics tree, then draw diagram for it. (Reference: Wang Y, et al. 2013. Nucleic Acids Res. 41(Database issue):D1159-66).

                                                                          GECA is a user-friendly tool for representing gene exon/intron organization and highlighting changes in gene structure among members of a gene family. It relies on protein alignment, completed with the identification of common introns in the corresponding genes using CIWOG. GECA produces a main graphical representation showing the resulting aligned set of gene structures, where exons are to scale. The important and original feature of GECA is that it combines these gene structures with a symbolic display highlighting sequence similarity between subsequent genes. It is worth noting that this combination of gene structure with the indications of similarities between related genes allows rapid identification of possible events of gain or loss of introns, or points to erroneous structural annotations. The output image is generated in a portable network graphics format which can be used for scientific publications. ( Reference: Fawal N, et al. 2012. Bioinformatics 28:1398-9).

                                                                          GeneDesign - is an excellent resource for designing synthetic genes. It includes tools for codon optimization and removal of restriction sites ( Reference: Richarson, S.M. et al. 2006. Genome Research 16:550-556)

                                                                          Orphelia - Orphelia is a metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin. Orphelia is based on a two-stage machine learning approach that was recently introduced by our group. After the initial extraction of ORFs, linear discriminants are used to extract features from those ORFs. Subsequently, an artificial neural network combines the features and computes a gene probability for each ORF in a fragment. A greedy strategy computes a likely combination of high scoring ORFs with an overlap constraint. ( Reference: K.J. Hoff et al. 2009. Nucl. Acids Res. 37(Web Server issue:W101-W105).

                                                                          WebMGA is a customizable web server for fast metagenomic analysis which includes over 20 commonly used tools for analyses such as ORF calling, sequence clustering, quality control of raw reads, removal of sequencing artifacts and contaminations, taxonomic analysis, functional annotation etc. All the tools behind WebMGA were implemented to run in parallel on our local computer cluster. ( Reference: Wu S, et al. 2011. BMC Genomics. 12:444).

                                                                          MG-RAST (the Metagenomics RAST) server is an automated analysis platform for metagenomes providing quantitative insights into microbial populations based on sequence data. The server primarily provides upload, quality control, automated annotation and analysis for prokaryotic metagenomic shotgun samples. ( Reference: Wilke A, et al. 2016. Nucleic Acids Res. 44(D1):D590-4).

                                                                          MetaBin Comprehensive Taxonomic Assignment of Metagenomic Sequences (Laboratory for Integrated Bioinformatics, RIKEN, Japan) web server and standalone program allow faster and more accurate taxonomic assignment of single and paired-end sequence reads of varying lengths (&ge45 bp) obtained from both Sanger and next-generation sequencing platforms. Has a tutorial.

                                                                          AmphoraNet - uses 31 bacterial and 104 archaeal protein coding marker genes for metagenomic and genomic phylotyping. Most of these are single copy genes, therefore AmphoraNet is suitable for estimating the taxonomic composition of bacterial and archaeal communities from metagenomic shotgun sequencing data. ( Reference: Kerepesi C, et al. 2014. Gene. 533:538-40).

                                                                          METAGENassist - allows users to take bacterial census data from different environment sites or different biological hosts, and perform comprehensive multivariate statistical analyses on the data. These multivariate analyses can be done using either taxonomic or automatically generated phenotypic labels and visualized using a variety of high quality graphical tools. The bacterial census data can be derived from 16S rRNA data, NextGen shotgun sequencing or even classical microbial culturing techniques. Includes a tutorial. ( Reference: Arndt D, et al. 2012. Nucleic Acids Res. 40(Web Server issue):W88-95).

                                                                          Real Time Metagenomics (Dr. Robert Edwards, San Diego State University, USA) - is the next revolution in metagenome annotation: Real time data processing and analysis. You can finally annotate a metagenome in real time, with no waiting. You can upload your own data for analysis. They accept either fasta or fastq files, and you can provide zip or gzip compressed data.

                                                                          EBI Metagenomics (EMBL-EBI) - is an automated pipeline for the analysis and archiving of metagenomic data that aims to provide insights into the phylogenetic diversity as well as the functional and metabolic potential of a sample. You can freely browse all the public data in the repository. The service identifies rRNA sequences, using rRNASelector, and performs taxonomic analysis upon 16S rRNAs using Qiime. The remaining reads are submitted for functional analysis of predicted protein coding sequences using the InterPro sequence analysis resource. InterPro uses diagnostic models to classify sequences into families and to predict the presence of functionally important domains and sites. By utilising this resource, the service offers a powerful and sophisticated alternative to BLAST-based functional metagenomic analyses. Data submitted to the EBI Metagenomics service is automatically archived in the European Nucleotide Archive (ENA). Accession numbers are supplied for sequence data.

                                                                          Kaiju - is a fast and sensitive taxonomic classification for metagenomics which takes nucleotide sequences in compressed FASTA or FASTQ format. Reads are directly assigned to taxa using the NCBI taxonomy and a reference database of protein sequences from bacterial, archaeal and viral genomes. By default, Kaiju uses either the available complete genomes from NCBI RefSeq or the microbial subset of the non-redundant protein database nr used by NCBI BLAST. Kaiju translates reads into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform, which finds maximum exact matches (MEMs), optionally allowing mismatches in the protein alignment. ( Reference: Menzel P et al. 2016. (Nat. Commun. 7:11257)

                                                                          PhyloPythiaS - is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments. ( Reference: Patil KR, et al. 2012. PLoS One. 7:e38581).

                                                                          Virtual Metagenome - A web server to reconstruct metagenomes from 16S rRNA sequences. a novel method for the rapid and efficient reconstruction of a virtual metagenome in environmental microbial communities without using large-scale genomic sequencing. We demonstrate this approach using 16S rRNA gene sequences obtained from denaturing gradient gel electrophoresis analysis, mapped to fully sequenced genomes, to reconstruct virtual metagenome-like organizations. ( Reference: Okuda S, et al. 2012. Nat Commun. 3:1203.)

                                                                          MetaPhlAn2 (version 2.0.0) - is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data with species level resolution. It is also able to identify specific strains and to track strains across samples for all species. It allows for unambiguous taxonomic assignments, accurate estimation of organismal relative abundance, and species-level resolution for bacteria, archaea, eukaryotes and viruses. ( Reference: Segata N, et al. 2012. Nature Methods 8: 811&ndash814).

                                                                          CoMet-Universe &mdash a web-server for comparative analysis of metagenomes based on protein domain signatures. Starting with an upload of your DNA sequences the CoMet pipeline performs all necessary steps for a comprehensive metagenome analysis including gene prediction, protein domain detection using Pfam 27, metabolic profiling based on KEGG pathways and taxon abundance estimation across all domains of life and viruses. ( Reference: Aßhauer KP et al. Int J Mol Sci. 2014 15(7):12364-78).

                                                                          16S Classifier - is a tool for fast and accurate taxonomic classification of 16S rRNA hypervariable regions in metagenomic datasets. On real metagenomic datasets, it showed up to 99.7% accuracy at the phylum level and up to 99.0% accuracy at the genus level. ( Reference: N. Chaudhary et al. 2015. PLoS One 10(2): e0116106). It can also be accessed here

                                                                          DNAATLAS (DNA2.0 Inc., U.S.A.) - A place for all your sequences. Easily import all your constructs including Genbank, Gene Designer, Excel, Word, and nearly any text-based format. DNA Atlas immediately parses your upload files and infers whether each sequence is a feature, construct, primer, DNA or amino acid. Upload features and primers to see them annotated in your sequences. Instantly view constructs annotated with our curated list of over 1000 features, or add your own. Use the BLAST-based sequence search to quickly align and compare your sequences.Keep track of your sequences, features, and primers. Categorize them using tags - from freezer locations to characterization data. (requires registration).

                                                                          SuperPhy (Chad Laing & Vic Gannon, Public Health Agency of Canada) is an online tool for the predictive genomics of Escherichia coli. The platform integrates the analyses tools and genome sequence data for all publicly available E. coli genomes and facilitates the upload of new genome sequences from users under public or private settings. SuperPhy provides real-time analyses of thousands of genome sequences based on strain metadata, including geospatial and phylogenetic context.

                                                                          Naming your bacteriophage: This is of prime importance for members of the bacterial virus community to name their newly isolated phages appropriately. A good place to start is " How to Name and Classify Your Phage: An Informal Guide." ( Reference: Adriaenssens E & Brister JR. 2017. Viruses 9(4). pii: E70) to which I will add the following points (a) please check that the name you propose has not been used already and, (b) Do not name your phage Enterobacter ia phage ø1234 or Enterobacteria phage 2017/ABC_567 since these names are incompatable with the creation of new species and genera taxa by the International Committee on Taxonomy of Viruses (ICTV). To find if your proposed name is unique consult:

                                                                          Phage Name Check (Stephen T. Abedon, Ohio State University, USA) - to see whether 'your' phage name is currently found on Google Scholar, Google Books, PubMed, or even Bacteriophage Names 2000.

                                                                          CPT Phage Name Search (Center for Phage Technology at Texas A&M University)


                                                                          Estimating Community Dissimilarities (Beta-Diversity)

                                                                          �ta-diversity,” as coined by Whittaker (1960), is “The extent of change of community composition, or degree of community differentiation, in relation to a complex gradient of environment, or a pattern of environments.” In other words, beta-diversity is the degree to which two samples are different. This is a rather different issue than within-sample richness and evenness, and can be measured in many different ways.

                                                                          The choice of beta-diversity metric can have important consequences to subsequent analyses, such as clustering and ordination. This is partially due to the interplay between distance metrics and normalization techniques, which can widen or reduce the apparent distance between samples (Figure 3).

                                                                          A true distance metric is one that is always positive, in which the distance between a point and itself is 0, the distance between A and B is identical to the distance between B and A and the sum of the distance between A and B and between B and C is no greater than the distance between A and C. This last assumption is the one that often fails for other dissimilarity measures. The appropriate metric for a study might depend on the size of the effect of interest and on the depth of sampling.

                                                                          The most widely known true distance metric is the euclidean:

                                                                          Where S1 and S2 are two samples and S1i, S2i are the abundance of OTU i in samples S1 and S2, respectively.

                                                                          However, the euclidean distance requires very large effect sizes for statistical significance (Kuczynski et al., 2010) and doesn't perform well in datasets with many zeroes. A more appropriate metric is thus Jensen-Shannon's, a symmetric version of the Kullback-Leibler divergence. In Kullback-Leibler, the distance between S1, S2 is:

                                                                          Thus, Kullback-Leibler is not applicable for 0-rich datasets. However, since Jensen-Shannon's compares samples S1 and S2 to their average, the problem of 0's disappears:

                                                                          This formulation also automatically satisfies the other requirements for a distance metric.

                                                                          In microbial ecology it is also common to use correlation coefficients, such as Pearson's product moment (Figure 4A):

                                                                          Figure 4. Visual intuition to selected community dissimilarity metrics. In each panel, the same set of OTU (blue dots) is represented in a scatter plot from two highly correlated samples. Pearson's and Spearman's correlations can be intuitively thought as the degree to which the scatter deviates from a straight diagonal line, except Pearson is based on the numeric values of distances (A) and Spearman on their ranks (B). Bray-Curtis dissimilarity is displayed in (C).

                                                                          To minimize the influence of noise, other researchers prefer Spearman's rank correlation, which is identical to Pearson's except that instead of the measured values, their ranks are used (Figure 4B). Finally, Bray-Curtis dissimilarity, while not very sensitive, is appropriate for 0-inflated datasets (Figure 4C):

                                                                          An alternative to OTU-based distances is to use phylogenetic distances. While these approaches also require several non-trivial choices, such as the underlying phylogenetic tree and the placement of OTU in it, evolutionary distances are often more biologically meaningful, not least because phylogenetic relatedness is often associated to trait conservation (Martiny et al., 2015). As is the case for OTU-based metrics, using a quantitative or qualitative approach to community comparison can lead to very different results (Lozupone et al., 2007). This can be ameliorated through an appropriate weighting procedure, such as generalized Unifrac (Chen et al., 2012). Recent work by Schmidt and colleagues does a thorough review of commonly used distance metrics and proposes new taxonomic and phylogenetic distances based on co-occurrence networks (Schmidt et al., 2016).

                                                                          Different approaches to community dissimilarity, such as OTU-based vs. phylogenetic, may highlight different aspects of the community and its functioning. It can therefore be useful to combine these different analyses to gain deeper insight into the system under study.


                                                                          BLAST Servers

                                                                          "The Bio-Web: Resources for Molecular and Cell Biologists" is a non-commercial, educational site with the only purpose of facilitating access to biology-related information over the internet. Keywords: biology books, molecular biology, cell biology, cell and molecular biology, bio, bioinformatics web development, scientific web development, web applications, open source, linux, strider, biology news, bioinformatics, biology software, mac software, biology software for macintosh, dna and protein sequence analysis. All logos and trademarks in this site are property of their respective owner. The comments are property of their posters. privacy policy
                                                                          Legacy pages: Bioinformatics FAQ - Macintosh Software for Molecular Biology - Rotating DNA. We found only this place for now to link to: CERN First Website Project - http://info.cern.ch/ - Proposal.
                                                                          Quote: "When describing a complex system, many people resort to diagrams with circles and arrows. Circles and arrows leave one free to describe the interrelationships between things in a way that tables, for example, do not. The system we need is like a diagram of circles and arrows, where circles and arrows can stand for anything. We can call the circles nodes, and the arrows links." - Tim Barner Lee


                                                                          Watch the video: Course: Analyzing amplicon sequencing data with Qiime 2 - pt. II (January 2022).