Information

RNA-Seq library normalization and experimental setup


I'm working on an RNA-Seq project and I am trying to figure out library normalization. I'm aware of using the geometric means (e.g. cuffdiff) of the fpkm for the normalization.

However, I was wondering why people don't add some unique, known RNA sequence of known concentration into their sample prior to amplification. Then after sequencing you would have some known measure by which to normalize your fpkm. I would think this would also be a way to minimize batch effects.

Is there a technical reason why this isn't done?


Normalizing single-cell RNA sequencing data: challenges and opportunities

Single-cell transcriptomics is becoming an important component of the molecular biologist's toolkit. A critical step when analyzing data generated using this technology is normalization. However, normalization is typically performed using methods developed for bulk RNA sequencing or even microarray data, and the suitability of these methods for single-cell transcriptomics has not been assessed. We here discuss commonly used normalization approaches and illustrate how these can produce misleading results. Finally, we present alternative approaches and provide recommendations for single-cell RNA sequencing users.


Background

In the last few years, high-throughput sequencing assays have been replacing microarrays as the assays of choice for measuring genome-wide transcription levels, in so-called RNA-Seq [1,2], as well as DNA copy number (DNA-Seq), protein-nucleic acid interactions (ChIP-Seq), and DNA methylation (methyl-Seq and RRBS). Several studies assessing technical aspects of RNA-Seq have shown good reproducibility and significant improvements over microarrays in terms of dynamic range and accuracy of expression fold-change estimation [3-5]. Nonetheless, as with microarrays, major technology-related artifacts and biases affect the expression measures [3,6-20] and normalization remains an important issue, despite initial optimistic claims such as: "One particularly powerful advantage of RNA-Seq is that it can capture transcriptome dynamics across different tissues or conditions without sophisticated normalization of data sets" [2].

Here, we focus on biases related to GC-content in the context of RNA-Seq data generated using the Illumina Genome Analyzer platform. Briefly, mRNA is converted to cDNA fragments which are then sequenced to produce millions of short reads (typically 25-100 bases). These reads are then mapped back to a reference genome and the number of reads mapping to a particular gene reflects the abundance of the transcript in the sample of interest. However, raw counts are neither directly comparable between genes within a lane, nor between replicate lanes (i.e., lanes assaying the same library) for a given gene, and normalization of the counts is needed to allow accurate inference of differences in transcript levels. Indeed, by virtue of the assay, one expects the read count for a given gene to be roughly proportional to both the gene's length and its transcript abundance. The read count will also vary between replicate lanes as a result of differences in sequencing depth, i.e., total number of reads produced in a given lane.

Furthermore, as detailed in the literature review below, previous studies have reported selection biases related to the sequencing efficiency of genomic regions, whereby read counts depend not only on length but also on sequence features such as GC-content and mappability (i.e., uniqueness of a particular sequence compared to the rest of the genome) [3,6-20]. For instance, GC-rich and GC-poor fragments tend to be under-represented in RNA-Seq, so that, within a lane, read counts are not directly comparable between genes. Additionally, GC-content effects tend to be lane-specific, so that the read counts for a given gene are not directly comparable between lanes. Biases related to length and GC-content confound differential expression (DE) results as well as downstream analyses, such as those involving Gene Ontology (GO). As GC-content varies throughout the genome and is often associated with functionality, it may be difficult to infer true expression levels from biased read count measures. Proper normalization of read counts is therefore crucial to allow accurate inference of differences in expression levels.

Herein, we distinguish between two main types of effects on read counts: (1) within-lane gene-specific (and possibly lane-specific) effects, e.g., related to gene length or GC-content, and (2) effects related to between-lane distributional differences, e.g., sequencing depth. Accordingly, within-lane and between-lane normalization adjust for the first and second types of effects, respectively.

Within-lane normalization

The most obvious and well-known selection bias in RNA-Seq is due to gene length. Bullard et al. [3] and Oshlack & Wakefield [14] show that scaling counts by gene length is not sufficient for removing this bias and that the power of common tests of differential expression is positively correlated with both gene length and expression level. Indeed, the longer the gene, the higher the read count for a given expression level thus, any method for which precision is related to read count will tend to report more significant DE statistics for longer genes, even when considering per-base read counts. Hansen et al. [12] incorporate length effects on the mean of a Poisson model for read counts using natural cubic splines and adjust for this effect using robust quantile regression. Young et al. [19] propose a method that accounts for gene length bias in Gene Ontology analysis after performing DE tests.

Another documented source of bias for the Illumina sequencing technology is GC-content, i.e., the proportion of G and C nucleotides in a region of interest. Several authors have reported strong GC-content biases in DNA-Seq [7,10] and ChIP-Seq [17]. Yoon et al. [18] propose a GC-content normalization method for DNA copy number studies, which involves binning reads in 100-bp windows and scaling bin-level read counts by the ratio between the overall median and the median for bins with the same GC-content. More recently, Boeva et al. [8] propose a polynomial regression approach, based on binning reads in non-overlapping windows and regressing bin-level counts on GC-content (with default polynomial degree of three). Still in the context of DNA-Seq, Benjamini & Speed [6] report that read counts are most affected by the GC-content of the actual DNA fragments from the sequence library (vs. that of the sequenced reads themselves) and that the effect of GC-content is sample-specific and unimodal, i.e., both GC-rich and GC-poor fragments are under-represented. They develop a method for estimating and correcting for GC-content bias that works at the base-pair level and accommodates library, strand, and fragment length information, as well as varying bin sizes throughout the genome.

Sequence composition biases have also been observed in RNA-Seq. Hansen et al. [11] report large and reproducible base-specific read biases associated with random hexamer priming in Illumina's standard library preparation protocol. The bias takes the form of patterns in the nucleotide frequencies of the first dozen or so bases of a read. They provide a re-weighting scheme, where each read is assigned a weight based on its nucleotide composition, to mitigate the impact of the bias and improve the uniformity of reads along expressed transcripts.

Roberts et al. [16] also consider the problem of non-uniform cDNA fragment distribution in RNA-Seq and use a likelihood-based approach for correcting for this fragment bias.

When analyzing RNA-Seq data from a yeast diploid hybrid for allele-specific expression (ASE), Bullard et al. [9] note that read counts from an orthologous pair of genes might overestimate the expression level of the more GC-rich ortholog. To correct for this confounding effect, they develop a resampling-based method where the significance of differences in read counts is assessed by reference to a null distribution that accounts for between-species differences in nucleotide composition.

While there has been general agreement about the need to adjust for GC-content effects when comparing read counts between genomic regions for a given sample (as in DNA-Seq and ChIP-Seq) or between orthologs (as in ASE with RNA-Seq in an F1 hybrid organism [9]), the need to do so was not immediately recognized for standard RNA-Seq DE studies, where one compares read counts between samples for a given gene. The common belief was that, for a given gene, the GC-content effect was the same across samples and hence would cancel out when considering DE statistics such as count ratios. Pickrell et al. [15] seem to be the first to note the sample-specificity of the GC-content effect in the context of RNA-Seq and the resulting confounding of expression fold-change estimates. To address this problem, they developed a lane-specific correction procedure which involves binning exons according to GC-content, defining for each GC-bin and each lane a relative read enrichment factor as the proportion of reads in that bin originating from that lane divided by the overall proportion of reads in that lane, and scaling exon-level counts by the spline-smoothed enrichment factors. As noted by Hansen et al. [12], this approach suffers from two main drawbacks. Firstly, as the enrichment factors are computed for each lane relative to all others, the procedure equalizes the GC-content effect across lanes instead of removing it. Secondly, by adding counts across exons and lanes, the method does not account for the fact that regions with higher counts also tend to have higher variances.

Zheng et al. [20] note that base-level read counts from RNA-Seq may not be randomly distributed along the transcriptome and can be affected by local nucleotide composition. They propose an approach based on generalized additive models to simultaneously correct for different sources of bias, such as gene length, GC-content, and dinucleotide frequencies.

In their recent manuscript, Hansen et al. [12] show that GC-content has a strong impact on expression fold-change estimation and that failure to adjust for this effect can mislead differential expression analysis. They develop a conditional quantile normalization (CQN) procedure, which combines both within and between-lane normalization and is based on a Poisson model for read counts. Lane-specific systematic biases, such as GC-content and length effects, are incorporated as smooth functions using natural cubic splines and estimated using robust quantile regression. In order to account for distributional differences between lanes, a full-quantile normalization procedure is adopted, in the spirit of that considered in Bullard et al. [3]. The main advantage of this approach is that it is lane-specific, i.e., it works independently in each lane, aiming at removing the bias rather than equalizing it across lanes. Modeling simultaneously GC-content and length (and in principle other sources of bias) leads to a flexible normalization method. On the other hand, for some datasets such as the Yeast dataset analysed in the present article, a regression approach may be too weak to completely remove the GC-content effect and other more aggressive normalization strategies may be needed.

Between-lane normalization

The simplest between-lane normalization procedure adjusts for lane sequencing depth by dividing gene-level read counts by the total number of reads per lane (as in multiplicative Poisson model of Marioni et al. [4] and Reads Per Kilobase of exon model per Million mapped reads (RPKM) of Mortazavi et al. [5]). However, this still widely-used approach has proven ineffective and more beneficial procedures have been proposed [3,12,21,22].

In particular, Bullard et al. [3] consider three main types of between-lane normalization procedures: (1) global-scaling procedures, where counts are scaled by a single factor per lane (e.g., total count as in RPKM, count for housekeeping gene, or single quantile of count distribution) (2) full-quantile (FQ) normalization procedures, where all quantiles of the count distributions are matched between lanes and (3) procedures based on generalized linear models (GLM). They demonstrate the large impact of normalization on differential expression results in some contexts, sensitivity varies more between normalization procedures than between DE methods. Standard total-count normalization (cf. RPKM) tends to be heavily affected by a relatively small proportion of highly-expressed genes and can lead to biased DE results, while the upper-quartile (UQ) or full-quantile normalization procedures proposed in [3] tend to be more robust and improve sensitivity without loss of specificity.

In this article, we propose three different strategies to normalize RNA-Seq data for GC-content following a within-lane (i.e., sample-specific) gene-level approach. We examine their performance on two different types of data: a new RNA-Seq dataset for yeast grown in three different media and well-known benchmarking RNA-Seq datasets for two types of human reference samples from the MicroArray Quality Control (MAQC) Project [23]. For the latter datasets, the gene expression measures from qRT-PCR and Affymetrix chips serve as useful standards for performance assessment of RNA-Seq. We compare our approaches to the state-of-the-art CQN procedure of Hansen et al. [12] (which was shown to outperform competing methods such as that of Pickrell et al. [15]), in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. We demonstrate how properly correcting for GC-content bias, as well as for between-lane differences in count distributions, leads to more accurate estimation of gene expression levels and fold-changes, making statistical inference of differential expression less prone to false discoveries. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq.


7.3 Normalization by deconvolution

As previously mentioned, composition biases will be present when any unbalanced differential expression exists between samples. Consider the simple example of two cells where a single gene (X) is upregulated in one cell (A) compared to the other cell (B) . This upregulation means that either (i) more sequencing resources are devoted to (X) in (A) , thus decreasing coverage of all other non-DE genes when the total library size of each cell is experimentally fixed (e.g., due to library quantification) or (ii) the library size of (A) increases when (X) is assigned more reads or UMIs, increasing the library size factor and yielding smaller normalized expression values for all non-DE genes. In both cases, the net effect is that non-DE genes in (A) will incorrectly appear to be downregulated compared to (B) .

The removal of composition biases is a well-studied problem for bulk RNA sequencing data analysis. Normalization can be performed with the estimateSizeFactorsFromMatrix() function in the DESeq2 package (Anders and Huber 2010 Love, Huber, and Anders 2014) or with the calcNormFactors() function (Robinson and Oshlack 2010) in the edgeR package. These assume that most genes are not DE between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias that is used to compute an appropriate size factor for its removal.

However, single-cell data can be problematic for these bulk normalization methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the size of the counts for accurate size factor estimation (Lun, Bach, and Marioni 2016) . Pool-based size factors are then “deconvolved” into cell-based factors for normalization of each cell’s expression profile. This is performed using the calculateSumFactors() function from scran, as shown below.

We use a pre-clustering step with quickCluster() where cells in each cluster are normalized separately and the size factors are rescaled to be comparable across clusters. This avoids the assumption that most genes are non-DE across the entire population - only a non-DE majority is required between pairs of clusters, which is a weaker assumption for highly heterogeneous populations. By default, quickCluster() will use an approximate algorithm for PCA based on methods from the irlba package. The approximation relies on stochastic initialization so we need to set the random seed (via set.seed() ) for reproducibility.

We see that the deconvolution size factors exhibit cell type-specific deviations from the library size factors in Figure 7.2. This is consistent with the presence of composition biases that are introduced by strong differential expression between cell types. Use of the deconvolution size factors adjusts for these biases to improve normalization accuracy for downstream applications.

Figure 7.2: Deconvolution size factor for each cell in the Zeisel brain dataset, compared to the equivalent size factor derived from the library size. The red line corresponds to identity between the two size factors.

Accurate normalization is most important for procedures that involve estimation and interpretation of per-gene statistics. For example, composition biases can compromise DE analyses by systematically shifting the log-fold changes in one direction or another. However, it tends to provide less benefit over simple library size normalization for cell-based analyses such as clustering. The presence of composition biases already implies strong differences in expression profiles, so changing the normalization strategy is unlikely to affect the outcome of a clustering procedure.


Abstract

In high-throughput omics disciplines like transcriptomics, researchers face a need to assess the quality of an experiment prior to an in-depth statistical analysis. To efficiently analyze such voluminous collections of data, researchers need triage methods that are both quick and easy to use. Such a normalization method for relative quantitation, CONSTANd, was recently introduced for isobarically-labeled mass spectra in proteomics. It transforms the data matrix of abundances through an iterative, convergent process enforcing three constraints: (I) identical column sums (II) each row sum is fixed (across matrices) and (III) identical to all other row sums. In this study, we investigate whether CONSTANd is suitable for count data from massively parallel sequencing, by qualitatively comparing its results to those of DESeq2. Further, we propose an adjustment of the method so that it may be applied to identically balanced but differently sized experiments for joint analysis. We find that CONSTANd can process large data sets at well over 1 million count records per second whilst mitigating unwanted systematic bias and thus quickly uncovering the underlying biological structure when combined with a PCA plot or hierarchical clustering. Moreover, it allows joint analysis of data sets obtained from different batches, with different protocols and from different labs but without exploiting information from the experimental setup other than the delineation of samples into identically processed sets (IPSs). CONSTANd’s simplicity and applicability to proteomics as well as transcriptomics data make it an interesting candidate for integration in multi-omics workflows.


RNA-Seq library normalization and experimental setup - Biology

Identifying the relevant genes (or other genomic features such as transcripts, miRNAs, lncRNAs, etc.) across the conditions (e.g. tumor and non-tumor tissue samples) is a common research interest in gene-expression studies. In this gene selection, researchers are often interested in detecting a small set of genes for diagnostic purpose in medicine that involves identification of the minimal subset of genes that achieves maximal predictive performance. biomarker discovery and classification problem.

VoomDDA is a decision support tool developed for RNA-Sequencing datasets to assist researchers in their decisions for diagnostic biomarker discovery and classification problem. VoomDDA consists both sparse and non-sparse statistical learning classifiers adapted with voom method. Voom is a recent method that estimates the mean and variance relationship of the log-counts of RNA-Seq data (log counts per million, log-cpm) at observational level. It also provides precision weights for each observation that can be incorporated with the log-cpm values for further analysis. Algorithms in our tool incorporates the log-cpm values and the corresponding precision weights into biomarker discovery and classification problem. For this purpose, these algorithms use weighted statistics in estimating the discriminating functions of the used statistical learning algorithms.

VoomNSC is a sparse classifier that is developed to bring together two powerful methods for RNA-Seq classification:

1. to extend voom method for RNA-Seq classification studies,
2. to make nearest shrunken centroids (NSC) algorithm available for RNA-Seq technology.

VoomNSC both provides fast, accurate and sparser classification results for RNA-Seq data. More details can be found in the research paper. This tool also includes RNA-Seq extensions of diagonal linear and diagonal quadratic discriminant classifiers: (i) voomDLDA and (ii) voomDQDA.

References

[1] Zararsiz, G., Goksuluk, G., Korkmaz, S., et al. (2015). VoomDDA: Discovery of Diagnostic Biomarkers and Classification of RNA-Seq Data.

[4] Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 97(457): 77-87.

Getting model results. this may take a while.

Creating heatmap. this may take a while.

Creating network plot. this may take a while.

Gene ontology results

Getting ontology results. this may take a while.

Tutorial

Two example datasets are available in voomDDA web application. Cervical cancer is a miRNA, lung cancer is a gene expression dataset. For GO analysis, users should select the necessary option (miRNA or gene) to obtain the related analysis results.

VoomDDA application requires three inputs from the user. Train and test sets should be text files (.txt) that contain the raw mapped read counts in a matrix form, where rows correspond to genomic features (for simplicity of language, let’s say genes) and the columns correspond to observations (or samples). This type of count data can be obtained from feature counting softwares such as HTSeq [1] or featureCounts [2]. Note that this type of count data should contain the raw number of mapped reads, should not be normalized or contain RPKM values. Class labels should also be in a text file (.txt) and should contain each sample condition. Note that each row should contain only one label of a sample. Example datasets for Witten et al. cervical dataset are given as below:

If the purpose is the prediction of the class labels of new test observations, users should upload all three necessary files. However, test set is not required, when the purpose is just the identification of the diagnostic biomarkers.

After uploading the data, make sure that the data is displayed in the screen.

2. Pre-processing the Data

VoomDDA classifiers (VoomNSC, VoomDLDA and VoomDQDA) introduced in this application have the same assumptions with voom+limma pipeline [3], that is to filter out the rows with zero or very very low counts. In RNA-Seq data, we often meet with count data that contains rows with single unique values (mostly zero). This type of data may lead to unreliable estimation of the mean and variance relationship of the data and unstable model fitting for the introduced classifiers. Three possible filtering criteria are available: (i) DESeq2 outlier and independent filtering, (ii) near-zero variance filtering, (iii) variance filtering.

DESeq2 package [4] contains a filtering criteria based on outlier detection and independent filtering. Outliers are detected based on the Cook’s distance and independent filtering is applied based on the gene-wise mean normalized counts. More details can be obtained in the vignette of DESeq2 package [5].

Near-zero variance filtering is described in caret package of R [6]. This package applies filtering based on two criteria: (i) the frequency of the most frequent value to the most frequent second value is higher than 19 (95/5), (ii) the number of unique values divided by the sample size is less than 10%.

Variance filtering is another option to filter out the non-informative genes. This option may also be selected to decrease the computational cost of the model building process for very large datasets. After selecting this option, users can enter the number of genes desired to be included to the classification models.

After selecting one or multiple filtering criteria, filtering statistics are demonstrated in the screen.

Library sizes for each observation are dependent on the experimental design and may lead to the existence of technical biases. These biases can have significant effect on the classification results and should be corrected before starting to classification model building. In our experiments, we found that normalization has a significant effect on the classification results for datasets that have very large library size differences across samples. Two normalization approaches are available in the application: (i) DESeq median ratio [7], (ii) trimmed mean of M values (TMM) [8]. More details about this approaches can be found in referenced papers.

3. Model Building for Classification

After data processing, users can build classification models with three introduced algorithms: (i) voomNSC, (ii) voomDLDA, (iii) voomDQDA. VoomNSC is a sparse classifier that brings together two powerful methods, voom method [3] and nearest shrunken centroids algorithm [9], for the classification of RNA-Seq data. VoomDLDA and voomDQDA are non-sparse classifiers which are the extensions of diagonal discriminant classifiers [10]. Details of these classifiers are given in the referenced paper [11].

After selecting any of the three classifiers, a summary of the fitting process is displayed in the screen. A confusion matrix and several statistical diagnostic measures are given to examine how successful the classifier fit to the given data. Furthermore, a heatmap plot is constructed to display the expression levels of genes and the gene-wise and sample-wise relationships. Heatmap is displayed for the entire unfiltered genes for non-sparse classifiers, while displayed for the selected gene subset for sparse voomNSC classifier.

4. Identification of Diagnostic Biomarkers

If VoomNSC is the selected classifier, the subset of genes, that are most relevant with the class condition, are identified and the gene names are displayed in the screen. Several plots are also given. First plot demonstrates the selection of the threshold parameter. The parameter which fits the most accurate and sparsest model is identified as optimal. Second plot displays the distribution of selected genes in each class. Third plot displays the shrunken differences of the selected genes. Final plot is the heatmap plot discussed in the previous section.

Based on the selected classifier, predictions appear on the screen for each test observation. Note that the test observations should be processed as same as the training observations. Same experimental and computational procedures should be applied before obtaining the raw count data. Data should be in the same format as the training data to obtain the predictions. It should contain the raw mapped read counts, and the gene names should match with the training data.

VoomDDA application filters and normalizes the test data based on the information obtained from the training data. Thus, the estimated parameters from the training data are used for the test data. This guarantees that both sets are on the same scale and homoscedastic each other.

6. Downstream Analysis

After detecting diagnostic biomarkers via voomNSC algorithm, it may be useful to visualize the results to see the interactions or go further analysis, such as GO analysis. For this purpose, several downstream analysis tools are also available in this web application. These tools include heatmaps, network analysis and gene ontology analysis. Detailed information about gene ontology analysis can be found in topGO BIOCONDUCTOR package.

[1] Anders, S., Pyl, P.T., and Huber, W. (2015) HTSeq - a Python framework to work with high-throughput sequencing data. Bioinformatics 31(2):166-9.

[2] Liao, Y., Smyth, G.K., and Shi, W. (2013). featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics. doi: 10.1093/bioinformatics/btt656.

[3] Law, C.W., Chen, Y., Shi, W. and Smyth, G.K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-Seq read counts. Genome Biology 15:R29.

[4] Love, M.I., Huber, W. and Anders, S. (2015). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15(550). doi:10.1186/s13059-014-0550-8 .

[5] Love, M.I., Huber, W. and Anders, S. (2015). Differential analysis of count data – the DESeq2 package. http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf (19.06.2015).

[6] Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software 28(5).

[7] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology 11(R106): doi:10.1186/gb-2010-11-10-r106 .

[8] Robinson, M.D., and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology 11(R25).

[9] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS 99(10): 6567–72.

[10] Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American Statistical Association 97(457): 77-87.

[11] Zararsiz, G., Goksuluk, D, Korkmaz, S., et al. (2015). VoomDDA: Discovery of Diagnostic Biomarkers and Classification of RNA-Seq Data.


Introduction

RNA-sequencing (RNA-seq) has become the primary technology used for gene expression profiling, with the genome-wide detection of differentially expressed genes between two or more conditions of interest one of the most commonly asked questions by researchers. The edgeR (Robinson, McCarthy, and Smyth 2010) and limma packages (Ritchie et al. 2015) available from the Bioconductor project (Huber et al. 2015) offer a well-developed suite of statistical methods for dealing with this question for RNA-seq data.

In this article, we describe an edgeR - limma workflow for analysing RNA-seq data that takes gene-level counts as its input, and moves through pre-processing and exploratory data analysis before obtaining lists of differentially expressed (DE) genes and gene signatures. This analysis is enhanced through the use of interactive graphics from the Glimma package (Su et al. 2017) , that allows for a more detailed exploration of the data at both the sample and gene-level than is possible using static R plots.

The experiment analysed in this workflow is from Sheridan et al. (2015) (Sheridan et al. 2015) and consists of three cell populations (basal, luminal progenitor (LP) and mature luminal (ML)) sorted from the mammary glands of female virgin mice, each profiled in triplicate. RNA samples were sequenced across three batches on an Illumina HiSeq 2000 to obtain 100 base-pair single-end reads. The analysis outlined in this article assumes that reads obtained from an RNA-seq experiment have been aligned to an appropriate reference genome and summarised into counts associated with gene-specific regions. In this instance, reads were aligned to the mouse reference genome (mm10) using the R based pipeline available in the Rsubread package (specifically the align function (Liao, Smyth, and Shi 2013) followed by featureCounts (Liao, Smyth, and Shi 2014) for gene-level summarisation based on the in-built mm10 RefSeq-based annotation).


Cuffdiff options:

Prints the help message and exits

-o/–output-dir <string>

Sets the name of the directory in which Cuffdiff will write all of its output. The default is “./”.

-L/–labels <label1,label2,…,labelN>

Specify a label for each sample, which will be included in various output files produced by Cuffdiff.

Use this many threads to align reads. The default is 1.

-T/–time-series

Instructs Cuffdiff to analyze the provided samples as a time series, rather than testing for differences between all pairs of samples. Samples should be provided in increasing time order at the command line (e.g first time point SAM, second timepoint SAM, etc.)

–total-hits-norm

With this option, Cufflinks counts all fragments, including those not compatible with any reference transcript, towards the number of mapped fragments used in the FPKM denominator. It is inactive by default.

–compatible-hits-norm

With this option, Cufflinks counts only those fragments compatible with some reference transcript towards the number of mapped fragments used in the FPKM denominator. Using this mode is generally recommended in Cuffdiff to reduce certain types of bias caused by differential amounts of ribosomal reads which can create the impression of falsely differentially expressed genes. It is active by default.

-b/–frag-bias-correct <genome.fa>

Providing Cufflinks with the multifasta file your reads were mapped to via this option instructs it to run our bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates. See How Cufflinks Workshow_it_works/index.html#) for more details.

-u/–multi-read-correct

Tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome. See How Cufflinks Works for more details.

-c/–min-alignment-count <int>

The minimum number of alignments in a locus for needed to conduct significance testing on changes in that locus observed between samples. If no testing is performed, changes in the locus are deemed not signficant, and the locus’ observed changes don’t contribute to correction for multiple testing. The default is 10 fragment alignments.

-M/–mask-file <mask.(gtf/gff)>

Tells Cuffdiff to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.

The allowed false discovery rate. The default is 0.05.

–library-type

–library-norm-method

–dispersion-method

Cuffdiff advanced options:

This is the expected (mean) fragment length. The default is 200bp.

Note: Cuffdiff now learns the fragment length mean for each SAM file, so using this option is no longer recommended with paired-end reads.

-s/–frag-len-std-dev <int>

The standard deviation for the distribution on fragment lengths. The default is 80bp.

Note: Cuffdiff now learns the fragment length standard deviation for each SAM file, so using this option is no longer recommended with paired-end reads.

–max-mle-iterations <int>

Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000

Print lots of status updates and other diagnostic information.

Suppress messages other than serious warnings and errors.

–no-update-check

Turns off the automatic routine that contacts the Cufflinks server to check for a more recent version.

–poisson-dispersion

Use the Poisson fragment dispersion model instead of learning one in each condition.

–emit-count-tables

Cuffdiff will output a file for each condition (called <sample>_counts.txt) containing the fragment counts, fragment count variances, and fitted variance model. For internal debugging only. This option will be removed in a future version of Cuffdiff.

-F/–min-isoform-fraction <0.0-1.0>

Cuffdiff will round down to zero the abundance of alternative isoforms quantified at below the specified fraction of the major isoforms. This is done after MLE estimation but before MAP estimation to improve robustness of confidence interval generation and differential expression analysis. The default is 1e-5, and we recommend you not alter this parameter.

–max-bundle-frags <int>

Sets the maximum number of fragments a locus may have before being skipped. Skipped loci are marked with status HIDATA. Default: 1000000

–max-frag-count-draws <int>

Cuffdiff will make this many draws from each transcript’s predicted negative binomial random numbder generator. Each draw is a number of fragments that will be probabilistically assigned to the transcripts in the transcriptome. Used to estimate the variance-covariance matrix on assigned fragment counts. Default: 100.

–max-frag-assign-draws <int>

For each fragment drawn from a transcript, Cuffdiff will assign it this many times (probabilistically), thus estimating the assignment uncertainty for each transcript. Used to estimate the variance-covariance matrix on assigned fragment counts. Default: 50.

–min-reps-for-js-test <int>

Cuffdiff won’t test genes for differential regulation unless the conditions in question have at least this many replicates. Default: 3.

Cuffdiff will not employ its “effective” length normalization to transcript FPKM.

–no-length-correction

Cuffdiff will not normalize fragment counts by transcript length at all. Use this option when fragment count is independent of the size of the features being quantified (e.g. for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length). Experimental option, use with caution.

Cuffdiff takes a GTF2/GFF3 file of transcripts as input, along with two or more SAM files containing the fragment alignments for two or more samples. It produces a number of output files that contain test results for changes in expression at the level of transcripts, primary transcripts, and genes. It also tracks changes in the relative abundance of transcripts sharing a common transcription start site, and in the relative abundances of the primary transcripts of each gene. Tracking the former allows one to see changes in splicing, and the latter lets one see changes in relative promoter use within a gene.

If you have more than one replicate for a sample, supply the SAM files for the sample as a single comma-separated list. It is not necessary to have the same number of replicates for each sample.

Note that Cuffdiff can also accepted BAM files (which are binary, compressed SAM files). It can also accept CXB files produced by Cuffquant. Note that mixing SAM and BAM files is supported, but you cannot currently mix CXB and SAM/BAM. If one of the samples is supplied as a CXB file, all of the samples must be supplied as CXB files.

Cuffdiff requires that transcripts in the input GTF be annotated with certain attributes in order to look for changes in primary transcript expression, splicing, coding output, and promoter use. These attributes are:

The ID of this transcript’s inferred start site. Determines which primary transcript this processed transcript is believed to come from. Cuffcompare appends this attribute to every transcript reported in the .combined.gtf file.

The ID of the coding sequence this transcript contains. This attribute is attached by Cuffcompare to the .combined.gtf records only when it is run with a reference annotation that include CDS records. Further, differential CDS analysis is only performed when all isoforms of a gene have p_id attributes, because neither Cufflinks nor Cuffcompare attempt to assign an open reading frame to transcripts.

Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:

The resulting cuffcmp.combined.gtf file created by this command will have the tss_id and p_id attributes added to each record and this file can be used as input for cuffdiff.

FPKM tracking files

Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group. The results are output in FPKM tracking files in the format described here. There are four FPKM tracking files:

isoforms.fpkm_tracking Transcript FPKMs
genes.fpkm_tracking Gene FPKMs. Tracks the summed FPKM of transcripts sharing each gene_id
cds.fpkm_tracking Coding sequence FPKMs. Tracks the summed FPKM of transcripts sharing each p_id, independent of tss_id
tss_groups.fpkm_tracking Primary transcript FPKMs. Tracks the summed FPKM of transcripts sharing each tss_id

Count tracking files

Cuffdiff estimates the number of fragments that originated from each transcript, primary transcript, and gene in each sample. Primary transcript and gene counts are computed by summing the counts of transcripts in each primary transcript group or gene group. The results are output in count tracking files in the format described here. There are four Count tracking files:

isoforms.count_tracking Transcript counts
genes.count_tracking Gene counts. Tracks the summed counts of transcripts sharing each gene_id
cds.count_tracking Coding sequence counts. Tracks the summed counts of transcripts sharing each p_id, independent of tss_id
tss_groups.count_tracking Primary transcript counts. Tracks the summed counts of transcripts sharing each tss_id

Read group tracking files

Cuffdiff calculates the expression and fragment count for each transcript, primary transcript, and gene in each replicate. The results are output in per-replicate tracking files in the format described here. There are four read group tracking files:

isoforms.read_group_tracking Transcript read group tracking
genes.read_group_tracking Gene read group tracking. Tracks the summed expression and counts of transcripts sharing each gene_id in each replicate
cds.read_group_tracking Coding sequence FPKMs. Tracks the summed expression and counts of transcripts sharing each p_id, independent of tss_id in each replicate
tss_groups.read_group_tracking Primary transcript FPKMs. Tracks the summed expression and counts of transcripts sharing each tss_id in each replicate

Differential expression tests

This tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary transcripts, genes, and coding sequences. Four files are created:

isoform_exp.diff Transcript-level differential expression.
gene_exp.diff Gene-level differential expression. Tests differences in the summed FPKM of transcripts sharing each gene_id
tss_group_exp.diff Primary transcript differential expression. Tests differences in the summed FPKM of transcripts sharing each tss_id
cds_exp.diff Coding sequence differential expression. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id

Each of the above files has the following format:

Column number Column name Example Description
1 Tested id XLOC_000001 A unique identifier describing the transcipt, gene, primary transcript, or CDS being tested
2 gene Lypla1 The gene_name(s) or gene_id(s) being tested
3 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the genes or transcripts being tested.
4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
6 Test status NOTEST Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7 FPKMx 8.01089 FPKM of the gene in sample x
8 FPKMy 8.551545 FPKM of the gene in sample y
9 log2(FPKMy/FPKMx) 0.06531 The (base 2) log of the fold change y/x
10 test stat 0.860902 The value of the test statistic used to compute significance of the observed change in FPKM
11 p value 0.389292 The uncorrected p-value of the test statistic
12 q value 0.985216 The FDR-adjusted p-value of the test statistic
13 significant no Can be either “yes” or “no”, depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

Differential splicing tests - splicing.diff

This tab delimited file lists, for each primary transcript, the amount of isoform switching detected among its isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript. Only primary transcripts from which two or more isoforms are spliced are listed in this file.

Column number Column name Example Description
1 Tested id TSS10015 A unique identifier describing the primary transcript being tested.
2 gene name Rtkn The gene_name or gene_id that the primary transcript being tested belongs to
3 locus chr6:83087311-83102572 Genomic coordinates for easy browsing to the genes or transcripts being tested.
4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
6 Test status OK Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7 Reserved 0
8 Reserved 0
9 √JS(x,y) 0.22115 The amount of isoform switching between the isoforms originating from this TSS, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the splice variants
10 test stat 0.22115 The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
11 p value 0.000174982 The uncorrected p-value of the test statistic.
12 q value 0.985216 The FDR-adjusted p-value of the test statistic
13 significant no Can be either “yes” or “no”, depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

Differential coding output - cds.diff

This tab delimited file lists, for each gene, the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples. Only genes producing two or more distinct CDS (i.e. multi-protein genes) are listed here.

Column number Column name Example Description
1 Tested id XLOC_000002 A unique identifier describing the gene being tested.
2 gene name Atp6v1h The gene_name or gene_id
3 locus chr1:5073200-5152501 Genomic coordinates for easy browsing to the genes or transcripts being tested.
4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
6 Test status OK Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7 Reserved 0
8 Reserved 0
9 √JS(x,y) 0.0686517 The CDS overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the coding sequences
10 test stat 0.0686517 The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
11 p value 0.00546783 The uncorrected p-value of the test statistic
12 q value 0.985216 The FDR-adjusted p-value of the test statistic
13 significant no Can be either “yes” or “no”, depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

Differential promoter use - promoters.diff

This tab delimited file lists, for each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples. Only genes producing two or more distinct primary transcripts (i.e. multi-promoter genes) are listed here.

Column number Column name Example Description
1 Tested id XLOC_000019 A unique identifier describing the gene being tested.
2 gene name Tmem70 The gene_name or gene_id
3 locus chr1:16651657-16668357 Genomic coordinates for easy browsing to the genes or transcripts being tested.
4 sample 1 Liver Label (or number if no labels provided) of the first sample being tested
5 sample 2 Brain Label (or number if no labels provided) of the second sample being tested
6 Test status OK Can be one of OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL, when an ill-conditioned covariance matrix or other numerical exception prevents testing.
7 Reserved 0
8 Reserved 0
9 √JS(x,y) 0.0124768 The promoter overloading of the gene, as measured by the square root of the Jensen-Shannon divergence computed on the relative abundances of the primary transcripts
10 test stat 0.0124768 The value of the test statistic used to compute significance of the observed overloading, equal to √JS(x,y)
11 p value 0.394327 The uncorrected p-value of the test statistic
12 q value 0.985216 The FDR-adjusted p-value of the test statistic
13 significant no Can be either “yes” or “no”, depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

Read group info - read_groups.info

This tab delimited file lists, for each replicate, key properties used by Cuffdiff during quantification, such as library normalization factors. The read_groups.info file has the following format:

Column number Column name Example Description
1 file mCherry_rep_A/accepted_hits.bam BAM or SAM file containing the data for the read group
2 condition mCherry Condition to which the read group belongs
3 replicate_num 0 Replicate number of the read group
4 total_mass 4.72517e+06 Total number of fragments for the read group
5 norm_mass 4.72517e+06 Fragment normalization constant used during calculation of FPKMs.
6 internal_scale 1.23916 Scaling factor used to normalize for library size
7 external_scale 1.0 Currently unused, and always equal to 1.0.


This project was funded by NIH/NLM training grant T15 LM011270, NIH/NCI Cancer Center Support Grant P30 CA016058, and NIH/NLM Individual Fellowship 1F31LM013056. Publication costs are funded by Philip R.O. Payne’s startup fund.

Affiliations

Department Biomedical Informatics, Ohio State University, 250 Lincoln Tower, 1800 Cannon Dr. Columbus, Columbus, OH, 43210, USA

Zachary B. Abrams, Travis S. Johnson & Kevin Coombes

Department of Medicine, Indiana University School of Medicine, 545 Barnhill Drive, Indianapolis, IN, 46202, USA

Travis S. Johnson & Kun Huang

Regenstrief Institute, Indiana University, 1101 West 10th Street, Indianapolis, IN, 46262, USA

Department of Biomedical Informatics, Washington University, 4444 Forest Park Ave, Suite 6318 Campus Box 8102, St. Louis, MO, 63108-2212, USA


Affiliations

Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA

Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, Jilin Province, P.R. China

Department of Mathematics, Washington University in Saint Louis, 63130, Saint Louis, Missouri, USA

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Corresponding author


Watch the video: StatQuest: A gentle introduction to RNA-seq (January 2022).