Information

Is my RNA-seq experimental design correct to use it for SNP calling?


I am a newbie here and would highly appreciate your advice about one particular experimental design.

We have data from RNAseq experiment which was originally designed to assess differential expression. The details of experiment are as follows:

2 modalities of the phenotype

Each phenotype is represented by 4 samples. 1 sample = 60 individuals pooled together at the stage of RNA isolation.

Molecule - polyadenylated mRNA

Sequencing chemistry - Illumina paired-end, read length - 2*100 bp

My question is whether it is correct to use this RNAseq data to call for SNPs? I made previous search and found that most of people calling SNP from RNAseq use 40-1000 samples (= individuals). But they initially designed RNAseq experiment for further GWAS. I see that this analysis cannot be applied to my data (at least because in my case individual flies were pooled without barcoding - 60 flies per a sample). However, can I still call for SNPs and upload the list to database as a list of potential targets for GWAS with, for example, estimation of functional impact upon protein structure? Will they be “true” SNPs, or our experimental design makes even this step invalid?

I found this paper https://www.ncbi.nlm.nih.gov/pubmed/27458203 where people used 2 phenotypes each represented by 2 samples what is almost like our experiment, but still have doubts.


Using RNAseq for SNP analysis is not the best tool available for several reasons. First, you will only find SNP in genes that are expressed (you need more than 120 reads per SNP in your experiment). Second, you will only find SNP in coding regions of the genes. For your particular experiment, where you have a pool of 60 individuals per sample, it's another drawback because you'll have 120 alleles in your sample (without knowing the allele combinations of the individuals).

Of course you can still try to find SNPs that are present in your phenotypes, but real GWAS kind of analysis is in my opinion not possible.


ScSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

Identifying single nucleotide variants has become common practice for droplet-based single-cell RNA-seq experiments however, presently, a pipeline does not exist to maximize variant calling accuracy. Furthermore, molecular duplicates generated in these experiments have not been utilized to optimally detect variant co-expression. Herein, we introduce scSNV designed from the ground up to “collapse” molecular duplicates and accurately identify variants and their co-expression. We demonstrate that scSNV is fast, with a reduced false-positive variant call rate, and enables the co-detection of genetic variants and A>G RNA edits across twenty-two samples.


Design is a fundamental step of a particular RNA-Seq experiment. Some important questions like sequencing depth/coverage or how many biological or technical replicates must be carefully considered. Design review. [5]

  • PROPER : PROspective Power Evaluation for RNAseq.
  • RNAtor Android Application to calculate optimal parameters for popular tools and kits available for DNA sequencing projects.
  • Scotty : a web tool for designing RNA-Seq experiments to measure differential gene expression.
  • ssizeRNA Sample Size Calculation for RNA-Seq Experimental Design.

Quality assessment of raw data [6] is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), adapters, contaminations, overrepresented sequences or correcting errors to assure a coherent final result.

Quality control Edit

  • AfterQC - Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data.
  • dupRadar[7] An R package which provides functions for plotting and analyzing the duplication rates dependent on the expression levels.
  • FastQC is a quality control tool for high-throughput sequence data (Babraham Institute) and is developed in Java. Import of data is possible from FastQ files, BAM or SAM format. This tool provides an overview to inform about problematic areas, summary graphs and tables to rapid assessment of data. Results are presented in HTML permanent reports. FastQC can be run as a stand-alone application or it can be integrated into a larger pipeline solution.
  • fastqp Simple FASTQ quality assessment using Python.
  • Kraken: [8] A set of tools for quality control and analysis of high-throughput sequence data.
  • HTSeq . [9] The Python script htseq-qa takes a file with sequencing reads (either raw or aligned reads) and produces a PDF file with useful plots to assess the technical quality of a run.
  • mRIN[10] - Assessing mRNA integrity directly from RNA-Seq data.
  • MultiQC[11] - Aggregate and visualise results from numerous tools (FastQC, HTSeq, RSeQC, Tophat, STAR, others..) across all samples into a single report.
  • NGSQC: cross-platform quality analysis pipeline for deep sequencing data.
  • NGS QC Toolkit A toolkit for the quality control (QC) of next generation sequencing (NGS) data. The toolkit comprises user-friendly stand alone tools for quality control of the sequence data generated using Illumina and Roche 454 platforms with detailed results in the form of tables and graphs, and filtering of high-quality sequence data. It also includes few other tools, which are helpful in NGS data quality control and analysis.
  • PRINSEQ is a tool that generates summary statistics of sequence and quality data and that is used to filter, reformat and trim next-generation sequence data. It is particular designed for 454/Roche data, but can also be used for other types of sequence.
  • QC-Chain is a package of quality control tools for next generation sequencing (NGS) data, consisting of both raw reads quality evaluation and de novo contamination screening, which could identify all possible contamination sequences.
  • QC3 a quality control tool designed for DNA sequencing data for raw data, alignment, and variant calling.
  • qrqc Quickly scans reads and gathers statistics on base and quality frequencies, read length, and frequent sequences. Produces graphical output of statistics for use in quality control pipelines, and an optional HTML quality report. S4 SequenceSummary objects allow specific tests and functionality to be written around the data collected.
  • RNA-SeQC[12] is a tool with application in experiment design, process optimization and quality control before computational analysis. Essentially, provides three types of quality control: read counts (such as duplicate reads, mapped reads and mapped unique reads, rRNA reads, transcript-annotated reads, strand specificity), coverage (like mean coverage, mean coefficient of variation, 5’/3’ coverage, gaps in coverage, GC bias) and expression correlation (the tool provides RPKM-based estimation of expression levels). RNA-SeQC is implemented in Java and is not required installation, however can be run using the GenePattern web interface. The input could be one or more BAM files. HTML reports are generated as output.
  • RSeQC[13] analyzes diverse aspects of RNA-Seq experiments: sequence quality, sequencing depth, strand specificity, GC bias, read distribution over the genome structure and coverage uniformity. The input can be SAM, BAM, FASTA, BED files or Chromosome size file (two-column, plain text file). Visualization can be performed by genome browsers like UCSC, IGB and IGV. However, R scripts can also be used to visualization.
  • SAMStat[14] identifies problems and reports several statistics at different phases of the process. This tool evaluates unmapped, poorly and accurately mapped sequences independently to infer possible causes of poor mapping.
  • SolexaQA calculates sequence quality statistics and creates visual representations of data quality for second-generation sequencing data. Originally developed for the Illumina system (historically known as “Solexa”), SolexaQA now also supports Ion Torrent and 454 data.
  • Trim galore is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing).

Improving the quality Edit

Improvement of the RNA-Seq quality, correcting the bias is a complex subject. [15] [16] Each RNA-Seq protocol introduces specific type of bias, each step of the process (such as the sequencing technology used) is susceptible to generate some sort of noise or type of error. Furthermore, even the species under investigation and the biological context of the samples are able to influence the results and introduce some kind of bias. Many sources of bias were already reported – GC content and PCR enrichment, [17] [18] rRNA depletion, [19] errors produced during sequencing, [20] priming of reverse transcription caused by random hexamers. [21]

Different tools were developed to attempt to solve each of the detected errors.

Trimming and adapters removal Edit

  • BBDuk multithreaded tool to trim adapters and filter or mask contaminants based on kmer-matching, allowing a hamming- or edit-distance, as well as degenerate bases. Also performs optimal quality-trimming and filtering, format conversion, contaminant concentration reporting, gc-filtering, length-filtering, entropy-filtering, chastity-filtering, and generates text histograms for most operations. Interconverts between fastq, fasta, sam, scarf, interleaved and 2-file paired, gzipped, bzipped, ASCII-33 and ASCII-64. Keeps pairs together. Open-source, written in pure Java supports all platforms with no recompilation and no other dependencies.
  • clean_reads cleans NGS (Sanger, 454, Illumina and solid) reads. It can trim bad quality regions, adaptors, vectors, and regular expressions. It also filters out the reads that do not meet a minimum quality criteria based on the sequence length and the mean quality.
  • condetri[22] is a method for content dependent read trimming for Illumina data using quality scores of each base individually. It is independent from sequencing coverage and user interaction. The main focus of the implementation is on usability and to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequencing data of arbitrary length.
  • cutadapt[23] removes adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case.
  • Deconseq Detect and remove contaminations from sequence data.
  • Erne-Filter[24] is a short string alignment package whose goal is to provide an all-inclusive set of tools to handle short (NGS-like) reads. ERNE comprises ERNE-FILTER (read trimming and continamination filtering), ERNE-MAP (core alignment tool/algorithm), ERNE-BS5 (bisulfite treated reads aligner), and ERNE-PMAP/ERNE-PBS5 (distributed versions of the aligners).
  • FastqMcf Fastq-mcf attempts to: Detect & remove sequencing adapters and primers Detect limited skewing at the ends of reads and clip Detect poor quality at the ends of reads and clip Detect Ns, and remove from ends Remove reads with CASAVA 'Y' flag (purity filtering) Discard sequences that are too short after all of the above Keep multiple mate-reads in sync while doing all of the above.
  • FASTX Toolkit is a set of command line tools to manipulate reads in files FASTA or FASTQ format. These commands make possible preprocess the files before mapping with tools like Bowtie. Some of the tasks allowed are: conversion from FASTQ to FASTA format, information about statistics of quality, removing sequencing adapters, filtering and cutting sequences based on quality or conversion DNA/RNA.
  • Flexbar performs removal of adapter sequences, trimming and filtering features.
  • FreClu improves overall alignment accuracy performing sequencing-error correction by trimming short reads, based on a clustering methodology.
  • htSeqTools is a Bioconductor package able to perform quality control, processing of data and visualization. htSeqTools makes possible visualize sample correlations, to remove over-amplification artifacts, to assess enrichment efficiency, to correct strand bias and visualize hits.
  • NxTrim Adapter trimming and virtual library creation routine for Illumina Nextera Mate Pair libraries.
  • PRINSEQ[25] generates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios. Filter the data, reformat and trim sequences.
  • Sabre A barcode demultiplexing and trimming tool for FastQ files.
  • Scythe A 3'-end adapter contaminant trimmer.
  • SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly.
  • Sickle A windowed adaptive trimming tool for FASTQ files using quality.
  • SnoWhite[26] is a pipeline designed to flexibly and aggressively clean sequence reads (gDNA or cDNA) prior to assembly. It takes in and returns fastq or fasta formatted sequence files.
  • ShortRead is a package provided in the R (programming language) / BioConductor environments and allows input, manipulation, quality assessment and output of next-generation sequencing data. This tool makes possible manipulation of data, such as filter solutions to remove reads based on predefined criteria. ShortRead could be complemented with several Bioconductor packages to further analysis and visualization solutions (BioStrings, BSgenome, IRanges, and so on).
  • SortMeRNA is a program tool for filtering, mapping and OTU-picking NGS reads in metatranscriptomic and metagenomic data. The core algorithm is based on approximate seeds and allows for analyses of nucleotide sequences. The main application of SortMeRNA is filtering ribosomal RNA from metatranscriptomic data.
  • TagCleaner The TagCleaner tool can be used to automatically detect and efficiently remove tag sequences (e.g. WTA tags) from genomic and metagenomic datasets. It is easily configurable and provides a user-friendly interface.
  • Trimmomatic[27] performs trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, converts quality scores to Phred-33/64.
  • fastp A tool designed to provide all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported.
  • FASTX-Toolkit The FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.

Detection of chimeric reads Edit

Recent sequencing technologies normally require DNA samples to be amplified via polymerase chain reaction (PCR). Amplification often generates chimeric elements (specially from ribosomal origin) - sequences formed from two or more original sequences joined together.

  • UCHIME is an algorithm for detecting chimeric sequences.
  • ChimeraSlayeris a chimeric sequence detection utility, compatible with near-full length Sanger sequences and shorter 454-FLX sequences (

Error correction Edit

High-throughput sequencing errors characterization and their eventual correction. [28]

  • Acacia Error-corrector for pyrosequenced amplicon reads.
  • AllPathsLG error correction.
  • AmpliconNoise[29] AmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. It involves two steps the removal of noise from the sequencing itself and the removal of PCR point errors. This project also includes the Perseus algorithm for chimera removal.
  • BayesHammer. Bayesian clustering for error correction. This algorithm is based on Hamming graphs and Bayesian subclustering. While BAYES HAMMER was designed for single-cell sequencing, it also improves on existing error correction tools for bulk sequencing data.
  • Bless[30] A bloom filter-based error correction solution for high-throughput sequencing reads.
  • Blue[31] Blue is a short-read error-correction tool based on k-mer consensus and context.
  • bf A sequencing error corrector designed for Illumina short reads. It uses a non-greedy algorithm with a speed comparable to implementations based on greedy methods.
  • Denoiser Denoiser is designed to address issues of noise in pyrosequencing data. Denoiser is a heuristic variant of PyroNoise. Developers of denoiser report a good agreement with PyroNoise on several test datasets.
  • Echo A reference-free short-read error correction algorithm.
  • Lighter. A sequencing error correction without counting.
  • LSC LSC uses short Illumina reads to corrected errors in long reads.
  • Karect Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data.
  • NoDe NoDe: an error-correction algorithm for pyrosequencing amplicon reads.
  • PyroTagger PyroTagger: A fast, accurate pipeline for analysis of rRNA amplicon pyrosequence data.
  • Quake is a tool to correct substitution sequencing errors in experiments with deep coverage for Illumina sequencing reads.
  • QuorUM: An Error Corrector for Illumina Reads.
  • Rcorrector. Error correction for Illumina RNA-seq reads.
  • Reptile is a software developed in C++ for correcting sequencing errors in short reads from next-gen sequencing platforms.
  • Seecer SEquencing Error CorrEction for Rna reads.
  • SGA.
  • SOAP denovo.
  • UNOISE.

Bias correction Edit

  • Alpine[32] Modeling and correcting fragment sequence bias for RNA-seq.
  • cqn[33] is a normalization tool for RNA-Seq data, implementing the conditional quantile normalization method.
  • EDASeq[34] is a Bioconductor package to perform GC-Content Normalization for RNA-Seq Data.
  • GeneScissors A comprehensive approach to detecting and correcting spurious transcriptome inference due to RNAseq reads misalignment.
  • Peer[35] is a collection of Bayesian approaches to infer hidden determinants and their effects from gene expression profiles using factor analysis methods. Applications of PEER have: a) detected batch effects and experimental confounders, b) increased the number of expression QTL findings by threefold, c) allowed inference of intermediate cellular traits, such as transcription factor or pathway activations.
  • RUV[36] is a R package that implements the remove unwanted variation (RUV) methods of Risso et al. (2014) for the normalization of RNA-Seq read counts between samples.
  • svaSurrogate Variable Analysis.
  • svaseq removing batch effects and other unwanted noise from sequencing data.
  • SysCall[37] is a classifier tool to identification and correction of systematic error in high-throughput sequence data.

Other tasks/pre-processing data Edit

Further tasks performed before alignment, namely paired-read mergers.

  • AuPairWise A Method to Estimate RNA-Seq Replicability through Co-expression.
  • BamHash is a checksum based method to ensure that the read pairs in FASTQ files match exactly the read pairs stored in BAM files, regardless of the ordering of reads. BamHash can be used to verify the integrity of the files stored and discover any discrepancies. Thus, BamHash can be used to determine if it is safe to delete the FASTQ files storing raw sequencing reads after alignment, without the loss of data.
  • BBMerge Merges paired reads based on overlap to create longer reads, and an insert-size histogram. Fast, multithreaded, and yields extremely few false positives. Open-source, written in pure Java supports all platforms with no recompilation and no other dependencies. Distributed with BBMap.
  • Biopieces are a collection of bioinformatics tools that can be pieced together in a very easy and flexible manner to perform both simple and complex tasks. The Biopieces work on a data stream in such a way that the data stream can be passed through several different Biopieces, each performing one specific task: modifying or adding records to the data stream, creating plots, or uploading data to databases and web services.
  • COPE[38] COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly.
  • DeconRNASeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data.
  • FastQ Screen screens FASTQ format sequences to a set of databases to confirm that the sequences contain what is expected (such as species content, adapters, vectors, etc.).
  • FLASH is a read pre-processing tool. FLASH combines paired-end reads which overlap and converts them to single long reads.
  • IDCheck
  • ORNA and ORNA Q/K A tool for reducing redundancy in RNA-seq data which reduces the computational resource requirements of an assembler
  • PANDASeq.is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.
  • PEAR[39] PEAR: Illumina Paired-End reAd mergeR.
  • qRNASeq script The qRNAseq tool can be used to accurately eliminate PCR duplicates from RNA-Seq data if Molecular Indexes™ or other stochastic labels have been used during library prep.
  • SHERA[40] a SHortread Error-Reducing Aligner.
  • XORRO Rapid Paired-End Read Overlapper.

After quality control, the first step of RNA-Seq analysis involves alignment of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also List of sequence alignment software.

Short (unspliced) aligners Edit

Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group tend to be more sensitive, generating more correctly aligned reads.

  • BFAST aligns short reads to reference sequences and presents particular sensitivity towards errors, SNPs, insertions and deletions. BFAST works with the Smith-Waterman algorithm. See also seqanwers/BFAST.
  • Bowtie is a short aligner using an algorithm based on the Burrows-Wheeler transform and the FM-index. Bowtie tolerates a small number of mismatches.
  • Bowtie2 Bowtie 2 is a memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly recommended for aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM-index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
  • Burrows-Wheeler Aligner (BWA) BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.
  • Short Oligonucleotide Analysis Package (SOAP)
  • GNUMAP performs alignment using a probabilistic Needleman-Wunsch algorithm. This tool is able to handle alignment in repetitive regions of a genome without losing information. The output of the program was developed to make possible easy visualization using available software.
  • Maq first aligns reads to reference sequences and after performs a consensus stage. On the first stage performs only ungapped alignment and tolerates up to 3 mismatches.
  • Mosaik Mosaik is able to align reads containing short gaps using Smith-Waterman algorithm, ideal to overcome SNPs, insertions and deletions.
  • NovoAlign (commercial) is a short aligner to the Illumina platform based on Needleman-Wunsch algorithm. It is able to deal with bisulphite data. Output in SAM format.
  • PerM is a software package which was designed to perform highly efficient genome scale alignments for hundreds of millions of short reads produced by the ABI SOLiD and Illumina sequencing platforms. PerM is capable of providing full sensitivity for alignments within 4 mismatches for 50bp SOLID reads and 9 mismatches for 100bp Illumina reads.
  • RazerS
  • SEAL uses a MapReduce model to produce distributed computing on clusters of computers. Seal uses BWA to perform alignment and Picard MarkDuplicates to detection and duplicate read removal.
  • segemehl
  • SeqMap
  • SHRiMP employs two techniques to align short reads. Firstly, the q-gram filtering technique based on multiple seeds identifies candidate regions. Secondly, these regions are investigated in detail using Smith-Waterman algorithm.
  • SMALT
  • Stampy combines the sensitivity of hash tables and the speed of BWA. Stampy is prepared to alignment of reads containing sequence variation like insertions and deletions. It is able to deal with reads up to 4500 bases and presents the output in SAM format.
  • Subread[41] is a read aligner. It uses the seed-and-vote mapping paradigm to determine the mapping location of the read by using its largest mappable region. It automatically decides whether the read should be globally mapped or locally mapped. For RNA-seq data, Subread should be used for the purpose of expression analysis. Subread can also be used to map DNA-seq reads.
  • ZOOM (commercial) is a short aligner of the Illumina/Solexa 1G platform. ZOOM uses extended spaced seeds methodology building hash tables for the reads, and tolerates mismatches and insertions and deletions.
  • WHAM WHAM is a high-throughput sequence alignment tool developed at University of Wisconsin-Madison. It aligns short DNA sequences (reads) to the whole human genome at a rate of over 1500 million 60bit/s reads per hour, which is one to two orders of magnitudes faster than the leading state-of-the-art techniques.

Spliced aligners Edit

Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also. [42] [43]

Aligners based on known splice junctions (annotation-guided aligners) Edit

In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST).

  • Erange is a tool to alignment and data quantification to mammalian transcriptomes.
  • IsoformEx
  • MapAL
  • OSA
  • RNA-MATE is a computational pipeline for alignment of data from Applied Biosystems SOLID system. Provides the possibility of quality control and trimming of reads. The genome alignments are performed using mapreads and the splice junctions are identified based on a library of known exon-junction sequences. This tool allows visualization of alignments and tag counting.
  • RUM performs alignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat. The flowchart starts doing alignment against a genome and a transcriptome database executed by Bowtie. The next step is to perform alignment of unmapped sequences to the genome of reference using BLAT. In the final step all alignments are merged to get the final alignment. The input files can be in FASTA or FASTQ format. The output is presented in RUM and SAM format.
  • RNASEQR.
  • SAMMate
  • SpliceSeq
  • X-Mate

De novo splice aligners Edit

De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information (some of these tools present annotation as a suplementar option).

  • ABMapper
  • BBMap Uses short kmers to align reads directly to the genome (spanning introns to find novel isoforms) or transcriptome. Highly tolerant of substitution errors and indels, and very fast. Supports output of all SAM tags needed by Cufflinks. No limit to genome size or number of splices per read. Supports Illumina, 454, Sanger, Ion Torrent, PacBio, and Oxford Nanopore reads, paired or single-ended. Does not use any splice-site-finding heuristics optimized for a single taxonomic branch, but rather finds optimally-scoring multi-affine-transform global alignments, and thus is ideal for studying new organisms with no annotation and unknown splice motifs. Open-source, written in pure Java supports all platforms with no recompilation and no other dependencies.
  • ContextMap was developed to overcome some limitations of other mapping approaches, such as resolution of ambiguities. The central idea of this tool is to consider reads in gene expression context, improving this way alignment accuracy. ContextMap can be used as a stand-alone program and supported by mappers producing a SAM file in the output (e.g.: TopHat or MapSplice). In stand-alone mode aligns reads to a genome, to a transcriptome database or both.
  • CRAC propose a novel way of analyzing reads that integrates genomic locations and local coverage, and detect candidate mutations, indels, splice or fusion junctions in each single read. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses.
  • GSNAP
  • GMAP A Genomic Mapping and Alignment Program for mRNA and EST Sequences.
  • HISAT HISAT is a spliced alignment program for mapping RNA-seq reads. In addition to one global FM-index that represents a whole genome, HISAT uses a large set of small FM-indexes that collectively cover the whole genome (each index represents a genomic region of

48,000 indexes are needed to cover the human genome). These small indexes (called local indexes) combined with several alignment strategies enable effective alignment of RNA-seq reads, in particular, reads spanning multiple exons. The memory footprint of HISAT is relatively low (

De novo splice aligners that also use annotation optionally Edit
  • MapNext
  • OLego
  • STAR is a tool that employs "sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure", detects canonical, non-canonical splices junctions and chimeric-fusion sequences. It is already adapted to align long reads (third-generation sequencing technologies) and can reach speeds of 45 million paired reads per hour per processor. [46]
  • Subjunc[41] is a specialized version of Subread. It uses all mappable regions in an RNA-seq read to discover exons and exon-exon junctions. It uses the donor/receptor signals to find the exact splicing locations. Subjunc yields full alignments for every RNA-seq read including exon-spanning reads, in addition to the discovered exon-exon junctions. Subjunc should be used for the purpose of junction detection and genomic variation detection in RNA-seq data.
  • TopHat[47] is prepared to find de novo junctions. TopHat aligns reads in two steps. Firstly, unspliced reads are aligned with Bowtie. After, the aligned reads are assembled with Maq resulting islands of sequences. Secondly, the splice junctions are determined based on the initially unmapped reads and the possible canonical donor and acceptor sites within the island sequences.
Other spliced aligners Edit

Evaluation of alignment tools Edit

  • AlignerBoost is a generalized software toolkit for boosting Next-Gen sequencing mapping precision using a Bayesian-based mapping quality framework.
  • CADBURE Bioinformatics tool for evaluating aligner performance on your RNA-Seq dataset.
  • QualiMap : Evaluating next generation sequencing alignment data.
  • RNAseqEVAL A collection of tools for evaluating RNA seq mapping.
  • Teaser: Individualized benchmarking and optimization of read mapping results for NGS data.

General tools Edit

These tools perform normalization and calculate the abundance of each gene expressed in a sample. [48] RPKM, FPKM and TPMs [49] are some of the units employed to quantification of expression. Some software are also designed to study the variability of genetic expression between samples (differential expression). Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. Several studies are available comparing differential expression methods. [50] [51] [52]

  • ABSSeq a new RNA-Seq analysis method based on modelling absolute expression differences.
  • ALDEx2 is a tool for comparative analysis of high-throughput sequencing data. ALDEx2 uses compositional data analysis and can be applied to RNAseq, 16S rRNA gene sequencing, metagenomic sequencing, and selective growth experiments.
  • Alexa-Seq is a pipeline that makes possible to perform gene expression analysis, transcript specific expression analysis, exon junction expression and quantitative alternative analysis. Allows wide alternative expression visualization, statistics and graphs.
  • ARH-seq – identification of differential splicing in RNA-seq data.
  • ASC[53]
  • Ballgown
  • BaySeq is a Bioconductor package to identify differential expression using next-generation sequencing data, via empirical Bayesian methods. There is an option of using the "snow" package for parallelisation of computer data processing, recommended when dealing with large data sets.
  • GMNB[54] is a Bayesian method to temporal gene differential expression analysis across different phenotypes or treatment conditions that naturally handles the heterogeneity of sequencing depth in different samples, removing the need for ad-hoc normalization.
  • BBSeq
  • BitSeq (Bayesian Inference of Transcripts from Sequencing Data) is an application for inferring expression levels of individual transcripts from sequencing (RNA-Seq) data and estimating differential expression (DE) between conditions.
  • CEDER Accurate detection of differentially expressed genes by combining significance of exons using RNA-Seq.
  • CPTRA The CPTRA package is for analyzing transcriptome sequencing data from different sequencing platforms. It combines advantages of 454, Illumina GAII, or other platforms and can perform sequence tag alignment and annotation, expression quantification tasks.
  • casper is a Bioconductor package to quantify expression at the isoform level. It combines using informative data summaries, flexible estimation of experimental biases and statistical precision considerations which (reportedly) provide substantial reductions in estimation error.
  • Cufflinks/Cuffdiff is appropriate to measure global de novo transcript isoform expression. It performs assembly of transcripts, estimation of abundances and determines differential expression (Cuffdiff) and regulation in RNA-Seq samples. [55]
  • DESeq is a Bioconductor package to perform differential gene expression analysis based on negative binomial distribution.
  • DEGSeq
  • Derfinder Annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution via the DER Finder approach.
  • DEvis is a powerful, integrated solution for the analysis of differential expression data. Using DESeq2 as a framework, DEvis provides a wide variety of tools for data manipulation, visualization, and project management.
  • DEXSeq is Bioconductor package that finds differential differential exon usage based on RNA-Seq exon counts between samples. DEXSeq employs negative binomial distribution, provides options to visualization and exploration of the results.
  • DEXUS is a Bioconductor package that identifies differentially expressed genes in RNA-Seq data under all possible study designs such as studies without replicates, without sample groups, and with unknown conditions. [56] In contrast to other methods, DEXUS does not need replicates to detect differentially expressed transcripts, since the replicates (or conditions) are estimated by the EM method for each transcript.
  • DGEclust is a Python package for clustering expression data from RNA-seq, CAGE and other NGS assays using a Hierarchical Dirichlet Process Mixture Model. The estimated cluster configurations can be post-processed in order to identify differentially expressed genes and for generating gene- and sample-wise dendrograms and heatmaps. [57]
  • DiffSplice is a method for differential expression detection and visualization, not dependent on gene annotations. This method is supported on identification of alternative splicing modules (ASMs) that diverge in the different isoforms. A non-parametric test is applied to each ASM to identify significant differential transcription with a measured false discovery rate.
  • EBSeq is a Bioconductor package for identifying genes and isoforms differentially expressed (DE) across two or more biological conditions in an RNA-seq experiment. It also can be used to identify DE contigs after performing de novo transcriptome assembly. While performing DE analysis on isoforms or contigs, different isoform/contig groups have varying estimation uncertainties. EBSeq models the varying uncertainties using an empirical Bayes model with different priors.
  • EdgeR is a R package for analysis of differential expression of data from DNA sequencing methods, like RNA-Seq, SAGE or ChIP-Seq data. edgeR employs statistical methods supported on negative binomial distribution as a model for count variability.
  • EdgeRun an R package for sensitive, functionally relevant differential expression discovery using an unconditional exact test.
  • EQP The exon quantification pipeline (EQP): a comprehensive approach to the quantification of gene, exon and junction expression from RNA-seq data.
  • ESAT The End Sequence Analysis Toolkit (ESAT) is specially designed to be applied for quantification of annotation of specialized RNA-Seq gene libraries that target the 5' or 3' ends of transcripts.
  • eXpress performance includes transcript-level RNA-Seq quantification, allele-specific and haplotype analysis and can estimate transcript abundances of the multiple isoforms present in a gene. Although could be coupled directly with aligners (like Bowtie), eXpress can also be used with de novo assemblers and thus is not needed a reference genome to perform alignment. It runs on Linux, Mac and Windows.
  • ERANGE performs alignment, normalization and quantification of expressed genes.
  • featureCounts an efficient general-purpose read quantifier.
  • FDM
  • FineSplice Enhanced splice junction detection and estimation from RNA-Seq data.
  • GFOLD[58] Generalized fold change for ranking differentially expressed genes from RNA-seq data.
  • globalSeq[59] Global test for counts: testing for association between RNA-Seq and high-dimensional data.
  • GPSeq This is a software tool to analyze RNA-seq data to estimate gene and exon expression, identify differentially expressed genes, and differentially spliced exons.
  • IsoDOT – Differential RNA-isoform Expression.
  • Limma Limma powers differential expression analyses for RNA-sequencing and microarray studies.
  • LPEseq accurately test differential expression with a limited number of replicates.
  • Kallisto "Kallisto is a program for quantifying abundances of transcripts from RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads. It is based on the novel idea of pseudoalignment for rapidly determining the compatibility of reads with targets, without the need for alignment. On benchmarks with standard RNA-Seq data, kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build."
  • MATS Multivariate Analysis of Transcript Splicing (MATS).
  • MAPTest provides a general testing framework for differential expression analysis of RNA-Seq time course experiment. Method of the pack is based on latent negative-binomial Gaussian mixture model. The proposed test is optimal in the maximum average power. The test allows not only identification of traditional DE genes but also testing of a variety of composite hypotheses of biological interest. [60]
  • MetaDiff Differential isoform expression analysis using random-effects meta-regression.
  • metaseqR is a Bioconductor package that detects differentially expressed genes from RNA-Seq data by combining six statistical algorithms using weights estimated from their performance with simulated data estimated from real data, either public or user-based. In this way, metaseqR optimizes the tradeoff between precision and sensitivity. [61] In addition, metaseqR creates a detailed and interactive report with a variety of diagnostic and exploration plots and auto-generated text.
  • MMSEQ is a pipeline for estimating isoform expression and allelic imbalance in diploid organisms based on RNA-Seq. The pipeline employs tools like Bowtie, TopHat, ArrayExpressHTS and SAMtools. Also, edgeR or DESeq to perform differential expression.
  • MultiDE
  • Myrna is a pipeline tool that runs in a cloud environment (Elastic MapReduce) or in a unique computer for estimating differential gene expression in RNA-Seq datasets. Bowtie is employed for short read alignment and R algorithms for interval calculations, normalization, and statistical processing.
  • NEUMA is a tool to estimate RNA abundances using length normalization, based on uniquely aligned reads and mRNA isoform models. NEUMA uses known transcriptome data available in databases like RefSeq.
  • NOISeq NOISeq is a non-parametric approach for the identification of differentially expressed genes from count data or previously normalized count data. NOISeq empirically models the noise distribution of count changes by contrasting fold-change differences (M) and absolute expression differences (D) for all the features in samples within the same condition.
  • NPEBseq is a nonparametric empirical Bayesian-based method for differential expression analysis.
  • NSMAP allows inference of isoforms as well estimation of expression levels, without annotated information. The exons are aligned and splice junctions are identified using TopHat. All the possible isoforms are computed by a combination of the detected exons.
  • NURD an implementation of a new method to estimate isoform expression from non-uniform RNA-seq data.
  • PANDORA An R package for the analysis and result reporting of RNA-Seq data by combining multiple statistical algorithms.
  • PennSeq PennSeq: accurate isoform-specific gene expression quantification in RNA-Seq by modeling non-uniform read distribution.
  • Quark Quark enables semi-reference-based compression of RNA-seq data.
  • QuasR Quantify and Annotate Short Reads in R.
  • RapMap A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes.
  • RNAeXpress Can be run with Java GUI or command line on Mac, Windows, and Linux. It can be configured to perform read counting, feature detection or GTF comparison on mapped rnaseq data.
  • Rcount Rcount: simple and flexible RNA-Seq read counting.
  • rDiff is a tool that can detect differential RNA processing (e.g. alternative splicing, polyadenylation or ribosome occupancy).
  • RNASeqPower Calculating samples Size estimates for RNA Seq studies. R package version.
  • RNA-Skim RNA-Skim: a rapid method for RNA-Seq quantification at transcript-level.
  • rSeq rSeq is a set of tools for RNA-Seq data analysis. It consists of programs that deal with many aspects of RNA-Seq data analysis, such as read quality assessment, reference sequence generation, sequence mapping, gene and isoform expressions (RPKMs) estimation, etc.
  • RSEM
  • rQuant is a web service (Galaxy (computational biology) installation) that determines abundances of transcripts per gene locus, based on quadratic programming. rQuant is able to evaluate biases introduced by experimental conditions. A combination of tools is employed: PALMapper (reads alignment), mTiM and mGene (inference of new transcripts).
  • Salmon is a software tool for computing transcript abundance from RNA-seq data using either an alignment-free (based directly on the raw reads) or an alignment-based (based on pre-computed alignments) approach. It uses an online stochastic optimization approach to maximize the likelihood of the transcript abundances under the observed data. The software itself is capable of making use of many threads to produce accurate quantification estimates quickly. It is part of the Sailfish suite of software, and is the successor to the Sailfish tool.
  • SAJR is a java-written read counter and R-package for differential splicing analysis. It uses junction reads to estimate exon exclusion and reads mapped within exon to estimate its inclusion. SAJR models it by GLM with quasibinomial distribution and uses log likelihood test to assess significance.
  • Scotty Performs power analysis to estimate the number of replicates and depth of sequencing required to call differential expression.
  • Seal alignment-free algorithm to quantify sequence expression by matching kmers between raw reads and a reference transcriptome. Handles paired reads and alternate isoforms, and uses little memory. Accepts all common read formats, and outputs read counts, coverage, and FPKM values per reference sequence. Open-source, written in pure Java supports all platforms with no recompilation and no other dependencies. Distributed with BBMap. (Seal - Sequence Expression AnaLyzer - is unrelated to the SEAL distributed short-read aligner.)
  • semisup[62] Semi-supervised mixture model: detecting SNPs with interactive effects on a quantitative trait
  • Sleuth is a program for analysis of RNA-Seq experiments for which transcript abundances have been quantified with kallisto.
  • SplicingCompass differential splicing detection using RNA-Seq data.
  • sSeq The purpose of this R package is to discover the genes that are differentially expressed between two conditions in RNA-seq experiments.
  • StringTie is an assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. It was designed as a successor to Cufflinks (its developers include some of the Cufflinks developers) and has many of the same features.
  • TIGAR Transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference.
  • TimeSeq Detecting Differentially Expressed Genes in Time Course RNA-Seq Data.
  • TPMCalculator[63] one-step software to quantify mRNA abundance of genomic features.
  • WemIQ is a software tool to quantify isoform expression and exon splicing ratios from RNA-seq data accurately and robustly.

Evaluation of quantification and differential expression Edit

  • CompcodeR RNAseq data simulation, differential expression analysis and performance comparison of differential expression methods.
  • DEAR-O Differential Expression Analysis based on RNA-seq data – Online.
  • PROPER comprehensive power evaluation for differential expression using RNA-seq.
  • RNAontheBENCH computational and empirical resources for benchmarking RNAseq quantification and differential expression methods.
  • rnaseqcomp Several quantitative and visualized benchmarks for RNA-seq quantification pipelines. Two-condition quantifications for genes, transcripts, junctions or exons by each pipeline with nessasery meta information should be organizd into numeric matrices in order to proceed the evaluation.

Multi-tool solutions Edit

  • DEB is a web-interface/pipeline that permits to compare results of significantly expressed genes from different tools. Currently are available three algorithms: edgeR, DESeq and bayseq.
  • SARTools A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data.

Transposable Element expression Edit

  • TeXP is a Transposable Element quantification pipeline that deconvolves pervasive transcription from autonomous transcription of LINE-1 elements. [64]

Commercial solutions Edit

  • ActiveSite by Cofactor Genomics
  • Avadis NGS (currently Strand NGS)
  • BaseSpace by Illumina
  • Biowardrobe an integrated platform for analysis of epigenomics and transcriptomics data.
  • BBrowser a platform for analyzing public and in-house single-cell transcriptomics data
  • CLC Genomics Workbench
  • DNASTAR
  • ERGO
  • Genedata
  • GeneSpring GX
  • Genevestigator by Nebion (basic version is for free for academic researchers).
  • geospiza
  • Golden Helix
  • Maverix Biomics
  • NextGENe
  • OmicsOffice
  • Partek Flow Comprehensive single cell analysis within an intuitive interface.
  • Qlucore. Easy to use for analysis and visualization. One button import of BAM files.

Open (free) source solutions Edit

  • ArrayExpressHTS is a BioConductor package that allows preprocessing, quality assessment and estimation of expression of RNA-Seq datasets. It can be run remotely at the European Bioinformatics Institute cloud or locally. The package makes use of several tools: ShortRead (quality control), Bowtie, TopHat or BWA (alignment to a reference genome), SAMtools format, Cufflinks or MMSEQ (expression estimation).
  • BioJupies is a web-based platform that provides complete RNA-seq analysis solution from free alignment service to a complete data analysis report delivered as an interactive Jupyter Notebook.
  • BioQueue is a web-based queue engine designed preferentially to improve the efficiency and robustness of job execution in bioinformatics research by estimating the system resources required by a certain job. At the same time, BioQueue also aims to promote the accessibility and reproducibility of data analysis in biomedical research. Implemented by Python 2.7, BioQueue can work in both POSIX compatible systems (Linux, Solaris, OS X, etc.) and Windows. See also. [65]
  • BioWardrobe is an integrated package that for analysis of ChIP-Seq and RNA-Seq datasets using a web-based user-friendly GUI. For RNA-Seq Biowardrobe performs mapping, quality control, RPKM estimation and differential expression analysis between samples (groups of samples). Results of differential expression analysis can be integrated with ChIP-Seq data to build average tag density profiles and heat maps. The package makes use of several tools open source tools including STAR and DESeq. See also. [66]
  • Chipster is a user-friendly analysis software for high-throughput data. It contains over 350 analysis tools for next generation sequencing (NGS), microarray, proteomics and sequence data. Users can save and share automatic analysis workflows, and visualize data interactively using a built-in genome browser and many other visualizations.
  • DEWE (Differential Expression Workflow Executor) is an open source desktop application that provides a user-friendly GUI for easily executing Differential Expression analyses in RNA-Seq data. Currently, DEWE provides two differential expression analysis workflows: HISAT2, StringTie and Ballgown and Bowtie2, StringTie and R libraries (Ballgown and edgeR). It runs in Linux, Windows and Mac OS X.
  • easyRNASeq Calculates the coverage of high-throughput short-reads against a genome of reference and summarizes it per feature of interest (e.g. exon, gene, transcript). The data can be normalized as 'RPKM' or by the 'DESeq' or 'edgeR' package.
  • ExpressionPlot
  • FASTGenomics is an online platform to share single-cell RNA sequencing data and analyses using reproducible workflows. Gene expression data can be shared meeting European data protection standards (GDPR). FASTGenomics enables the user to upload their own data and generate customized and reproducible workflows for the exploration and analysis of gene expression data (Scholz et al. 2018).
  • FX FX is a user-Frendly RNA-Seq gene eXpression analysis tool, empowered by the concept of cloud-computing. With FX, you can simply upload your RNA-Seq raw FASTQ data on the cloud, and let the computing infra to do the heavy analysis.
  • Galaxy: Galaxy is a general purpose workbench platform for computational biology.
  • GENE-Counter is a Perl pipeline for RNA-Seq differential gene expression analyses. Gene-counter performs alignments with CASHX, Bowtie, BWA or other SAM output aligner. Differential gene expression is run with three optional packages (NBPSeq, edgeR and DESeq) using negative binomial distribution methods. Results are stored in a MySQL database to make possible additional analyses.
  • GenePattern offers integrated solutions to RNA-Seq analysis (Broad Institute).
  • GeneProf Freely accessible, easy to use analysis pipelines for RNA-seq and ChIP-seq experiments.
  • GREIN is an interactive web platform for re-processing and re-analyzing GEO RNA-seq data. GREIN is powered by the back-end computational pipeline for uniform processing of RNA-seq data and the large number (>5,800) of already processed data sets. The front-end user friendly interfaces provide a wealth of user-analytics options including sub-setting and downloading processed data, interactive visualization, statistical power analyses, construction of differential gene expression signatures and their comprehensive functional characterization, connectivity analysis with LINCS L1000 data, etc.
  • GT-FAR is an RNA seq pipeline that performs RNA-seq QC, alignment, reference free quantification, and splice variant calling. It filters, trims, and sequentially aligns reads to gene models and predicts and validates new splice junctions after which it quantifies expression for each gene, exon, and known/novel splice junction, and Variant Calling.
  • MultiExperiment Viewer (MeV) is suitable to perform analysis, data mining and visualization of large-scale genomic data. The MeV modules include a variety of algorithms to execute tasks like Clustering and Classification, Student's t-test, Gene Set Enrichment Analysis or Significance Analysis. MeV runs on Java.
  • NGSUtils is a suite of software tools for working with next-generation sequencing datasets.
  • Rail-RNA Scalable analysis of RNA-seq splicing and coverage.
  • RAP RNA-Seq Analysis Pipeline, a new cloud-based NGS web application.
  • RSEQtools "RSEQtools consists of a set of modules that perform common tasks such as calculating gene expression values, generating signal tracks of mapped reads, and segmenting that signal into actively transcribed regions. In addition to the anonymization afforded by this format it also facilitates the decoupling of the alignment of reads from downstream analyses."
  • RobiNA provides a user graphical interface to deal with R/BioConductor packages. RobiNA provides a package that automatically installs all required external tools (R/Bioconductor frameworks and Bowtie). This tool offers a diversity of quality control methods and the possibility to produce many tables and plots supplying detailed results for differential expression. Furthermore, the results can be visualized and manipulated with MapMan and PageMan. RobiNA runs on Java version 6.
  • RseqFlow is an RNA-Seq analysis pipeline which offers an express implementation of analysis steps for RNA sequencing datasets. It can perform pre and post mapping quality control (QC) for sequencing data, calculate expression levels for uniquely mapped reads, identify differentially expressed genes, and convert file formats for ease of visualization.
  • S-MART handles mapped RNA-Seq data, and performs essentially data manipulation (selection/exclusion of reads, clustering and differential expression analysis) and visualization (read information, distribution, comparison with epigenomic ChIP-Seq data). It can be run on any laptop by a person without computer background. A friendly graphical user interface makes easy the operation of the tools.
  • Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation.
  • TCW is a Transcriptome Computational Workbench.
  • TRAPLINE a standardized and automated pipeline for RNA sequencing data analysis, evaluation and annotation.
  • ViennaNGS A toolbox for building efficient next- generation sequencing analysis pipelines.
  • wapRNA This is a free web-based application for the processing of high-throughput RNA-Seq data (wapRNA) from next generation sequencing (NGS) platforms, such as Genome Analyzer of Illumina Inc. (Solexa) and SOLiD of Applied Biosystems (SOLiD). wapRNA provides an integrated tool for RNA sequence, refers to the use of High-throughput sequencing technologies to sequence cDNAs in order to get information about a sample's RNA content.

General tools Edit

  • Alternative Splicing Analysis Tool Package(ASATP) Alternative splicing analysis tool package (ASATP) includes a series of toolkits to analyze alternative splicing events, which could be used to detect and visualized alternative splicing events, check ORF changes, assess regulations of alternative splicing and do statistical analysis.
  • Asprofile is a suite of programs for extracting, quantifying and comparing alternative splicing (AS) events from RNA-seq data.
  • AStalavista The AStalavista web server extracts and displays alternative splicing (AS) events from a given genomic annotation of exon-intron gene coordinates. By comparing all given transcripts, AStalavista detects the variations in their splicing structure and identify all AS events (like exon skipping, alternate donor, etc.) by assigning to each of them an AS code.
  • CLASS2 accurate and efficient splice variant annotation from RNA-seq reads.
  • Cufflinks/Cuffdiff
  • DEXseq Inference of differential exon usage in RNA-Seq.
  • Diceseq Statistical modeling of isoform splicing dynamics from RNA-seq time series data.
  • EBChangepoint An empirical Bayes change-point model for identifying 3′ and 5′ alternative splicing by RNA-Seq.
  • Eoulsan A versatile framework dedicated to high throughput sequencing data analysis. Allows automated analysis (mapping, counting and differencial analysis with DESeq2).
  • GESS for de novo detection of exon-skipping event sites from raw RNA-seq reads.
  • LeafCutter a suite of novel methods that allow identification and quantication of novel and existing alternative splicing events by focusing on intron excisions.
  • LEMONS[67] A Tool for the Identification of Splice Junctions in Transcriptomes of Organisms Lacking Reference Genomes.
  • MAJIQ. Modeling Alternative Junction Inclusion Quantification.
  • MATS Multivariate Analysis of Transcript Splicing (MATS).
  • MISO quantifies the expression level of splice variants from RNA-Seq data and is able to recognize differentially regulated exons/isoforms across different samples. MISO uses a probabilistic method (Bayesian inference) to calculate the probability of the reads origin.
  • Rail-RNA Scalable analysis of RNA-seq splicing and coverage.
  • RPASuite[68] RPASuite (RNA Processing Analysis Suite) is a computational pipeline to identify differentially and coherently processed transcripts using RNA-seq data obtained from multiple tissue or cell lines.
  • RSVP RSVP is a software package for prediction of alternative isoforms of protein-coding genes, based on both genomic DNA evidence and aligned RNA-seq reads. The method is based on the use of ORF graphs, which are more general than the splice graphs used in traditional transcript assembly.
  • SAJR calculates the number of the reads that confirms segment (part of gene between two nearest splice sites) inclusion or exclusion and then model these counts by GLM with quasibinomial distribution to account for biological variability.
  • SGSeq A R package to de novo prediction of splicing events.
  • SplAdder Identification, quantification and testing of alternative splicing events from RNA-Seq data.
  • SpliceGrapher Prediction of novel alternative splicing events from RNA-Seq data. Also includes graphical tools for visualizing splice graphs. [69][70]
  • SpliceJumper a classification-based approach for calling splicing junctions from RNA-seq data.
  • SplicePie is a pipeline to analyze non-sequential and multi-step splicing. SplicePie contains three major analysis steps: analyzing the order of splicing per sample, looking for recursive splicing events per sample and summarizing predicted recursive splicing events for all analyzed sample (it is recommended to use more samples for higher reliability). The first two steps are performed individually on each sample and the last step looks at the overlap in all samples. However, the analysis can be run on one sample as well.
  • SplicePlot is a tool for visualizing alternative splicing and the effects of splicing quantitative trait loci (sQTLs) from RNA-seq data. It provides a simple command line interface for drawing sashimi plots, hive plots, and structure plots of alternative splicing events from .bam, .gtf, and .vcf files.
  • SpliceR An R package for classification of alternative splicing and prediction of coding potential from RNA-seq data.
  • SpliceSEQ SpliceViewer is a Java application that allows researchers to investigate alternative mRNA splicing patterns in data from high-throughput mRNA sequencing studies. Sequence reads are mapped to splice graphs that unambiguously quantify the inclusion level of each exon and splice junction. The graphs are then traversed to predict the protein isoforms that are likely to result from the observed exon and splice junction reads. UniProt annotations are mapped to each protein isoform to identify potential functional impacts of alternative splicing.
  • SpliceTrap[71] is a statistical tool for the quantification of exon inclusion ratios from RNA-seq data.
  • Splicing Express – a software suite for alternative splicing analysis using next-generation sequencing data.
  • SUPPA This tool generates different Alternative Splicing (AS) events and calculates the PSI ("Percentage Spliced In") value for each event exploiting the quantification of transcript abundances from multiple samples.
  • SwitchSeq identifies extreme changes in splicing (switch events).
  • Portcullis identification of genuine splice junctions.
  • TrueSight A Self-training Algorithm for Splice Junction Detection using RNA-seq.
  • Vast-tools A toolset for profiling alternative splicing events in RNA-Seq data.

Intron retention analysis Edit

  • IRcall / IRclassifier IRcall is a computational tool for IR event detection from RNA-Seq data. IRclassifier is a supervised machine learning-based approach for IR event detection from RNA-Seq data.

Differential isoform/transcript usage Edit

  • IsoformSwitchAnalyzeR IsoformSwitchAnalyzeR is an R package that enables statistical identification of isoform switches with predicted functional consequences where the consequences of interest can be chosen from a long list but includes gain/loss of protein domains, signal peptides changes in NMD sensitivity. [72] IsoformSwitchAnalyzeR is made for post analysis of data from any full length isoform/transcript quantification tool but directly support Cufflinks/Cuffdiff, RSEM Kallisto and Salmon.
  • DRIMSeq An R package that utilizes generalized linear modeling (GLM) to identify isoform switches from estimated isoform count data. [73]
  • BayesDRIMSeq An R package containing a Bayesian implementation of DRIMSeq. [74]
  • Cufflinks/Cuffdiff Full length isoform/transcript quantification and differential analysis tool which amongst other test for changes in usage for isoform belonging to the same primary transcript (sharing a TSS) via a one-sided t-test based on the asymptotic of the Jensen-Shannon metric. [55]
  • rSeqNP An R package that implements a non-parametric approach to test for differential expression and splicing from RNA-Seq data. [75]
  • Isolator Full length isoform/transcript quantification and differential analysis tool which analyses all samples in an experiment in unison using a simple Bayesian hierarchical model. Can identify differential isoform usage by testing for probability of monotonic splicing. [76]

Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies. [77]

  • Arriba[78] is a fusion detection algorithm based on the STAR [46] RNA-Seq aligner. It is the winner of the DREAM Challenge about fusion detection. [79] Arriba can also detect viral integration sites, internal tandem duplications, whole exon duplications, circular RNAs, enhancer hijacking events involving immunoglobulin/T-cell receptor loci, and breakpoints in introns or intergenic regions.
  • Bellerophontes
  • BreakDancer
  • BreakFusion
  • ChimeraScan
  • EBARDenovo
  • EricScript
  • DEEPEST is a statistical fusion detection algorithm. [80] DEEPEST can also detect Circular RNAs.
  • DeFuse DeFuse is a software package for gene fusion discovery using RNA-Seq data.
  • FusionAnalyser FusionAnalyser uses paired reads mapping to different genes (Bridge reads).
  • FusionCatcher FusionCatcher searches for novel/known somatic fusion genes, translocations, and chimeras in RNA-seq data (stranded/unstranded paired-end reads from Illumina NGS platforms) from diseased samples.
  • FusionHunter identifies fusion transcripts without depending on already known annotations. It uses Bowtie as a first aligner and paired-end reads.
  • FusionMap FusionMap is a fusion aligner which aligns reads spanning fusion junctions directly to the genome without prior knowledge of potential fusion regions. It detects and characterizes fusion junctions at base-pair resolution. FusionMap can be applied to detect fusion junctions in both single- and paired-end dataset from either gDNA-Seq or RNA-Seq studies.
  • FusionSeq
  • JAFFA is based on the idea of comparing a transcriptome against a reference transcriptome rather than a genome-centric approach like other fusion finders.
  • MapSplice[81]
  • nFuse
  • Oncomine NGS RNA-Seq Gene Expression Browser.
  • PRADA
  • SOAPFuse detects fusion transcripts from human paired-end RNA-Seq data. It outperforms other five similar tools in both computation and fusion detection performance using both real and simulated data. [82]
  • SOAPfusion
  • TopHat-Fusion is based on TopHat version and was developed to handle reads resulting from fusion genes. It does not require previous data about known genes and uses Bowtie to align continuous reads.
  • ViralFusionSeq is high-throughput sequencing (HTS) tool for discovering viral integration events and reconstruct fusion transcripts at single-base resolution.
  • ViReMa (Viral Recombination Mapper) detects and reports recombination or fusion events in and between virus and host genomes using deep sequencing datasets. [83]
  • CNVseq detects copy number variations supported on a statistical model derived from array-comparative genomic hybridization. Sequences alignment are performed by BLAT, calculations are executed by R modules and is fully automated using Perl.

Single cell sequencing. The traditional RNA-Seq methodology is commonly known as "bulk RNA-Seq", in this case RNA is extracted from a group of cells or tissues, not from the individual cell like it happens in single cell methods. Some tools available to bulk RNA-Seq are also applied to single cell analysis, however to face the specificity of this technique new algorithms were developed.

  • CEL-Seq[84] single-cell RNA-Seq by multiplexed linear amplification.
  • Drop-Seq[85] Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Single cell transcriptome sequencing in situ, i.e. without dissociating the cells.
  • Oscope: a statistical pipeline for identifying oscillatory genes in unsynchronized single cell RNA-seq experiments.
  • SCUBA[86] Extracting lineage relationships and modeling dynamic changes associated with multi-lineage cell differentiation.
  • scLVM[87] scLVM is a modelling framework for single-cell RNA-seq data that can be used to dissect the observed heterogeneity into different sources, thereby allowing for the correction of confounding sources of variation.
  • scM&T-Seq Parallel single-cell sequencing.
  • Sphinx[88] SPHINX is a hybrid binning approach that achieves high binning efficiency by utilizing both 'compositional' and 'similarity' features of the query sequence during the binning process. SPHINX can analyze sequences in metagenomic data sets as rapidly as composition based approaches, but nevertheless has the accuracy and specificity of similarity based algorithms.
  • TraCeR[89] Paired T-cell receptor reconstruction from single-cell RNA-Seq reads.
  • VDJPuzzle[90] T-cell receptor reconstruction from single-cell RNA-Seq reads and link the clonotype with the functional phenotype and transcriptome of individual cells.

Integrated Packages Edit

  • Monocle Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments.
  • SCANPY[91] Scalable Python-based implementation for preprocessing, visualization, clustering, trajectory inference and differential expression testing.
  • SCell integrated analysis of single-cell RNA-seq data.
  • Seurat[92] R package designed for QC, analysis, and exploration of single-cell RNA-seq data.
  • Sincell an R/Bioconductor package for statistical assessment of cell-state hierarchies from single-cell RNA-seq.
  • SINCERA[93] A Pipeline for Single-Cell RNA-Seq Profiling Analysis.

Quality Control and Gene Filtering Edit

  • Celloline A pipeline for mapping and quality assessment single cell RNA-seq data.
  • OEFinder A user interface to identify and visualize ordering effects in single-cell RNA-seq data.
  • SinQC A Method and Tool to Control Single-cell RNA-seq Data Quality.

Normalization Edit

  • BASICS Understanding changes in gene expression at the single-cell level.
  • GRM Normalization and noise reduction for single cell RNA-seq experiments.

Dimension Reduction Edit

Differential Expression Edit

  • BPSC An R package BPSC for model fitting and differential expression analyses of single-cell RNA-seq.
  • MAST a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data.
  • SCDE Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis.

Visualization Edit

These Simulators generate in silico reads and are useful tools to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols.

  • BEERS Simulator is formatted to mouse or human data, and paired-end reads sequenced on Illumina platform. Beers generates reads starting from a pool of gene models coming from different published annotation origins. Some genes are chosen randomly and afterwards are introduced deliberately errors (like indels, base changes and low quality tails), followed by construction of novel splice junctions.
  • compcodeR RNAseq data simulation, differential expression analysis and performance comparison of differential expression methods.
  • CuReSim a customized read simulator.
  • Flux simulator implements a computer pipeline simulation to mimic a RNA-Seq experiment. All component steps that influence RNA-Seq are taken into account (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing) in the simulation. These steps present experimental attributes that can be measured, and the approximate experimental biases are captured. Flux Simulator allows joining each of these steps as modules to analyse different type of protocols.
  • PBSIM PacBio reads simulator - toward accurate genome assembly.
  • Polyester This bioconductor package can be used to simulate RNA-seq reads from differential expression experiments with replicates. The reads can then be aligned and used to perform comparisons of methods for differential expression.
  • RandomReads Generates synthetic reads from a genome with an Illumina or PacBio error model. The reads may be paired or unpaired, with arbitrary length and insert size, output in fasta or fastq, RandomReads has a wide selection of options for mutation rates, with individual settings for substitution, deletion, insertion, and N rates and length distributions, annotating reads with their original, unmutated genomic start and stop location. RandomReads is does not vary expression levels and thus is not designed to simulate RNA-seq experiments, but to test the sensitivity and specificity of RNA-seq aligners with de-novo introns. Includes a tool for grading and generating ROC curves from resultant sam files. Open-source, written in pure Java supports all platforms with no recompilation and no other dependencies. Distributed with BBMap.
  • rlsim is a software package for simulating RNA-seq library preparation with parameter estimation.
  • rnaseqbenchmark A Benchmark for RNA-seq Quantification Pipelines.
  • rnaseqcomp Benchmarks for RNA-seq Quantification Pipelines.
  • RSEM Read Simulator RSEM provides users the ‘rsem-simulate-reads’ program to simulate RNA-Seq data based on parameters learned from real data sets.
  • RNASeqReadSimulator contains a set of simple Python scripts, command line driven. It generates random expression levels of transcripts (single or paired-end), equally simulates reads with a specific positional bias pattern and generates random errors from sequencing platforms.
  • RNA Seq Simulator RSS takes SAM alignment files from RNA-Seq data and simulates over dispersed, multiple replica, differential, non-stranded RNA-Seq datasets.
  • SimSeq A Nonparametric Approach to Simulation of RNA-Sequence Datasets.
  • WGsim Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms.

The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs. There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome (if possible a finished and high quality genome) as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts.


Results

SNV calling from scRNA-seq data

We implemented a pipeline to identify SNVs directly from FASTQ files of scRNA-seq data, following the SNV guideline of GATK (Supplementary Figure 1). We applied this pipeline to five scRNA-seq cancer datasets (Kim 20 , Ting 21 , Miyamoto 22 , Patel 23 , and Chung 24 see Methods), and tested the efficiency of SNV features on retrieving single cell groups of interest. These datasets vary in tissue types, origins (Mouse or Human), read lengths and map-ability (Table 1). They all have pre-defined cell types (subclasses), providing useful references for assessing the performance of a variety of clustering methods used in this study.

We evaluated the GATK SNV calling pipeline using several approaches. First, we estimated the true positive rates of the SNV calling pipeline at different depths of scRNA-seq reads. For this we performed a simulation experiment by artificially introducing 50,000 random SNVs in the exonic regions of hg19 and measured recovery of these SNVs using our pipeline on Kim dataset. The true positive rates monotonically increase with read depth. For as few as 4 read-depth, the pipeline achieves on average over 50% true positive rate and increases to 68% true positive rate when read depth is more than 6 (Fig. 1a). This accuracy is in line with what was reported from bulk cell RNA-seq 25 . The false positive rate is consistently <0.1, and the median reaches below 0.05 when the read depth is >6 (Fig. 1b). We compared the SNV call results from GATK to those from another SNV caller FreeBayes 26 and obtained similar results (Supplementary Figure 2A, B). Additionally, we conducted simulation experiment on a new non-cancer 10X Genomic dataset, and obtained comparable true positive rates (Supplementary Figure 2C, D). Moreover GATK shows better performance than FreeBayes in the 10X dataset. We thus opted to use GATK to call SNVs for the remainder of the report, given its popularity and performance.

The performance measurements of GATK SNV calling and SSrGE pipelines. a, b Performance measurement of GATK SNV calling pipeline. Box plots of true positive rate (a) and false positive rate (b) with respect to the read depth at the called SNV position. The rates are calculated from GATK SNV calling pipeline, using hg19 reference genome to align modified scRNA-seq reads from a subset of 20 cells from the Kim dataset, which were introduced 50,000 random artificial mutations the exonic region of the reads. Error bars represent standard deviation. c, d Comparisons of importance the different types of features in SSrGE models, with respect to the ranking, in Miyamoto dataset (c) and Kim dataset (d). The scores of the SNVs and CNVs correspond to the sum of the coefficients inferred by the SSrGE models. The gene score is the sum of the SNVs scores for a particular gene. Blue: CNV feature Red: eeSNV feature Green: gene feature

Using SSrGE to detect eeSNVs in scRNA-seq data

To link the relationship between SNV and GE, we developed a method called Sparse SNV inference to reflect Gene Expression (SSrGE), as detailed in Methods. In addition to SNV, we also optionally considered the effect of CNVs on gene expression, since copy number variation (CNV) may contribute to gene expression variation as well. Similar to gene-based association method PrediXscan 17 , SSrGE uses SNVs and additionally optionally CNVs as predictors to fit a linear model for gene expression, under LASSO regularization and feature selection 27 . We choose LASSO rather than elasticNet for penalization, so that the list of resulting eeSNVs is short (Supplementary Figure 3). These eeSNVs serve as refined descriptive features for subsequent subpopulation identification. To directly pinpoint the contributions of SNVs relevant to protein-coding genes, we used the SNVs residing between transcription starting and ending sites of genes as the inputs. We further assessed the relative contributions of eeSNVs and CNVs to gene expression and found that the coefficients of the CNVs are significantly lower than those of eeSNVs(Fig. 1c, d). The ranks of the top genes with and without CNVs in the SSrGE models are not statistically different overall, as the Kendall-Tau correlation scores 28 are close to 1 with p-values = 0 (Kendall-Tau test).

Additionally, SNV genotypes and allelic specific level gene expression may also complicate the relationships between eeSNVs and gene expression. Therefore, we further calibrated SSrGE model by considering SNV genotype and allelic specific gene expression. We used QUASAR 29 to estimate the SNV genotypes (Supplementary Table 1), and the allelic specific gene expression using the SNV genotype. We rebuilt individual SSrGE models using only the SNVs from a particular genotype and allelic-specific gene expression, and then merged the eeSNV weights from related SSrGE models together to obtain a final ranking of eeSNVs. The new rankings are not statistically different compared to the previous approach (Supplementary Table 1). The Kendall-Tau scores, which evaluate the similarities between the re-calibrated model and the original model have p-values = 0 (Kendall-Tau test) in all data sets.

Lastly, to quantitatively evaluate if the eeSNVs obtained from SSrGE are truly significant, we designed a simulation pipeline (Methods). The pipeline creates random binary matrices of SNVs for n simulated cells, which are connected to the matrices of gene expression. The SNVs present in the simulated cell have probabilities to modify gene expression of the genes positively or negatively. We used various levels of noise to perturb the GE and the SNV matrices. We compared the ranks of top genes identified by SSrGE to the expected impact of each gene provided by the simulation. The inferred top ranked genes using SSrGE have monotonic and positive correlations with those set by the simulation (Supplementary Figure 4A). These correlations are all significant (p-value « 0.05, Kendall-Tau test), independently of the alpha and the level of noise used, confirming the value of SSrGE model. Moreover, to simulate the patterns of dropout in the data, we also introduced two other parameters, one for random dropout or biased dropout toward both lowly expressed genes, and other for dropout rate relative to cell, gene or reads (Methods). We observed that SSrGE performs well on all dropout models (Supplementary Figure 4B). Thus, SSrGE method is validated to generate eeSNVs that are truly important.

EeSNVs are better than gene counts at finding subpopulations

We measured the performance of SNVs and gene expression (GE) to identify subpopulations on the five datasets, using five clustering approaches (Fig. 2). These clustering approaches include two dimension reduction methods, namely principal component analysis (PCA) 30 and factor analysis (FA) 31 , followed by either K-means or the hierarchical agglomerative method (agglo) with WARD linkage 32 . We also used a recent algorithm SIMLR designed explicitly for scRNA-seq data clustering and visualization 33 . To evaluate the accuracy of obtained subpopulations in each dataset, we used the metric of adjusted mutual information (AMI) over 30 bootstrap runs, from the optimal a parameters (Supplementary Data 1). These optimal parameters were estimated by testing different a values for each dataset and each clustering approach (Supplementary Figure 5). As shown in Fig. 2, eeSNVs are better features to retrieve cancer cell subpopulations compared to GE, independent of the clustering methods used. Among the clustering algorithms, SIMLR tends to be a better choice using eeSNV features.

Comparison of clustering accuracy using eeSNV and gene expression (GE) features. ae Bar plots comparing the clustering performance using eeSNV vs. gene expression (GE) as features, over five different clustering strategies and five datasets, each with its own pre-defined classes as the truth measure: a Kim dataset, b Ting dataset, c Chung dataset, d Miyamoto dataset, and e Patel dataset. Y-axis is the adjusted mutual information (AMI) obtained across 30 bootstrap runs (mean ± s.d.). Error bars represent standard deviation. *P < 0.05, **P < 0.01, and ***P < 0.001 (paired t-test). f Heatmap of the rankings among different methods and datasets as shown in ae

Additionally, we computed the Adjusted Rand index (ARI) 34 and V-measure 35 , two other metrics for modularity measurements (Methods) and obtained similar trends (Supplementary Figure 6). Similar to AMI, ARI is a normalized metric against random chance, and evaluates the number of correct pairs obtained. On the other hand, V-measure combines the homogeneity score, which measures the homogeneity of reference classes in the obtained clusters, and the completeness score, which measures the homogeneity of obtained clusters within the reference classes. Due to the high number of small homogenous clusters obtained for the Miyamoto dataset, we observed higher V-measure scores, compared to AMI and ARI results (Supplementary Figure 6).

Visualization of subpopulations with bipartite graphs

Bipartite graphs are an efficient way to describe the binary relations between two different classes of objects. We next represented the presence of the eeSNVs into single cell genomes with bipartite graphs using ForceAtlas2 algorithm 36 . We drew an edge (link) between a cell node and a given eeSNV node whenever an eeSNV is detected. The results show that bipartite graph is a robust and more discriminative alternative (Fig. 3), comparing to PCA plots (using GE and eeSNVs) as well as SIMLR (using GE). For Kim dataset, bipartite graph separates the three classes perfectly. However, gene-based visualization approaches using either PCA or SIMLR have misclassified data points. For Ting data, the eeSNV-cell bipartite graph gives a clear visualization of all six different subgroups of single cells. Other three approaches have more exaggerated separations among the same mouse circulating tumor cells (CTC) subgroup MP (orange color), but mix some other subpopulations (e.g., GM, MP, and TuGMP groups). Miyamoto dataset is the most difficult one to visualize among the four datasets, due to its high number (24) of reference classes and heterogeneity among CTCs. Bipartite graphs are not only able to condense the whole populations, but also separate subpopulations (e.g., the orange colored PC subpopulation) much better than the other three methods.

Comparison of clustering visualization using eeSNV and gene expression (GE) features. a Bipartite graphs using eeSNVs and cells as two groups of nodes. An edge between a cell and an eeSNV represents the presence of the eeSNV within that cell. b Principle component analysis (PCA) results using GE as features of the cells. c PCA results using eeSNVs as features of the cells. d SIMILR results using GE as the input

Characteristics of eeSNVs

In SSrGE, regularization parameter a is the only tuning variable, controlling the sparsity of the linear models and influences the number of eeSNVs. We next explored the relationship between eeSNVs and a (Fig. 4). For every dataset, increasing the value of a decreases the number of selected eeSNVs overall (Fig. 4a), as well as the average number of eeSNVs associated with every expressed gene (Fig. 4b). The optimal a depends on the clustering algorithm and the dataset used (Supplementary Data 1 and Supplementary Figure 5). Increasing the value of a expands the proportion of eeSNVs that have annotations in human dbSNP138 database, indicating a higher true positive rate of SNVs compared to that prior to SSrGE filtering (Fig. 4c). Additionally, increasing a increases the average number of cells sharing the same eeSNVs (Fig. 4d), corresponding to the decreasing number of eeSNVs (Fig. 4b). Note the slight drop in the average number of cells sharing the same eeSNVs in Kim data when a > 0.6, this is due to overpenalization (e.g., a = 0.8 yields only 34 eeSNVs).

Characteristics of the eeSNVs. X-axis: the regularization parameter a values used by LASSO penalization in the SSrGE models. And the Y-axes are: a Log10 transformation of the number of eeSNVs. b The average number of eeSNVs per gene. c The proportion of SNVs with dbSNP138 annotations (human datasets). d The average number of cells sharing eeSNVs

Cancer relevance of eeSNVs

Following the simulation results, we ranked the different eeSNVs and the genes for the five datasets, from SSrGE models (Supplementary Data 2). We found that eeSNVs from multiple genes in human leukocyte antigen (HLA) complex, such as HLA-A, HLA-B, HLA-C, and HLA-DRA, are top ranked in all four human datasets (Table 2 and Supplementary Data 2). HLA is a family encoding the major histocompatibility complex (MHC) proteins in human. Beta-2-microglobulin (B2M), on the other hand, is ranked 7th and 45th in Ting and Patel datasets, respectively (Table 2). Unlike HLA that is present in human only, B2M encodes a serum protein involved in the histocompatibility complex MHC that is also present in mice. Other previously identified tumor driver genes are also ranked top by SSrGE, demonstrating the impact of mutations on cis-gene expression (Table 2 and Supplementary Data 2). Notably, KRAS, previously linked to tumor heterogeneity (Kim et al. 37 ), is ranked 13th among all eeSNV containing genes (Supplementary Data 2). AR and KLK3, two genes reported to show genomic heterogeneity in tumor development in the original study 22 , are ranked 6th and 19th, respectively. EGFR, the therapeutic target in Patel study with an important oncogenic variant EGFRvIII (Patel et al. 23 ), is ranked 88th out of 4225 genes. Therefore, genes top-ranked by their eeSNVs are empirically validated.

Next we conducted more systematic investigation to identify KEGG pathways enriched in each dataset, using these genes as the input for DAVID annotation tool 38 (Fig. 5a). The pathway-gene bipartite graph illustrates the relationships between these genes and enriched pathways (Fig. 5b). As expected, Antigen processing and presentation pathway stands out as the most enriched pathway, with the sum −log10 (p-value) of 15.80 (Fig. 5b). Phagosome is the second most enriched pathway in all four data sets, largely due to its members in HLA families (Fig. 5b). Additionally, pathways related to cell junctions and adhesion (focal adhesion and cell adhesion molecules CAMs), protein processing (protein processing in endoplasmic reticulum and proteasome), and PI3K-AKT signaling pathway are also highly enriched with eeSNVs (Fig. 5a).

Gene and KEGG pathways enriched with eeSNVs in the five scRNA-seq datasets. a Bipartite graph using significant KEGG pathways and genes enriched with eeSNVs as nodes. An edge exists between a significant pathway and a gene if this gene is part of the pathway. Genes of each data set is represented with a distinct color. The size of the nodes reflects the gene and the pathway scores. The gene scores are computed by SSrGE and the pathway scores are the sum of the gene scores linked for each pathway. b KEGG pathways enriched within the top 100 genes based on eeSNV contributions in the five datasets. Pathways are sorted by the sum of the −log10 (p-value) of each dataset, in the descending order

Heterogeneity markers using eeSNVs

We display the potential of eeSNV as heterogeneity markers via pseudo-time reconstruction and heatmap, using Kim dataset (Fig. 6a, b). We built a Minimum Spanning Tree, similarly to the Monocle algorithm 39 , to reconstruct the pseudo-time ordering of the single cells. The graphs beautifully capture the continuity among cells, from the primary to metastasized tumors (Fig. 6a). Moreover, it highlights ramifications inside each of the subgroups, demonstrating the intra-group heterogeneity. On the contrary, pseudo-time reconstruction using GE showed much less complexity and more singularity (Supplementary Figure 7). As a confirmation, hierarchical clustering of eeSNV heatmp also shows almost perfect separation of the three subgroups (Fig. 6b). Next, we used our method to identify eeSNVs specific to each single-cell subgroup and ranked the genes according to these eeSNVs. We compared the characteristics of the metastasis cells to primary tumor cells. Two top-ranked genes identified by the method, CD44 (1st) and LPP (2nd), are known to promote cancer cell dissemination and metastasis growth after genomic alteration 40,41,42,43 (Supplementary Data 2). Other top-ranked genes related to metastasis are also identified, including LAMPC2 (7th), HSP90B1 (14th), MET (44th), and FN1 (52th). As expected, Pathways in Cancer are the top-ranked pathway enriched with mutations (Fig. 6b). Additionally, Focal Adhesion, and Endocytosis pathways are among the other significantly mutated pathways, providing new insights on the mechanistic difference between primary and metastasized RCC tumors (Fig. 6c).

Heterogeneity revealed by Kim dataset. a Pseudo-time ordering reconstruction of the different subgroups: single cells from PDX primary tumor (green), patient metastasis (blue) and PDX metastasis (red). The eeSNVs are obtained with a = 0.6. The tree is inferred using the MST algorithm on Pearson’s correlation-based distance matrix. b Heatmap of the cells (row) and eeSNVs (column). c Bipartite graph using KEGG pathways (orange color) and genes enriched with significant eeSNVs (green color) as two set of nodes. The significant eeSNVs are inferred from the metastasized cells, compared to the primary tumor cells. The size of the nodes reflects the gene scores (given by SSrGE) and the pathway scores (sum of the gene scores). Lighter green indicates genes with a lower rank

Another application is to explore the potential of eeSNVs to separate different cell types within the same individual. Toward this, we extended the same analysis on the two patients BC03 and BC07 from the Chung dataset, who have primary and metastasized tumor cells as well as infiltrating immune cells. Again, bipartite graphs and minimum spanning trees-based visualization illustrate clear separations of tumor (primary and metastasized) cells from immune cells (Supplementary Figure 8). Furthermore, the top-ranked genes relative to the metastasis subgroups (BC03M and BC07M) present some similarities with those found in Kim dataset (Supplementary Data 3). Strikingly, CD44 is also top-ranked (23rd) among the significant genes of BC07M. Similarly, HSP90B1 is top-ranked as the 63rd and 51st most important genes, in BC03M and BC07M, respectively.

Integrating DNA- and RNA-seq data in the same single cells

Coupled DNA-seq and RNA-seq measurements from the same single cell are the new horizon of single-cell genomics. To demonstrate the potential of SSrGE in integrating DNA and RNA data, we downloaded a public single cells data, which have DNA methylation and RNA-seq records from the same hepatocellular carcinoma (HCC) single cells (Hou dataset) 44 . We then inferred SNVs from the aligned reduced representation bisulfite sequencing (RRBS) reads (see Methods), and used them to predict the scRNA-seq data from the same samples. Given the fact that SNVs are heterozygous among tumor and normal cells, and that a small fraction of genes harboring eeSNVs are subject to CNV, we included both the percentages of SNVs as well as CNVs as additional predictive variables in the SSrGE model besides SNV features. Interestingly, the identified eeSNVs can clearly separate normal hepatocellular cells from cancer cells and highlight the two cancer subtypes identified in the original study (Fig. 7). Pseudo-time ordering shows an early divergence between the two previously assumed subtypes (Fig. 7b). This observation is confirmed by hierarchical clustering of eeSNV based heatmap (Fig. 7c). A simplified version of SSrGE model, where only SNV features were considered as predictors for gene expression, shared 92% eeSNVs as those in Fig. 7a, and achieved almost identical separations between normal hepatocellular cells and cancer cells. This confirms the earlier observation that eeSNVs are much more important predictive features, compared to CNVs (Fig. 1c, d).

Heterogeneity revealed by eeSNVs from multi-omics single cell HCC (Hou) dataset. Normal cells are colored green and HCC tumor cells are colored bright (subpopulation I) or dark (subpopulation II) red. a Bipartite-graph representation using the single cells and eeSNVs from RRBS reads as two sets of nodes. b Pseudo-time ordering reconstruction of the HCC cells, using eeSNVs in from RRBS. c Heatmap of the cells (row) and eeSNVs (column)

We postulated that a considerable part of bisulfite reads was aligned with methylation islands associated with gene promoter regions. Thus, we annotated eeSNVs within 1500 bp upstream of the transcription starting codon, and obtained genes with these eeSNVs, which are significantly prevalent in certain groups. When comparing HCC vs. normal control cells, two genes PRMT2, SULF2 show statistically significant mutations in HCC cells (p-values < 0.05, Fisher’s exact test). Downregulation of PRMT2 was previously associated with breast cancer 45 , SULF2 was known to be upregulated in HCC and promotes HCC growth 46 .


FEATURES

SNP calling

QualitySNPng takes as input a sequence alignment file in SAM ( 3) or ACE ( 13) format with single-end or paired-end reads as produced by read mappers like Bowtie ( 14) and BWA ( 15) or de novo assemblers like CABOG ( 16) and PCAP ( 17). The QualitySNPng software uses three filtering steps to eliminate unreliable variations similar to the original QualitySNP ( 6). The first filter labels all nucleotide differences that occur in a minimum number of reads as potential SNPs. This minimum number can be adjusted by the user as an absolute number or a fraction of the total number of reads. The second filter takes into account the quality of the sequence containing the variant nucleotide and leaves only the high confidence SNPs. The base quality, characterized by the Phred score ( 18), is used for this when it is present in the input sequence alignment. If no Phred score is present, all nucleotides in the input reads are assumed to be of high quality. Additionally, the score can be modified based on specific sequence patterns. For instance, variations found in homopolymeric tracts can be set to low quality. This option is particularly useful when Roche/454 sequences are processed, as these are known to be prone to homopolymer-associated errors ( 19). Also a number of nucleotides at the 5′- or 3′-ends can be labelled as low quality, for instance to avoid false SNPs caused by incomplete adaptor trimming. The third filter involves predicting haplotypes based on the high confidence SNPs. Only if variation is supported by one or more haplotypes, it is considered as a reliable SNP. Compared with the original QualitySNP software, the second and third filters were reversed to make sure that the detected haplotypes are based on high confidence SNPs only. The run time largely depends on the size and nature of the input sequencing data, ranging from less than a minute for a set of ∼25 000 contigs (∼100 reads/contig) to 10 min for one large single contig of 7000 bp with 800 000 reads. Larger and more variable sequence alignments can take longer, also depending on the stringency of the settings: lowering the threshold for potential SNPs will result in more work for the second and third filters that are computationally the most expensive. For large input files that are expected to take several hours to process, one can use the command line ‘server mode’ option of the tool to perform the SNP calling on a compute server and subsequently analyse the results using the GUI.

Viewing results

The results of the SNP calling can be viewed directly using the GUI, and they are also saved in structured text files for later reference or further processing. The different contigs from the input sequence alignments are listed in a table showing the number of SNPs, the reads and the haplotypes. The haplotype count in the table is corrected for fragmented haplotypes by taking the maximal number of haplotypes that is found per SNP position. Fragmentation of haplotypes may occur and is caused by SNPs that are too far apart to be linked to one allele by a single-sequence read or a read pair, see Figure 1 for an example. The contig list can be filtered based on the numbers of reads, SNPs and haplotypes and (partial) contig name.

Screenshot of QualitySNPng output. Result of the SNP detection using Arabidopsis thaliana RNA-seq data set from two accessions that were mapped to Arabidopsis transcripts ( 20). In the left, the list of transcripts is shown, limited here using the filter options to only the ones with between 8 and 25 SNPs and between 1000 and 2000 reads. The details for the selected transcript are shown on the right: the top window shows the predicted haplotypes, the middle window shows the alleles per accession (Col-0 and Can-0) and the bottom window shows the reads aligned to the transcript sorted per haplotype (reads without SNP are not shown).

Screenshot of QualitySNPng output. Result of the SNP detection using Arabidopsis thaliana RNA-seq data set from two accessions that were mapped to Arabidopsis transcripts ( 20). In the left, the list of transcripts is shown, limited here using the filter options to only the ones with between 8 and 25 SNPs and between 1000 and 2000 reads. The details for the selected transcript are shown on the right: the top window shows the predicted haplotypes, the middle window shows the alleles per accession (Col-0 and Can-0) and the bottom window shows the reads aligned to the transcript sorted per haplotype (reads without SNP are not shown).

A selected contig will show a window with the aligned reads and the SNPs indicated, a table with the haplotypes and their alleles per SNP position and a table showing the alleles for the different samples in the input data ( Figure 1). For this last table to appear, the input sequence alignment file should be annotated with a ‘read group’ (see SAM format definition) per read, or alternatively, have group labels included in the read names. The overview per sample can for instance be used to compare alleles between different accessions, strains or ecotypes and for genotyping by sequencing.

Manual inspection of the read alignment together with the haplotype overview gives insight in the quality of the alignment, local read coverage and positions of the SNPs. Based on this visual inspection, one can decide to alter the stringency of the filter settings and rerun the SNP calling. The reads can be sorted on start position or per haplotype and can be viewed at different zoom levels.

For the creation of a SNP array, marker SNPs can be selected and exported with flanking sequence of a specified length as a structured text file that can be imported into a standard spreadsheet program or an assay design program.

To avoid problems in SNP scoring, we suggest selecting markers from contigs that have no more than the maximum expected number of haplotypes, i.e. two for diploid species, as contigs with more haplotypes may contain paralogous sequences. To further increase the chance of obtaining markers that will perform well on arrays, one could use the BLAST program ( 21) to eliminate marker sequences that show high similarity to other genes, as was shown previously ( 7).


RNA-SEQ SPECIFIC EFFECTS AND BLOCKING

As in microarray studies, RNA-seq experiments can be affected by the variability coming from nuisance factors, often called technical effects in the RNA-seq literature. Besides processing date, technician and reagent batch, which are commonly known to investigators, there are some recognized technical effects specific to the RNA-seq procedures. One of these technical effects comes from the generation of libraries of cDNA fragments, which involves various ligations of adaptors and PCR amplifications. Besides the library preparation effect, there are also other technology-specific effects. For example, the commonly used Illumina-sequencing technology can sequence eight samples simultaneously in the eight lanes in one flow cell, of which one lane is often used for the control sample. Thus, there is variation from one flow cell to another resulting in flow cell effect. In addition, there exits variation between the individual lanes within a flow cell due to systematic variation in sequencing cycling and/or base-calling. Among these sources of variation, the library preparation effect is the largest [ 40]. The flow cell and lane effects are relatively small [ 20, 41].

From the experimental design point of view, there are some steps that can be taken to properly handle these effects besides the technology improvement. For the library preparation effect, introducing replicates before this step (often biological replicates) provides a way to estimate this effect and to properly handle it in the statistical inference. Blocking design can be used to eliminate the flow cell and lane effects. Blocking is also an experimental design principle. It dictates comparisons within a block, a known uninteresting factor that causes variation, such as flow cell effect. Either the randomized complete block design (RCBD) or the balanced incomplete block design (BIBD) can be used to achieve this goal, depending on the number of treatments/groups to be compared. Sequencing lanes can also serve as blocks when bar-coding during library preparation (for the protocol for Illumina platform, see http://www.illumina.com/Documents/products/datasheets/datasheet_sequencing_multiplex.pdf) is used for multiplexing [ 17]. However, it has been shown that multiplexing reduces sensitivity and reproducibility in miRNA detection [ 42]. Therefore, caution needs to be taken when multiplexing is considered for the purpose of reducing flow cell and lane effects.


Discussion

We report evidence of extensive RNA editing in a human cell line, which underscores the need for robust methods to detect these events. We developed a pipeline for identifying RNA editing events by screening RNA-DNA differences in the same individual through successive quality control filters. The pipeline performed well on simulated data in terms of both sensitivity and specificity (Fig. 1c and Supplementary Tables 4,5).

False positives are a critical issue in the analysis of SNVs from RNA-Seq data, as evidenced by the simulation results (Fig. 1c) and our experience with an earlier version of our method that did not include strict quality control filters (data not shown). This problem, which is also evident in genomic SNP detection 24,35 and the recent reevaluation of the large-scale RNA-Seq study 36 , may be due to several factors. Some of the false positives are certainly a result of inaccuracy in read alignment, of which paralogous, highly similar sequences, and splice junctions represent major sources. In addition, mapping reliability can be further compromised in the presence of RNA transcript sequence variation. From a technical perspective, paired-end reads with greater length (75, 90 or 100 bp in our study) presumably should increase the fidelity of read alignment 37 . To address the inconsistency in read mapping, we implemented two independent filters in our calling method (the MES and BLAT steps) that remove false-positive results that were identified in a naive analysis of simulated reads. Finally, biases in calling editing sites may also stem from insufficient coverage and accuracy of the genome sequence, which becomes problematic in ascertaining potential edits at positions that correspond to genomic polymorphisms. However, the 36-fold average coverage of our genome-resequencing data, combined with the YH genome variants filter that addresses zygosity and copy number variation of the genome at the editing sites, reduces the likelihood of such errors.

Our method and results affirm that quality control filters are necessary to accurately identify RNA editing sites. This study also indicates the need to archive RNA edits for the development of more thorough statistical models that incorporate prior knowledge of sequence variation and the sequencing technologies used. Notably, recent computational approaches for detecting A→I(G) base changes in human mRNA databases also incorporated molecular features that underlie RNA editing, such as RNA folding characteristics or tissue-preferred distribution of editing events 38,39 . These filter criteria may thus be included as additional modules in our workflow to analyze more complex or functionally relevant data sets in future deep-sequencing studies.

As this manuscript was being prepared, two large-scale screens for RNA-DNA differences that used deep-sequencing approaches similar to ours were reported 20,21 . Several differences were notable between these studies, including the design of the site-calling pipelines, the extent of the sampled transcriptome, the number of sites identified and the distribution of editing types. Notably, these studies reported that ∼ 23% (ref. 20) and 62% (ref. 21) of editing sites were A→G changes, whereas the vast majority (>90%) of our candidate sites were A→G changes. Furthermore, our work complements previous and recent findings with in-depth information of editing across a broader sampling of the transcriptome, particularly the intergenic transcripts.

Amid the recent deep-sequencing studies of RNA editing, there has been substantial controversy centered on the technical drawbacks of this technology as well as related analysis algorithms and experimental design. We suggest that our overall methodology thoroughly addresses these concerns and minimizes errors when inferring editing sites from RNA-Seq data. The need for stringent criteria in identifying RNA-DNA differences is reinforced by a recent report showing that, after accounting for paralogous and genomic variant sequences, a considerable portion of the candidate sites that were identified in a previous study 20 might actually represent spurious results 36 . We therefore also did an independent assessment of these data 20 using our workflow (Supplementary Table 13). This analysis revealed that candidate editing or RNA-DNA difference sites were likely overestimated by their approach. In addition to the potential contribution of paralogous sequences and genomic variants as the error source, we also found that data quality and depth played a role in the possibly erroneous calling of variants. Almost 60% of the sites identified previously 20 could be removed by our “read parameter” filter based on their location within 8 bp from the ends of 50-bp reads. Moreover, because of the low depth of individual genome sequences sampled in that report, some of the putative editing sites did not fulfill the requirements defined by our “genome variants” filter and may actually represent polymorphic sequences encoded by the genome. Notably, however, features of the editing sites called by our pipeline from the data 20 are similar to features of sites called from our RNA-Seq data (Supplementary Table 13). This suggests that the discrepancies between the two studies might be attributed mainly to the different study designs rather than the underlying molecular biology.

In summary, our results support the accuracy of our multifilter modular pipeline to annotate an editome and to provide a global and quantitative catalog of nucleotide variants in a transcriptome. The next step is clearly to apply this methodology to larger-scale deep-sequencing studies involving additional physiologically relevant samples, so as to profile and compare editomes more comprehensively and accurately.


MATERIALS AND METHODS

Retrieval of empirical data

We used three types of empirical data (Figure 1, Supplementary Table S1 ). First, we used previously published RNA-seq data from four different studies that include transcriptome sequencing from a 17-member, three-generation family ( 26), transcriptome sequencing from a trio ( 3), transcriptome sequencing from a pair of first degree relatives and two unrelated individuals ( 27) as well as targeted transcript sequencing (Ion AmpliSeq, Life Technologies) from 7 unrelated individuals and a pair of siblings ( 28). Second, we retrieved genetic variation data from six pairs of first and second degree relatives from the CDX population of the 1000 Genomes Project ( 29). From this data we then simulated RNA-seq reads (the method of simulation of RNA-seq reads is described below). Third, we retrieved genetic variation data from unrelated individuals from the 1000 Genomes Project ( 29) from which we simulated families (the method of simulation of genotypes of family members is described below) and RNA-seq reads.

Overview of data workflow for kinship detection and pedigree reconstruction using RNA-seq data. GQ: genotype quality, DP:depth, IBD: identity by descent, MAF: minor allele frequency.


Dual RNA-seq

There is a wide range of interactions between species, such as parasitism, symbiosis, competition, etc. The conventional transcriptome sequencing can only study the information of a single species, which not only wastes part of the data, but also affects the sample itself during the separation of two species.

Dual RNA-seq has been shown to monitor the all classes of coding and noncoding transcripts of both host and pathogen simultaneously. CD Genomics is offering a high-resolution, affordable and straightforward dual RNA-seq service to provide direct insight into host–pathogen interplay.

By constructing only one transcriptome library, dual RNA-seq of total mixed RNA following double rRNA depletion or poly(A) capture allows sequencing and analyzing two (or more) species at the same time without the need to separate the species, thereby revealing the dynamic changes in gene expression between them. Meanwhile, through the interaction model diagram, to obtain the regulatory relationship of genes and the interaction mechanism between two species to investigate the regulatory network in the interaction process, the mechanism of pathogen infection and the host resistance to disease and to investigate the evolutionary relationship of pathogen among different species, and further explore the positive selection of related genes based on homologous genes.

CD Genomics could deal with various invasion models-- pathogens involving bacteria, fungi, protozoa, etc., host could be a mammal or plant. We are aiming to provide comprehensive dual RNA-seq services from experimental design to biocomputational analysis to support your research needs.

Key Advantages and Features

  • Available for various invasion models
  • Flexibility of sample type: total mixed RNA, infected host cells, etc.
  • UMI technology allows small amounts of input template

Dual RNA-seq Workflow


Generic Pipeline of Dual RNA-seq Data Analysis (figure from V. Arluison et al., 2018)

  1. Alexander J. Westermann, et al., Dual RNA-seq unveils noncoding RNA functions in host–pathogen interactions. Nature. 2016, vol. 000.
  2. Alexander J. Westermann, et al., Resolving host–pathogen interactions by dual RNA-seq. PLoS Pathog. 2017, 13(2).
  3. Véronique Arluison and Claudio Valverde (eds.), Bacterial Regulatory RNA: Methods and Protocols. Methods in Molecular Biology. 2018, vol. 1737.
  4. Pisu et al., Dual RNA-seq of Mtb-infected macrophages in vivo reveals ontologically distinct host-pathogen interactions. Cell Reports. 2020, vol. 30.

Methods

RNA and library preparation

Total RNA was isolated using Trizol from a freshly collected kidney tissue of an adult female mouse of 129S1 x Cast/Ei F1 background (F1 breeding was performed at the DFCI mouse facility, with parent animals obtained from the Jackson Laboratories. All animal work was performed under DFCI protocol 09-065, approved by the DFCI Institutional Animal Care and Use Committee. Animals were housed in accordance with Guide for the Care and Use of Laboratory Animals). RNA integrity was assessed using Bioanalyzer, and it was quantified using the Qubit device. Aliquots of this total RNA prep were used to prepare three sets of replicate libraries, all starting with polyA RNA isolation: six libraries with NEBNext kit, starting each with 100ng six libraries using SMARTseq v4 kit starting with 10 ng RNA and the same, with 0.1 ng RNA. All libraries were prepared at the DFCI sequencing facility according to the manufacturers’ instructions. All sequencing was done on HiSeq 2500 machine at the DFCI sequencing facility.

For data analysis example discussed in Use Case 2, Abelson lymphoblastoid clonal cell lines Abl.1 and Abl.2 of 129S1 × Cast/Ei F1 background 14 were cultured in RPMI medium (Gibco), containing 15% FBS (Sigma), 1X L-Glutamine (Gibco), 1X Penicillin/Streptomycin (Gibco) and 0.1% β-mercaptoethanol (Sigma). Total RNA was extracted from cells using a magnetic bead-based protocol using Sera-Mag SpeedBeads (GE Healthcare). Isolated total RNA was DNase-treated with RQ1 DNase (Promega). RNA sequencing libraries were prepared using SMARTseq v.4 kit (Takara) starting with 10 ng total RNA for each replicate. Sequencing was performed on HiSeq4000 platform at Novogene, Inc.

Additional data sources

Geuvadis dataset includes RNA-seq data on LCLs established from 462 individuals from five populations 16 . FASTQ files for paired-end reads (2 × 75 bp) for five individuals (HG00117, HG00355, NA06986, NA19095, NA20527), each with 7 replicates, were downloaded from 1000 Genomes project [ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/]. Allelic count data (processed using standard GTEx pipeline) for a randomly selected individual GTEX-11NUK from the Midpoint phase of the GTEx project were downloaded from dbGaP [https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v7.p2]. We also used RNA-seq data from mouse neuronal progenitor cells (GSE54016) [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54016].

AI estimation pipeline

AI estimation tools described here are implemented in two parts. Data processing steps from read alignment to allelic counts were based on the ASEReadCounter tool in the GATK pipeline 24 . It was re-implemented using in part Python scripts developed by S. Castel (github.com/secastel/allelecounter), and denoted as ASEReadCounter* (github.com/gimelbrantlab/asereadcounter_star). Calculation of QCC, estimation of confidence intervals and differential AI analysis are implemented in Qllelic tool set (github.com/gimelbrantlab/Qllelic).

Reference preparation

Two custom parental genomes (“pseudogenomes" 44,45 see ASEReadCounter* at github.com/gimelbrantlab/asereadcounter_star) were used for mapping as reference. For 129S1 × Cast/Ei F1 cross mouse samples, alleles are determined with maternal and paternal strain genomes and strain-specific variants for human data (Geuvadis project 16 ) phased SNP variant calls were used. Respective allelic variants from Single Nucleotide Polymorphism database 142 (dbSNP142 46 ) or 1000 Genomes Project phase 3 structural variant call-set were inserted into the reference genome (GRCm38.86 or hs37d5, 1000 genomes, phase 2), to obtain a pair of “parental" reference genomes for further analysis (for worked example see Supplementary Note S6). For each organism, we also created a vcf file with one allele considered as a reference (maternal 129S1 or first phased allele) and the other as an alternative allele. Only heterozygous sites were used in the downstream analysis.

Calculation of allelic counts

Alignment: RNA-seq reads were aligned with STAR aligner (v.2.5.4a) 47 on each of two pseudogenomes, with default threshold on quality of alignment. Only uniquely aligned reads were taken into further consideration ( –outFilterMultimapNmax 1 parameter was applied).

Allele assignment: Reads that were present in only one of the alignments, and reads that had better alignment quality for one of the alignments, were assigned to the corresponding allele read group and marked respectively. The remaining reads (not overlapping heterozygous SNP positions) were not used downstream. This procedure is based on Python scripts by S.Castel.

Read deduplication: When applied, Picard (v.2.8.0 broadinstitute.github.io/picard) MarkDuplicates was used.

Library subsampling: To ensure that all aligned counts belong to similar distributions, BAM files corresponding to the same experiment were subsampled to the same size using a custom bash script with randomness generated using the shuf command.

Allelic counting for SNPs: Given a vcf file with heterozygous positions (discussed under Reference preparation), coverage over each SNP was calculated using samtools mpileup (v.1.3.1) and parsed to obtain the table with allelic counts. This procedure is based on Python scripts by S.Castel.

Allelic counting for genes: All exons belonging to the same gene were merged into a single gene model based on GTF file (RefSeq GTF files, GRCm38.68 and GRCh37.63, were downloaded from Ensemble ftp://ftp.ensembl.org/pub/release-68/gtf/ 48 ), excluding overlapping regions that belong to multiple genes. Phased allelic counts for all SNPs within the whole gene model were summed:

Unless specified otherwise, only genes with ≥10 total counts were used for further analysis.

Allelic Imbalance estimates: Estimates for AI for a gene g were obtained as a proportion of maternal gene counts (Mg) to total allelic gene counts:

Additional tools for AI calculation

We used three tools in our comparative analyses: Qllelic (v0.3.2), MBASED (v1.20.0) and GeneiASE (v1.0.1). For uniformity, input for comparisons was pre-processed in the same way for all tools. In case of real data, the same genes were filtered so the data will satisfy all the tools requirements on SNP numbers and SNP coverages (see Supplementary Fig. S4). The default parameters of all the tools were used in analyses (see Supplementary Figs. S4, S7b):

One-sample analysis: For Qllelic: default parameters of PerformBinTestAIAnalysisForConditionNPoint() function. For MBASED: runMBASED function with isPhased = TRUE, numSim = 10000 , and the rest set to default values. For GeneiASE: default parameters of geneiase -t static

Two-sample analysis: For Qllelic: default parameters of PerformBinTestAIAnalysisForTwoConditions() function. For MBASED: runMBASED function with isPhased = TRUE, numSim = 10000 , and the rest set to default values. For GeneiASE: default parameters of geneiase -t icd

Calculation of quality correction constant for 2 replicates

As gene coverage is an essential parameter of proportional beta-binomial model of allelic imbalance, we started with the standard procedure of splitting genes into bins by coverage to discretize our model.

Bin boundaries were defined as rounded up powers of the base b = 1.05: (ar=^<1> ceil ,lceil ^<2> ceil ,lceil ^<3> ceil ,ldots >) . Note that QCC calculations do not strongly depend on the exact bin size, see Supplementary Fig. S7. Each gene g was assigned to a bin according to the mean of its counts C1g and C2g from two technical replicates:

then each bin Bi, containing set of genes Gi, was processed separately.

Fitting AI distribution as beta-binomial mixture

To fit the parameters of a mixture of two proportional beta-binomial distributions, representing observed AI from the pooled replicate in each coverage bin Bi: