RNA-Seq

RNA-Seq is a powerful technology that aims to uncover the existence or absence of RNA at any given time in the genome. The transcriptome as we call it is very dynamic and is constantly changing as opposed to a static genome. The recent developments of next-generation sequencing (NGS) allow for increased base coverage of a DNA sequence, as well as higher sample throughput.

This facilitates sequencing of the RNA transcripts in a cell, providing the ability to look at alternative gene spliced transcripts, post-transcriptional changes, gene fusion, mutations/SNPs, and changes in gene expression. In addition to mRNA transcripts, RNA-Seq can look at different populations of RNA to include total RNA, small RNA such as miRNA, tRNA, and ribosomal profiling. RNA-Seq can also be used to determine exon/intron boundaries and verify or amend previously annotated 5’ and 3’ gene boundaries.

RNA Sequencing

Ongoing RNA-Seq research includes observing cellular pathway alterations during infection, and gene expression level changes in cancer studies. Prior to NGS, transcriptomics and gene expression studies were previously done with expression microarrays, which contain thousands of DNA sequences that probe for a match in the target sequence, making available a profile of all transcripts being expressed.

This was later done with Serial Analysis of Gene Expression (SAGE). One deficiency with microarrays that makes RNA-Seq more attractive has been limited coverage; such arrays target the identification of known common alleles that represent approximately 500,000 to 2,000,000 SNPs of the more than 10,000,000 in the genome. As such, libraries aren’t usually available to detect and evaluate rare allele variant transcripts, and the arrays are only as good as the SNP databases they’re designed from, so they have limited application for research purposes. Many cancers for example are caused by rare <1% mutations and would go undetected.

We have a lot of collective experience in this field of genetic research in the form of geneticists, statisticians, and computational biologists. We provide cutting-edge sequencing, data analysis, and support to numerous researchers in UT Southwestern and beyond. The following is a basic workflow that we employ for the analysis of such data.

Please contact us f you would like more details about the workflow including specific parameters of the software, genome versions used, etc. This publication is also a great place to start for those who want a basic introduction as well as analysis procedures for RNA-Seq data.

    Demultiplexing

    Usually, NGS is done on pools of samples. Hence a lane can contain a mixture of libraries that need to be extracted into their corresponding samples. We use Illumina's bcl2fastq software for most of converting the bcl files to fastq files.

    Sequencing Quality Assessing

    At the McDermott Center, we are most careful about the quality of data generated by the sequencers. The first check is to make sure we have enough sequencing reads for the analysis. Other parameters include % Passing Filter, Mean Quality Score, and % of >=Q30 Bases to name a few.

    For an overview of how Illumina sequencing technology works, please visit the Massachusetts General Hospital Overview of Illumina Chemistry.

    Once samples pass initial sequencing quality metrics generated by the sequencers, they are assessed by FASTQC which checks for per-base sequence quality, GC content, and N content, among others. If the data indeed looks sub-par, they will be immediately reprepped and resequenced. Data trimming is done if needed using any one of:

    Our quality control process also includes assessing per base coverage, mean coverage, and on-target percentages, among others; these will be discussed in the Useful Metrics section. We also check for contaminants across different genomes as well as different contaminants such as ribosomal, mitochondrial, adapters, vectors, etc.

    Ribosomal Contamination

    A very important part of our quality control is to quantify the amount of ribosomal content in each of the samples. For each sample, a random selection of reads is mapped to the ribosomal sequences of the pertinent species. If the percentage mapped is too high, they are flagged as contaminated and the sequencing core is notified that the sample needs re-prepping and re-sequencing.

    Mitochondrial Contamination

    It is necessary to remove mitochondrial sequences from the sample before sequencing so that they do not interfere with downstream analysis.

    Internal Metrics used for QC

    RNA-SeQC

    RNA-SeQC is a java program that computes a series of quality control metrics for RNA-seq data. The input can be one or more BAM files. The output consists of HTML reports and tab-delimited files of metrics data. This program can be valuable for comparing sequencing quality across different samples or experiments to evaluate different experimental parameters. It can also be run on individual samples as a means of quality control before continuing with downstream analysis.

    RSeQC

    RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data, especially RNA-seq data. “Basic modules” quickly inspect sequence quality, nucleotide composition bias, PCR bias, and GC bias, while “RNA-seq specific modules” investigate sequencing saturation status of both splicing junction detection and expression estimation, mapped reads clipping profile, mapped reads distribution, coverage uniformity over gene body, reproducibility, strand specificity, and splice junction annotation

    Genomic mapping
    The flowchart employed in the processing of RNAseq datset.

    We use featureCounts to generate raw counts for RNA-Seq data. The raw counts are then normalized for sequencing depth and gene length. The raw counts are used as input for the differential expression analysis. We currently edgeR for differential expression analysis.

    Transcript assembly and expression

    The raw counts for each of the sample are used in our edgeR analysis for differential expression and to output normalized TPM (Transcript perm million) values for both the GENCODE and igenomes classifications.

    Differential expression

    Differential expression analysis is carried out by edgeR (using both the igenomes GTF and the GENCODE GTFs). The edgeR analysis produces fold change smear plots, tagwise dispersion plots, cluster plots, mean-variance plots, normalized counts, and most importantly a table of differentially expressed genes/transcripts. The different outputs are described more in detail in this edgeR user guide.

    Normalization in RNA sequencing

    Raw counts are normalized for sequencing depth, and gene length and also to account for variation (technical/biological) in the libraries. We provide users with TPM counts (Transcript per million). For details about various types of normalization applied to RNA sequencing datasets please visit the National Library of Medicine.