Exon Capture/Whole Genome

Exon Capture or Whole Exome Sequencing is an efficient approach to sequencing the coding regions of the human genome. Many researchers are only interested in the regions that are responsible for protein coding i.e. 1-2 percent of the genome. It only makes sense to target these regions during sequencing, which guarantees a greater resolution and lower cost than Whole genome sequencing which aims to capture the entire genetic spectrum. In general, knowing the complete DNA sequence of an individual's genome does not, on its own, provide useful clinical information, but this may change over time as a large number of scientific studies continue to be published detailing clear associations between specific genetic variants and disease. Whole genome sequencing is also preferable when questions cannot be answered by looking only at the coding regions. This sequencing technique aims to decipher many different types of gene mutations including germline, somatic, insertions, deletions, and copy number variations, to name a few.

Gene Mutation

Most of the biologically significant events arise due to variations at the DNA sequence level. These sequence variations can be successfully correlated to various phenotypes, disease conditions, experimental conditions, etc. Identifying these sequence variants is the most fundamental feature of modern genomic studies. These changes can now be identified with confidence at a reasonable price using high throughput genomic sequencing. At the McDermott Sequencing Core, high-throughput genomic sequencing is done on Illumina machines and is available in two flavors: Whole Genome Sequencing and Exome Sequencing. Which of the two options should be chosen depends on the desired results.

Whole genome sequencing is mostly done when variations are expected at the whole genome level. These would include Structural Variations, Copy Number Variations, variations in Non-Coding regions, and other regions not targeted by Exome Sequencing. It is also used when the organism to be sequenced has a relatively smaller genome size or does not have exome capture kits available. Whole genome sequencing can be expensive and takes longer than Exome Sequencing. Many times the researchers are only interested in variants at exonic regions and these can be specifically sequenced at high coverage using Exome Sequencing. Exome Sequencing involves targeted exon capture and sequencing of the exome of an organism using various kits.

Our expertise in this field of genetic research includes geneticists, statisticians, and computational biologists. We provide cutting-edge sequencing, data analysis, and support to numerous researchers in UT Southwestern and beyond. The following is a basic workflow that we employ for the analysis of such data.

Please contact us if you would like more details about the workflow including specific parameters of the software, genome versions used, etc. Exome sequencing: the sweet spot before whole genomes (Teer and Mullikin, 2010) is an accessible introduction to exome sequencing.

The Broad GATK Workflow — The Broad GATK Varinat Calling Workflow

Files provided with Exon Capture/Whole genome analysis

Raw unprocessed gzipped FASTQ files
FASTQC report with basic sequencing quality statistics
Mapped BAM files (sorted -> low-qual filtered -> duplicates removed -> realigned -> recalibrated)
Mapping statistics
PICARD metrics
Coverage statistics (plots, target region coverage, mean coverage, etc.)
Annotated Variant calls in VCF formats and tab-delimited formats

Sequencing

Data is generated from the sequencing core using paired-end (PE) 100x100 sequencing, which is very important to get excellent mapping statistics. We also aim to provide at least 50x mean coverage and 95–98% over 20x coverage per base for each Whole Exome sample sequenced. This is the kind of coverage that would be necessary to have a statistically sound analysis if indeed substantive variations are found.

Kits

The type of kit used only pertains to targeted sequencing i.e. Exome Capture. There are numerous kits available each with their unique protocols and capture sizes. The kit that we use predominantly is the SureSelect All ExonV4, which has a capture region of 51Mb defined by CCDS, RefSeq, and GENCODE databases.

Sequencing chart — Photo courtesy of Agilent

Demultiplexing

Multiple samples can be pooled into a lane using sample-specific barcode/index sequences. The Illumina sequencing machines have a read cycle that specifically reads these barcode sequences which are usually 6bp long. The data can then be accurately separated into their respective samples via demultiplexing. We use CASAVA to demultiplex a typical Illumina sequencing run. During demultiplexing, the adapters can be masked or left intact if the researcher choses to do so. There are two FASTQ files produced for each sample, one for each read.

Diagram of a DNA fragment ready for sequencing.

Sequence Quality Accessing

At the Eugene McDermott Center for Human Growth and Development, we are most careful about the quality of data generated by the sequencers. The first check is to make sure we have enough sequencing for the analysis. If samples do not pass the threshold required for the number of reads, they will be resequenced. Other parameters include % Passing Filter, Mean Quality Score, and % of >=Q30 Bases to name a few. A detailed summary of the demultiplex stats used for initial Quality Accessing is available online.

Quality Control

Once samples pass initial sequencing quality metrics generated by the sequencers, they are assessed by FASTQC which checks for per-base sequence quality, GC content, and N content, among others. If the data indeed looks sub-par, they will be immediately reprepped and resequenced. Data contamination is checked using FastQ Screen.

Data trimming is done if needed using any one of:

Our quality control process also included assessing per base coverage, mean coverage, and on-target percentages, among others. These will be discussed in the Useful Metrics section.

Genomic Mapping

Samples that pass QC are finally ready to be mapped and analyzed. The researchers will be consulted on what genome version they would like to map to, although the default would be to use the latest version available. On special request, we can try to use an older genome or even a custom one.

Align reads

Reads are mapped to the reference genome using the Burrows-Wheeler Alignment Tool BWA and the output is produced in BAM format. The mapping workflow is synonymous to the workflow employed by the 1000genomes project. Using other software is not out of the question if the experiment needs it. We have a very high mapping rate given the high-quality sequencing at the McDermott Center, using almost all of the data produced!

Remove duplicates

Duplicates are removed from the mapped data using PICARD and reads with quality less than a Phred score of 10 are also filtered out. This leaves us with an alignment file with analysis-ready data.

Call Variants

Once the data has been mapped and filtered, we follow the best practices workflow created by The Broad Institute for their GATK software. This includes everything from data pre-processing, variant discovery, and preliminary analyses for cohorts of samples. In order to have enough power for all exome sequencing analyses, we require at least 30 samples for joint variant calling. If an experiment has less than 30 samples, we will use a random mix of samples from the 1000genomes project to increase power during variant calling and variant recalibration.

GATK

The GATK Toolkit is the industry standard when it comes to SNP and Indel mutation analyses. It can be applied to all kinds of datasets and genomes. It comes with a host of tools that can be used for data alignment refinement, coverage analysis, diagnosis, and mainly variant calling. GATK variant calling can be run in two modes, the UnifiedGenotyper mode or HaplotypeCaller mode. By default, we run both the modes and provide a union of the two results. The variants found are then optimized using GATK re-calibrator tools. Please visit their page for more in-depth information about the software. Both the raw SNP/INDEL calls as well as the recalibrated files are provided.

Variant Annotation

Variants detected by GATK are then annotated using snpEff. Annotations are incorporated in the VCF file and also presented as a user-friendly TAB file, one for each sample as well as a joint table containing the whole experiment. Annotations include:

COLUMN IDENTIFIER	DESCRIPTION
CHROM	The chromosome of the variant
POS	The genomic location of the variant
REF	Reference base
ALT	Alternate base
VariantType	Type of variant (SNP or INDEL)
ID	rsID of dbSNP, if present
dbSNPBuildID	The version of dbSNP the variant was first found in
SNPEFF_GENE_NAME	Gene name for the highest-impact effect resulting from the current variant
Set	(UNI: called by Unified Genotyper; HC: Called by HaploType Caller; Intersection: Called by both callers)
SampleName.GT	Genotype
SampleName.AD	Allelic Depth
SampleName.GQ	Genomic Quality
SNPEFF_EXON_ID	Exon ID for the highest-impact effect resulting from the current variant
SNPEFF_FUNCTIONAL_CLASS	Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)
SNPEFF_AMINO_ACID_CHANGE	Old/New amino acid for the highest-impact effect resulting from the current variant
dbNSFP_Uniprot_acc	Uniprot accession number. Multiple entries separated by ";"/td>
dbNSFP_SIFT_score	SIFT score (SIFTori). The smaller the more damaging. Multiple scores separated by ";"
dbNSFP_Polyphen2_HDIV_score	Polyphen2 score based on HumDiv, i.e. hdiv_prob. The score ranges from 0 to 1, and the corresponding prediction is "probably damaging" if it is in [0.957,1]; "possibly damaging" if it is in [0.453,0.956]; "benign" if it is in [0,0.452]. Score cutoff for binary classification is 0.5, i.e. the prediction is "neutral" if the score is smaller than 0.5 and "deleterious" if the score is larger than 0.5. Multiple entries separated by ";".
dbNSFP_GERP_RS	GERP++ RS score, the larger the score, the more conserved the site.
SNPEFF_IMPACT	Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)
GAF	Global Allele Frequency based on AC/AN (1000 Genomes)
AFR	African Allele Frequency (1000 Genomes)
AMR	American Allele Frequency (1000 Genomes)
ASN	Asian Allele Frequency (1000 Genomes)
EUR	European Allele Frequency (1000 Genomes)
AA_AC	African American Allele Count in the order of AltAlleles,RefAllele. For INDELs, A1, A2, or An refers to the N-th alternate allele while R refers to the reference allele. (EVS)
EA_AC	European American Allele Count in the order of AltAlleles,RefAllele. For INDELs, A1, A2, or An refers to the N-th alternate allele while R refers to the reference allele. (EVS)
GWASCAT	Trait related to this chromosomal position, according to GWAS catalog
CLOSEST	Closest Splice Site in bps (0 if variant is exonic)
CA	Clinical Association (EVS)

snpEff

snpEff is a Genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes along with making the addition of custom annotations possible. The Broad Institute is a user of the snpEff toolbox and through collaboration has made GATK compatible with this highly efficient annotator. Please view their webpage for additional details.

Metrics

It is imperative that any sequencing/analysis done at the McDermott Center have the highest standards of data quality. This includes not just sequencing quality statistics, but also experiment quality statistics which are equally important. For this purpose, we use a set of tools that is part of the PICARD tool-set as well as diagnosis tools from GATK.

Coverage

Good coverage is very important in order to be confident about the variants being called. We check each sample individually to make sure that they meet our minimum thresholds of at least 50x average coverage and 95%-98% over 20x coverage. A per target coverage table is provided as well as a diagram of the percentage bases covered with X coverage.