Exon Capture or Whole Exome Sequencing is an efficient approach to sequencing the coding regions of the human genome. Many researchers are only interested in the regions that are responsible for protein coding i.e. 1-2 percent of the genome. It only makes sense to target these regions during sequencing, which guarantees a greater resolution and lower cost than Whole genome sequencing which aims to capture the entire genetic spectrum. In general, knowing the complete DNA sequence of an individual's genome does not, on its own, provide useful clinical information, but this may change over time as a large number of scientific studies continue to be published detailing clear associations between specific genetic variants and disease. Whole genome sequencing is also preferable when questions cannot be answered by looking only at the coding regions. This sequencing technique aims to decipher many different types of gene mutations including germline, somatic, insertions, deletions, and copy number variations, to name a few.
Most of the biologically significant events arise due to variations at the DNA sequence level. These sequence variations can be successfully correlated to various phenotypes, disease conditions, experimental conditions, etc. Identifying these sequence variants is the most fundamental feature of modern genomic studies. These changes can now be identified with confidence at a reasonable price using high throughput genomic sequencing. At the McDermott Sequencing Core, high-throughput genomic sequencing is done on Illumina machines and is available in two flavors: Whole Genome Sequencing and Exome Sequencing. Which of the two options should be chosen depends on the desired results.
Whole genome sequencing is mostly done when variations are expected at the whole genome level. These would include Structural Variations, Copy Number Variations, variations in Non-Coding regions, and other regions not targeted by Exome Sequencing. It is also used when the organism to be sequenced has a relatively smaller genome size or does not have exome capture kits available. Whole genome sequencing can be expensive and takes longer than Exome Sequencing. Many times the researchers are only interested in variants at exonic regions and these can be specifically sequenced at high coverage using Exome Sequencing. Exome Sequencing involves targeted exon capture and sequencing of the exome of an organism using various kits.
Our expertise in this field of genetic research includes geneticists, statisticians, and computational biologists. We provide cutting-edge sequencing, data analysis, and support to numerous researchers in UT Southwestern and beyond. The following is a basic workflow that we employ for the analysis of such data.
Please contact us if you would like more details about the workflow including specific parameters of the software, genome versions used, etc. Exome sequencing: the sweet spot before whole genomes (Teer and Mullikin, 2010) is an accessible introduction to exome sequencing.
Files provided with Exon Capture/Whole genome analysis
- Raw unprocessed gzipped FASTQ files
- FASTQC report with basic sequencing quality statistics
- Mapped BAM files (sorted -> low-qual filtered -> duplicates removed -> realigned -> recalibrated)
- Mapping statistics
- PICARD metrics
- Coverage statistics (plots, target region coverage, mean coverage, etc.)
- Annotated Variant calls in VCF formats and tab-delimited formats
Data is generated from the sequencing core using paired-end (PE) 100x100 sequencing, which is very important to get excellent mapping statistics. We also aim to provide at least 50x mean coverage and 95–98% over 20x coverage per base for each Whole Exome sample sequenced. This is the kind of coverage that would be necessary to have a statistically sound analysis if indeed substantive variations are found.
Kits
The type of kit used only pertains to targeted sequencing i.e. Exome Capture. There are numerous kits available each with their unique protocols and capture sizes. The kit that we use predominantly is the SureSelect All ExonV4, which has a capture region of 51Mb defined by CCDS, RefSeq, and GENCODE databases.
Demultiplexing
Multiple samples can be pooled into a lane using sample-specific barcode/index sequences. The Illumina sequencing machines have a read cycle that specifically reads these barcode sequences which are usually 6bp long. The data can then be accurately separated into their respective samples via demultiplexing. We use CASAVA to demultiplex a typical Illumina sequencing run. During demultiplexing, the adapters can be masked or left intact if the researcher choses to do so. There are two FASTQ files produced for each sample, one for each read.
Sequence Quality Accessing
At the Eugene McDermott Center for Human Growth and Development, we are most careful about the quality of data generated by the sequencers. The first check is to make sure we have enough sequencing for the analysis. If samples do not pass the threshold required for the number of reads, they will be resequenced. Other parameters include % Passing Filter, Mean Quality Score, and % of >=Q30 Bases to name a few. A detailed summary of the demultiplex stats used for initial Quality Accessing is available online.
Once samples pass initial sequencing quality metrics generated by the sequencers, they are assessed by FASTQC which checks for per-base sequence quality, GC content, and N content, among others. If the data indeed looks sub-par, they will be immediately reprepped and resequenced. Data contamination is checked using FastQ Screen.
Data trimming is done if needed using any one of:
Our quality control process also included assessing per base coverage, mean coverage, and on-target percentages, among others. These will be discussed in the Useful Metrics section.
Samples that pass QC are finally ready to be mapped and analyzed. The researchers will be consulted on what genome version they would like to map to, although the default would be to use the latest version available. On special request, we can try to use an older genome or even a custom one.
Align reads
Reads are mapped to the reference genome using the Burrows-Wheeler Alignment Tool BWA and the output is produced in BAM format. The mapping workflow is synonymous to the workflow employed by the 1000genomes project. Using other software is not out of the question if the experiment needs it. We have a very high mapping rate given the high-quality sequencing at the McDermott Center, using almost all of the data produced!
Remove duplicates
Duplicates are removed from the mapped data using PICARD and reads with quality less than a Phred score of 10 are also filtered out. This leaves us with an alignment file with analysis-ready data.
Once the data has been mapped and filtered, we follow the best practices workflow created by The Broad Institute for their GATK software. This includes everything from data pre-processing, variant discovery, and preliminary analyses for cohorts of samples. In order to have enough power for all exome sequencing analyses, we require at least 30 samples for joint variant calling. If an experiment has less than 30 samples, we will use a random mix of samples from the 1000genomes project to increase power during variant calling and variant recalibration.
GATK
The GATK Toolkit is the industry standard when it comes to SNP and Indel mutation analyses. It can be applied to all kinds of datasets and genomes. It comes with a host of tools that can be used for data alignment refinement, coverage analysis, diagnosis, and mainly variant calling. GATK variant calling can be run in two modes, the UnifiedGenotyper mode or HaplotypeCaller mode. By default, we run both the modes and provide a union of the two results. The variants found are then optimized using GATK re-calibrator tools. Please visit their page for more in-depth information about the software. Both the raw SNP/INDEL calls as well as the recalibrated files are provided.
Variants detected by GATK are then annotated using snpEff. Annotations are incorporated in the VCF file and also presented as a user-friendly TAB file, one for each sample as well as a joint table containing the whole experiment. Annotations include:
COLUMN IDENTIFIER | DESCRIPTION |
---|---|
CHROM | The chromosome of the variant |
POS | The genomic location of the variant |
REF | Reference base |
ALT | Alternate base |
VariantType | Type of variant (SNP or INDEL) |
ID | rsID of dbSNP, if present |
dbSNPBuildID | The version of dbSNP the variant was first found in |
SNPEFF_GENE_NAME | Gene name for the highest-impact effect resulting from the current variant |
Set | (UNI: called by Unified Genotyper; HC: Called by HaploType Caller; Intersection: Called by both callers) |
SampleName.GT | Genotype |
SampleName.AD | Allelic Depth |
SampleName.GQ | Genomic Quality |
SNPEFF_EXON_ID | Exon ID for the highest-impact effect resulting from the current variant |
SNPEFF_FUNCTIONAL_CLASS | Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE) |
SNPEFF_AMINO_ACID_CHANGE | Old/New amino acid for the highest-impact effect resulting from the current variant |
dbNSFP_Uniprot_acc | Uniprot accession number. Multiple entries separated by ";"/td> |
dbNSFP_SIFT_score | SIFT score (SIFTori). The smaller the more damaging. Multiple scores separated by ";" |
dbNSFP_Polyphen2_HDIV_score | Polyphen2 score based on HumDiv, i.e. hdiv_prob. The score ranges from 0 to 1, and the corresponding prediction is "probably damaging" if it is in [0.957,1]; "possibly damaging" if it is in [0.453,0.956]; "benign" if it is in [0,0.452]. Score cutoff for binary classification is 0.5, i.e. the prediction is "neutral" if the score is smaller than 0.5 and "deleterious" if the score is larger than 0.5. Multiple entries separated by ";". |
dbNSFP_GERP_RS | GERP++ RS score, the larger the score, the more conserved the site. |
SNPEFF_IMPACT | Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER) |
GAF | Global Allele Frequency based on AC/AN (1000 Genomes) |
AFR | African Allele Frequency (1000 Genomes) |
AMR | American Allele Frequency (1000 Genomes) |
ASN | Asian Allele Frequency (1000 Genomes) |
EUR | European Allele Frequency (1000 Genomes) |
AA_AC | African American Allele Count in the order of AltAlleles,RefAllele. For INDELs, A1, A2, or An refers to the N-th alternate allele while R refers to the reference allele. (EVS) |
EA_AC | European American Allele Count in the order of AltAlleles,RefAllele. For INDELs, A1, A2, or An refers to the N-th alternate allele while R refers to the reference allele. (EVS) |
GWASCAT | Trait related to this chromosomal position, according to GWAS catalog |
CLOSEST | Closest Splice Site in bps (0 if variant is exonic) |
CA | Clinical Association (EVS) |
snpEff
snpEff is a Genetic variant annotation and effect prediction toolbox. It annotates and predicts the effects of variants on genes along with making the addition of custom annotations possible. The Broad Institute is a user of the snpEff toolbox and through collaboration has made GATK compatible with this highly efficient annotator. Please view their webpage for additional details.
It is imperative that any sequencing/analysis done at the McDermott Center have the highest standards of data quality. This includes not just sequencing quality statistics, but also experiment quality statistics which are equally important. For this purpose, we use a set of tools that is part of the PICARD tool-set as well as diagnosis tools from GATK.
Coverage
Good coverage is very important in order to be confident about the variants being called. We check each sample individually to make sure that they meet our minimum thresholds of at least 50x average coverage and 95%-98% over 20x coverage. A per target coverage table is provided as well as a diagram of the percentage bases covered with X coverage.