Chromatin immunoprecipitation followed by sequencing (ChIP-seq) allows in vivo determination of where a protein binds in the genome, which can be transcription factors, DNA-binding enzymes, histones, chaperones, or nucleosomes. ChIP-seq first cross-links bound proteins to chromatin, fragments the chromatic, captures the DNA fragments bound to one protein using the antibody specific to it, and sequences the ends of the captured fragments using next generation sequencing (NGS).
Computational mapping of the sequenced DNA identifies the genomic locations of bound DNA-binding enzymes, modified histones, chaperones, nucleosomes, and transcription factors (TFs), thereby illuminating the role of these protein-DNA interactions in gene expression and other cellular processes. The use of NGS provides relatively high resolution, low noise, and high genomic coverage compared with ChIP-chip assays (ChIP followed by microarray hybridization).
ChIP-seq is now the most widely used procedure for genome-wide assays of protein-DNA interaction, and its use in mapping histone modifications has been seminal in epigenetics research.
The Eugene McDermott Center for Human Growth and Development has been very active in this field of genetic research, providing cutting-edge sequencing, data analysis, and support to the magnitude of researchers at UT Southwestern and beyond. The following is a basic workflow that we employ for the analysis of such data.
Please contact us if you would like more details about the workflow including specific parameters of the software, genome versions used, etc. This publication is also a great place to start for those who want an accessible introduction to ChIP-Seq analysis. Another great resource is the guidelines written by the ENCODE consortia
Files Provided with ChIP-Seq Analysis
The following is a list of files provided with your ChIP-Seq analysis:
Raw unprocessed gzipped FASTQ files
FASTQC report with basic sequencing quality statistics
For human and mouse ChIP-Seq experiments, we highly recommend at least 10 million analysis-ready reads per sample (i.e. after low quality and duplicate removal). The read number may be less for smaller genomes.
Usually, NGS is done on pools of samples. Hence a lane can contain a mixture of libraries that need to be extracted into their corresponding samples. We use illumina's CASAVA software for most experiments, unless there are customized indexes and adapters present. The default output are gzipped files in the FASTQ format. These files are readily compatible with most NGS data analysis software currently out there.
At the McDermott Center, we are most careful about the quality of data generated by the sequencers. The first check is to make sure we have enough sequencing for the analysis. If samples do not pass the threshold required for the number of reads, they will be resequenced. Other parameters include % Passing Filter, Mean Quality Score, and % of >=Q30 Bases to name a few. A detailed summary of the demultiplex stats used for initial Quality Accessing is available online.
Data is generated from the sequencing core using either Paired End (PE) or Single End (SE) protocols. Please view this excellent paper for a thorough comparison between the two protocols.
Once samples pass initial sequencing quality metrics generated by the sequencers, they are assessed by FASTQC which checks for per-base sequence quality, GC content, and N content, among others. If the data indeed looks sub-par, they will be immediately reprepped and resequenced. Data trimming is done if needed using any one of:
We also use tools such as phantompeakqualtools and HOMER to get highly informative enrichment and quality measures for ChIP-seq/DNase-seq/FAIRE-seq/MNase-seq data. It can also be used to obtain robust estimates of the predominant fragment length or characteristic tag shift values in these assays.
Samples that pass QC are finally ready to be mapped and analyzed. The researchers will be consulted on what genome version they would like to map to, although the default would be to use the latest version available. On special request, we can try to use an older genome or even a custom one.
Reads are mapped using software that is pertinent to the type of experiment being done. Usually, we use Bowtie2 for single-end sequencing and BWA for paired-end sequencing, although using other software is not out of the question.
Duplicates are removed from the mapped data using PICARD and reads with quality less than 10 are also filtered out. This leaves us with an alignment file with analysis-ready data.
The analysis-ready alignment files are then used to identify transcription factor binding sites, histone modifications, enriched motifs, and other information typical to a ChIP-Seq experiment. We recommend researchers have at least one control sample for their treatments. This will give statistically better results compared to one without any controls.
There are many software available for ChIP-Seq analysis that have their own merits and downfalls (Excel spreadsheet). We use the currently popular tools HOMER and Macs2 for our pipeline. Some experiments produce clearly defined peaks of 100–200 base pairs as typified by transcription factors, e.g. ERα; others produce wider smears of a few to several hundreds of kilobases such as H3K27me3, and lastly, those that produce a mix of clearly defined peaks and wider smears such as RNA polymerase II. Most algorithms have been developed for the analysis of clearly defined peaks, as these present the opportunity to determine nucleotide resolution of transcription factor binding and motif analysis. If you have any specific requests, please don't hesitate to contact us.
Regions called are annotated by default using HOMER and any regions called by other tools use snpEff. By default, regions lying with 30kb to the peak regions are annotated. This by no means pertains to all experiments and the parameters can be tweaked according to one's needs.