Welcome to the Bioinformatics Lab Skill Handbook — a comprehensive tutorial series designed to equip researchers with practical, hands-on skills for modern bioinformatics analysis.
These tutorials cover the entire research workflow — from environment setup and data management to pipeline development, quality control, visualization, and collaboration. Each guide emphasizes reproducibility, efficiency, and lab best practices, helping you build a strong foundation for both independent research and collaborative projects.
Using Mamba and Bioconda for Bioinformatic Research
This guide provides step-by-step instructions for setting up and using Mamba and Bioconda to manage bioinformatics software and environments efficiently.
Prerequisites
- Basic command-line knowledge (Linux/macOS/Windows WSL).
- Miniconda or Anaconda installed. Download Miniconda.
Step 1: Install Mamba
Mamba is a faster alternative to Conda. Install it into the base environment:
- conda install mamba -n base -c conda-forge
Verify the installation:
- mamba --version
Step 2: Configure Bioconda
Bioconda provides thousands of bioinformatics tools. Configure Conda channels in the recommended order:
- Add channels:
- conda config --add channels defaults
- conda config --add channels bioconda
- conda config --add channels conda-forge
- Set strict channel priority:
- conda config --set channel_priority strict
Verify channel configuration:
- conda config --show channels
Step 3: Create a Bioinformatics Environment
Create and activate a new environment with commonly used tools (e.g., FastQC):
- mamba create -n bioenv fastqc
- conda activate bioenv
- fastqc --version
Step 4: Install Additional Tools
Install additional tools as needed (e.g., BWA, Samtools, BCFtools):
- mamba install bwa samtools bcftools
Verify installations:
- bwa
- samtools --version
- bcftools --version
Step 5: Search for Bioconda Packages
Find and install tools available on Bioconda:
- mamba search bowtie2
- mamba install bowtie2
Step 6: Update and Manage Environments
- Update all packages:
- mamba update --all
- Remove a package:
- mamba remove fastqc
- Deactivate the environment:
- conda deactivate
- Remove an environment:
- conda env remove -n bioenv
Step 7: Export and Share Environments
Export an environment to a YAML file for sharing or backup:
- conda activate bioenv
- conda env export > bioenv.yml
Recreate the environment elsewhere:
- conda env create -f bioenv.yml
Step 8: Use Mamba in Scripts
Example of incorporating Mamba into a workflow script:
- #!/bin/bash
- # Create environment
mamba create -n workflow_env -y fastqc bwa samtools - # Activate environment
conda activate workflow_env - # Run analysis
fastqc input.fastq
bwa index reference.fasta
bwa mem reference.fasta input.fastq > output.sam
samtools view -b output.sam > output.bam - # Deactivate environment
conda deactivate
Best Practices
- Use Mamba for faster installations, especially on HPC systems.
- Keep environments modular — one per project or workflow stage.
- Regularly update packages to maintain performance and compatibility.
- Share YAML files for consistent environments across collaborators.
HPC: Build Environment
Building and managing environments on high-performance computing (HPC) clusters requires different practices than local machines. HPC systems often have limited internet access, use module systems, and enforce storage quotas, so environments must be isolated and reproducible.
Prerequisites
- Basic knowledge of Conda/Mamba (see previous tutorial).
- Access to an HPC cluster with a scheduler (e.g., SLURM) and module system.
- Familiarity with the cluster’s storage layout (home, scratch, project directories).
Step 1: Understand HPC Constraints
- HPC nodes may lack direct internet access; use login nodes or offline package downloads.
- System modules (e.g., pre-installed Python) can conflict with Conda/Mamba — load minimal modules.
- Home directories may have small quotas; store environments in project or scratch directories.
Step 2: Set Environment Storage Path
Store Conda/Mamba environments in a project-specific path for better control and sharing:
- mkdir -p /path/to/project/envs
- conda config --add envs_dirs /path/to/project/envs
Step 3: Create Environment with Mamba
Example: Create an environment for RNA-seq analysis:
- mamba create -n rnaseq fastqc star samtools
- conda activate rnaseq
Step 4: Handle Module Conflicts
- Unload unnecessary modules before activating Conda/Mamba environments:
- module purge
- Load only required compilers or libraries (e.g.,
module load gcc/9.3.0
) if needed by tools.
Step 5: Use Containers for Reproducibility
Singularity or Docker containers ensure consistent environments across nodes:
- singularity pull docker://biocontainers/fastqc:v0.11.9_cv8
- singularity exec fastqc_v0.11.9_cv8.sif fastqc --version
Step 6: Export and Share Environment
- conda env export > rnaseq_env.yml
- conda env create -f rnaseq_env.yml
Best Practices
- Keep environments in project directories to avoid personal quota limits.
- Document environment paths and YAML files in project READMEs for reproducibility.
- Use Mamba for faster resolution and installation on HPC systems.
- Prefer containers for complex pipelines or when sharing environments across clusters.
Command-Line Mastery for Bioinformatics
Mastering the Linux command line is a core skill for every bioinformatician. Most bioinformatics tools are command-line based, and efficient CLI usage improves productivity, reproducibility, and troubleshooting across all stages of data analysis.
Prerequisites
- Basic understanding of Linux file system structure (/, home, relative vs absolute paths).
- Access to a terminal (local machine or HPC cluster).
- Familiarity with basic navigation commands like
ls
,cd
, andpwd
.
Step 1: Navigating the File System
Efficient navigation is essential when working with large project directories.
- pwd # Show current directory
- ls -lh # List files with human-readable sizes
- cd /path/to/project # Change to project directory
- cd .. # Go up one directory
Step 2: Viewing and Inspecting Files
Quickly check file contents without opening them in an editor.
- head file.fastq.gz # View first 10 lines
- tail file.log # View last 10 lines
- less file.txt # Scroll through a file interactively
- zcat file.fastq.gz | head # View compressed FASTQ without decompression
Step 3: Searching and Filtering
Extract relevant lines or columns from large text files using grep, awk, and sed.
- grep "ATCG" file.fastq # Find lines containing 'ATCG'
- awk '{print $1, $2}' file.tsv # Print first two columns
- sed 's/foo/bar/g' file.txt # Replace 'foo' with 'bar'
Step 4: Sorting and Counting
Summarize data for sanity checks or quick statistics.
- sort file.txt | uniq # Unique sorted values
- wc -l file.txt # Count number of lines
- cut -f1 file.tsv | sort | uniq -c # Count unique values in column 1
Step 5: Combining Commands with Pipes
Chain commands together for efficient one-liners.
- zcat file.fastq.gz | grep "ATCG" | wc -l # Count lines with 'ATCG' in compressed file
- cut -f1 file.tsv | sort | uniq -c | sort -nr # Rank most frequent items
Step 6: Aliases and Shortcuts
Save time by creating command shortcuts in your shell configuration (e.g., ~/.bashrc
).
- alias ll='ls -lh'
- alias ..='cd ..'
- alias gzhead='zcat $1 | head'
Best Practices
- Prefer one-liners for quick checks; scripts for repeatable workflows.
- Use pipes to minimize intermediate files and save disk space.
- Document frequently used commands in a personal cheat sheet.
- Learn core tools deeply (grep, awk, sed) — they apply to all data types.
Working with HPC Systems (SLURM & Modules)
High-performance computing (HPC) clusters allow bioinformaticians to run large-scale analyses efficiently. Understanding how to submit jobs with SLURM and manage software using environment modules is essential for effective use of shared resources.
Prerequisites
- Access to an HPC cluster with SLURM job scheduler and module system.
- Basic familiarity with the Linux command line (see previous tutorial).
- Knowledge of project storage locations (home, scratch, project directories).
Step 1: Understanding SLURM Basics
SLURM manages job scheduling and resource allocation. Key commands:
- sinfo # Show cluster partition and node availability
- squeue # View running/pending jobs
- scancel JOBID # Cancel a job by ID
- sacct # View job history and resource usage
Step 2: Writing a SLURM Job Script
Example script to run FastQC on an HPC cluster:
- #!/bin/bash
- #SBATCH --job-name=fastqc_job # Job name
- #SBATCH --output=fastqc_%j.log # Log file (with job ID)
- #SBATCH --time=02:00:00 # Wall time limit (2 hours)
- #SBATCH --mem=8G # Memory allocation
- #SBATCH --cpus-per-task=4 # Number of CPU cores
- #SBATCH --partition=general # Partition/queue name
- module load fastqc/0.11.9 # Load FastQC module
- fastqc raw_data/sample.fastq.gz -o analysis/qc/
Step 3: Submitting and Monitoring Jobs
Submit and check jobs:
- sbatch fastqc_job.sh # Submit job
- squeue -u $USER # Monitor jobs for current user
- tail -f fastqc_12345.log # Live view of log file
Step 4: Using Interactive Jobs
Request an interactive session for quick testing or debugging:
- srun --time=01:00:00 --mem=4G --cpus-per-task=2 --pty bash
Step 5: Working with Modules
Modules provide access to pre-installed software without modifying your environment permanently:
- module avail # List available modules
- module load samtools/1.15 # Load a specific version
- module list # Show currently loaded modules
- module purge # Unload all modules
Step 6: Combining Modules and Conda/Mamba
Best practice: load minimal system modules (e.g., compilers) and use Conda/Mamba for tool installation to avoid conflicts.
Best Practices
- Test commands interactively before submitting large jobs.
- Request only the resources needed to reduce queue times.
- Organize logs and outputs into separate directories for clarity.
- Document job scripts with comments for reproducibility and sharing.
Data Handling and File Management
Effective organization and management of data is critical for reproducibility and collaboration in bioinformatics projects. This tutorial covers folder structures, data transfer methods, compression, and indexing to help maintain clean and scalable workflows.
Prerequisites
- Basic command-line knowledge (navigation and file operations).
- Access to local and/or HPC project storage.
- Familiarity with raw vs processed data (e.g., FASTQ vs BAM).
Step 1: Recommended Project Folder Structure
Organize project files into clearly separated directories:
- project_name/
│── raw_data/ # FASTQ, BAM, or VCF files from sequencing
│── metadata/ # Sample sheets, design files, README
│── analysis/ # Processed outputs (QC, counts, plots)
│── scripts/ # Analysis scripts, notebooks
│── logs/ # Log files from pipelines or scripts
Step 2: Creating Folder Structure
- mkdir -p project_name/{raw_data,metadata,analysis,scripts,logs}
Step 3: Data Transfer Between Local and HPC
Use scp
for single transfers or rsync
for large, resumable transfers:
- # Local → HPC
scp file.fastq.gz user@cluster:/path/to/project/raw_data/ - # HPC → Local
scp user@cluster:/path/to/file.fastq.gz ./raw_data/ - # Sync entire folder (resumable)
rsync -avP raw_data/ user@cluster:/path/to/project/raw_data/
Step 4: Compression and Decompression
Sequencing files are typically compressed to save space:
- # Compress FASTQ
gzip file.fastq - # Decompress FASTQ
gunzip file.fastq.gz
Step 5: Indexing Large Files
Index files for faster random access during analysis:
- # Index BAM
samtools index file.bam - # Index VCF
tabix -p vcf file.vcf.gz
Step 6: Documentation
Include a README file in the project root to describe:
- Project description and objectives
- Folder structure and file naming conventions
- Data source and acquisition date
- Checksums or QC summaries for verification
Best Practices
- Separate raw and processed data; never overwrite raw files.
- Use descriptive file names (e.g.,
Sample01_L1_R1.fastq.gz
). - Document transfer methods and data provenance in README or metadata files.
- Prefer
rsync
for large datasets due to resumable transfers and speed.
Quick Data Sanity Checks Before Analysis
Before starting any bioinformatics workflow, it is essential to verify that input data is complete, correctly formatted, and consistent with associated metadata. Quick checks help prevent downstream errors and wasted computation time.
Prerequisites
- Basic familiarity with sequencing file formats (FASTQ, BAM, VCF).
- Command-line skills for viewing and counting file contents.
- Access to metadata files (e.g., sample sheets) for cross-checking.
Step 1: Verify File Completeness
Ensure all expected files are present and readable:
- ls raw_data/ # List files in raw data directory
- file sample.fastq.gz # Confirm file type (gzip-compressed FASTQ)
- zcat sample.fastq.gz | head # Inspect first lines of FASTQ
Step 2: Count Reads in FASTQ
Each read in FASTQ spans 4 lines. Divide line count by 4 to get total reads:
- zcat sample.fastq.gz | wc -l # Count lines
- echo $((LINES/4)) # Convert to number of reads
Step 3: Validate Paired-End FASTQ Files
Check that R1 and R2 files have the same number of reads:
- zcat sample_R1.fastq.gz | wc -l
- zcat sample_R2.fastq.gz | wc -l
- # The two counts should be equal
Step 4: Check BAM/VCF File Integrity
Use samtools and bcftools to confirm file validity:
- samtools quickcheck file.bam # Check BAM header and EOF marker
- bcftools view file.vcf.gz | head # Inspect VCF header lines
Step 5: Compare to Metadata
Ensure file names and counts match metadata entries (e.g., sample sheet):
- Check sample IDs: consistent naming between FASTQ files and metadata table.
- Confirm expected number of files (e.g., paired-end vs single-end).
- Verify read depth or total reads roughly align with sequencing plan.
Step 6: Use Checksums for File Integrity
Generate and verify checksums to detect file corruption:
- md5sum sample.fastq.gz > sample.fastq.gz.md5
- md5sum -c sample.fastq.gz.md5
Best Practices
- Perform sanity checks immediately after data transfer or download.
- Record basic QC (read counts, file sizes) in a log or metadata sheet.
- Flag discrepancies (e.g., mismatched read counts) before starting pipelines.
- Integrate quick checks into automated workflows for consistency.
File and Data Management Standards
Consistent file naming, folder structure, and permissions are essential for collaboration and reproducibility in bioinformatics projects. This tutorial outlines lab-wide standards for organizing data, managing permissions, and documenting workflows.
Prerequisites
- Access to shared storage (HPC project directory or lab server).
- Basic understanding of file permissions and symbolic links.
- Familiarity with project folder structures (see previous tutorials).
Step 1: Standard Folder Structure
Use a consistent structure across all projects:
- project_name/
│── raw_data/ # Unmodified input data (FASTQ, BAM)
│── metadata/ # Sample sheets, design files, README
│── analysis/ # Processed results (QC, counts, plots)
│── scripts/ # Analysis scripts or notebooks
│── logs/ # Pipeline or job logs
Step 2: File Naming Conventions
Adopt clear, descriptive file names to avoid confusion and ensure traceability:
- Sample identifiers: Use alphanumeric IDs (e.g.,
Sample01
). - Include lane and read info:
Sample01_L1_R1.fastq.gz
- Avoid spaces or special characters: Use underscores (
_
) or hyphens (-
). - Use lowercase where possible: Keeps scripts portable across systems.
Step 3: Managing Permissions
Ensure collaborators can read/write without accidental deletion:
- # Set group ownership
chgrp -R labgroup project_name/ - # Grant group read/write permissions
chmod -R g+rwX project_name/
Step 4: Using Symbolic Links
Symbolic links reduce duplication and simplify organization:
- # Create symbolic link
ln -s /path/to/raw_data/sample1.fastq.gz analysis/sample1.fastq.gz - Always document link sources in a README for traceability.
Step 5: Documentation with README Files
Each project should include a README describing:
- Project purpose and description
- Folder layout and naming conventions
- Data source and acquisition details
- QC summaries and key commands used
Step 6: Versioning of Data and Scripts
Record versions of raw data, scripts, and results for reproducibility:
- Tag releases (e.g.,
v1.0
) when datasets or analyses are finalized. - Maintain changelogs for major updates (e.g., new reference genome).
Best Practices
- Apply consistent naming across all projects and pipelines.
- Separate raw, processed, and temporary data to prevent accidental overwrites.
- Use group permissions for collaborative work; avoid personal-only folders.
- Document everything (folder purpose, file naming) in a README for each project.
Version Control with Git
Git is essential for tracking changes to scripts, documenting analysis history, and collaborating on code. This tutorial introduces basic Git workflows used in bioinformatics labs, including repository setup, branching, and collaboration via GitHub/GitLab.
Prerequisites
- Git installed on local machine or HPC login node.
- Basic command-line knowledge (navigation, file editing).
- GitHub or GitLab account for remote repositories (optional but recommended).
Step 1: Initialize a Git Repository
Start tracking a new or existing project directory:
- cd project_name
- git init
- git add .
- git commit -m "Initial commit: add project structure"
Step 2: Connect to Remote Repository
Link local repository to GitHub/GitLab for collaboration:
- git remote add origin git@github.com:username/project.git
- git push -u origin main
Step 3: Tracking and Committing Changes
Add files and commit descriptive messages regularly:
- git status # Check changes
- git add script.py # Stage a file
- git commit -m "Add QC script for FASTQ processing"
Step 4: Branching Workflow
Use branches to develop features or fixes without affecting main code:
- git branch feature-qc
- git checkout feature-qc
- # Make changes, then commit
- git checkout main
- git merge feature-qc
Step 5: Ignoring Unnecessary Files
Exclude large data files and temporary outputs with .gitignore
:
- echo "raw_data/" >> .gitignore
- echo "*.log" >> .gitignore
- git add .gitignore
- git commit -m "Add .gitignore for raw data and logs"
Step 6: Collaborating with Pull Requests
Best practice for lab work: submit changes via pull requests for review:
- Push branch to remote:
git push origin feature-qc
- Open a pull request on GitHub/GitLab for code review.
- Merge after approval to maintain clean project history.
Step 7: Updating Local Repository
Keep local repository synchronized with remote:
- git pull origin main
Best Practices
- Commit early and often with meaningful messages.
- Use branches for features, bug fixes, or experiments.
- Do not commit large raw data files — track only scripts and metadata.
- Ensure all team members follow the same branching and review workflow.
Quality Control Across Workflows
Quality control (QC) is a critical first step in any bioinformatics analysis. This tutorial covers QC strategies for raw sequencing reads (FASTQ), alignment files (BAM), and specialized data types like single-cell RNA-seq or ATAC-seq.
Prerequisites
- Familiarity with basic command-line operations.
- FASTQ, BAM, or single-cell data files to analyze.
- Installed tools: FastQC, MultiQC, samtools, Picard (or available via Conda/Mamba).
Step 1: QC of FASTQ Files
Use FastQC to assess read quality, adapter contamination, and sequence content:
- fastqc raw_data/sample_R1.fastq.gz -o analysis/qc/
- fastqc raw_data/sample_R2.fastq.gz -o analysis/qc/
Aggregate results across multiple samples with MultiQC:
- multiqc analysis/qc/ -o analysis/qc/
Step 2: QC of BAM Files
After alignment, check BAM file integrity and mapping statistics:
- samtools quickcheck aligned.bam # Validate BAM file
- samtools flagstat aligned.bam # Alignment summary
- samtools idxstats aligned.bam # Per-chromosome mapping stats
Use Picard for detailed metrics (e.g., insert size):
- picard CollectInsertSizeMetrics I=aligned.bam O=insert_size.txt H=insert_size_histogram.pdf
Step 3: QC for Single-Cell and ATAC-seq
For single-cell RNA-seq, evaluate metrics like total counts, number of genes detected, and mitochondrial gene percentage:
- Seurat (R) or Scanpy (Python) can generate violin and scatter plots for QC.
- Typical thresholds: remove cells with low gene counts or high mitochondrial %.
For ATAC-seq, assess TSS enrichment and fragment size distribution:
- tools like
ATACseqQC
ordeepTools
can compute these metrics.
Step 4: Summarizing QC Results
Use MultiQC to integrate all QC reports (FASTQ, BAM, etc.) into a single HTML report:
- multiqc analysis/qc/ -o analysis/qc_summary/
Best Practices
- Perform QC immediately after data generation or download.
- Establish lab-wide thresholds (e.g., minimum read count, mapping rate).
- Automate QC steps in pipelines to ensure consistent evaluation.
- Store raw QC reports alongside processed data for reproducibility.
Metadata Handling and Sample Sheets
Metadata and sample sheets provide essential context for sequencing data, including sample identifiers, conditions, and experimental design. Accurate and consistent metadata management ensures reproducibility and smooth pipeline execution.
Prerequisites
- Understanding of experiment design (e.g., biological replicates, treatment groups).
- Basic knowledge of spreadsheet formats (CSV, TSV).
- Access to command-line or scripting tools (Python/R) for validation.
Step 1: Standard Metadata Structure
A typical sample sheet includes these columns:
- SampleID: Unique identifier for each sample (e.g.,
Sample01
). - Condition/Group: Experimental group (e.g., Control, Treatment).
- Replicate: Biological replicate number.
- File Path: Absolute or relative path to raw data files (FASTQ, BAM).
- Notes: Optional field for comments or QC flags.
Step 2: File Naming and Metadata Consistency
Ensure sample names in metadata match raw data file names exactly:
- # Example raw FASTQ: Sample01_L1_R1.fastq.gz
- # Metadata SampleID: Sample01
- # Consistency prevents pipeline errors and mislabeling
Step 3: Manual Validation in Spreadsheet Tools
Use spreadsheet programs (Excel, Google Sheets) to check:
- No duplicate SampleIDs.
- All required fields are filled (no blank cells).
- File paths point to actual files in
raw_data/
.
Step 4: Automated Validation with Command-Line or Scripts
Example using Python (pandas) to validate a CSV file:
- python -c "import pandas as pd; df=pd.read_csv('samplesheet.csv'); print(df.isnull().sum())"
Example using R to check for duplicates:
- Rscript -e "data <- read.csv('samplesheet.csv'); any(duplicated(data\$SampleID))"
Step 5: Tracking Provenance
Record changes to metadata as the project progresses:
- Use version control (Git) for sample sheets.
- Add date-stamped versions (e.g.,
samplesheet_2025-08-01.csv
). - Document transformations (e.g., adding QC flags) in a CHANGELOG or README.
Best Practices
- Use consistent column names and formats across projects (e.g., always “SampleID”).
- Validate metadata before starting any analysis pipeline to prevent failures.
- Keep metadata under version control to track changes over time.
- Store metadata in the
metadata/
folder within the project structure.
Pipeline Basics with Nextflow
Nextflow is a workflow manager widely used in bioinformatics for reproducible and scalable analysis pipelines. This tutorial introduces basic DSL2 concepts, modular pipeline design, and running workflows on HPC systems with SLURM.
Prerequisites
- Basic knowledge of the Linux command line and file structures.
- Nextflow installed on your system (Nextflow installation guide).
- Familiarity with Conda/Mamba environments for software management.
Step 1: Understanding DSL2 Structure
Nextflow DSL2 pipelines are modular and consist of:
- Processes: Define a single computational step (e.g., QC, alignment).
- Channels: Pass data between processes (files, parameters).
- Workflows: Define pipeline execution order.
Step 2: Minimal DSL2 Example
A simple pipeline with one process for FastQC:
- nextflow.enable.dsl=2
- process FASTQC {
- input:
- path fastq
- output:
- path "fastqc_results"
- script:
- """
- fastqc $fastq -o fastqc_results
- """
- }
- workflow {
- Channel.fromPath('raw_data/*.fastq.gz') | FASTQC
- }
Step 3: Parameterizing Pipelines
Pass parameters (e.g., input folder) via params
:
- params.input = "raw_data/*.fastq.gz"
- workflow {
- Channel.fromPath(params.input) | FASTQC
- }
Step 4: Using Profiles and Configs
Separate development and HPC configurations using nextflow.config
:
- profiles {
- standard { process.executor = 'local' }
- slurm { process.executor = 'slurm'; queueSize = 100 }
- }
Step 5: Running on HPC (SLURM)
Submit workflow to SLURM:
- nextflow run main.nf -profile slurm
Step 6: Managing Dependencies
Integrate Conda/Mamba or Singularity for reproducibility:
- conda { fastqc = "fastqc=0.11.9" }
Or containerized approach:
- container 'biocontainers/fastqc:v0.11.9_cv8'
Best Practices
- Keep processes modular — one tool per process.
- Use parameters for inputs and outputs instead of hardcoding paths.
- Test locally before scaling to HPC with SLURM.
- Version control pipeline code and configuration files.
Automating Routine Tasks with Bash
Bash scripting allows bioinformaticians to automate repetitive tasks, streamline pipelines, and minimize manual errors. This tutorial demonstrates common patterns for batch processing, job arrays, and logging in bioinformatics workflows.
Prerequisites
- Basic command-line knowledge (loops, variables, file redirection).
- Access to a Unix/Linux shell (local machine or HPC cluster).
- Understanding of raw vs processed data structures (see earlier tutorials).
Step 1: Writing a Simple Bash Script
Example script to compress all FASTQ files in a folder:
- #!/bin/bash
- for file in raw_data/*.fastq; do
- echo "Compressing $file"
- gzip "$file"
- done
Make the script executable:
- chmod +x compress_fastq.sh
Step 2: Using Variables and Arguments
Pass arguments to make scripts flexible:
- #!/bin/bash
- INPUT_DIR=$1
- for file in "$INPUT_DIR"/*.fastq.gz; do
- fastqc "$file" -o qc_results/
- done
Run the script:
- ./qc_fastq.sh raw_data
Step 3: Job Arrays for Multiple Samples (SLURM)
Efficiently submit multiple jobs for different samples:
- #!/bin/bash
- #SBATCH --job-name=fastqc_array
- #SBATCH --array=1-10
- #SBATCH --output=fastqc_%A_%a.log
- FILES=($(ls raw_data/*.fastq.gz))
- fastqc ${FILES[$SLURM_ARRAY_TASK_ID-1]} -o analysis/qc/
Step 4: Adding Logging and Error Handling
Capture stdout/stderr and exit on error:
- #!/bin/bash
- set -euo pipefail
- LOG="script.log"
- echo "Script started at $(date)" > $LOG
- for file in raw_data/*.fastq.gz; do
- echo "Processing $file" | tee -a $LOG
- fastqc "$file" -o qc_results/ 2>> $LOG
- done
- echo "Script finished at $(date)" >> $LOG
Step 5: Scheduling with Cron (Optional)
Automate periodic tasks (e.g., nightly backups) using cron:
- crontab -e
- # Run script every day at midnight
- 0 0 * * * /path/to/backup.sh
Best Practices
- Use variables and arguments to make scripts reusable across projects.
- Enable
set -euo pipefail
for safer script execution. - Document each step with comments and maintain logs for reproducibility.
- Combine job arrays with pipelines for large-scale processing on HPC systems.
Benchmarking and Tool Comparison
Benchmarking helps evaluate and compare bioinformatics tools in terms of speed, memory usage, and accuracy. This tutorial demonstrates how to design fair benchmarks and document results to guide tool selection for the lab.
Prerequisites
- Basic command-line knowledge (timing commands, parsing logs).
- Access to multiple tools performing similar tasks (e.g., aligners or QC tools).
- Sample dataset representative of typical project workflows.
Step 1: Define Benchmarking Goals
Decide what you want to measure:
- Performance: Runtime and memory usage.
- Accuracy: Correctness of results (e.g., alignment rate, variant concordance).
- Scalability: Ability to handle larger datasets or parallel workloads.
Step 2: Prepare Input Data
Use a small representative dataset or a controlled subset of real data:
- Subset FASTQ files (e.g., first 1M reads) to reduce runtime during testing.
- Ensure all tools use the same reference genome and parameters for fair comparison.
Step 3: Measure Runtime and Memory
Use the /usr/bin/time
command for resource tracking:
- /usr/bin/time -v bwa mem ref.fa reads.fq > output.sam
Key metrics to note:
- Elapsed (wall clock) time
- Maximum resident set size (memory usage)
Step 4: Evaluate Accuracy
Compare tool outputs against known truth sets or between tools:
- Alignment rate (from
samtools flagstat
). - Variant concordance with reference calls (using
bcftools isec
or precision/recall metrics). - QC metrics (FastQC scores, coverage depth).
Step 5: Log and Visualize Results
Maintain structured records for benchmarking results:
- Record runtime, memory, accuracy metrics in CSV or TSV format.
- Visualize results with bar plots or tables for lab presentations.
- Store benchmarking scripts and results in version control for reproducibility.
Step 6: Compare Across Tools
Present results with clear comparisons:
- Table comparing runtime and memory (e.g., BWA vs Bowtie2 vs STAR).
- Charts showing accuracy metrics (e.g., alignment rate, variant calling precision).
Best Practices
- Benchmark tools on the same hardware and dataset for fair comparison.
- Document all versions, parameters, and environments used during testing.
- Update benchmarks when tools or datasets are upgraded.
- Use benchmarks to guide lab-wide tool adoption and pipeline updates.
Reproducibility and FAIR Principles
Reproducibility ensures that bioinformatics analyses can be replicated by others, while FAIR principles (Findable, Accessible, Interoperable, Reusable) promote data sharing and long-term usability. This tutorial outlines practical steps to make analyses both reproducible and FAIR-compliant.
Prerequisites
- Basic understanding of environment management (Conda/Mamba, containers).
- Familiarity with version control (Git) and structured project folders.
- Knowledge of metadata and documentation standards (see earlier tutorials).
Step 1: Capture Software Environments
Record exact software versions to ensure reproducibility:
- # Export Conda/Mamba environment
- conda env export > env_snapshot.yml
- # Recreate environment
- conda env create -f env_snapshot.yml
Alternatively, use container images (Docker/Singularity):
- singularity pull docker://biocontainers/fastqc:v0.11.9_cv8
- singularity exec fastqc_v0.11.9_cv8.sif fastqc --version
Step 2: Version Control for Code and Data
Track all analysis scripts and configuration files:
- Use Git for code and metadata (exclude raw data with
.gitignore
). - Tag releases for key project milestones (e.g.,
v1.0_analysis
). - Store configuration and parameter files with pipelines for exact replication.
Step 3: Annotate Metadata for FAIR
Ensure metadata follows FAIR guidelines:
- Findable: Use unique identifiers (e.g., DOIs, accession numbers).
- Accessible: Provide clear access instructions (e.g., repository links).
- Interoperable: Use standard formats (CSV/TSV, JSON, YAML).
- Reusable: Include detailed descriptions (sample prep, QC thresholds).
Step 4: Document Analysis Steps
Maintain detailed records of workflows:
- README files describing data flow and processing steps.
- Workflow diagrams or schematic pipeline overviews.
- Logs and QC reports archived with results.
Step 5: Preparing Data for Sharing
Ensure datasets are ready for publication or repository upload:
- De-identify sensitive data (if human samples).
- Include checksums for file verification.
- Provide mapping of filenames to metadata for clarity.
Best Practices
- Capture exact environment states (YAML or container images) at every major analysis step.
- Follow FAIR principles when storing and sharing data to maximize reuse and citation.
- Integrate reproducibility checks into pipeline development and lab reviews.
- Archive final results with complete metadata and QC summaries for future reference.
Secure Handling of Sensitive Data
Handling sensitive genomic data requires strict security measures to comply with regulations such as HIPAA and GDPR. This tutorial provides guidelines for encryption, controlled access, and de-identification to ensure data privacy and integrity.
Prerequisites
- Access to secure HPC or institutional storage systems.
- Basic knowledge of file permissions and encryption tools.
- Understanding of sensitive data policies (HIPAA/GDPR requirements).
Step 1: Identify Sensitive Data
Examples of sensitive data:
- Human genomic sequences (FASTQ, BAM, VCF) linked to identifiable individuals.
- Clinical metadata containing personal health information (PHI).
- Internal research data subject to embargo or collaboration agreements.
Step 2: Control Access with Permissions
Limit access to authorized users using Unix group permissions:
- # Set group ownership
chgrp -R labgroup sensitive_project/ - # Restrict access to group members only
chmod -R 770 sensitive_project/
Step 3: Encrypt Data at Rest and in Transit
Use encryption for both stored files and file transfers:
- # Encrypt a file with GPG
gpg -c sensitive_data.fastq.gz - # Decrypt the file
gpg sensitive_data.fastq.gz.gpg - # Secure copy with SSH (encrypted in transit)
scp sensitive_data.fastq.gz user@secure-server:/path/
Step 4: De-Identification of Data
Remove or replace personally identifiable information (PII) in metadata:
- Replace patient names with anonymized IDs (e.g.,
PAT001
). - Strip dates of birth, addresses, and medical record numbers.
- Store ID-to-patient mappings in a separate encrypted file accessible only to authorized personnel.
Step 5: Use Secure Storage and Transfer Services
Follow institutional or HPC policies for secure storage:
- Use encrypted network storage or approved cloud services (e.g., S3 with server-side encryption).
- Never use personal cloud accounts (Dropbox, Google Drive) for sensitive data unless explicitly approved.
Step 6: Monitor and Audit Access
Implement logging to track file access:
- Enable auditing tools (e.g.,
aureport
on Linux) for sensitive directories. - Review access logs regularly for unusual activity.
Best Practices
- Classify data sensitivity at project start and document handling policies.
- Use encryption by default for storage and transfer of human genomic data.
- Apply least-privilege principles: grant access only to required personnel.
- Periodically review permissions and revoke access for inactive users.
Building Publication-Quality Figures
High-quality figures are critical for communicating bioinformatics results in publications and presentations. This tutorial demonstrates best practices for creating clear, consistent, and reproducible plots using tools like R (ggplot2), Python (matplotlib), and MultiQC outputs.
Prerequisites
- Basic knowledge of data visualization in R or Python.
- Familiarity with bioinformatics outputs (UMAP, heatmaps, QC metrics).
- Installed plotting libraries (ggplot2, matplotlib, seaborn).
Step 1: Choose Appropriate Plot Types
Match figure type to data and message:
- UMAP/t-SNE: Visualize cell clustering in single-cell data.
- Heatmaps: Show gene expression or peak accessibility patterns.
- Bar/Box Plots: Summarize QC metrics or group comparisons.
- MultiQC: Combine multiple QC reports into one summary.
Step 2: Ensure Consistent Color Palettes
Use meaningful and colorblind-friendly palettes:
- Consistent cluster colors across all figures (e.g., same color for “Cluster 1”).
- Use palettes like
viridis
(Python/R) or ColorBrewer schemes. - Avoid using red/green contrasts for accessibility.
Step 3: High-Resolution and Export Settings
Export figures at publication quality (300+ dpi):
- # In R (ggplot2)
ggsave("figure1.png", dpi=300, width=6, height=4) - # In Python (matplotlib)
plt.savefig("figure1.png", dpi=300, bbox_inches="tight")
Step 4: Combining Multiple Plots
Create multi-panel figures for complex results:
- In R: use
patchwork
orcowplot
for combining ggplots. - In Python: use
matplotlib.subplots()
orgridspec
. - Maintain consistent font sizes and axis labels across panels.
Step 5: Annotating Figures
Add informative labels and legends:
- Clearly label axes, groups, and statistical tests.
- Use panel labels (A, B, C) for multi-panel figures.
- Include descriptive figure captions when sharing internally or publishing.
Step 6: Reproducibility of Figures
Save figure scripts and data used for plotting:
- Commit plotting scripts to version control with the rest of the analysis.
- Document data preprocessing steps used for figure generation.
- Ensure figures can be regenerated by others in the lab.
Best Practices
- Use consistent styles (fonts, colors, axis labels) across all figures in a project.
- Generate vector graphics (SVG, PDF) when possible for scalable quality.
- Preview figures at final publication size to ensure readability of labels.
- Integrate figure generation into pipelines (RMarkdown, Jupyter) for full reproducibility.
Reproducible Reporting (RMarkdown/Quarto)
Reproducible reports combine code, results, and narrative text into a single document. Using RMarkdown or Quarto, bioinformaticians can generate dynamic reports (HTML/PDF) for collaborators and publications, ensuring transparency and easy updates.
Prerequisites
- R and RStudio installed (for RMarkdown) or Quarto installed (quarto.org).
- Basic knowledge of R or Python for data analysis and visualization.
- Familiarity with Markdown syntax for formatting text.
Step 1: RMarkdown Basics
RMarkdown files have three components:
- YAML header: Title, author, output format.
- Markdown text: Descriptive content, figures, and tables.
- Code chunks: Embedded R or Python code for analysis.
Example YAML header:
- ---
- title: "QC Report"
- author: "Bioinformatics Lab"
- output: html_document
- ---
Step 2: Quarto Basics
Quarto supports R, Python, and multi-language documents with extended publishing features:
- Use
quarto render report.qmd
to generate reports. - Supports advanced layouts (dashboards, interactive plots).
- Can render directly to PDF, Word, or HTML from the same source file.
Step 3: Embedding Code and Results
Code chunks run during report rendering and insert results dynamically:
- ```{r}
- summary(read_counts)
- ```
For Python (Quarto or RMarkdown with reticulate
):
- ```{python}
- import pandas as pd
- pd.read_csv("qc_metrics.csv").head()
- ```
Step 4: Adding Figures and Tables
Use plotting libraries (ggplot2, matplotlib) inside code chunks:
- ```{r fig.width=6, fig.height=4}
- library(ggplot2)
- ggplot(df, aes(x=Sample, y=Reads)) + geom_bar(stat="identity")
- ```
Step 5: Rendering Reports
Generate outputs in different formats:
- # RMarkdown (RStudio or command line)
- rmarkdown::render("report.Rmd")
- # Quarto
- quarto render report.qmd --to html
Step 6: Lab Standard Templates
Maintain reusable templates for consistency:
- Predefined YAML headers with lab branding (logo, author, date).
- Sections for QC summary, figures, and interpretation.
- Automatic inclusion of MultiQC or pipeline outputs.
Best Practices
- Integrate reports into pipelines to auto-generate after analyses.
- Version control reports and data used to produce them.
- Ensure all figures and tables are labeled and captioned for clarity.
- Use Quarto for multi-language or advanced layouts; RMarkdown for R-focused workflows.
Code Review and Collaboration Workflow
Code review ensures quality, reproducibility, and maintainability of bioinformatics scripts and pipelines. This tutorial outlines lab practices for collaborative development using Git branching, pull requests, and standardized coding styles.
Prerequisites
- Basic Git knowledge (init, commit, branch, push).
- Access to GitHub/GitLab repository for the lab.
- Familiarity with lab coding style and documentation standards.
Step 1: Branching Strategy
Use feature branches for development and keep main
stable:
- git checkout -b feature-add-qc
- # Develop feature on branch
- git add qc_script.sh
- git commit -m "Add QC script"
- git push origin feature-add-qc
Step 2: Submitting Pull Requests (PRs)
PRs allow review before merging into main:
- Open PR on GitHub/GitLab describing changes and purpose.
- Link to related issues or tickets (if applicable).
- Assign reviewers (lab members responsible for this module).
Step 3: Review Checklist
Reviewers should check for:
- Correctness (does it solve the intended problem?).
- Readability (clear variable names, comments).
- Consistency (follows lab style guide and directory structure).
- Reproducibility (dependencies and parameters documented).
Step 4: Coding Style Standards
Adopt a consistent style across the lab:
- Use lowercase filenames with underscores (
sample_qc.sh
). - Document scripts with header comments (author, purpose, date).
- Limit line length (e.g., 80-100 characters) for readability.
- Use consistent indentation (2 or 4 spaces, no tabs).
Step 5: Handling Comments and Changes
Incorporate reviewer feedback efficiently:
- Address comments in new commits (avoid force-pushing unless necessary).
- Mark resolved comments in the PR discussion.
- Communicate rationale for design choices in replies if not changing code.
Step 6: Merging and Cleanup
After approval:
- Squash commits (if preferred) to keep history clean.
- Merge PR into
main
using “merge commit” or “rebase and merge.” - Delete feature branch after merge to reduce clutter.
Best Practices
- Keep PRs small and focused for easier review.
- Review regularly to avoid blocking team workflows.
- Use templates for PR descriptions and review checklists.
- Document style guide and review process in the lab wiki for new members.
Lab Data Lifecycle and Archival
Managing the entire lifecycle of bioinformatics data—from raw acquisition to long-term archival—is essential for efficient storage usage, cost control, and reproducibility. This tutorial outlines best practices for organizing, cleaning, and archiving data in a lab setting.
Prerequisites
- Understanding of lab storage systems (scratch, project, and archival storage).
- Familiarity with data formats (FASTQ, BAM, VCF, metadata files).
- Knowledge of institutional policies for data retention and sharing.
Step 1: Define Data Stages
Classify data into stages for better handling:
- Raw data: Unmodified output from sequencing instruments.
- Processed data: Outputs from pipelines (QC, alignment, counts).
- Final results: Ready for publication or sharing (figures, tables).
Step 2: Storage Policies
Understand storage tiers and assign data accordingly:
- Scratch: Temporary high-speed storage for active computation; purge after job completion.
- Project storage: Medium-term storage for ongoing analyses.
- Archive: Long-term storage for completed projects or regulatory retention.
Step 3: Data Retention Timelines
Set clear retention periods for each data stage:
- Raw data: Retain at least until publication or per funder’s requirement.
- Processed data: Retain for re-analysis or method validation (e.g., 3–5 years).
- Final results: Retain indefinitely in lab archives or repositories.
Step 4: Preparing Data for Archival
Before archiving:
- Remove redundant intermediate files to save space.
- Compress large files (e.g., BAM to CRAM) when possible.
- Generate checksums (md5/sha256) for integrity verification.
- Include README with metadata, project summary, and software versions.
Step 5: Archival Systems and Tools
Options for archival:
- Institutional cold storage or tape systems (low-cost, slower access).
- Cloud archival (e.g., AWS Glacier, Google Coldline) with encryption.
- External repositories (NCBI SRA, GEO) for public datasets.
Step 6: Deletion and Cleanup
Implement regular cleanup cycles:
- Delete scratch data after jobs complete (avoid accidental buildup).
- Remove temporary or duplicate files before moving to archive.
- Log deletions in a project lifecycle document for accountability.
Best Practices
- Separate raw, processed, and final data clearly with dedicated folders.
- Document archival locations and retention timelines for each project.
- Perform periodic audits to ensure compliance with institutional policies.
- Store both data and metadata (QC reports, analysis logs) together for future reuse.