Tutorial

Welcome to the Bioinformatics Lab Skill Handbook — a comprehensive tutorial series designed to equip researchers with practical, hands-on skills for modern bioinformatics analysis.
These tutorials cover the entire research workflow — from environment setup and data management to pipeline development, quality control, visualization, and collaboration. Each guide emphasizes reproducibility, efficiency, and lab best practices, helping you build a strong foundation for both independent research and collaborative projects.

Using Mamba and Bioconda

Using Mamba and Bioconda for Bioinformatic Research

This guide provides step-by-step instructions for setting up and using Mamba and Bioconda to manage bioinformatics software and environments efficiently.

Prerequisites

Basic command-line knowledge (Linux/macOS/Windows WSL).
Miniconda or Anaconda installed. Download Miniconda.

Step 1: Install Mamba

Mamba is a faster alternative to Conda. Install it into the base environment:

conda install mamba -n base -c conda-forge

Verify the installation:

mamba --version

Step 2: Configure Bioconda

Bioconda provides thousands of bioinformatics tools. Configure Conda channels in the recommended order:

Add channels:
- conda config --add channels defaults
- conda config --add channels bioconda
- conda config --add channels conda-forge
Set strict channel priority:
- conda config --set channel_priority strict

Verify channel configuration:

conda config --show channels

Step 3: Create a Bioinformatics Environment

Create and activate a new environment with commonly used tools (e.g., FastQC):

mamba create -n bioenv fastqc
conda activate bioenv
fastqc --version

Step 4: Install Additional Tools

Install additional tools as needed (e.g., BWA, Samtools, BCFtools):

mamba install bwa samtools bcftools

Verify installations:

bwa
samtools --version
bcftools --version

Step 5: Search for Bioconda Packages

Find and install tools available on Bioconda:

mamba search bowtie2
mamba install bowtie2

Step 6: Update and Manage Environments

Update all packages:
- mamba update --all
Remove a package:
- mamba remove fastqc
Deactivate the environment:
- conda deactivate
Remove an environment:
- conda env remove -n bioenv

Step 7: Export and Share Environments

Export an environment to a YAML file for sharing or backup:

conda activate bioenv
conda env export > bioenv.yml

Recreate the environment elsewhere:

conda env create -f bioenv.yml

Step 8: Use Mamba in Scripts

Example of incorporating Mamba into a workflow script:

#!/bin/bash
# Create environment
mamba create -n workflow_env -y fastqc bwa samtools
# Activate environment
conda activate workflow_env
# Run analysis
fastqc input.fastq
bwa index reference.fasta
bwa mem reference.fasta input.fastq > output.sam
samtools view -b output.sam > output.bam
# Deactivate environment
conda deactivate

Best Practices

Use Mamba for faster installations, especially on HPC systems.
Keep environments modular — one per project or workflow stage.
Regularly update packages to maintain performance and compatibility.
Share YAML files for consistent environments across collaborators.

HPC: Build Environment

Building and managing environments on high-performance computing (HPC) clusters requires different practices than local machines. HPC systems often have limited internet access, use module systems, and enforce storage quotas, so environments must be isolated and reproducible.

Prerequisites

Basic knowledge of Conda/Mamba (see previous tutorial).
Access to an HPC cluster with a scheduler (e.g., SLURM) and module system.
Familiarity with the cluster’s storage layout (home, scratch, project directories).

Step 1: Understand HPC Constraints

HPC nodes may lack direct internet access; use login nodes or offline package downloads.
System modules (e.g., pre-installed Python) can conflict with Conda/Mamba — load minimal modules.
Home directories may have small quotas; store environments in project or scratch directories.

Step 2: Set Environment Storage Path

Store Conda/Mamba environments in a project-specific path for better control and sharing:

mkdir -p /path/to/project/envs
conda config --add envs_dirs /path/to/project/envs

Step 3: Create Environment with Mamba

Example: Create an environment for RNA-seq analysis:

mamba create -n rnaseq fastqc star samtools
conda activate rnaseq

Step 4: Handle Module Conflicts

Unload unnecessary modules before activating Conda/Mamba environments:
module purge
Load only required compilers or libraries (e.g., module load gcc/9.3.0) if needed by tools.

Step 5: Use Containers for Reproducibility

Singularity or Docker containers ensure consistent environments across nodes:

singularity pull docker://biocontainers/fastqc:v0.11.9_cv8
singularity exec fastqc_v0.11.9_cv8.sif fastqc --version

Step 6: Export and Share Environment

conda env export > rnaseq_env.yml
conda env create -f rnaseq_env.yml

Best Practices

Keep environments in project directories to avoid personal quota limits.
Document environment paths and YAML files in project READMEs for reproducibility.
Use Mamba for faster resolution and installation on HPC systems.
Prefer containers for complex pipelines or when sharing environments across clusters.

Basic Command-Line

Command-Line Mastery for Bioinformatics

Mastering the Linux command line is a core skill for every bioinformatician. Most bioinformatics tools are command-line based, and efficient CLI usage improves productivity, reproducibility, and troubleshooting across all stages of data analysis.

Prerequisites

Basic understanding of Linux file system structure (/, home, relative vs absolute paths).
Access to a terminal (local machine or HPC cluster).
Familiarity with basic navigation commands like ls, cd, and pwd.

Step 1: Navigating the File System

Efficient navigation is essential when working with large project directories.

pwd # Show current directory
ls -lh # List files with human-readable sizes
cd /path/to/project # Change to project directory
cd .. # Go up one directory

Step 2: Viewing and Inspecting Files

Quickly check file contents without opening them in an editor.

head file.fastq.gz # View first 10 lines
tail file.log # View last 10 lines
less file.txt # Scroll through a file interactively
zcat file.fastq.gz | head # View compressed FASTQ without decompression

Step 3: Searching and Filtering

Extract relevant lines or columns from large text files using grep, awk, and sed.

grep "ATCG" file.fastq # Find lines containing 'ATCG'
awk '{print $1, $2}' file.tsv # Print first two columns
sed 's/foo/bar/g' file.txt # Replace 'foo' with 'bar'

Step 4: Sorting and Counting

Summarize data for sanity checks or quick statistics.

sort file.txt | uniq # Unique sorted values
wc -l file.txt # Count number of lines
cut -f1 file.tsv | sort | uniq -c # Count unique values in column 1

Step 5: Combining Commands with Pipes

Chain commands together for efficient one-liners.

zcat file.fastq.gz | grep "ATCG" | wc -l # Count lines with 'ATCG' in compressed file
cut -f1 file.tsv | sort | uniq -c | sort -nr # Rank most frequent items

Step 6: Aliases and Shortcuts

Save time by creating command shortcuts in your shell configuration (e.g., ~/.bashrc).

alias ll='ls -lh'
alias ..='cd ..'
alias gzhead='zcat $1 | head'

Best Practices

Prefer one-liners for quick checks; scripts for repeatable workflows.
Use pipes to minimize intermediate files and save disk space.
Document frequently used commands in a personal cheat sheet.
Learn core tools deeply (grep, awk, sed) — they apply to all data types.

Working with HPC Systems

Working with HPC Systems (SLURM & Modules)

High-performance computing (HPC) clusters allow bioinformaticians to run large-scale analyses efficiently. Understanding how to submit jobs with SLURM and manage software using environment modules is essential for effective use of shared resources.

Prerequisites

Access to an HPC cluster with SLURM job scheduler and module system.
Basic familiarity with the Linux command line (see previous tutorial).
Knowledge of project storage locations (home, scratch, project directories).

Step 1: Understanding SLURM Basics

SLURM manages job scheduling and resource allocation. Key commands:

sinfo # Show cluster partition and node availability
squeue # View running/pending jobs
scancel JOBID # Cancel a job by ID
sacct # View job history and resource usage

Step 2: Writing a SLURM Job Script

Example script to run FastQC on an HPC cluster:

#!/bin/bash
#SBATCH --job-name=fastqc_job # Job name
#SBATCH --output=fastqc_%j.log # Log file (with job ID)
#SBATCH --time=02:00:00 # Wall time limit (2 hours)
#SBATCH --mem=8G # Memory allocation
#SBATCH --cpus-per-task=4 # Number of CPU cores
#SBATCH --partition=general # Partition/queue name
module load fastqc/0.11.9 # Load FastQC module
fastqc raw_data/sample.fastq.gz -o analysis/qc/

Step 3: Submitting and Monitoring Jobs

Submit and check jobs:

sbatch fastqc_job.sh # Submit job
squeue -u $USER # Monitor jobs for current user
tail -f fastqc_12345.log # Live view of log file

Step 4: Using Interactive Jobs

Request an interactive session for quick testing or debugging:

srun --time=01:00:00 --mem=4G --cpus-per-task=2 --pty bash

Step 5: Working with Modules

Modules provide access to pre-installed software without modifying your environment permanently:

module avail # List available modules
module load samtools/1.15 # Load a specific version
module list # Show currently loaded modules
module purge # Unload all modules

Step 6: Combining Modules and Conda/Mamba

Best practice: load minimal system modules (e.g., compilers) and use Conda/Mamba for tool installation to avoid conflicts.

Best Practices

Test commands interactively before submitting large jobs.
Request only the resources needed to reduce queue times.
Organize logs and outputs into separate directories for clarity.
Document job scripts with comments for reproducibility and sharing.

Data Handling and File Management

Effective organization and management of data is critical for reproducibility and collaboration in bioinformatics projects. This tutorial covers folder structures, data transfer methods, compression, and indexing to help maintain clean and scalable workflows.

Prerequisites

Basic command-line knowledge (navigation and file operations).
Access to local and/or HPC project storage.
Familiarity with raw vs processed data (e.g., FASTQ vs BAM).

Step 1: Recommended Project Folder Structure

Organize project files into clearly separated directories:

project_name/
│── raw_data/     # FASTQ, BAM, or VCF files from sequencing
│── metadata/     # Sample sheets, design files, README
│── analysis/     # Processed outputs (QC, counts, plots)
│── scripts/     # Analysis scripts, notebooks
│── logs/     # Log files from pipelines or scripts

Step 2: Creating Folder Structure

mkdir -p project_name/{raw_data,metadata,analysis,scripts,logs}

Step 3: Data Transfer Between Local and HPC

Use scp for single transfers or rsync for large, resumable transfers:

# Local → HPC
scp file.fastq.gz user@cluster:/path/to/project/raw_data/
# HPC → Local
scp user@cluster:/path/to/file.fastq.gz ./raw_data/
# Sync entire folder (resumable)
rsync -avP raw_data/ user@cluster:/path/to/project/raw_data/

Step 4: Compression and Decompression

Sequencing files are typically compressed to save space:

# Compress FASTQ
gzip file.fastq
# Decompress FASTQ
gunzip file.fastq.gz

Step 5: Indexing Large Files

Index files for faster random access during analysis:

# Index BAM
samtools index file.bam
# Index VCF
tabix -p vcf file.vcf.gz

Step 6: Documentation

Include a README file in the project root to describe:

Project description and objectives
Folder structure and file naming conventions
Data source and acquisition date
Checksums or QC summaries for verification

Best Practices

Separate raw and processed data; never overwrite raw files.
Use descriptive file names (e.g., Sample01_L1_R1.fastq.gz).
Document transfer methods and data provenance in README or metadata files.
Prefer rsync for large datasets due to resumable transfers and speed.

Quick Data Sanity Checks Before Analysis

Before starting any bioinformatics workflow, it is essential to verify that input data is complete, correctly formatted, and consistent with associated metadata. Quick checks help prevent downstream errors and wasted computation time.

Prerequisites

Basic familiarity with sequencing file formats (FASTQ, BAM, VCF).
Command-line skills for viewing and counting file contents.
Access to metadata files (e.g., sample sheets) for cross-checking.

Step 1: Verify File Completeness

Ensure all expected files are present and readable:

ls raw_data/ # List files in raw data directory
file sample.fastq.gz # Confirm file type (gzip-compressed FASTQ)
zcat sample.fastq.gz | head # Inspect first lines of FASTQ

Step 2: Count Reads in FASTQ

Each read in FASTQ spans 4 lines. Divide line count by 4 to get total reads:

zcat sample.fastq.gz | wc -l # Count lines
echo $((LINES/4)) # Convert to number of reads

Step 3: Validate Paired-End FASTQ Files

Check that R1 and R2 files have the same number of reads:

zcat sample_R1.fastq.gz | wc -l
zcat sample_R2.fastq.gz | wc -l
# The two counts should be equal

Step 4: Check BAM/VCF File Integrity

Use samtools and bcftools to confirm file validity:

samtools quickcheck file.bam # Check BAM header and EOF marker
bcftools view file.vcf.gz | head # Inspect VCF header lines

Step 5: Compare to Metadata

Ensure file names and counts match metadata entries (e.g., sample sheet):

Check sample IDs: consistent naming between FASTQ files and metadata table.
Confirm expected number of files (e.g., paired-end vs single-end).
Verify read depth or total reads roughly align with sequencing plan.

Step 6: Use Checksums for File Integrity

Generate and verify checksums to detect file corruption:

md5sum sample.fastq.gz > sample.fastq.gz.md5
md5sum -c sample.fastq.gz.md5

Best Practices

Perform sanity checks immediately after data transfer or download.
Record basic QC (read counts, file sizes) in a log or metadata sheet.
Flag discrepancies (e.g., mismatched read counts) before starting pipelines.
Integrate quick checks into automated workflows for consistency.

File and Data Management Standards

Consistent file naming, folder structure, and permissions are essential for collaboration and reproducibility in bioinformatics projects. This tutorial outlines lab-wide standards for organizing data, managing permissions, and documenting workflows.

Prerequisites

Access to shared storage (HPC project directory or lab server).
Basic understanding of file permissions and symbolic links.
Familiarity with project folder structures (see previous tutorials).

Step 1: Standard Folder Structure

Use a consistent structure across all projects:

project_name/
│── raw_data/     # Unmodified input data (FASTQ, BAM)
│── metadata/     # Sample sheets, design files, README
│── analysis/     # Processed results (QC, counts, plots)
│── scripts/     # Analysis scripts or notebooks
│── logs/     # Pipeline or job logs

Step 2: File Naming Conventions

Adopt clear, descriptive file names to avoid confusion and ensure traceability:

Sample identifiers: Use alphanumeric IDs (e.g., Sample01).
Include lane and read info: Sample01_L1_R1.fastq.gz
Avoid spaces or special characters: Use underscores (_) or hyphens (-).
Use lowercase where possible: Keeps scripts portable across systems.

Step 3: Managing Permissions

Ensure collaborators can read/write without accidental deletion:

# Set group ownership
chgrp -R labgroup project_name/
# Grant group read/write permissions
chmod -R g+rwX project_name/

Step 4: Using Symbolic Links

Symbolic links reduce duplication and simplify organization:

# Create symbolic link
ln -s /path/to/raw_data/sample1.fastq.gz analysis/sample1.fastq.gz
Always document link sources in a README for traceability.

Step 5: Documentation with README Files

Each project should include a README describing:

Project purpose and description
Folder layout and naming conventions
Data source and acquisition details
QC summaries and key commands used

Step 6: Versioning of Data and Scripts

Record versions of raw data, scripts, and results for reproducibility:

Tag releases (e.g., v1.0) when datasets or analyses are finalized.
Maintain changelogs for major updates (e.g., new reference genome).

Best Practices

Apply consistent naming across all projects and pipelines.
Separate raw, processed, and temporary data to prevent accidental overwrites.
Use group permissions for collaborative work; avoid personal-only folders.
Document everything (folder purpose, file naming) in a README for each project.

Version Control with Git

Git is essential for tracking changes to scripts, documenting analysis history, and collaborating on code. This tutorial introduces basic Git workflows used in bioinformatics labs, including repository setup, branching, and collaboration via GitHub/GitLab.

Prerequisites

Git installed on local machine or HPC login node.
Basic command-line knowledge (navigation, file editing).
GitHub or GitLab account for remote repositories (optional but recommended).

Step 1: Initialize a Git Repository

Start tracking a new or existing project directory:

cd project_name
git init
git add .
git commit -m "Initial commit: add project structure"

Step 2: Connect to Remote Repository

Link local repository to GitHub/GitLab for collaboration:

git remote add origin git@github.com:username/project.git
git push -u origin main

Step 3: Tracking and Committing Changes

Add files and commit descriptive messages regularly:

git status # Check changes
git add script.py # Stage a file
git commit -m "Add QC script for FASTQ processing"

Step 4: Branching Workflow

Use branches to develop features or fixes without affecting main code:

git branch feature-qc
git checkout feature-qc
# Make changes, then commit
git checkout main
git merge feature-qc

Step 5: Ignoring Unnecessary Files

Exclude large data files and temporary outputs with .gitignore:

echo "raw_data/" >> .gitignore
echo "*.log" >> .gitignore
git add .gitignore
git commit -m "Add .gitignore for raw data and logs"

Step 6: Collaborating with Pull Requests

Best practice for lab work: submit changes via pull requests for review:

Push branch to remote: git push origin feature-qc
Open a pull request on GitHub/GitLab for code review.
Merge after approval to maintain clean project history.

Step 7: Updating Local Repository

Keep local repository synchronized with remote:

git pull origin main

Best Practices

Commit early and often with meaningful messages.
Use branches for features, bug fixes, or experiments.
Do not commit large raw data files — track only scripts and metadata.
Ensure all team members follow the same branching and review workflow.

Quality Control Across Workflows

Quality control (QC) is a critical first step in any bioinformatics analysis. This tutorial covers QC strategies for raw sequencing reads (FASTQ), alignment files (BAM), and specialized data types like single-cell RNA-seq or ATAC-seq.

Prerequisites

Familiarity with basic command-line operations.
FASTQ, BAM, or single-cell data files to analyze.
Installed tools: FastQC, MultiQC, samtools, Picard (or available via Conda/Mamba).

Step 1: QC of FASTQ Files

Use FastQC to assess read quality, adapter contamination, and sequence content:

fastqc raw_data/sample_R1.fastq.gz -o analysis/qc/
fastqc raw_data/sample_R2.fastq.gz -o analysis/qc/

Aggregate results across multiple samples with MultiQC:

multiqc analysis/qc/ -o analysis/qc/

Step 2: QC of BAM Files

After alignment, check BAM file integrity and mapping statistics:

samtools quickcheck aligned.bam # Validate BAM file
samtools flagstat aligned.bam # Alignment summary
samtools idxstats aligned.bam # Per-chromosome mapping stats

Use Picard for detailed metrics (e.g., insert size):

picard CollectInsertSizeMetrics I=aligned.bam O=insert_size.txt H=insert_size_histogram.pdf

Step 3: QC for Single-Cell and ATAC-seq

For single-cell RNA-seq, evaluate metrics like total counts, number of genes detected, and mitochondrial gene percentage:

Seurat (R) or Scanpy (Python) can generate violin and scatter plots for QC.
Typical thresholds: remove cells with low gene counts or high mitochondrial %.

For ATAC-seq, assess TSS enrichment and fragment size distribution:

tools like ATACseqQC or deepTools can compute these metrics.

Step 4: Summarizing QC Results

Use MultiQC to integrate all QC reports (FASTQ, BAM, etc.) into a single HTML report:

multiqc analysis/qc/ -o analysis/qc_summary/

Best Practices

Perform QC immediately after data generation or download.
Establish lab-wide thresholds (e.g., minimum read count, mapping rate).
Automate QC steps in pipelines to ensure consistent evaluation.
Store raw QC reports alongside processed data for reproducibility.

Metadata Handling and Sample Sheets

Metadata and sample sheets provide essential context for sequencing data, including sample identifiers, conditions, and experimental design. Accurate and consistent metadata management ensures reproducibility and smooth pipeline execution.

Prerequisites

Understanding of experiment design (e.g., biological replicates, treatment groups).
Basic knowledge of spreadsheet formats (CSV, TSV).
Access to command-line or scripting tools (Python/R) for validation.

Step 1: Standard Metadata Structure

A typical sample sheet includes these columns:

SampleID: Unique identifier for each sample (e.g., Sample01).
Condition/Group: Experimental group (e.g., Control, Treatment).
Replicate: Biological replicate number.
File Path: Absolute or relative path to raw data files (FASTQ, BAM).
Notes: Optional field for comments or QC flags.

Step 2: File Naming and Metadata Consistency

Ensure sample names in metadata match raw data file names exactly:

# Example raw FASTQ: Sample01_L1_R1.fastq.gz
# Metadata SampleID: Sample01
# Consistency prevents pipeline errors and mislabeling

Step 3: Manual Validation in Spreadsheet Tools

Use spreadsheet programs (Excel, Google Sheets) to check:

No duplicate SampleIDs.
All required fields are filled (no blank cells).
File paths point to actual files in raw_data/.

Step 4: Automated Validation with Command-Line or Scripts

Example using Python (pandas) to validate a CSV file:

python -c "import pandas as pd; df=pd.read_csv('samplesheet.csv'); print(df.isnull().sum())"

Example using R to check for duplicates:

Rscript -e "data <- read.csv('samplesheet.csv'); any(duplicated(data\$SampleID))"

Step 5: Tracking Provenance

Record changes to metadata as the project progresses:

Use version control (Git) for sample sheets.
Add date-stamped versions (e.g., samplesheet_2025-08-01.csv).
Document transformations (e.g., adding QC flags) in a CHANGELOG or README.

Best Practices

Use consistent column names and formats across projects (e.g., always “SampleID”).
Validate metadata before starting any analysis pipeline to prevent failures.
Keep metadata under version control to track changes over time.
Store metadata in the metadata/ folder within the project structure.

Pipeline Basics with Nextflow

Nextflow is a workflow manager widely used in bioinformatics for reproducible and scalable analysis pipelines. This tutorial introduces basic DSL2 concepts, modular pipeline design, and running workflows on HPC systems with SLURM.

Prerequisites

Basic knowledge of the Linux command line and file structures.
Nextflow installed on your system (Nextflow installation guide).
Familiarity with Conda/Mamba environments for software management.

Step 1: Understanding DSL2 Structure

Nextflow DSL2 pipelines are modular and consist of:

Processes: Define a single computational step (e.g., QC, alignment).
Channels: Pass data between processes (files, parameters).
Workflows: Define pipeline execution order.

Step 2: Minimal DSL2 Example

A simple pipeline with one process for FastQC:

nextflow.enable.dsl=2
process FASTQC {
input:
path fastq
output:
path "fastqc_results"
script:
"""
fastqc $fastq -o fastqc_results
"""
}
workflow {
Channel.fromPath('raw_data/*.fastq.gz') | FASTQC
}

Step 3: Parameterizing Pipelines

Pass parameters (e.g., input folder) via params:

params.input = "raw_data/*.fastq.gz"
workflow {
Channel.fromPath(params.input) | FASTQC
}

Step 4: Using Profiles and Configs

Separate development and HPC configurations using nextflow.config:

profiles {
standard { process.executor = 'local' }
slurm { process.executor = 'slurm'; queueSize = 100 }
}

Step 5: Running on HPC (SLURM)

Submit workflow to SLURM:

nextflow run main.nf -profile slurm

Step 6: Managing Dependencies

Integrate Conda/Mamba or Singularity for reproducibility:

conda { fastqc = "fastqc=0.11.9" }

Or containerized approach:

container 'biocontainers/fastqc:v0.11.9_cv8'

Best Practices

Keep processes modular — one tool per process.
Use parameters for inputs and outputs instead of hardcoding paths.
Test locally before scaling to HPC with SLURM.
Version control pipeline code and configuration files.

Automating Routine Tasks with Bash

Bash scripting allows bioinformaticians to automate repetitive tasks, streamline pipelines, and minimize manual errors. This tutorial demonstrates common patterns for batch processing, job arrays, and logging in bioinformatics workflows.

Prerequisites

Basic command-line knowledge (loops, variables, file redirection).
Access to a Unix/Linux shell (local machine or HPC cluster).
Understanding of raw vs processed data structures (see earlier tutorials).

Step 1: Writing a Simple Bash Script

Example script to compress all FASTQ files in a folder:

#!/bin/bash
for file in raw_data/*.fastq; do
echo "Compressing $file"
gzip "$file"
done

Make the script executable:

chmod +x compress_fastq.sh

Step 2: Using Variables and Arguments

Pass arguments to make scripts flexible:

#!/bin/bash
INPUT_DIR=$1
for file in "$INPUT_DIR"/*.fastq.gz; do
fastqc "$file" -o qc_results/
done

Run the script:

./qc_fastq.sh raw_data

Step 3: Job Arrays for Multiple Samples (SLURM)

Efficiently submit multiple jobs for different samples:

#!/bin/bash
#SBATCH --job-name=fastqc_array
#SBATCH --array=1-10
#SBATCH --output=fastqc_%A_%a.log
FILES=($(ls raw_data/*.fastq.gz))
fastqc ${FILES[$SLURM_ARRAY_TASK_ID-1]} -o analysis/qc/

Step 4: Adding Logging and Error Handling

Capture stdout/stderr and exit on error:

#!/bin/bash
set -euo pipefail
LOG="script.log"
echo "Script started at $(date)" > $LOG
for file in raw_data/*.fastq.gz; do
echo "Processing $file" | tee -a $LOG
fastqc "$file" -o qc_results/ 2>> $LOG
done
echo "Script finished at $(date)" >> $LOG

Step 5: Scheduling with Cron (Optional)

Automate periodic tasks (e.g., nightly backups) using cron:

crontab -e
# Run script every day at midnight
0 0 * * * /path/to/backup.sh

Best Practices

Use variables and arguments to make scripts reusable across projects.
Enable set -euo pipefail for safer script execution.
Document each step with comments and maintain logs for reproducibility.
Combine job arrays with pipelines for large-scale processing on HPC systems.

Benchmarking and Tool Comparison

Benchmarking helps evaluate and compare bioinformatics tools in terms of speed, memory usage, and accuracy. This tutorial demonstrates how to design fair benchmarks and document results to guide tool selection for the lab.

Prerequisites

Basic command-line knowledge (timing commands, parsing logs).
Access to multiple tools performing similar tasks (e.g., aligners or QC tools).
Sample dataset representative of typical project workflows.

Step 1: Define Benchmarking Goals

Decide what you want to measure:

Performance: Runtime and memory usage.
Accuracy: Correctness of results (e.g., alignment rate, variant concordance).
Scalability: Ability to handle larger datasets or parallel workloads.

Step 2: Prepare Input Data

Use a small representative dataset or a controlled subset of real data:

Subset FASTQ files (e.g., first 1M reads) to reduce runtime during testing.
Ensure all tools use the same reference genome and parameters for fair comparison.

Step 3: Measure Runtime and Memory

Use the /usr/bin/time command for resource tracking:

/usr/bin/time -v bwa mem ref.fa reads.fq > output.sam

Key metrics to note:

Elapsed (wall clock) time
Maximum resident set size (memory usage)

Step 4: Evaluate Accuracy

Compare tool outputs against known truth sets or between tools:

Alignment rate (from samtools flagstat).
Variant concordance with reference calls (using bcftools isec or precision/recall metrics).
QC metrics (FastQC scores, coverage depth).

Step 5: Log and Visualize Results

Maintain structured records for benchmarking results:

Record runtime, memory, accuracy metrics in CSV or TSV format.
Visualize results with bar plots or tables for lab presentations.
Store benchmarking scripts and results in version control for reproducibility.

Step 6: Compare Across Tools

Present results with clear comparisons:

Table comparing runtime and memory (e.g., BWA vs Bowtie2 vs STAR).
Charts showing accuracy metrics (e.g., alignment rate, variant calling precision).

Best Practices

Benchmark tools on the same hardware and dataset for fair comparison.
Document all versions, parameters, and environments used during testing.
Update benchmarks when tools or datasets are upgraded.
Use benchmarks to guide lab-wide tool adoption and pipeline updates.

Reproducibility and FAIR Principles

Reproducibility ensures that bioinformatics analyses can be replicated by others, while FAIR principles (Findable, Accessible, Interoperable, Reusable) promote data sharing and long-term usability. This tutorial outlines practical steps to make analyses both reproducible and FAIR-compliant.

Prerequisites

Basic understanding of environment management (Conda/Mamba, containers).
Familiarity with version control (Git) and structured project folders.
Knowledge of metadata and documentation standards (see earlier tutorials).

Step 1: Capture Software Environments

Record exact software versions to ensure reproducibility:

# Export Conda/Mamba environment
conda env export > env_snapshot.yml
# Recreate environment
conda env create -f env_snapshot.yml

Alternatively, use container images (Docker/Singularity):

singularity pull docker://biocontainers/fastqc:v0.11.9_cv8
singularity exec fastqc_v0.11.9_cv8.sif fastqc --version

Step 2: Version Control for Code and Data

Track all analysis scripts and configuration files:

Use Git for code and metadata (exclude raw data with .gitignore).
Tag releases for key project milestones (e.g., v1.0_analysis).
Store configuration and parameter files with pipelines for exact replication.

Step 3: Annotate Metadata for FAIR

Ensure metadata follows FAIR guidelines:

Findable: Use unique identifiers (e.g., DOIs, accession numbers).
Accessible: Provide clear access instructions (e.g., repository links).
Interoperable: Use standard formats (CSV/TSV, JSON, YAML).
Reusable: Include detailed descriptions (sample prep, QC thresholds).

Step 4: Document Analysis Steps

Maintain detailed records of workflows:

README files describing data flow and processing steps.
Workflow diagrams or schematic pipeline overviews.
Logs and QC reports archived with results.

Step 5: Preparing Data for Sharing

Ensure datasets are ready for publication or repository upload:

De-identify sensitive data (if human samples).
Include checksums for file verification.
Provide mapping of filenames to metadata for clarity.

Best Practices

Capture exact environment states (YAML or container images) at every major analysis step.
Follow FAIR principles when storing and sharing data to maximize reuse and citation.
Integrate reproducibility checks into pipeline development and lab reviews.
Archive final results with complete metadata and QC summaries for future reference.

Secure Handling of Sensitive Data

Handling sensitive genomic data requires strict security measures to comply with regulations such as HIPAA and GDPR. This tutorial provides guidelines for encryption, controlled access, and de-identification to ensure data privacy and integrity.

Prerequisites

Access to secure HPC or institutional storage systems.
Basic knowledge of file permissions and encryption tools.
Understanding of sensitive data policies (HIPAA/GDPR requirements).

Step 1: Identify Sensitive Data

Examples of sensitive data:

Human genomic sequences (FASTQ, BAM, VCF) linked to identifiable individuals.
Clinical metadata containing personal health information (PHI).
Internal research data subject to embargo or collaboration agreements.

Step 2: Control Access with Permissions

Limit access to authorized users using Unix group permissions:

# Set group ownership
chgrp -R labgroup sensitive_project/
# Restrict access to group members only
chmod -R 770 sensitive_project/

Step 3: Encrypt Data at Rest and in Transit

Use encryption for both stored files and file transfers:

# Encrypt a file with GPG
gpg -c sensitive_data.fastq.gz
# Decrypt the file
gpg sensitive_data.fastq.gz.gpg
# Secure copy with SSH (encrypted in transit)
scp sensitive_data.fastq.gz user@secure-server:/path/

Step 4: De-Identification of Data

Remove or replace personally identifiable information (PII) in metadata:

Replace patient names with anonymized IDs (e.g., PAT001).
Strip dates of birth, addresses, and medical record numbers.
Store ID-to-patient mappings in a separate encrypted file accessible only to authorized personnel.

Step 5: Use Secure Storage and Transfer Services

Follow institutional or HPC policies for secure storage:

Use encrypted network storage or approved cloud services (e.g., S3 with server-side encryption).
Never use personal cloud accounts (Dropbox, Google Drive) for sensitive data unless explicitly approved.

Step 6: Monitor and Audit Access

Implement logging to track file access:

Enable auditing tools (e.g., aureport on Linux) for sensitive directories.
Review access logs regularly for unusual activity.

Best Practices

Classify data sensitivity at project start and document handling policies.
Use encryption by default for storage and transfer of human genomic data.
Apply least-privilege principles: grant access only to required personnel.
Periodically review permissions and revoke access for inactive users.

Building Publication-Quality Figures

High-quality figures are critical for communicating bioinformatics results in publications and presentations. This tutorial demonstrates best practices for creating clear, consistent, and reproducible plots using tools like R (ggplot2), Python (matplotlib), and MultiQC outputs.

Prerequisites

Basic knowledge of data visualization in R or Python.
Familiarity with bioinformatics outputs (UMAP, heatmaps, QC metrics).
Installed plotting libraries (ggplot2, matplotlib, seaborn).

Step 1: Choose Appropriate Plot Types

Match figure type to data and message:

UMAP/t-SNE: Visualize cell clustering in single-cell data.
Heatmaps: Show gene expression or peak accessibility patterns.
Bar/Box Plots: Summarize QC metrics or group comparisons.
MultiQC: Combine multiple QC reports into one summary.

Step 2: Ensure Consistent Color Palettes

Use meaningful and colorblind-friendly palettes:

Consistent cluster colors across all figures (e.g., same color for “Cluster 1”).
Use palettes like viridis (Python/R) or ColorBrewer schemes.
Avoid using red/green contrasts for accessibility.

Step 3: High-Resolution and Export Settings

Export figures at publication quality (300+ dpi):

# In R (ggplot2)
ggsave("figure1.png", dpi=300, width=6, height=4)
# In Python (matplotlib)
plt.savefig("figure1.png", dpi=300, bbox_inches="tight")

Step 4: Combining Multiple Plots

Create multi-panel figures for complex results:

In R: use patchwork or cowplot for combining ggplots.
In Python: use matplotlib.subplots() or gridspec.
Maintain consistent font sizes and axis labels across panels.

Step 5: Annotating Figures

Add informative labels and legends:

Clearly label axes, groups, and statistical tests.
Use panel labels (A, B, C) for multi-panel figures.
Include descriptive figure captions when sharing internally or publishing.

Step 6: Reproducibility of Figures

Save figure scripts and data used for plotting:

Commit plotting scripts to version control with the rest of the analysis.
Document data preprocessing steps used for figure generation.
Ensure figures can be regenerated by others in the lab.

Best Practices

Use consistent styles (fonts, colors, axis labels) across all figures in a project.
Generate vector graphics (SVG, PDF) when possible for scalable quality.
Preview figures at final publication size to ensure readability of labels.
Integrate figure generation into pipelines (RMarkdown, Jupyter) for full reproducibility.

Reproducible Reporting (RMarkdown/Quarto)

Reproducible reports combine code, results, and narrative text into a single document. Using RMarkdown or Quarto, bioinformaticians can generate dynamic reports (HTML/PDF) for collaborators and publications, ensuring transparency and easy updates.

Prerequisites

R and RStudio installed (for RMarkdown) or Quarto installed (quarto.org).
Basic knowledge of R or Python for data analysis and visualization.
Familiarity with Markdown syntax for formatting text.

Step 1: RMarkdown Basics

RMarkdown files have three components:

YAML header: Title, author, output format.
Markdown text: Descriptive content, figures, and tables.
Code chunks: Embedded R or Python code for analysis.

Example YAML header:

---
title: "QC Report"
author: "Bioinformatics Lab"
output: html_document
---

Step 2: Quarto Basics

Quarto supports R, Python, and multi-language documents with extended publishing features:

Use quarto render report.qmd to generate reports.
Supports advanced layouts (dashboards, interactive plots).
Can render directly to PDF, Word, or HTML from the same source file.

Step 3: Embedding Code and Results

Code chunks run during report rendering and insert results dynamically:

```{r}
summary(read_counts)
```

For Python (Quarto or RMarkdown with reticulate):

```{python}
import pandas as pd
pd.read_csv("qc_metrics.csv").head()
```

Step 4: Adding Figures and Tables

Use plotting libraries (ggplot2, matplotlib) inside code chunks:

```{r fig.width=6, fig.height=4}
library(ggplot2)
ggplot(df, aes(x=Sample, y=Reads)) + geom_bar(stat="identity")
```

Step 5: Rendering Reports

Generate outputs in different formats:

# RMarkdown (RStudio or command line)
rmarkdown::render("report.Rmd")
# Quarto
quarto render report.qmd --to html

Step 6: Lab Standard Templates

Maintain reusable templates for consistency:

Predefined YAML headers with lab branding (logo, author, date).
Sections for QC summary, figures, and interpretation.
Automatic inclusion of MultiQC or pipeline outputs.

Best Practices

Integrate reports into pipelines to auto-generate after analyses.
Version control reports and data used to produce them.
Ensure all figures and tables are labeled and captioned for clarity.
Use Quarto for multi-language or advanced layouts; RMarkdown for R-focused workflows.

Code Review and Collaboration Workflow

Code review ensures quality, reproducibility, and maintainability of bioinformatics scripts and pipelines. This tutorial outlines lab practices for collaborative development using Git branching, pull requests, and standardized coding styles.

Prerequisites

Basic Git knowledge (init, commit, branch, push).
Access to GitHub/GitLab repository for the lab.
Familiarity with lab coding style and documentation standards.

Step 1: Branching Strategy

Use feature branches for development and keep main stable:

git checkout -b feature-add-qc
# Develop feature on branch
git add qc_script.sh
git commit -m "Add QC script"
git push origin feature-add-qc

Step 2: Submitting Pull Requests (PRs)

PRs allow review before merging into main:

Open PR on GitHub/GitLab describing changes and purpose.
Link to related issues or tickets (if applicable).
Assign reviewers (lab members responsible for this module).

Step 3: Review Checklist

Reviewers should check for:

Correctness (does it solve the intended problem?).
Readability (clear variable names, comments).
Consistency (follows lab style guide and directory structure).
Reproducibility (dependencies and parameters documented).

Step 4: Coding Style Standards

Adopt a consistent style across the lab:

Use lowercase filenames with underscores (sample_qc.sh).
Document scripts with header comments (author, purpose, date).
Limit line length (e.g., 80-100 characters) for readability.
Use consistent indentation (2 or 4 spaces, no tabs).

Step 5: Handling Comments and Changes

Incorporate reviewer feedback efficiently:

Address comments in new commits (avoid force-pushing unless necessary).
Mark resolved comments in the PR discussion.
Communicate rationale for design choices in replies if not changing code.

Step 6: Merging and Cleanup

After approval:

Squash commits (if preferred) to keep history clean.
Merge PR into main using “merge commit” or “rebase and merge.”
Delete feature branch after merge to reduce clutter.

Best Practices

Keep PRs small and focused for easier review.
Review regularly to avoid blocking team workflows.
Use templates for PR descriptions and review checklists.
Document style guide and review process in the lab wiki for new members.

Lab Data Lifecycle and Archival

Managing the entire lifecycle of bioinformatics data—from raw acquisition to long-term archival—is essential for efficient storage usage, cost control, and reproducibility. This tutorial outlines best practices for organizing, cleaning, and archiving data in a lab setting.

Prerequisites

Understanding of lab storage systems (scratch, project, and archival storage).
Familiarity with data formats (FASTQ, BAM, VCF, metadata files).
Knowledge of institutional policies for data retention and sharing.

Step 1: Define Data Stages

Classify data into stages for better handling:

Raw data: Unmodified output from sequencing instruments.
Processed data: Outputs from pipelines (QC, alignment, counts).
Final results: Ready for publication or sharing (figures, tables).

Step 2: Storage Policies

Understand storage tiers and assign data accordingly:

Scratch: Temporary high-speed storage for active computation; purge after job completion.
Project storage: Medium-term storage for ongoing analyses.
Archive: Long-term storage for completed projects or regulatory retention.

Step 3: Data Retention Timelines

Set clear retention periods for each data stage:

Raw data: Retain at least until publication or per funder’s requirement.
Processed data: Retain for re-analysis or method validation (e.g., 3–5 years).
Final results: Retain indefinitely in lab archives or repositories.

Step 4: Preparing Data for Archival

Before archiving:

Remove redundant intermediate files to save space.
Compress large files (e.g., BAM to CRAM) when possible.
Generate checksums (md5/sha256) for integrity verification.
Include README with metadata, project summary, and software versions.

Step 5: Archival Systems and Tools

Options for archival:

Institutional cold storage or tape systems (low-cost, slower access).
Cloud archival (e.g., AWS Glacier, Google Coldline) with encryption.
External repositories (NCBI SRA, GEO) for public datasets.

Step 6: Deletion and Cleanup

Implement regular cleanup cycles:

Delete scratch data after jobs complete (avoid accidental buildup).
Remove temporary or duplicate files before moving to archive.
Log deletions in a project lifecycle document for accountability.

Best Practices

Separate raw, processed, and final data clearly with dedicated folders.
Document archival locations and retention timelines for each project.
Perform periodic audits to ensure compliance with institutional policies.
Store both data and metadata (QC reports, analysis logs) together for future reuse.