Variant Calling Pipelines: From Raw Reads to Annotated VCF

Variant calling sits at the heart of nearly every human genetics project — from rare disease diagnostics to cancer genomics to population studies. The tools have matured considerably, but the pipeline is still full of decisions that affect sensitivity, specificity, and reproducibility. This post walks through a complete variant calling workflow with code at every step.

Pipeline Architecture

A short-variant calling pipeline moves through five major phases:

Raw FASTQ reads
      │
      ▼
┌─────────────────┐
│  Pre-alignment  │  FastQC, Trimming (optional)
│  QC             │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Alignment      │  BWA-MEM2 → samtools sort/index
│                 │  Picard MarkDuplicates
│                 │  BQSR (BaseRecalibrator + ApplyBQSR)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Variant        │  HaplotypeCaller → GenomicsDBImport
│  Calling        │  → GenotypeGVCFs (GATK4 GVCF mode)
│                 │  — or — DeepVariant (DL-based)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Variant        │  VQSR / Hard Filtering
│  Filtering      │  BCFtools view/filter
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Annotation     │  VEP / ANNOVAR / SnpEff
│  & Reporting    │  + custom prioritisation scripts
└─────────────────┘

Step 1: Alignment

BWA-MEM2 is the current standard for short-read alignment to large genomes — it’s significantly faster than BWA-MEM with identical results:

# Index the reference (once)
bwa-mem2 index \
    -p /reference/hg38/hg38.bwa-mem2 \
    /reference/hg38/Homo_sapiens_assembly38.fasta

# Align paired-end reads
bwa-mem2 mem \
    -t 16 \
    -R "@RG\tID:sample001\tSM:sample001\tPL:ILLUMINA\tLB:lib1\tPU:unit1" \
    /reference/hg38/hg38.bwa-mem2 \
    sample_R1.fastq.gz \
    sample_R2.fastq.gz \
    | samtools sort -@ 8 -m 4G -o sample.sorted.bam

samtools index sample.sorted.bam

The @RG read group tag is required by GATK — it encodes sample ID, library, and platform information that downstream tools use for error modelling and multi-sample handling.

Step 2: Mark Duplicates

PCR duplicates are reads that arise from the same original DNA fragment. If not removed, they inflate coverage and artificially increase variant confidence:

# Mark duplicates with Picard (keeps duplicates, marks them with FLAG 0x400)
gatk MarkDuplicates \
    --INPUT  sample.sorted.bam \
    --OUTPUT sample.markdup.bam \
    --METRICS_FILE sample.markdup.metrics \
    --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 \
    --CREATE_INDEX true

Check the metrics file — a duplicate rate above 40–50% for WGS or above 60–70% for WES suggests a library quality issue.

Step 3: Base Quality Score Recalibration (BQSR)

BQSR corrects systematic errors in the base quality scores assigned by the sequencer. It trains a model on known variant sites (dbSNP, Mills indels) and re-scores every base:

# Phase 1: Build the recalibration model
gatk BaseRecalibrator \
    -I sample.markdup.bam \
    -R /reference/hg38/Homo_sapiens_assembly38.fasta \
    --known-sites /reference/resources/dbsnp_146.hg38.vcf.gz \
    --known-sites /reference/resources/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz \
    --known-sites /reference/resources/1000G_phase1.snps.high_confidence.hg38.vcf.gz \
    -O sample.recal.table

# Phase 2: Apply the recalibration
gatk ApplyBQSR \
    -I sample.markdup.bam \
    -R /reference/hg38/Homo_sapiens_assembly38.fasta \
    --bqsr-recal-file sample.recal.table \
    -O sample.bqsr.bam

BQSR is worth doing for clinical-grade variant calling. For population genomics with many samples, it’s sometimes skipped in favour of GATK’s variant recalibration (VQSR) — the marginal gain is smaller when you have joint calling across thousands of samples.

Step 4: Variant Calling with GATK4 HaplotypeCaller

GATK recommends a GVCF-based joint calling workflow — call each sample to GVCF first, then genotype jointly across all samples. This enables efficient addition of new samples later:

# Per-sample GVCF generation
gatk HaplotypeCaller \
    -I sample.bqsr.bam \
    -R /reference/hg38/Homo_sapiens_assembly38.fasta \
    -O sample.g.vcf.gz \
    -ERC GVCF \
    --dbsnp /reference/resources/dbsnp_146.hg38.vcf.gz \
    --tmp-dir /tmp \
    -L /reference/targets/exome_v7_hg38.interval_list  # for WES

For cohort calling, consolidate GVCFs with GenomicsDBImport, then jointly genotype:

# Consolidate GVCFs per chromosome (parallelise across chromosomes)
gatk GenomicsDBImport \
    --sample-name-map sample_map.txt \
    --genomicsdb-workspace-path /data/genomicsdb/chr1 \
    -L chr1 \
    --reader-threads 4

# Joint genotyping
gatk GenotypeGVCFs \
    -R /reference/hg38/Homo_sapiens_assembly38.fasta \
    -V gendb:///data/genomicsdb/chr1 \
    -O cohort.chr1.vcf.gz \
    --dbsnp /reference/resources/dbsnp_146.hg38.vcf.gz

Step 4 (Alternative): DeepVariant

DeepVariant treats variant calling as an image classification problem — it converts pileup data into image tensors and classifies each candidate site with a CNN. It consistently matches or outperforms GATK on SNPs and is notably better on indels in repetitive regions:

# Run DeepVariant (Docker or Singularity)
docker run \
    --rm \
    -v /data:/data \
    -v /reference:/reference \
    google/deepvariant:1.6.1 \
    /opt/deepvariant/bin/run_deepvariant \
        --model_type=WGS \
        --ref=/reference/hg38/Homo_sapiens_assembly38.fasta \
        --reads=/data/sample.bqsr.bam \
        --output_vcf=/data/sample.deepvariant.vcf.gz \
        --output_gvcf=/data/sample.deepvariant.g.vcf.gz \
        --num_shards=16 \
        --intermediate_results_dir=/data/tmp

For multi-sample calling with DeepVariant, use GLnexus to merge GVCFs:

glnexus_cli \
    --config DeepVariantWGS \
    --dir /data/glnexus_db \
    /data/*.deepvariant.g.vcf.gz \
    | bcftools view - | bgzip -@ 8 > cohort.deepvariant.vcf.gz

Step 5: Variant Filtering

VQSR (Variant Quality Score Recalibration) is the GATK-preferred method for large cohorts (>30 samples, or WGS): it trains a Gaussian mixture model on known variant sites to assign a quality score to each variant.

# SNP recalibration
gatk VariantRecalibrator \
    -V cohort.vcf.gz \
    -R /reference/hg38/Homo_sapiens_assembly38.fasta \
    --resource:hapmap,known=false,training=true,truth=true,prior=15 \
        /reference/resources/hapmap_3.3.hg38.vcf.gz \
    --resource:omni,known=false,training=true,truth=false,prior=12 \
        /reference/resources/1000G_omni2.5.hg38.vcf.gz \
    --resource:dbsnp,known=true,training=false,truth=false,prior=2 \
        /reference/resources/dbsnp_146.hg38.vcf.gz \
    -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
    -mode SNP \
    -O cohort.snp.recal \
    --tranches-file cohort.snp.tranches

gatk ApplyVQSR \
    -V cohort.vcf.gz \
    --recal-file cohort.snp.recal \
    --tranches-file cohort.snp.tranches \
    --truth-sensitivity-filter-level 99.0 \
    -mode SNP \
    -O cohort.snp_recal.vcf.gz

For small cohorts or single samples, use hard filtering thresholds recommended by GATK:

# Hard filter SNPs
gatk VariantFiltration \
    -V sample.raw_snps.vcf.gz \
    --filter-expression "QD < 2.0"     --filter-name "QD2" \
    --filter-expression "FS > 60.0"    --filter-name "FS60" \
    --filter-expression "MQ < 40.0"    --filter-name "MQ40" \
    --filter-expression "SOR > 3.0"    --filter-name "SOR3" \
    -O sample.filtered_snps.vcf.gz

Step 6: Annotation with VEP

Ensembl VEP annotates variants with predicted functional consequences, allele frequencies, and clinical significance:

# Run VEP (Docker)
docker run \
    --rm \
    -v /data:/data \
    -v /vep_cache:/vep_cache \
    ensemblorg/ensembl-vep:release_111 \
    vep \
        --input_file /data/cohort.filtered.vcf.gz \
        --output_file /data/cohort.vep.vcf.gz \
        --format vcf \
        --vcf \
        --compress_output bgzip \
        --cache \
        --dir_cache /vep_cache \
        --assembly GRCh38 \
        --species homo_sapiens \
        --af_gnomade \
        --af_gnomadg \
        --sift b \
        --polyphen b \
        --check_existing \
        --fork 8

Extracting and Prioritising Variants

After annotation, extract rare, high-impact variants with bcftools:

# Extract rare (gnomAD AF < 0.01), high-impact variants
bcftools view \
    -i 'INFO/gnomADe_AF < 0.01 && INFO/Consequence ~ "stop_gained|frameshift|splice"' \
    cohort.vep.vcf.gz \
    | bcftools sort -Oz -o rare_high_impact.vcf.gz

# Summary statistics
bcftools stats cohort.vep.vcf.gz > cohort.stats.txt
plot-vcfstats -p /data/plots/ cohort.stats.txt

For clinical prioritisation, parse VEP annotations with Python:

import cyvcf2
import pandas as pd

vcf = cyvcf2.VCF("cohort.vep.vcf.gz")
records = []

for variant in vcf:
    csq = variant.INFO.get("CSQ", "")
    for transcript in csq.split(","):
        fields = transcript.split("|")
        gene, consequence, impact, sift, polyphen, af = (
            fields[3], fields[1], fields[2], fields[24], fields[25],
            float(fields[42] or 0)  # gnomADe_AF index may vary — check your VEP header
        )
        if impact in ("HIGH", "MODERATE") and af < 0.01:
            records.append({
                "chrom": variant.CHROM, "pos": variant.POS,
                "ref": variant.REF, "alt": str(variant.ALT[0]),
                "gene": gene, "consequence": consequence,
                "impact": impact, "sift": sift, "polyphen": polyphen,
                "gnomAD_AF": af,
            })

df = pd.DataFrame(records).sort_values(["impact", "gnomAD_AF"])
df.to_csv("prioritised_variants.tsv", sep="\t", index=False)

Key Considerations

Reference genome consistency. Every file in your pipeline must use the same reference genome build and contig naming convention. Mixing chr1 and 1 chromosome names between FASTQ, reference, and VCF files is a common source of silent errors.

BQSR and known sites. The quality of BQSR depends on the known-sites VCFs. Use the GATK resource bundle — don’t use in-house variant calls as known sites (circular reasoning).

Parallelisation strategy. Split by chromosome (or interval list) for HaplotypeCaller and VQSR — this is the standard approach in scalable Nextflow/Snakemake implementations.

Validation. For any new pipeline, validate sensitivity and specificity against a reference sample (NA12878/HG001 from NIST Genome in a Bottle) using hap.py:

hap.py \
    /reference/giab/HG001_GRCh38_benchmark.vcf.gz \
    sample.filtered.vcf.gz \
    -r /reference/hg38/Homo_sapiens_assembly38.fasta \
    -f /reference/giab/HG001_GRCh38_benchmark.bed \
    -o validation/sample_vs_giab \
    --engine=vcfeval \
    --threads 16

A well-validated, well-documented variant calling pipeline is essential infrastructure for any genomics group. The tools above — GATK4, DeepVariant, VEP — represent current best practice, but the field evolves quickly; revisit your pipeline yearly.