Sangram Keshari Sahu

Variant Calling Pipelines: From Raw Reads to Annotated VCF

Wed, 15 Nov 2023 00:00:00 +0000

Variant calling sits at the heart of nearly every human genetics project — from rare disease diagnostics to cancer genomics to population studies. The tools have matured considerably, but the pipeline is still full of decisions that affect sensitivity, specificity, and reproducibility. This post walks through a complete variant calling workflow with code at every step.

Pipeline Architecture

A short-variant calling pipeline moves through five major phases:

Raw FASTQ reads
 │
 ▼
┌─────────────────┐
│ Pre-alignment │ FastQC, Trimming (optional)
│ QC │
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Alignment │ BWA-MEM2 → samtools sort/index
│ │ Picard MarkDuplicates
│ │ BQSR (BaseRecalibrator + ApplyBQSR)
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Variant │ HaplotypeCaller → GenomicsDBImport
│ Calling │ → GenotypeGVCFs (GATK4 GVCF mode)
│ │ — or — DeepVariant (DL-based)
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Variant │ VQSR / Hard Filtering
│ Filtering │ BCFtools view/filter
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Annotation │ VEP / ANNOVAR / SnpEff
│ & Reporting │ + custom prioritisation scripts
└─────────────────┘

Step 1: Alignment

BWA-MEM2 is the current standard for short-read alignment to large genomes — it’s significantly faster than BWA-MEM with identical results:

Single-Cell Multi-omics Integration: Linking RNA, ATAC, and Protein Data

Fri, 22 Sep 2023 00:00:00 +0000

Measuring gene expression alone gives you a snapshot of cell state. Measuring chromatin accessibility alongside expression reveals the regulatory grammar underlying that state. Add surface proteins and you can link transcriptional identity to functional phenotype. Multi-omics integration is complex but increasingly tractable — this post shows you how to do it.

What Multi-omics Adds

Each single-cell modality captures a different layer of biology:

Modality │ What it measures │ Key insight
──────────────────┼───────────────────────────┼──────────────────────────────
scRNA-seq │ Gene expression (mRNA) │ Cell identity & state
scATAC-seq │ Chromatin accessibility │ Regulatory landscape
CITE-seq │ Surface proteins (ADT) │ Phenotypic marker quantification
scMethyl-seq │ DNA methylation │ Epigenetic silencing
Spatial omics │ Expression + location │ Tissue architecture

Combining modalities lets you ask questions no single layer can answer: Which open chromatin regions drive the transcription programmes that define this cell type? Which transcription factors are active based on both motif accessibility and their own expression?

Deep Learning for Genomics: Predicting Gene Expression with PyTorch

Tue, 27 Jun 2023 00:00:00 +0000

Deep learning has moved from hype to essential tool in genomics. Models now routinely outperform hand-engineered features for tasks like predicting gene expression, transcription factor binding, and chromatin accessibility from raw sequence. This post walks through building one such model end-to-end using PyTorch.

The Problem Setup

We’ll predict gene expression levels (log-normalized counts) from the 2 kb promoter sequence upstream of each transcription start site (TSS). This is a well-studied proxy task — promoter sequence encodes a substantial fraction of expression variance across tissues and conditions.

FAIR Data Principles in Practice: A Bioinformatics Engineer's Guide

Mon, 03 Apr 2023 00:00:00 +0000

Everyone in life sciences has heard about FAIR data — Findable, Accessible, Interoperable, Reusable. It’s become one of those principles that appears in grant applications and data management plans without always translating into practice. This post is about the practical side: what FAIR actually means for a bioinformatics project and how to implement it without adding unnecessary overhead.

The Four Principles, Grounded

The FAIR Guiding Principles were published in 2016. Let’s translate each principle into concrete bioinformatics terms:

Single-Cell RNA-seq: A Practical Overview

Tue, 14 Feb 2023 00:00:00 +0000

Single-cell RNA sequencing (scRNA-seq) has transformed our ability to study cellular heterogeneity. Instead of averaging gene expression across thousands of cells, we can profile each cell individually. That shift in resolution changes what questions we can ask.

The Core Workflow

A typical scRNA-seq analysis moves through these stages:

Alignment & quantification — map reads to a reference transcriptome (Cell Ranger, STARsolo, Salmon/Alevin)
Quality control — filter low-quality cells based on library size, gene count, and mitochondrial fraction
Normalization — correct for sequencing depth differences between cells
Dimensionality reduction — PCA, then UMAP or t-SNE for visualization
Clustering — identify groups of similar cells
Annotation — assign cell types to clusters using marker genes

Tooling Landscape

The two dominant ecosystems are:

Gene Regulatory Network Inference: Methods, Tools, and Pitfalls

Mon, 19 Dec 2022 00:00:00 +0000

Gene regulatory networks (GRNs) describe how transcription factors (TFs) control the expression of target genes. Reconstructing these networks from transcriptomic data is one of the hardest problems in computational biology — and one of the most rewarding.

This post covers the current landscape of GRN inference methods, how to run them in practice, and the pitfalls that trip people up most often.

What Are We Actually Trying to Infer?

A GRN is a directed graph where:

Getting Started with Snakemake for Bioinformatics Workflows

Tue, 08 Nov 2022 00:00:00 +0000

Reproducibility is one of the biggest challenges in bioinformatics. Raw sequencing data goes through dozens of tools and parameters before you get results, and keeping track of every step is notoriously hard. Snakemake is a workflow management system that solves this elegantly.

Why Snakemake?

Unlike shell scripts, Snakemake:

Tracks dependencies between steps automatically
Re-runs only the parts of the workflow that have changed
Scales from a laptop to a cluster with minimal changes
Produces a readable, version-controllable workflow definition

A Minimal Example

Here’s a simple rule that runs FastQC on a set of FASTQ files:

Nextflow DSL2: Building Modular, Scalable Bioinformatics Pipelines

Mon, 05 Sep 2022 00:00:00 +0000

Workflow managers have become essential infrastructure in bioinformatics. If you’ve outgrown shell scripts and are finding Snakemake limiting for complex multi-step pipelines — especially those that need to run across different compute environments — Nextflow DSL2 is worth a serious look.

Why DSL2?

Nextflow’s original syntax (DSL1) worked, but it made reuse difficult. DSL2 introduces a modular system where:

Processes are atomic, container-isolated execution units
Workflows compose processes into directed acyclic graphs (DAGs)
Modules are shareable, versionable process definitions

The result is a pipeline architecture that separates what a tool does from how the pipeline orchestrates it — a clean separation of concerns.

Containerising Bioinformatics Tools: Docker and Singularity Best Practices

Mon, 11 Jul 2022 00:00:00 +0000

Containers have become the de facto standard for reproducible bioinformatics. The days of “it worked on my machine” are largely over — if your tool isn’t containerised, it’s harder to share, harder to cite, and harder to reproduce. This post covers building production-quality bioinformatics containers from scratch.

Why Containers Beat Conda Alone

Conda environments solve Python and R package versioning, but they don’t capture the full system environment — kernel version, system libraries, locale settings, or external tool binaries. A Docker image captures everything from the OS up:

About

Mon, 01 Jan 0001 00:00:00 +0000

I’m a Bioinformatics Research Engineer working at the intersection of computational biology, data science, and open science infrastructure. My work is driven by a simple belief: research should be accessible to everyone, and data should be easy to understand.

With a background that spans wet-lab biology, computational analysis, and software engineering, I bring a rare end-to-end perspective — I understand both where the data comes from and how to build the tools that make sense of it.

Projects

Mon, 01 Jan 0001 00:00:00 +0000

A collection of tools, packages, and resources I’ve built for the bioinformatics and data science community.

📦 R Packages

R package + Shiny app for doing significant biology on a set of genes. Functional enrichment, pathway analysis, and more in an intuitive interface.

Rmarkdown Templates — a collection of R Markdown templates for common scientific reporting use cases.

A personal R package — utility and wrapper functions, API calls, and daily-task helpers that I reuse across projects.