<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Sangram Keshari Sahu</title><link>https://sksahu.com/</link><description>Recent content on Sangram Keshari Sahu</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Wed, 15 Nov 2023 00:00:00 +0000</lastBuildDate><atom:link href="https://sksahu.com/index.xml" rel="self" type="application/rss+xml"/><item><title>Variant Calling Pipelines: From Raw Reads to Annotated VCF</title><link>https://sksahu.com/blog/variant-calling-gatk-deepvariant/</link><pubDate>Wed, 15 Nov 2023 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/variant-calling-gatk-deepvariant/</guid><description>&lt;p>Variant calling sits at the heart of nearly every human genetics project — from rare disease diagnostics to cancer genomics to population studies. The tools have matured considerably, but the pipeline is still full of decisions that affect sensitivity, specificity, and reproducibility. This post walks through a complete variant calling workflow with code at every step.&lt;/p>
&lt;h2 id="pipeline-architecture">Pipeline Architecture&lt;/h2>
&lt;p>A short-variant calling pipeline moves through five major phases:&lt;/p>
&lt;pre tabindex="0">&lt;code>Raw FASTQ reads
 │
 ▼
┌─────────────────┐
│ Pre-alignment │ FastQC, Trimming (optional)
│ QC │
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Alignment │ BWA-MEM2 → samtools sort/index
│ │ Picard MarkDuplicates
│ │ BQSR (BaseRecalibrator + ApplyBQSR)
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Variant │ HaplotypeCaller → GenomicsDBImport
│ Calling │ → GenotypeGVCFs (GATK4 GVCF mode)
│ │ — or — DeepVariant (DL-based)
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Variant │ VQSR / Hard Filtering
│ Filtering │ BCFtools view/filter
└────────┬────────┘
 │
 ▼
┌─────────────────┐
│ Annotation │ VEP / ANNOVAR / SnpEff
│ &amp;amp; Reporting │ + custom prioritisation scripts
└─────────────────┘
&lt;/code>&lt;/pre>&lt;h2 id="step-1-alignment">Step 1: Alignment&lt;/h2>
&lt;p>BWA-MEM2 is the current standard for short-read alignment to large genomes — it&amp;rsquo;s significantly faster than BWA-MEM with identical results:&lt;/p></description></item><item><title>Single-Cell Multi-omics Integration: Linking RNA, ATAC, and Protein Data</title><link>https://sksahu.com/blog/single-cell-multi-omics-integration/</link><pubDate>Fri, 22 Sep 2023 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/single-cell-multi-omics-integration/</guid><description>&lt;p>Measuring gene expression alone gives you a snapshot of cell state. Measuring chromatin accessibility alongside expression reveals the regulatory grammar underlying that state. Add surface proteins and you can link transcriptional identity to functional phenotype. Multi-omics integration is complex but increasingly tractable — this post shows you how to do it.&lt;/p>
&lt;h2 id="what-multi-omics-adds">What Multi-omics Adds&lt;/h2>
&lt;p>Each single-cell modality captures a different layer of biology:&lt;/p>
&lt;pre tabindex="0">&lt;code>Modality │ What it measures │ Key insight
──────────────────┼───────────────────────────┼──────────────────────────────
scRNA-seq │ Gene expression (mRNA) │ Cell identity &amp;amp; state
scATAC-seq │ Chromatin accessibility │ Regulatory landscape
CITE-seq │ Surface proteins (ADT) │ Phenotypic marker quantification
scMethyl-seq │ DNA methylation │ Epigenetic silencing
Spatial omics │ Expression + location │ Tissue architecture
&lt;/code>&lt;/pre>&lt;p>Combining modalities lets you ask questions no single layer can answer: &lt;em>Which open chromatin regions drive the transcription programmes that define this cell type? Which transcription factors are active based on both motif accessibility and their own expression?&lt;/em>&lt;/p></description></item><item><title>Deep Learning for Genomics: Predicting Gene Expression with PyTorch</title><link>https://sksahu.com/blog/deep-learning-for-genomics-pytorch/</link><pubDate>Tue, 27 Jun 2023 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/deep-learning-for-genomics-pytorch/</guid><description>&lt;p>Deep learning has moved from hype to essential tool in genomics. Models now routinely outperform hand-engineered features for tasks like predicting gene expression, transcription factor binding, and chromatin accessibility from raw sequence. This post walks through building one such model end-to-end using PyTorch.&lt;/p>
&lt;h2 id="the-problem-setup">The Problem Setup&lt;/h2>
&lt;p>We&amp;rsquo;ll predict gene expression levels (log-normalized counts) from the 2 kb promoter sequence upstream of each transcription start site (TSS). This is a well-studied proxy task — promoter sequence encodes a substantial fraction of expression variance across tissues and conditions.&lt;/p></description></item><item><title>FAIR Data Principles in Practice: A Bioinformatics Engineer's Guide</title><link>https://sksahu.com/blog/fair-data-principles-in-practice/</link><pubDate>Mon, 03 Apr 2023 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/fair-data-principles-in-practice/</guid><description>&lt;p>Everyone in life sciences has heard about FAIR data — Findable, Accessible, Interoperable, Reusable. It&amp;rsquo;s become one of those principles that appears in grant applications and data management plans without always translating into practice. This post is about the &lt;em>practical&lt;/em> side: what FAIR actually means for a bioinformatics project and how to implement it without adding unnecessary overhead.&lt;/p>
&lt;h2 id="the-four-principles-grounded">The Four Principles, Grounded&lt;/h2>
&lt;p>The &lt;a href="https://www.nature.com/articles/sdata201618">FAIR Guiding Principles&lt;/a> were published in 2016. Let&amp;rsquo;s translate each principle into concrete bioinformatics terms:&lt;/p></description></item><item><title>Single-Cell RNA-seq: A Practical Overview</title><link>https://sksahu.com/blog/single-cell-rna-seq-overview/</link><pubDate>Tue, 14 Feb 2023 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/single-cell-rna-seq-overview/</guid><description>&lt;p>Single-cell RNA sequencing (scRNA-seq) has transformed our ability to study cellular heterogeneity. Instead of averaging gene expression across thousands of cells, we can profile each cell individually. That shift in resolution changes what questions we can ask.&lt;/p>
&lt;h2 id="the-core-workflow">The Core Workflow&lt;/h2>
&lt;p>A typical scRNA-seq analysis moves through these stages:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Alignment &amp;amp; quantification&lt;/strong> — map reads to a reference transcriptome (Cell Ranger, STARsolo, Salmon/Alevin)&lt;/li>
&lt;li>&lt;strong>Quality control&lt;/strong> — filter low-quality cells based on library size, gene count, and mitochondrial fraction&lt;/li>
&lt;li>&lt;strong>Normalization&lt;/strong> — correct for sequencing depth differences between cells&lt;/li>
&lt;li>&lt;strong>Dimensionality reduction&lt;/strong> — PCA, then UMAP or t-SNE for visualization&lt;/li>
&lt;li>&lt;strong>Clustering&lt;/strong> — identify groups of similar cells&lt;/li>
&lt;li>&lt;strong>Annotation&lt;/strong> — assign cell types to clusters using marker genes&lt;/li>
&lt;/ol>
&lt;h2 id="tooling-landscape">Tooling Landscape&lt;/h2>
&lt;p>The two dominant ecosystems are:&lt;/p></description></item><item><title>Gene Regulatory Network Inference: Methods, Tools, and Pitfalls</title><link>https://sksahu.com/blog/gene-regulatory-network-inference/</link><pubDate>Mon, 19 Dec 2022 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/gene-regulatory-network-inference/</guid><description>&lt;p>Gene regulatory networks (GRNs) describe how transcription factors (TFs) control the expression of target genes. Reconstructing these networks from transcriptomic data is one of the hardest problems in computational biology — and one of the most rewarding.&lt;/p>
&lt;p>This post covers the current landscape of GRN inference methods, how to run them in practice, and the pitfalls that trip people up most often.&lt;/p>
&lt;h2 id="what-are-we-actually-trying-to-infer">What Are We Actually Trying to Infer?&lt;/h2>
&lt;p>A GRN is a directed graph where:&lt;/p></description></item><item><title>Getting Started with Snakemake for Bioinformatics Workflows</title><link>https://sksahu.com/blog/getting-started-with-snakemake/</link><pubDate>Tue, 08 Nov 2022 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/getting-started-with-snakemake/</guid><description>&lt;p>Reproducibility is one of the biggest challenges in bioinformatics. Raw sequencing data goes through dozens of tools and parameters before you get results, and keeping track of every step is notoriously hard. &lt;a href="https://snakemake.readthedocs.io/">Snakemake&lt;/a> is a workflow management system that solves this elegantly.&lt;/p>
&lt;h2 id="why-snakemake">Why Snakemake?&lt;/h2>
&lt;p>Unlike shell scripts, Snakemake:&lt;/p>
&lt;ul>
&lt;li>Tracks dependencies between steps automatically&lt;/li>
&lt;li>Re-runs only the parts of the workflow that have changed&lt;/li>
&lt;li>Scales from a laptop to a cluster with minimal changes&lt;/li>
&lt;li>Produces a readable, version-controllable workflow definition&lt;/li>
&lt;/ul>
&lt;h2 id="a-minimal-example">A Minimal Example&lt;/h2>
&lt;p>Here&amp;rsquo;s a simple rule that runs FastQC on a set of FASTQ files:&lt;/p></description></item><item><title>Nextflow DSL2: Building Modular, Scalable Bioinformatics Pipelines</title><link>https://sksahu.com/blog/nextflow-dsl2-modular-pipelines/</link><pubDate>Mon, 05 Sep 2022 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/nextflow-dsl2-modular-pipelines/</guid><description>&lt;p>Workflow managers have become essential infrastructure in bioinformatics. If you&amp;rsquo;ve outgrown shell scripts and are finding Snakemake limiting for complex multi-step pipelines — especially those that need to run across different compute environments — Nextflow DSL2 is worth a serious look.&lt;/p>
&lt;h2 id="why-dsl2">Why DSL2?&lt;/h2>
&lt;p>Nextflow&amp;rsquo;s original syntax (DSL1) worked, but it made reuse difficult. DSL2 introduces a modular system where:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Processes&lt;/strong> are atomic, container-isolated execution units&lt;/li>
&lt;li>&lt;strong>Workflows&lt;/strong> compose processes into directed acyclic graphs (DAGs)&lt;/li>
&lt;li>&lt;strong>Modules&lt;/strong> are shareable, versionable process definitions&lt;/li>
&lt;/ul>
&lt;p>The result is a pipeline architecture that separates &lt;em>what a tool does&lt;/em> from &lt;em>how the pipeline orchestrates it&lt;/em> — a clean separation of concerns.&lt;/p></description></item><item><title>Containerising Bioinformatics Tools: Docker and Singularity Best Practices</title><link>https://sksahu.com/blog/containerising-bioinformatics-tools/</link><pubDate>Mon, 11 Jul 2022 00:00:00 +0000</pubDate><guid>https://sksahu.com/blog/containerising-bioinformatics-tools/</guid><description>&lt;p>Containers have become the de facto standard for reproducible bioinformatics. The days of &amp;ldquo;it worked on my machine&amp;rdquo; are largely over — if your tool isn&amp;rsquo;t containerised, it&amp;rsquo;s harder to share, harder to cite, and harder to reproduce. This post covers building production-quality bioinformatics containers from scratch.&lt;/p>
&lt;h2 id="why-containers-beat-conda-alone">Why Containers Beat Conda Alone&lt;/h2>
&lt;p>Conda environments solve Python and R package versioning, but they don&amp;rsquo;t capture the full system environment — kernel version, system libraries, locale settings, or external tool binaries. A Docker image captures everything from the OS up:&lt;/p></description></item><item><title>About</title><link>https://sksahu.com/about/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://sksahu.com/about/</guid><description>&lt;p>I&amp;rsquo;m a &lt;strong>Bioinformatics Research Engineer&lt;/strong> working at the intersection of computational biology, data science, and open science infrastructure. My work is driven by a simple belief: &lt;em>research should be accessible to everyone, and data should be easy to understand.&lt;/em>&lt;/p>
&lt;p>With a background that spans wet-lab biology, computational analysis, and software engineering, I bring a rare end-to-end perspective — I understand both where the data comes from and how to build the tools that make sense of it.&lt;/p></description></item><item><title>Projects</title><link>https://sksahu.com/projects/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://sksahu.com/projects/</guid><description>&lt;p>A collection of tools, packages, and resources I&amp;rsquo;ve built for the bioinformatics and data science community.&lt;/p>
&lt;div class="project-section-title">📦 R Packages&lt;/div>
&lt;div class="project-grid">
 &lt;div class="project-card">
 &lt;div class="project-card__header">
 &lt;span class="project-card__name">&lt;a href="https://github.com/sk-sahu/sig-bio-shiny" target="_blank" rel="noopener">Sig-Bio-Shiny&lt;/a>&lt;/span>
 &lt;span class="tag tag--active">Active&lt;/span>
 &lt;/div>
 &lt;p class="project-card__desc">R package + Shiny app for doing significant biology on a set of genes. Functional enrichment, pathway analysis, and more in an intuitive interface.&lt;/p>
 &lt;div class="project-card__footer">
 &lt;div class="tags">
 &lt;span class="tag tag--default">R&lt;/span>
 &lt;span class="tag tag--default">Shiny&lt;/span>
 &lt;span class="tag tag--default">Gene Ontology&lt;/span>
 &lt;/div>
 &lt;/div>
 &lt;/div>
 &lt;div class="project-card">
 &lt;div class="project-card__header">
 &lt;span class="project-card__name">&lt;a href="https://sk-sahu.github.io/Rmplates/" target="_blank" rel="noopener">Rmplates&lt;/a>&lt;/span>
 &lt;span class="tag tag--accent">Package&lt;/span>
 &lt;/div>
 &lt;p class="project-card__desc">&lt;strong>R&lt;/strong>markdown Te&lt;strong>mplates&lt;/strong> — a collection of R Markdown templates for common scientific reporting use cases.&lt;/p>
 &lt;div class="project-card__footer">
 &lt;div class="tags">
 &lt;span class="tag tag--default">R&lt;/span>
 &lt;span class="tag tag--default">R Markdown&lt;/span>
 &lt;/div>
 &lt;/div>
 &lt;/div>
 &lt;div class="project-card">
 &lt;div class="project-card__header">
 &lt;span class="project-card__name">&lt;a href="https://sksahu.net/sahu/" target="_blank" rel="noopener">sahu&lt;/a>&lt;/span>
 &lt;span class="tag tag--accent">Package&lt;/span>
 &lt;/div>
 &lt;p class="project-card__desc">A personal R package — utility and wrapper functions, API calls, and daily-task helpers that I reuse across projects.&lt;/p></description></item></channel></rss>