Containers have become the de facto standard for reproducible bioinformatics. The days of “it worked on my machine” are largely over — if your tool isn’t containerised, it’s harder to share, harder to cite, and harder to reproduce. This post covers building production-quality bioinformatics containers from scratch.
Why Containers Beat Conda Alone
Conda environments solve Python and R package versioning, but they don’t capture the full system environment — kernel version, system libraries, locale settings, or external tool binaries. A Docker image captures everything from the OS up:
Your analysis reproducibility stack
────────────────────────────────────────────────────
Application layer │ Your pipeline code
Tool layer │ STAR, featureCounts, FastQC...
Language layer │ Python 3.11, R 4.3, Julia 1.9
Library layer │ glibc, libhts, zlib, openssl
OS layer │ Debian 12 "bookworm" (pinned)
────────────────────────────────────────────────────
Hardware │ Your laptop / HPC / cloud VM
Without containerisation, the bottom four layers vary between collaborators and compute environments, silently changing results.
Anatomy of a Good Bioinformatics Dockerfile
Let’s build a real example: a container for a bulk RNA-seq pipeline using STAR + featureCounts.
# ──────────────────────────────────────────────────────────────────────
# Stage 1: build STAR from source (produces a smaller final image)
# ──────────────────────────────────────────────────────────────────────
FROM ubuntu:22.04 AS star_builder
ARG STAR_VERSION=2.7.11b
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
libzstd-dev \
zlib1g-dev \
ca-certificates \
wget \
&& rm -rf /var/lib/apt/lists/*
RUN wget -qO- \
https://github.com/alexdobin/STAR/archive/${STAR_VERSION}.tar.gz \
| tar xz \
&& cd STAR-${STAR_VERSION}/source \
&& make -j$(nproc) STARstatic \
&& mv STAR /usr/local/bin/STAR
# ──────────────────────────────────────────────────────────────────────
# Stage 2: final image — copy only the binary
# ──────────────────────────────────────────────────────────────────────
FROM ubuntu:22.04
LABEL org.opencontainers.image.source="https://github.com/sk-sahu/rnaseq-container"
LABEL org.opencontainers.image.version="1.2.0"
LABEL org.opencontainers.image.licenses="MIT"
ENV DEBIAN_FRONTEND=noninteractive
# System dependencies (pinned via apt-cache show)
RUN apt-get update && apt-get install -y --no-install-recommends \
subread=2.0.3+dfsg-2 \
samtools=1.16.1-1 \
python3=3.10.6-1 \
python3-pip=22.0.2+dfsg-1 \
&& rm -rf /var/lib/apt/lists/*
# Copy STAR binary from build stage
COPY --from=star_builder /usr/local/bin/STAR /usr/local/bin/STAR
# Python dependencies (pinned)
COPY requirements.txt /tmp/requirements.txt
RUN pip3 install --no-cache-dir -r /tmp/requirements.txt \
&& rm /tmp/requirements.txt
# Create a non-root user — critical for HPC compatibility
RUN useradd --uid 1000 --create-home biouser
USER biouser
WORKDIR /data
CMD ["bash"]
Key practices embedded here:
- Multi-stage build keeps the final image small (only the binary, not the build toolchain)
- Package version pinning in
apt-get installlocks system library versions - OCI labels add provenance metadata readable by registries and Nextflow
- Non-root user is essential for security and HPC compatibility (Singularity maps UID)
--no-install-recommendsavoids installing unnecessary packages
Python/R Containers
For Python-heavy tools, always pin with a lockfile:
FROM python:3.11.9-slim-bookworm
WORKDIR /app
# Copy lockfile first — Docker layer cache means pip only reruns when lockfile changes
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
USER 1000
ENTRYPOINT ["python", "-m", "my_tool"]
Generate a fully-pinned requirements.txt with:
pip install pip-tools
pip-compile --generate-hashes pyproject.toml --output-file requirements.txt
For R containers, Bioconductor Docker images are the best starting point:
FROM bioconductor/bioconductor_docker:RELEASE_3_19
# Install specific Bioconductor packages with version pinning via renv
COPY renv.lock .
RUN R -e "install.packages('renv'); renv::restore(lockfile='renv.lock')"
USER 1000
Generate renv.lock inside R:
# In your R project
renv::init()
renv::install(c("Seurat", "SingleCellExperiment", "DESeq2"))
renv::snapshot() # writes renv.lock
Singularity for HPC
Most HPC clusters don’t allow Docker (root privileges). Singularity (now Apptainer) runs containers without root and is the standard on HPC:
# Build a Singularity image from a Docker Hub image
singularity pull rnaseq.sif docker://ghcr.io/sk-sahu/rnaseq-container:1.2.0
# Run a command in the container (mounts $HOME automatically)
singularity exec rnaseq.sif STAR --version
# Bind additional directories explicitly
singularity exec \
--bind /scratch/data:/data \
--bind /reference:/reference \
rnaseq.sif \
STAR \
--runThreadN 16 \
--genomeDir /reference/star_index \
--readFilesIn /data/sample_R1.fastq.gz /data/sample_R2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix /data/results/sample.
In Nextflow, you can set both Docker and Singularity in the same config:
profiles {
docker {
docker.enabled = true
}
singularity {
singularity.enabled = true
singularity.autoMounts = true
// Cache pulled images to avoid re-downloading
singularity.cacheDir = "/scratch/singularity-cache"
}
}
CI/CD: Automated Image Building and Scanning
Manually building images is error-prone. Here’s a GitHub Actions workflow that builds, scans, and pushes your image on every release:
# .github/workflows/docker-publish.yml
name: Build and Publish Docker Image
on:
push:
tags: ['v*']
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata (tags, labels)
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=sha,prefix=git-
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Scan for vulnerabilities with Trivy
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.meta.outputs.version }}
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: 'trivy-results.sarif'
This workflow:
- Triggers on version tags
- Builds with layer caching (significantly faster CI)
- Pushes to GitHub Container Registry (free for public repos)
- Scans the image for CVEs with Trivy
- Uploads results to GitHub Security tab
Image Size Optimisation
Large images are slow to pull on HPC and cloud. Measure and optimise:
# Check layer sizes
docker history ghcr.io/sk-sahu/rnaseq-container:1.2.0 --human --format "table {{.Size}}\t{{.CreatedBy}}"
# Use dive for interactive exploration
docker run --rm -it \
-v /var/run/docker.sock:/var/run/docker.sock \
wagoodman/dive:latest \
ghcr.io/sk-sahu/rnaseq-container:1.2.0
Common size wins:
# Bad: separate RUN commands leave intermediate layers
RUN apt-get update
RUN apt-get install -y build-essential
RUN apt-get clean
# Good: single RUN command, clean up in the same layer
RUN apt-get update \
&& apt-get install -y --no-install-recommends build-essential \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
Registries for Bioinformatics Images
Before building from scratch, check these community registries:
| Registry | URL | Notes |
|---|---|---|
| BioContainers | biocontainers.pro | Automated builds from Bioconda packages |
| Quay.io | quay.io/biocontainers | BioContainers mirror |
| Docker Hub | hub.docker.com | Broad ecosystem, rate-limited |
| GHCR | ghcr.io | GitHub-integrated, no rate limits for public |
For most standard tools (FastQC, STAR, Salmon, DESeq2), a BioContainers image already exists. Use it — don’t maintain your own version.
Containers are not optional infrastructure anymore. They’re the minimum standard for reproducible research, and the tooling has matured to the point where the overhead of building a good image is small relative to the scientific value it adds.