Containers have become the de facto standard for reproducible bioinformatics. The days of “it worked on my machine” are largely over — if your tool isn’t containerised, it’s harder to share, harder to cite, and harder to reproduce. This post covers building production-quality bioinformatics containers from scratch.

Why Containers Beat Conda Alone

Conda environments solve Python and R package versioning, but they don’t capture the full system environment — kernel version, system libraries, locale settings, or external tool binaries. A Docker image captures everything from the OS up:

Your analysis reproducibility stack
────────────────────────────────────────────────────
Application layer  │  Your pipeline code
Tool layer         │  STAR, featureCounts, FastQC...
Language layer     │  Python 3.11, R 4.3, Julia 1.9
Library layer      │  glibc, libhts, zlib, openssl
OS layer           │  Debian 12 "bookworm" (pinned)
────────────────────────────────────────────────────
Hardware           │  Your laptop / HPC / cloud VM

Without containerisation, the bottom four layers vary between collaborators and compute environments, silently changing results.

Anatomy of a Good Bioinformatics Dockerfile

Let’s build a real example: a container for a bulk RNA-seq pipeline using STAR + featureCounts.

# ──────────────────────────────────────────────────────────────────────
# Stage 1: build STAR from source (produces a smaller final image)
# ──────────────────────────────────────────────────────────────────────
FROM ubuntu:22.04 AS star_builder

ARG STAR_VERSION=2.7.11b
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential \
        cmake \
        libzstd-dev \
        zlib1g-dev \
        ca-certificates \
        wget \
    && rm -rf /var/lib/apt/lists/*

RUN wget -qO- \
        https://github.com/alexdobin/STAR/archive/${STAR_VERSION}.tar.gz \
    | tar xz \
    && cd STAR-${STAR_VERSION}/source \
    && make -j$(nproc) STARstatic \
    && mv STAR /usr/local/bin/STAR

# ──────────────────────────────────────────────────────────────────────
# Stage 2: final image — copy only the binary
# ──────────────────────────────────────────────────────────────────────
FROM ubuntu:22.04

LABEL org.opencontainers.image.source="https://github.com/sk-sahu/rnaseq-container"
LABEL org.opencontainers.image.version="1.2.0"
LABEL org.opencontainers.image.licenses="MIT"

ENV DEBIAN_FRONTEND=noninteractive

# System dependencies (pinned via apt-cache show)
RUN apt-get update && apt-get install -y --no-install-recommends \
        subread=2.0.3+dfsg-2 \
        samtools=1.16.1-1 \
        python3=3.10.6-1 \
        python3-pip=22.0.2+dfsg-1 \
    && rm -rf /var/lib/apt/lists/*

# Copy STAR binary from build stage
COPY --from=star_builder /usr/local/bin/STAR /usr/local/bin/STAR

# Python dependencies (pinned)
COPY requirements.txt /tmp/requirements.txt
RUN pip3 install --no-cache-dir -r /tmp/requirements.txt \
    && rm /tmp/requirements.txt

# Create a non-root user — critical for HPC compatibility
RUN useradd --uid 1000 --create-home biouser
USER biouser
WORKDIR /data

CMD ["bash"]

Key practices embedded here:

  • Multi-stage build keeps the final image small (only the binary, not the build toolchain)
  • Package version pinning in apt-get install locks system library versions
  • OCI labels add provenance metadata readable by registries and Nextflow
  • Non-root user is essential for security and HPC compatibility (Singularity maps UID)
  • --no-install-recommends avoids installing unnecessary packages

Python/R Containers

For Python-heavy tools, always pin with a lockfile:

FROM python:3.11.9-slim-bookworm

WORKDIR /app

# Copy lockfile first — Docker layer cache means pip only reruns when lockfile changes
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

USER 1000
ENTRYPOINT ["python", "-m", "my_tool"]

Generate a fully-pinned requirements.txt with:

pip install pip-tools
pip-compile --generate-hashes pyproject.toml --output-file requirements.txt

For R containers, Bioconductor Docker images are the best starting point:

FROM bioconductor/bioconductor_docker:RELEASE_3_19

# Install specific Bioconductor packages with version pinning via renv
COPY renv.lock .
RUN R -e "install.packages('renv'); renv::restore(lockfile='renv.lock')"

USER 1000

Generate renv.lock inside R:

# In your R project
renv::init()
renv::install(c("Seurat", "SingleCellExperiment", "DESeq2"))
renv::snapshot()   # writes renv.lock

Singularity for HPC

Most HPC clusters don’t allow Docker (root privileges). Singularity (now Apptainer) runs containers without root and is the standard on HPC:

# Build a Singularity image from a Docker Hub image
singularity pull rnaseq.sif docker://ghcr.io/sk-sahu/rnaseq-container:1.2.0

# Run a command in the container (mounts $HOME automatically)
singularity exec rnaseq.sif STAR --version

# Bind additional directories explicitly
singularity exec \
    --bind /scratch/data:/data \
    --bind /reference:/reference \
    rnaseq.sif \
    STAR \
        --runThreadN 16 \
        --genomeDir /reference/star_index \
        --readFilesIn /data/sample_R1.fastq.gz /data/sample_R2.fastq.gz \
        --readFilesCommand zcat \
        --outSAMtype BAM SortedByCoordinate \
        --outFileNamePrefix /data/results/sample.

In Nextflow, you can set both Docker and Singularity in the same config:

profiles {
    docker {
        docker.enabled = true
    }

    singularity {
        singularity.enabled    = true
        singularity.autoMounts = true
        // Cache pulled images to avoid re-downloading
        singularity.cacheDir   = "/scratch/singularity-cache"
    }
}

CI/CD: Automated Image Building and Scanning

Manually building images is error-prone. Here’s a GitHub Actions workflow that builds, scans, and pushes your image on every release:

# .github/workflows/docker-publish.yml
name: Build and Publish Docker Image

on:
  push:
    tags: ['v*']

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - uses: actions/checkout@v4

      - name: Log in to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata (tags, labels)
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=semver,pattern={{version}}
            type=semver,pattern={{major}}.{{minor}}
            type=sha,prefix=git-

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

      - name: Scan for vulnerabilities with Trivy
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.meta.outputs.version }}
          format: 'sarif'
          output: 'trivy-results.sarif'

      - name: Upload Trivy scan results
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: 'trivy-results.sarif'

This workflow:

  1. Triggers on version tags
  2. Builds with layer caching (significantly faster CI)
  3. Pushes to GitHub Container Registry (free for public repos)
  4. Scans the image for CVEs with Trivy
  5. Uploads results to GitHub Security tab

Image Size Optimisation

Large images are slow to pull on HPC and cloud. Measure and optimise:

# Check layer sizes
docker history ghcr.io/sk-sahu/rnaseq-container:1.2.0 --human --format "table {{.Size}}\t{{.CreatedBy}}"

# Use dive for interactive exploration
docker run --rm -it \
    -v /var/run/docker.sock:/var/run/docker.sock \
    wagoodman/dive:latest \
    ghcr.io/sk-sahu/rnaseq-container:1.2.0

Common size wins:

# Bad: separate RUN commands leave intermediate layers
RUN apt-get update
RUN apt-get install -y build-essential
RUN apt-get clean

# Good: single RUN command, clean up in the same layer
RUN apt-get update \
    && apt-get install -y --no-install-recommends build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

Registries for Bioinformatics Images

Before building from scratch, check these community registries:

RegistryURLNotes
BioContainersbiocontainers.proAutomated builds from Bioconda packages
Quay.ioquay.io/biocontainersBioContainers mirror
Docker Hubhub.docker.comBroad ecosystem, rate-limited
GHCRghcr.ioGitHub-integrated, no rate limits for public

For most standard tools (FastQC, STAR, Salmon, DESeq2), a BioContainers image already exists. Use it — don’t maintain your own version.

Containers are not optional infrastructure anymore. They’re the minimum standard for reproducible research, and the tooling has matured to the point where the overhead of building a good image is small relative to the scientific value it adds.