Nextflow DSL2: Building Modular, Scalable Bioinformatics Pipelines

Workflow managers have become essential infrastructure in bioinformatics. If you’ve outgrown shell scripts and are finding Snakemake limiting for complex multi-step pipelines — especially those that need to run across different compute environments — Nextflow DSL2 is worth a serious look.

Why DSL2?

Nextflow’s original syntax (DSL1) worked, but it made reuse difficult. DSL2 introduces a modular system where:

Processes are atomic, container-isolated execution units
Workflows compose processes into directed acyclic graphs (DAGs)
Modules are shareable, versionable process definitions

The result is a pipeline architecture that separates what a tool does from how the pipeline orchestrates it — a clean separation of concerns.

Pipeline Architecture

A well-structured DSL2 project looks like this:

my-pipeline/
├── main.nf                  # Entry workflow
├── nextflow.config          # Executor, container, and resource config
├── modules/
│   ├── local/               # Pipeline-specific modules
│   │   └── custom_filter/
│   │       └── main.nf
│   └── nf-core/             # Community modules (via nf-core/modules)
│       ├── fastqc/main.nf
│       ├── trimgalore/main.nf
│       └── star/align/main.nf
├── subworkflows/
│   └── local/
│       └── align_and_qc/
│           └── main.nf
└── conf/
    ├── base.config          # Default resource profiles
    ├── hpc.config           # HPC-specific settings
    └── cloud.config         # AWS/GCP settings

Data flows through the pipeline as channels — lazy, asynchronous streams that Nextflow schedules automatically. Here’s a simplified ASCII diagram of a typical RNA-seq pipeline DAG:

FASTQ files
    │
    ▼
[FASTQC] ──────────────────────────┐
    │                              │
    ▼                              ▼
[TRIM_GALORE]               [MULTIQC report]
    │
    ▼
[STAR_ALIGN]
    │
    ├──────────────────────────────┐
    ▼                              ▼
[FEATURECOUNTS]           [SAMTOOLS_SORT]
    │                              │
    ▼                              ▼
[DESEQ2]                  [BAMCOVERAGE (bigWig)]

Defining a Process

Processes are the atomic units of work. Each process runs in its own container and has explicit input/output declarations:

process STAR_ALIGN {
    tag "$meta.id"
    label 'process_high'

    container 'quay.io/biocontainers/star:2.7.10b--h9ee0642_0'

    input:
    tuple val(meta), path(reads)
    path  index
    path  gtf

    output:
    tuple val(meta), path("*Aligned.sortedByCoord.out.bam"), emit: bam
    tuple val(meta), path("*Log.final.out"),                 emit: log_final
    tuple val(meta), path("*SJ.out.tab"),                    emit: sj

    script:
    def prefix = task.ext.prefix ?: "${meta.id}"
    """
    STAR \\
        --runThreadN $task.cpus \\
        --genomeDir $index \\
        --sjdbGTFfile $gtf \\
        --readFilesIn $reads \\
        --readFilesCommand zcat \\
        --outSAMtype BAM SortedByCoordinate \\
        --outFileNamePrefix ${prefix}. \\
        --outSAMattributes NH HI AS NM MD
    """
}

A few things worth noticing:

tag adds a per-task label to logs — invaluable when debugging
label ties to resource profiles defined in nextflow.config
meta is a map carrying sample metadata (id, strandedness, etc.) — a DSL2 convention popularised by nf-core
emit gives named handles to outputs, making downstream composition explicit

Composing a Subworkflow

Subworkflows group related processes. Here’s an alignment subworkflow:

include { STAR_ALIGN      } from '../../modules/nf-core/star/align/main'
include { SAMTOOLS_SORT   } from '../../modules/nf-core/samtools/sort/main'
include { SAMTOOLS_INDEX  } from '../../modules/nf-core/samtools/index/main'

workflow ALIGN_AND_QC {
    take:
    reads  // channel: [ val(meta), [ path(fastq) ] ]
    index  // path: STAR genome index
    gtf    // path: annotation GTF

    main:
    STAR_ALIGN ( reads, index, gtf )
    SAMTOOLS_SORT ( STAR_ALIGN.out.bam )
    SAMTOOLS_INDEX ( SAMTOOLS_SORT.out.bam )

    emit:
    bam     = SAMTOOLS_SORT.out.bam
    bai     = SAMTOOLS_INDEX.out.bai
    log     = STAR_ALIGN.out.log_final
}

The take / main / emit blocks make the data contract of each subworkflow explicit — critical when you want to swap implementations later.

Configuration and Portability

One of Nextflow’s biggest strengths is executor portability. The same pipeline can run on a laptop, SLURM cluster, or AWS Batch by changing a config profile:

// nextflow.config
profiles {
    local {
        process.executor = 'local'
        docker.enabled   = true
    }

    slurm {
        process.executor = 'slurm'
        singularity.enabled    = true
        singularity.autoMounts = true
        process {
            withLabel: 'process_high' {
                cpus   = 16
                memory = '64.GB'
                time   = '12.h'
                queue  = 'long'
            }
        }
    }

    awsbatch {
        process.executor       = 'awsbatch'
        process.queue          = 'nextflow-queue'
        aws.region             = 'eu-west-1'
        docker.enabled         = true
    }
}

This configuration-driven portability is what makes Nextflow a practical choice for shared research infrastructure.

Consuming nf-core Modules

The nf-core/modules repository contains over 1,000 community-maintained, Docker-backed process definitions. You can pull them directly:

# Install nf-core tools
pip install nf-core

# Add a module to your pipeline
nf-core modules install fastqc
nf-core modules install trimgalore
nf-core modules install star/align

This command downloads the module into modules/nf-core/ and pins a specific version — making your pipeline auditable and reproducible.

Testing with nf-test

Untested pipelines accumulate hidden bugs. nf-test brings unit and integration testing to Nextflow:

// tests/modules/star_align.nf.test
nextflow_process {
    name "Test STAR_ALIGN"
    script "../../../modules/nf-core/star/align/main.nf"
    process "STAR_ALIGN"

    test("human - paired-end reads") {
        when {
            process {
                """
                input[0] = [
                    [ id:'test', single_end:false ],
                    [ file(params.test_data['homo_sapiens']['illumina']['test_paired_end_1_fastq_gz']),
                      file(params.test_data['homo_sapiens']['illumina']['test_paired_end_2_fastq_gz']) ]
                ]
                input[1] = file(params.test_data['homo_sapiens']['genome']['star_index'])
                input[2] = file(params.test_data['homo_sapiens']['genome']['gtf'])
                """
            }
        }

        then {
            assert process.success
            assert process.out.bam.size() == 1
            assert snapshot(process.out.log_final).match()
        }
    }
}

Run tests with:

nf-test test tests/modules/star_align.nf.test

Key Takeaways

Nextflow DSL2 gives you a principled way to build bioinformatics pipelines that are:

Modular — swap tools without rewriting the orchestration layer
Portable — one codebase, multiple compute environments
Reproducible — containers + version-pinned modules
Testable — nf-test enables process-level unit tests

For new projects I strongly recommend starting from the nf-core pipeline template — it gives you CI/CD, linting, docs, and test infrastructure out of the box.