Single-Cell Multi-omics Integration: Linking RNA, ATAC, and Protein Data

Measuring gene expression alone gives you a snapshot of cell state. Measuring chromatin accessibility alongside expression reveals the regulatory grammar underlying that state. Add surface proteins and you can link transcriptional identity to functional phenotype. Multi-omics integration is complex but increasingly tractable — this post shows you how to do it.

What Multi-omics Adds

Each single-cell modality captures a different layer of biology:

Modality          │ What it measures          │ Key insight
──────────────────┼───────────────────────────┼──────────────────────────────
scRNA-seq         │ Gene expression (mRNA)    │ Cell identity & state
scATAC-seq        │ Chromatin accessibility   │ Regulatory landscape
CITE-seq          │ Surface proteins (ADT)    │ Phenotypic marker quantification
scMethyl-seq      │ DNA methylation           │ Epigenetic silencing
Spatial omics     │ Expression + location     │ Tissue architecture

Combining modalities lets you ask questions no single layer can answer: Which open chromatin regions drive the transcription programmes that define this cell type? Which transcription factors are active based on both motif accessibility and their own expression?

Technology Overview

Three main experimental strategies for multi-omics:

Simultaneous measurement (same cell):

10x Multiome — scRNA + scATAC from the same nucleus
CITE-seq — scRNA + surface proteins (antibody-derived tags)
ASAP-seq — scATAC + surface proteins

Computational integration (different cells, matched populations):

Workflow: 10x Multiome (RNA + ATAC)

The 10x Multiome assay produces paired RNA and ATAC libraries from the same nuclei. Cell Ranger ARC processes both:

cellranger-arc count \
    --id=sample_multiome \
    --reference=/reference/refdata-cellranger-arc-GRCh38-2020-A-2.0.0 \
    --libraries=libraries.csv \
    --localcores=16 \
    --localmem=64

Where libraries.csv looks like:

fastqs,sample,library_type
/data/fastqs/rna,sample001,Gene Expression
/data/fastqs/atac,sample001,Chromatin Accessibility

The output includes:

filtered_feature_bc_matrix/ — RNA count matrix
atac_fragments.tsv.gz — ATAC fragment file (indexed)
per_barcode_metrics.csv — per-cell QC for both modalities

Downstream Analysis with Seurat v5

library(Seurat)
library(Signac)
library(GenomicRanges)

# Load the 10x Multiome output
counts <- Read10X_h5("sample_multiome/outs/filtered_feature_bc_matrix.h5")

# Create Seurat object with both assays
seurat_obj <- CreateSeuratObject(
    counts = counts$`Gene Expression`,
    assay  = "RNA"
)

# Add ATAC assay using Signac
frag_path <- "sample_multiome/outs/atac_fragments.tsv.gz"
seurat_obj[["ATAC"]] <- CreateChromatinAssay(
    counts    = counts$Peaks,
    sep       = c(":", "-"),
    fragments = frag_path,
    min.cells = 10,
    min.features = 200,
    genome    = "hg38"
)

Quality Control for Both Modalities

# RNA QC metrics
seurat_obj <- PercentageFeatureSet(seurat_obj, "^MT-", col.name = "pct_mito")

# ATAC QC metrics (requires Signac)
DefaultAssay(seurat_obj) <- "ATAC"
seurat_obj <- NucleosomeSignal(seurat_obj)   # nucleosome banding score
seurat_obj <- TSSEnrichment(seurat_obj)       # TSS enrichment score

# Filter cells that pass QC in both modalities
seurat_obj <- subset(
    seurat_obj,
    nFeature_RNA   > 500   & nFeature_RNA   < 8000 &
    pct_mito       < 20   &
    nFeature_ATAC  > 1000 & nFeature_ATAC  < 40000 &
    nucleosome_signal < 2 &
    TSS.enrichment    > 1
)

Weighted Nearest Neighbour (WNN) Joint Embedding

The key innovation in Seurat v5 for multi-omics is WNN: it learns a cell-specific weighting of each modality, so cells where one modality is less informative rely more on the other.

# Process RNA
DefaultAssay(seurat_obj) <- "RNA"
seurat_obj <- NormalizeData(seurat_obj) |>
    FindVariableFeatures() |>
    ScaleData() |>
    RunPCA(npcs = 50, reduction.name = "pca_rna")

# Process ATAC (using LSI)
DefaultAssay(seurat_obj) <- "ATAC"
seurat_obj <- RunTFIDF(seurat_obj) |>
    FindTopFeatures(min.cutoff = "q0") |>
    RunSVD(reduction.name = "lsi_atac")

# Jointly embed using WNN
seurat_obj <- FindMultiModalNeighbors(
    seurat_obj,
    reduction.list = list("pca_rna", "lsi_atac"),
    dims.list      = list(1:50, 2:40),   # exclude LSI dim 1 (correlated with depth)
    modality.weight.name = "RNA.weight"
)

seurat_obj <- RunUMAP(
    seurat_obj,
    nn.name      = "weighted.nn",
    reduction.name = "wnn.umap",
    reduction.key  = "wnnUMAP_"
)

seurat_obj <- FindClusters(seurat_obj, graph.name = "wsnn", resolution = 0.5)

Visualise the relative modality weights — this tells you which modality is more informative per cell:

library(ggplot2)

VlnPlot(
    seurat_obj,
    features = "RNA.weight",
    group.by = "seurat_clusters",
    pt.size  = 0
) + ggtitle("RNA contribution weight per cluster (1 = RNA-dominant)")

Linking ATAC Peaks to Genes

One of the most powerful analyses from Multiome data: finding which accessible chromatin regions correlate with gene expression across cells — a data-driven alternative to simple promoter-peak association.

# Find peaks correlated with gene expression
DefaultAssay(seurat_obj) <- "ATAC"
seurat_obj <- RegionStats(seurat_obj, genome = BSgenome.Hsapiens.UCSC.hg38)

links <- LinkPeaks(
    seurat_obj,
    peak.assay = "ATAC",
    expression.assay = "RNA",
    genes.use = c("CD34", "GATA2", "TAL1", "KIT"),  # genes of interest
    min.cells = 10,
    distance = 5e5   # search within 500 kb of TSS
)

# Visualise linked peaks around a gene
CoveragePlot(
    object    = links,
    region    = "GATA2",
    features  = "GATA2",
    expression.assay = "RNA",
    extend.upstream   = 10000,
    extend.downstream = 5000
)

CITE-seq: Adding Protein Data

For CITE-seq data (RNA + surface protein ADTs), the workflow is similar but uses different normalisation for the protein modality:

# Assuming seurat_obj already has RNA; add ADT (antibody-derived tags)
adt_counts <- Read10X("adt_count_matrix/")

seurat_obj[["ADT"]] <- CreateAssayObject(adt_counts[, colnames(seurat_obj)])

# Normalise ADT using centred log-ratio (CLR) — designed for compositional data
DefaultAssay(seurat_obj) <- "ADT"
seurat_obj <- NormalizeData(seurat_obj, normalization.method = "CLR", margin = 2)
seurat_obj <- ScaleData(seurat_obj)

# Add ADT to the WNN integration
seurat_obj <- RunPCA(seurat_obj, reduction.name = "pca_adt", npcs = 20)

seurat_obj <- FindMultiModalNeighbors(
    seurat_obj,
    reduction.list = list("pca_rna", "pca_adt"),
    dims.list      = list(1:30, 1:20),
    modality.weight.name = "RNA.weight"
)

Python Alternative: Muon

Muon is the Python counterpart — a multi-modal extension of AnnData that stores each modality as a separate AnnData object within a MuData container:

import muon as mu
import scanpy as sc

# Create MuData from separate AnnData objects
rna  = sc.read_h5ad("rna.h5ad")
atac = sc.read_h5ad("atac.h5ad")

mdata = mu.MuData({"rna": rna, "atac": atac})

# Process each modality
mu.pp.filter_obs(mdata, "rna:n_genes_by_counts", lambda x: (x > 500) & (x < 8000))
mu.pp.filter_obs(mdata, "atac:n_features_by_counts", lambda x: (x > 1000) & (x < 40000))
mu.pp.intersect_obs(mdata)   # keep only cells passing QC in both modalities

# Normalise RNA
sc.pp.normalize_total(mdata["rna"])
sc.pp.log1p(mdata["rna"])
sc.pp.highly_variable_genes(mdata["rna"])
sc.tl.pca(mdata["rna"])

# Normalise ATAC (TF-IDF + LSI)
mu.atac.pp.tfidf(mdata["atac"])
mu.atac.tl.lsi(mdata["atac"])

# Multi-modal neighbour graph and UMAP
mu.pp.neighbors(mdata, key_added="multi")
mu.tl.umap(mdata)
mu.pl.umap(mdata, color=["rna:CD34", "atac:n_features_by_counts"])

Common Integration Pitfalls

Don’t use LSI dimension 1. The first LSI component from scATAC-seq is almost always correlated with sequencing depth rather than biology. Always start from dimension 2 (dims.list = list(2:40)).

QC both modalities independently. A cell that passes RNA QC but not ATAC QC (or vice versa) should be removed — a low-quality ATAC profile will degrade your joint embedding.

Cell type annotation from RNA, regulatory interpretation from ATAC. RNA gives cleaner cell type clusters. Once you have annotations, use the ATAC data to understand what regulatory programs are active in each cell type — don’t try to do both simultaneously in an unsupervised way.

Batch effects compound across modalities. If you have batches, correct them in each modality independently before integration. Harmony works well for RNA; batch-correction for ATAC is more complex (consider ArchR’s iterative LSI approach).

Multi-omics data is expensive to generate and rich with information. Investing in careful QC and thoughtful integration pays dividends in biological insight that no single modality could provide alone.