FAIR Data Principles in Practice: A Bioinformatics Engineer's Guide

Everyone in life sciences has heard about FAIR data — Findable, Accessible, Interoperable, Reusable. It’s become one of those principles that appears in grant applications and data management plans without always translating into practice. This post is about the practical side: what FAIR actually means for a bioinformatics project and how to implement it without adding unnecessary overhead.

The Four Principles, Grounded

The FAIR Guiding Principles were published in 2016. Let’s translate each principle into concrete bioinformatics terms:

FAIR Principle         │ What it means in practice
───────────────────────┼────────────────────────────────────────────────────
Findable               │ Data has a persistent identifier (DOI/accession)
                       │ + rich, searchable metadata
───────────────────────┼────────────────────────────────────────────────────
Accessible             │ Data retrievable via open, standard protocol (HTTPS,
                       │ FTP, SRA API) with clear authentication rules
───────────────────────┼────────────────────────────────────────────────────
Interoperable          │ Uses community-standard formats (BAM, VCF, AnnData)
                       │ and controlled vocabularies (ENCODE, OBI, EFO)
───────────────────────┼────────────────────────────────────────────────────
Reusable               │ Has a clear licence, provenance trail, and enough
                       │ context for someone else to reproduce your analysis

Findability: Persistent Identifiers and Rich Metadata

Repository Deposition

Every dataset generated in a project should be deposited in a domain-appropriate repository before publication:

Data type	Repository	Accession format
Raw sequencing reads	NCBI SRA / EBI ENA	SRR, ERR
Processed omics data	GEO	GSE, GSM
Protein structures	PDB	4-char PDB ID
Computational workflows	WorkflowHub	wfhub:*
All other data	Zenodo	DOI

Zenodo is the catch-all: free, GitHub-integrated, and issues DOIs automatically for every version.

Connecting a GitHub Release to Zenodo

# 1. Link your repo at zenodo.org/account/settings/github
# 2. Enable the repository toggle
# 3. Create a GitHub release — Zenodo auto-mints a DOI

# Then add the badge to your README:
# [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.XXXXXXX.svg)](https://doi.org/10.5281/zenodo.XXXXXXX)

Metadata Quality

Rich metadata is what makes your data findable by search engines and databases. For sequencing experiments, the ISA framework and MINSEQE standard define the minimum information required:

# Example: generating SRA-compatible metadata using pandas
import pandas as pd

metadata = pd.DataFrame({
    "sample_name":      ["sample_001", "sample_002"],
    "organism":         ["Homo sapiens", "Homo sapiens"],
    "tissue":           ["peripheral blood", "bone marrow"],
    "cell_type":        ["CD34+ HSC", "GMP"],
    "treatment":        ["untreated", "LPS 100ng/mL 4h"],
    "library_strategy": ["RNA-Seq", "RNA-Seq"],
    "library_source":   ["TRANSCRIPTOMIC SINGLE CELL", "TRANSCRIPTOMIC SINGLE CELL"],
    "instrument":       ["Illumina NovaSeq 6000", "Illumina NovaSeq 6000"],
    "sequencing_depth": [50000, 48000],  # reads per cell
    "orcid_submitter":  ["0000-0002-XXXX-XXXX", "0000-0002-XXXX-XXXX"],
})

metadata.to_csv("sra_metadata.tsv", sep="\t", index=False)

Accessibility: Standard Protocols and Programmatic Access

Data should be retrievable programmatically without special software. Here’s how to access common repositories via API:

# Fetch metadata and download from GEO using GEOparse
import GEOparse

gse = GEOparse.get_GEO("GSE185224", destdir="/tmp/geo/")

for gsm_name, gsm in gse.gsms.items():
    print(f"Sample: {gsm_name}")
    print(f"  Title: {gsm.metadata['title'][0]}")
    print(f"  Organism: {gsm.metadata['organism_ch1'][0]}")
    print(f"  Source: {gsm.metadata['source_name_ch1'][0]}")

# Download SRA data programmatically with the SRA Toolkit
import subprocess

def download_sra(accession, output_dir, threads=8):
    """Download and convert SRA data to FASTQ."""
    cmd = [
        "prefetch", accession, "--output-directory", output_dir,
    ]
    subprocess.run(cmd, check=True)

    cmd = [
        "fasterq-dump", accession,
        "--outdir", output_dir,
        "--split-files",
        "--threads", str(threads),
    ]
    subprocess.run(cmd, check=True)
    print(f"Downloaded {accession} to {output_dir}")

For cloud-native access, NCBI’s open data programme hosts SRA data on AWS S3 in CRAM/FASTQ format — no download required for cloud analysis:

# Stream a CRAM file directly from S3 without downloading
samtools view -c \
    s3://sra-pub-run-odp/sra/SRR12345678/SRR12345678 \
    --reference hg38.fa

Interoperability: Standard Formats and Controlled Vocabularies

File Formats

Choose formats that are widely readable across ecosystems:

Data type           Preferred format     Avoid
──────────────────────────────────────────────────────────────────
Raw reads           FASTQ.gz             Custom binary formats
Aligned reads       CRAM (or BAM)        SAM (too large), custom
Variant calls       VCF/BCF              Excel, CSV
Gene counts         MEX (10x) / AnnData  Custom TSV, proprietary
Single-cell data    AnnData (h5ad)       Seurat RDS (R-only)
Annotation          GFF3/GTF             Custom formats

AnnData is worth highlighting: it’s the lingua franca for single-cell data, readable in both Python (Scanpy) and R (via zellkonverter), and the underlying HDF5 format is self-describing and efficient for large datasets.

import scanpy as sc

# Save in AnnData format for maximum interoperability
adata.write_h5ad("dataset_processed.h5ad", compression="gzip")

# Embed provenance in the object itself
adata.uns["provenance"] = {
    "source_accession":   "GSE185224",
    "processing_date":    "2025-07-22",
    "pipeline_version":   "v2.1.0",
    "pipeline_doi":       "10.5281/zenodo.XXXXXXX",
    "genome_reference":   "GRCh38.p14",
    "annotation":         "GENCODE v44",
    "normalisation":      "scran pooling-based",
}

Controlled Vocabularies

Metadata fields should reference ontology terms rather than free text. This is what makes your data machine-readable by databases and AI systems:

# Bad: free text cell type annotation
adata.obs["cell_type"] = "monocyte"

# Good: ontology-backed annotation
adata.obs["cell_type"]         = "classical monocyte"
adata.obs["cell_type_ontology_term_id"] = "CL:0000860"  # Cell Ontology

adata.obs["tissue"]            = "peripheral blood"
adata.obs["tissue_ontology_term_id"] = "UBERON:0013756"

adata.obs["disease"]           = "acute myeloid leukaemia"
adata.obs["disease_ontology_term_id"] = "MONDO:0018874"

The CZI Cell x Gene schema formalises these requirements for single-cell data — follow it even if you’re not submitting to their portal.

Reusability: Licences, Provenance, and Reproducibility

Licencing

Data is not software — apply an appropriate data licence:

CC BY 4.0 — attribution required; good default for most research data
CC0 — no restrictions; preferred for reference databases
CC BY-NC 4.0 — non-commercial only; use with caution (limits reuse)

Add a LICENSE file to every repository and include a CITATION.cff for software citations:

# CITATION.cff
cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "My bioinformatics pipeline"
authors:
  - family-names: Sahu
    given-names: Sangram Keshari
    orcid: "https://orcid.org/0000-0002-XXXX-XXXX"
version: 2.1.0
date-released: 2025-07-22
doi: 10.5281/zenodo.XXXXXXX
repository-code: "https://github.com/sk-sahu/my-pipeline"
license: MIT

Provenance Tracking

Every output file should record where it came from. There are several ways to achieve this:

Via Nextflow reports:

nextflow run main.nf -profile docker \
    -with-report execution_report.html \
    -with-trace trace.txt \
    -with-dag pipeline_dag.svg

The trace.txt file records every task, its inputs, outputs, runtime, and container — a complete audit trail.

Via Python code using attrs or dataclasses:

from dataclasses import dataclass, field
from datetime import datetime
import json

@dataclass
class AnalysisProvenance:
    tool:         str
    version:      str
    input_files:  list[str]
    output_files: list[str]
    parameters:   dict
    timestamp:    str = field(default_factory=lambda: datetime.utcnow().isoformat())
    user:         str = field(default_factory=lambda: __import__('os').environ.get('USER', 'unknown'))

    def save(self, path):
        with open(path, 'w') as f:
            json.dump(self.__dict__, f, indent=2)


prov = AnalysisProvenance(
    tool="scanpy",
    version=sc.__version__,
    input_files=["raw_matrix.h5ad"],
    output_files=["processed.h5ad"],
    parameters={"n_pcs": 50, "n_neighbors": 15, "resolution": 0.5},
)
prov.save("provenance.json")

A Practical Checklist

Before submitting a paper, run through this:

FAIR Checklist for a Bioinformatics Paper
─────────────────────────────────────────
[ ] Raw data deposited in domain repository (SRA/ENA/GEO) with accession
[ ] Processed data available (Zenodo / Figshare) with DOI
[ ] Code in a public repository tagged with a release + DOI
[ ] Metadata uses controlled vocabulary (OBI, CL, UBERON, EFO)
[ ] File formats are open and community-standard (AnnData, VCF, FASTQ)
[ ] Pipeline is versioned and containerised (Docker/Singularity)
[ ] Execution report archived alongside the data
[ ] CITATION.cff present in all software repositories
[ ] Data licence explicitly stated (CC BY 4.0 or CC0)
[ ] README explains how to reproduce the analysis from raw data

FAIR is not an ideal to aspire to at the end of a project — it’s an engineering discipline to build in from the start. The overhead is real, but so is the payoff: reproducibility, citability, and datasets that future researchers (and future AI systems) can actually use.