Skip to content

aubreybailey/ffq_mcp

Repository files navigation

FFQ MCP Server

A Model Context Protocol (MCP) server for fetching sequencing data metadata from public genomics databases using ffq and running kallisto quantification.

Quick Start (5 minutes)

For Claude Code (CLI)

1. Clone and setup environment:

git clone https://github.com/gmtrash/ffq_mcp.git
cd ffq_mcp
conda env create -f environment.yml
conda activate ffq-mcp

2. Get the paths:

which python
# Copy this output (e.g., /home/user/miniconda3/envs/ffq-mcp/bin/python)

pwd
# Copy this output (e.g., /home/user/ffq_mcp)

3. Add to Claude Code:

# General syntax:
claude mcp add --transport stdio --scope user ffq -- \
  <PYTHON_PATH> \
  <REPO_PATH>/src/ffq_server.py

# Example (replace with YOUR actual paths):
claude mcp add --transport stdio --scope user ffq -- \
  /home/aubreybailey/miniforge3/envs/ffq-mcp/bin/python \
  /home/aubreybailey/code/ffq_mcp/src/ffq_server.py

Important flags:

  • --transport stdio - Required: Specifies stdio transport for MCP
  • --scope user - Makes it available in all your projects
  • -- - Required separator between flags and command/args
  • Use --scope project if you only want it in this specific project

4. Verify:

claude mcp list  # Should show "ffq"

5. Test it: In any Claude Code session, ask: "What tools do you have from the ffq server?"

You should see 10 tools including fetch_metadata, summarize_experimental_design, and run_kallisto_quantification.


For Claude Desktop (GUI)

1. Clone and install:

git clone https://github.com/gmtrash/ffq_mcp.git
cd ffq_mcp
conda env create -f environment.yml
conda activate ffq-mcp

2. Find your Python path:

which python
# Copy this output (e.g., /Users/you/miniconda3/envs/ffq-mcp/bin/python)

3. Get your repo path:

pwd
# Copy this output (e.g., /Users/you/ffq_mcp)

4. Configure Claude Desktop:

Open config file:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

Add this (replace paths with your actual paths from steps 2 and 3):

{
  "mcpServers": {
    "ffq": {
      "command": "/Users/you/miniconda3/envs/ffq-mcp/bin/python",
      "args": ["/Users/you/ffq_mcp/src/ffq_server.py"]
    }
  }
}

5. Restart Claude Desktop

6. Test it: Ask Claude: "What tools do you have from the ffq server?"

You should see 10 tools including fetch_metadata, summarize_experimental_design, and run_kallisto_quantification.


Troubleshooting

"error: unknown option '--command'"

  • You're using old syntax. Use --transport stdio and -- separator instead.
  • Correct: claude mcp add --transport stdio --scope user ffq -- <python> <script>

Server shows "disconnected" in /mcp

  • Check your Python path is correct: which python (with conda env activated)
  • Make sure the script path points to src/ffq_server.py
  • Try removing and re-adding: claude mcp remove ffq then add again

"ffq not found" or "kallisto not found"

  • Make sure conda environment is created: conda env create -f environment.yml
  • The MCP uses the full Python path, so conda doesn't need to be activated when running Claude Code
  • Verify installation: Run <your-python-path> -m pip list | grep ffq

Need to update after git pull?

  • No need to re-add the MCP server
  • Just restart Claude Code or use /mcp to reconnect
  • The server automatically uses the latest code from your repo

Overview

This MCP server enables Claude and other MCP clients to:

  1. Fetch metadata and download links for sequencing data from:

    • GEO (Gene Expression Omnibus)
    • SRA (Sequence Read Archive)
    • EMBL-EBI (European Molecular Biology Laboratory)
    • DDBJ (DNA Data Bank of Japan)
    • NIH Biosample
    • ENCODE (Encyclopedia of DNA Elements)
  2. Run kallisto quantification to create analysis-ready transcript-level quantitation matrices from RNA-seq data

Features

The server provides 10 main tools:

1. fetch_metadata

Fetch comprehensive metadata for sequencing data using accession numbers or DOIs.

Supported accession formats:

  • SRA: SRR, SRX, SRS, SRP
  • GEO: GSE, GSM
  • ENCODE accessions
  • Bioproject codes
  • Biosample IDs
  • DOIs

Example:

{
  "accession": "GSE123456",
  "level": 2
}

2. fetch_download_links

Retrieve direct download URLs for sequencing data files from various providers.

Supported providers:

  • ftp - FTP servers
  • aws - Amazon Web Services
  • gcp - Google Cloud Platform
  • ncbi - NCBI servers

Example:

{
  "accession": "SRR123456",
  "provider": "aws"
}

3. batch_fetch_metadata

Fetch metadata for multiple accessions simultaneously.

Example:

{
  "accessions": ["GSM123456", "GSM123457", "GSM123458"],
  "level": 1
}

4. extract_sra_gsm_numbers

Extract all SRA and GSM accession numbers from a study or publication.

Example:

{
  "accession": "GSE123456"
}

Output:

{
  "query": "GSE123456",
  "sra_accessions": ["SRR123456", "SRR123457"],
  "gsm_accessions": ["GSM123456", "GSM123457"],
  "gse_accessions": ["GSE123456"],
  "all_accessions": ["GSE123456", "GSM123456", "GSM123457", "SRR123456", "SRR123457"]
}

5. extract_accessions_from_publication

Extract sequencing data accession numbers from a publication URL or text.

This tool automatically scans publications and extracts:

  • SRA: SRR, SRX, SRS, SRP (NCBI Sequence Read Archive)
  • GEO: GSE, GSM, GPL (Gene Expression Omnibus)
  • ENA: ERR, ERX, ERS, ERP (European Nucleotide Archive)
  • BioProject: PRJNA, PRJEB, PRJDB

Works with:

  • PubMed Central articles
  • Journal websites (Nature, Science, Cell, etc.)
  • Preprint servers (bioRxiv, medRxiv)
  • DOI links
  • Plain text

Example (from URL):

{
  "source": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1234567/",
  "source_type": "url"
}

Example (from text):

{
  "source": "The RNA-seq data are available at GEO under accession GSE123456. Individual samples are SRR123456, SRR123457, and SRR123458.",
  "source_type": "text"
}

Output:

{
  "source": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1234567/",
  "source_type": "url",
  "accessions_found": ["GSE123456", "SRR123456", "SRR123457", "SRR123458"],
  "count": 4,
  "by_database": {
    "sra": ["SRR123456", "SRR123457", "SRR123458"],
    "geo": ["GSE123456"],
    "ena": [],
    "bioproject": []
  },
  "by_type": {
    "SRR": ["SRR123456", "SRR123457", "SRR123458"],
    "GSE": ["GSE123456"]
  }
}

6. check_environment

Check if all required tools and dependencies are installed for the FFQ MCP server.

This tool validates:

  • Command-line tools: ffq, kallisto, wget, AWS CLI
  • Python packages: mcp, h5py, scipy, numpy
  • Provides OS-specific installation instructions for missing dependencies

Example:

{
  "verbose": true
}

Output:

{
  "overall_status": "ready",
  "operating_system": {
    "system": "Linux",
    "release": "5.15.0",
    "machine": "x86_64"
  },
  "command_line_tools": {
    "ffq": {"installed": true, "version": "0.3.1"},
    "kallisto": {"installed": true, "version": "0.50.1"},
    "wget": {"installed": true, "version": "1.21.3"},
    "aws": {"installed": true, "version": "2.13.0"}
  },
  "python_packages": {
    "mcp": {"installed": true, "version": "1.0.0"},
    "h5py": {"installed": true, "version": "3.9.0"},
    "scipy": {"installed": true, "version": "1.11.0"},
    "numpy": {"installed": true, "version": "1.24.3"}
  },
  "missing": []
}

If dependencies are missing, the response includes detailed installation instructions tailored to your operating system.

7. fetch_reference_transcriptome

Download reference transcriptome FASTA and GTF annotation files from Ensembl or GENCODE.

Supports:

  • Organisms: Human (GRCh38), Mouse (GRCm39/GRCm38)
  • Sources: Ensembl (comprehensive), GENCODE (high-quality human/mouse annotations)
  • Automatic latest release detection
  • Optional GTF annotation download

Example:

{
  "organism": "human",
  "source": "ensembl",
  "output_dir": "./references",
  "include_gtf": true
}

Output:

{
  "status": "success",
  "organism": "human",
  "source": "ensembl",
  "release": "110",
  "output_directory": "./references",
  "files_downloaded": {
    "fasta": "./references/Homo_sapiens.GRCh38.cdna.all.fa.gz",
    "gtf": "./references/Homo_sapiens.GRCh38.110.gtf.gz"
  }
}

8. build_kallisto_index

Build a kallisto index from a reference transcriptome FASTA file.

The index is required for quantification and needs to be built once per reference. Index building typically takes 5-10 minutes for human/mouse transcriptomes.

Example:

{
  "fasta_path": "./references/Homo_sapiens.GRCh38.cdna.all.fa.gz",
  "index_path": "./references/Homo_sapiens.GRCh38.idx",
  "kmer_size": 31
}

Parameters:

  • fasta_path (required): Path to transcriptome FASTA file (gzipped or uncompressed)
  • index_path (optional): Output path for index (auto-generated if not specified)
  • kmer_size (optional): K-mer size for index (default: 31)
    • Use 31 for reads ≥50bp (standard RNA-seq)
    • Use 15-21 for short reads <50bp

Output:

{
  "status": "success",
  "index_path": "./references/Homo_sapiens.GRCh38.idx",
  "fasta_path": "./references/Homo_sapiens.GRCh38.cdna.all.fa.gz",
  "kmer_size": 31,
  "index_size_mb": 3842.5,
  "kallisto_version": "kallisto 0.50.1"
}

9. run_kallisto_quantification

Run kallisto quantification on RNA-seq data and create transcript-level quantitation matrices.

This tool automates the complete workflow:

  1. Fetches FASTQ files using ffq metadata
  2. Downloads files from cloud storage (AWS S3, GCP, FTP)
  3. Runs kallisto quant on each sample
  4. Aggregates results into TPM and counts matrices

Example:

{
  "accessions": ["SRR123456", "SRR123457", "SRR123458"],
  "index_path": "/path/to/transcriptome.idx",
  "output_dir": "/path/to/output",
  "threads": 8,
  "provider": "aws"
}

Parameters:

  • accessions (required): List of SRA run accessions
  • index_path (required): Path to kallisto index file
  • output_dir (optional): Output directory (temp dir if not specified)
  • threads (optional): Number of threads (default: 4)
  • bootstrap_samples (optional): Bootstrap samples for uncertainty (default: 0)
  • provider (optional): Download provider - aws, gcp, ftp, or ncbi (default: aws)
  • output_format (optional): Output format - tsv, hdf5, or 10x (default: tsv)
    • tsv: Tab-separated text files
    • hdf5: HDF5 binary format (compatible with HDF5Array in R/Bioconductor)
    • 10x: 10x Genomics sparse matrix format (matrix.mtx, features.tsv, barcodes.tsv)

Output:

{
  "status": "success",
  "summary": {
    "samples_processed": 3,
    "num_transcripts": 178123,
    "num_samples": 3
  },
  "output_directory": "/path/to/output",
  "matrices": {
    "tpm_matrix": "/path/to/output/tpm_matrix.tsv",
    "counts_matrix": "/path/to/output/counts_matrix.tsv"
  },
  "samples": ["SRR123456", "SRR123457", "SRR123458"]
}

Safety Features:

  • Detects UMI (Unique Molecular Identifier) usage and blocks quantification if UMIs are detected
  • UMI protocols (10x Genomics, Drop-seq, etc.) require deduplication before quantification
  • Provides clear error messages suggesting appropriate preprocessing tools (UMI-tools, zUMIs)

10. summarize_experimental_design

Parse experimental design from study metadata to identify sample groupings.

This tool extracts sample characteristics and groups samples by experimental conditions, making it easy for the LLM to understand the experimental structure.

Example:

{
  "accession": "GSE123456"
}

Output:

{
  "design_summary": "6 samples with treatment: 3 control vs 3 dexamethasone",
  "total_samples": 6,
  "variables": {
    "treatment": {
      "control": ["GSM001", "GSM002", "GSM003"],
      "dexamethasone": ["GSM004", "GSM005", "GSM006"]
    },
    "replicate": {
      "1": ["GSM001", "GSM004"],
      "2": ["GSM002", "GSM005"],
      "3": ["GSM003", "GSM006"]
    }
  }
}

Detects:

  • Treatment/control groups
  • Genotype differences
  • Tissue types, cell types
  • Time series experiments
  • Replicates
  • Other experimental factors (age, sex, disease, etc.)

Quick Start

New users: Run the interactive setup wizard:

python setup.py

This wizard will:

  • Check your system for installed dependencies
  • Recommend the best installation method for your OS
  • Provide step-by-step installation instructions
  • Validate your setup

Existing users: Use the check_environment tool within Claude to verify your setup:

"Check my environment for the FFQ MCP server"

Installation

You can install and run this server in three ways:

Option 1: Conda Environment (Recommended)

Using conda ensures all dependencies are correctly installed:

# Clone the repository
git clone https://github.com/gmtrash/ffq_mcp.git
cd ffq_mcp

# Create conda environment
conda env create -f environment.yml

# Activate environment
conda activate ffq-mcp

# Test that it works
python src/ffq_server.py
# Press Ctrl+C to exit

Option 2: Docker Container

Use Docker for isolated, reproducible deployments:

# Build the Docker image
docker build -t ffq-mcp .

# Run the container
docker run -i ffq-mcp

# Or with volume mounts for data
docker run -i -v /path/to/data:/data -v /path/to/index:/index ffq-mcp

For Claude Desktop with Docker:

{
  "mcpServers": {
    "ffq": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "ffq-mcp"]
    }
  }
}

Option 3: Singularity Container (HPC Environments)

Singularity is ideal for HPC clusters that don't support Docker:

# Build Singularity image
sudo singularity build ffq_mcp.sif ffq_mcp.def

# Run the container
singularity run ffq_mcp.sif

For Claude Desktop with Singularity:

{
  "mcpServers": {
    "ffq": {
      "command": "singularity",
      "args": ["run", "/path/to/ffq_mcp.sif"]
    }
  }
}

Option 4: Manual Installation (pip)

If you prefer manual installation:

Prerequisites:

  • Python 3.10 or higher
  • ffq (for metadata fetching)
  • kallisto (for quantification)
  • wget or AWS CLI (for downloading FASTQ files)
cd ffq_mcp
pip install -r requirements.txt
pip install ffq
conda install -c bioconda kallisto  # or install from source

Prepare a kallisto index

You can prepare a kallisto index in two ways:

Option 1: Using the MCP tools (Recommended)

Ask Claude to download and index the reference:

"Download the human reference transcriptome from Ensembl and build a kallisto index"

Claude will call:

  1. fetch_reference_transcriptome to download the FASTA file
  2. build_kallisto_index to create the index

Option 2: Manual preparation

# Download a reference transcriptome (e.g., human from Ensembl)
wget http://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

# Build kallisto index
kallisto index -i Homo_sapiens.GRCh38.idx Homo_sapiens.GRCh38.cdna.all.fa.gz

Configuration

For Claude Desktop

Add the following to your Claude Desktop configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

Using Conda (Recommended):

{
  "mcpServers": {
    "ffq": {
      "command": "/path/to/conda/envs/ffq-mcp/bin/python",
      "args": [
        "/absolute/path/to/ffq_mcp/src/ffq_server.py"
      ]
    }
  }
}

Using System Python:

{
  "mcpServers": {
    "ffq": {
      "command": "python3",
      "args": [
        "/absolute/path/to/ffq_mcp/src/ffq_server.py"
      ]
    }
  }
}

Finding your conda path:

conda activate ffq-mcp
which python  # Use this full path in the config

For Other MCP Clients

Use the server via stdio transport:

python /path/to/mcp-ffq-server/src/ffq_server.py

Usage Examples

Understand experimental design automatically

User: "What's the experimental design for GSE63525?"

Claude: Uses summarize_experimental_design and responds:

"This study has 6 samples with treatment groups:

  • Control: 3 samples (GSM001, GSM002, GSM003)
  • Dexamethasone: 3 samples (GSM004, GSM005, GSM006)

This is a classic 3v3 treatment vs control design with 3 biological replicates per condition."

💡 Why this is useful: Claude can now automatically understand experimental structure without you having to manually explain it. Perfect for downstream analysis planning!


Extract SRA numbers from a GEO study

User: "What are the SRA run accessions for GSE123456?"

Claude (using the MCP server):

I'll extract the SRA accessions from GSE123456.

[Calls extract_sra_gsm_numbers with accession: GSE123456]

The study GSE123456 contains the following SRA runs:
- SRR123456
- SRR123457
- SRR123458

Fetch download links for sequencing data

User: "Get me the AWS download links for SRR123456"

Claude:

[Calls fetch_download_links with accession: SRR123456, provider: aws]

Here are the AWS S3 download links for SRR123456:
- s3://sra-pub-run-odp/sra/SRR123456/SRR123456_1.fastq.gz
- s3://sra-pub-run-odp/sra/SRR123456/SRR123456_2.fastq.gz

Fetch metadata from a DOI

User: "What sequencing data is associated with DOI 10.1038/example.2024?"

Claude:

[Calls fetch_metadata with accession: 10.1038/example.2024]

This publication includes sequencing data from:
- Study: SRP123456
- Samples: 12 biological replicates
- Runs: 24 paired-end RNA-seq runs
- Total size: 156.3 GB

Analyze data from a publication

User: "Analyze the RNA-seq data from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1234567/"

Claude:

[Step 1: Extract accessions from paper]
Calls extract_accessions_from_publication

Found accessions:
- GSE123456 (GEO Series)
- SRR123456, SRR123457, SRR123458 (SRA Runs)

[Step 2: Get detailed metadata]
Calls extract_sra_gsm_numbers with GSE123456

[Step 3: Downloads reference and builds index]
[Step 4: Runs quantification]

Analysis complete! Created expression matrices from the published data.

Complete workflow: Reference download to quantification

User: "Download the human reference from Ensembl, build an index, and quantify GSE123456"

Claude:

[Step 1: Downloads reference]
Calls fetch_reference_transcriptome with organism: human, source: ensembl

Downloaded:
- Homo_sapiens.GRCh38.cdna.all.fa.gz (170 MB)
- Homo_sapiens.GRCh38.110.gtf.gz (52 MB)

[Step 2: Builds index]
Calls build_kallisto_index with fasta_path: Homo_sapiens.GRCh38.cdna.all.fa.gz

Created kallisto index: Homo_sapiens.GRCh38.idx (3.8 GB)

[Step 3: Extracts SRA accessions]
Calls extract_sra_gsm_numbers with accession: GSE123456

Found 6 SRA runs: SRR123456-SRR123461

[Step 4: Runs quantification]
Calls run_kallisto_quantification with:
  accessions: ["SRR123456", "SRR123457", "SRR123458", "SRR123459", "SRR123460", "SRR123461"],
  index_path: "Homo_sapiens.GRCh38.idx",
  threads: 8,
  output_format: "hdf5"

Processing complete! Created quantitation matrices:
- TPM matrix: 6 samples x 178,123 transcripts
- Counts matrix: 6 samples x 178,123 transcripts

Output saved in HDF5 format:
- tpm_matrix.h5
- counts_matrix.h5

Output format examples

TSV format (default):

tpm_matrix.tsv
counts_matrix.tsv

HDF5 format (for R/Bioconductor HDF5Array):

tpm_matrix.h5
counts_matrix.h5

Each HDF5 file contains:

  • matrix: Expression values
  • features: Transcript IDs
  • barcodes: Sample IDs

10x Genomics format (for Seurat, Scanpy):

tpm_10x/
  ├── matrix.mtx
  ├── features.tsv
  └── barcodes.tsv
counts_10x/
  ├── matrix.mtx
  ├── features.tsv
  └── barcodes.tsv

Downstream Analysis

Reading HDF5 format in R

library(HDF5Array)

# Read TPM matrix
tpm_h5 <- HDF5Array("tpm_matrix.h5", name = "matrix")
features <- as.character(h5read("tpm_matrix.h5", "features"))
samples <- as.character(h5read("tpm_matrix.h5", "barcodes"))

# Convert to SummarizedExperiment
library(SummarizedExperiment)
se <- SummarizedExperiment(
  assays = list(tpm = tpm_h5),
  rowData = DataFrame(transcript_id = features),
  colData = DataFrame(sample_id = samples)
)

Reading 10x format in R (Seurat)

library(Seurat)

# Read counts matrix
counts <- Read10X(data.dir = "counts_10x")
seurat_obj <- CreateSeuratObject(counts = counts)

Reading 10x format in Python (Scanpy)

import scanpy as sc

# Read counts matrix
adata = sc.read_10x_mtx("counts_10x")

Reading HDF5 format in Python

import h5py
import numpy as np

# Read TPM matrix
with h5py.File("tpm_matrix.h5", "r") as f:
    tpm = f["matrix"][:]
    features = f["features"][:].astype(str)
    samples = f["barcodes"][:].astype(str)

Development

Project Structure

mcp-ffq-server/
├── src/
│   └── ffq_server.py      # Main MCP server implementation
├── requirements.txt        # Python dependencies
├── pyproject.toml         # Project configuration
└── README.md              # This file

Running Tests

pip install -e ".[dev]"
pytest

Adding New Tools

To add new tools to the server:

  1. Define the tool in list_tools() with appropriate input schema
  2. Implement the handler in call_tool()
  3. Update this README with usage examples

Metadata Depth Levels

The level parameter controls how much metadata to fetch:

  • Level 0: Only the requested accession
  • Level 1: Direct children (e.g., GSE → GSM)
  • Level 2: Grandchildren (e.g., GSE → GSM → SRX)
  • Level 3: Great-grandchildren (e.g., GSE → GSM → SRX → SRR)
  • Level 4: All downstream accessions (complete hierarchy)

Default: Fetches all levels (equivalent to level 4)

Error Handling

The server provides detailed error messages for:

  • Invalid accession numbers
  • Network failures when accessing databases
  • Malformed responses from ffq
  • Missing dependencies

License

This MCP server is provided as-is for use with ffq. Please refer to the ffq repository for licensing information about the underlying tool.

Contributing

Contributions are welcome! Please ensure:

  • Code follows Python best practices
  • New tools include input validation
  • README is updated with examples
  • Error handling is comprehensive

Troubleshooting

"ffq command not found"

Ensure ffq is installed and in your PATH:

which ffq
pip install ffq

Server won't start

Check Python version:

python --version  # Should be 3.10+

Install dependencies:

pip install -r requirements.txt

Metadata fetch fails

  • Verify the accession number is correct
  • Check your internet connection
  • Try a different metadata level
  • Check if the database is accessible (GEO/SRA may have downtime)

"kallisto command not found"

Ensure kallisto is installed and in your PATH:

which kallisto
conda install -c bioconda kallisto  # or install from source

Download failures during kallisto quantification

  • For AWS S3 downloads: Ensure AWS CLI is installed (aws --version)
  • For GCP downloads: Ensure gsutil is installed
  • For FTP/HTTP downloads: Ensure wget is installed
  • Try a different provider (e.g., switch from aws to ftp)
  • Check your internet connection and firewall settings

Kallisto index not found

  • Verify the index path is correct and accessible
  • Build an index from a reference transcriptome:
    kallisto index -i transcriptome.idx transcriptome.fa.gz

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors