PERSEO Automated model selection and differential expression analysis for omics data using GAMLSS (Generalized Additive Models for Location, Scale and Shape). It supports overdispersed, skewed, or otherwise non-standard distributions, allowing for better model fit and more accurate inference.
- Why PERSEO?
- Installation
- Core Concepts
- Quick Start
- Input Specifications
- Advanced Features
- Interpretation Guide
- FAQ
- License
Classical differential expression pipelines assume a single likelihood family for all features:
- RNA-seq tools: Negative Binomial (DESeq2, edgeR)
- Microarray tools: Normal distribution (limma)
- Generic tools: Log-normal, Poisson, etc.
This is brittle for high-dimensional omics data where:
- Features vary in support (counts vs. positive continuous vs. real-valued)
- Features vary in shape (overdispersed, zero-inflated, skewed, heavy-tailed)
- A single family cannot fit all features well
PERSEO solves this by:
- Testing multiple GAMLSS families per feature (Negative Binomial, Gamma, Log-Normal, Beta, etc.)
- Using strict transformations + common masking so all families are compared on the same observations
- Applying Jacobian correction so information criteria (BIC/AIC/GAIC) are comparable on the original scale
- Selecting the best family per feature and extracting coefficient statistics
- Scaling efficiently via parallel processing
# Install from GitHub
remotes::install_github("OPSergio/Perseo")Dependencies: R ≥ 4.2.0, gamlss, gamlss.dist, dplyr, tibble, future, future.apply, progressr
PERSEO uses bootstrap sampling to identify commonly selected distribution families across a representative subset of features. This avoids the computational cost of testing all families on all features during exploration.
- Bootstrap mode (
bootstrap = TRUE): Fast, samplesn_genesfeatures per pull forn_bootiterations - Full evaluation mode (
bootstrap = FALSE): Comprehensive, evaluates all families on all features
For each feature, PERSEO:
- Fits all candidate families on the same observations (common mask)
- Applies family-specific transformations with Jacobian correction
- Compares models using BIC/AIC/GAIC (on original scale)
- Selects the best family and extracts per-term statistics from
summary(fit, what = "mu")
PERSEO supports two approaches for computing linear contrasts:
- Automatic: Specify a categorical variable (
contrast_variable) → all pairwise contrasts generated - Manual: Provide a custom contrast matrix (
contrast_matrix) → full control over comparisons
For multi-level factors (3+ levels), PERSEO offers omnibus testing:
- Stage 1 (Omnibus): Test if factor has any effect (multi-df test)
- Stage 2 (Contrasts): Only compute pairwise contrasts for significant features
This mirrors ANOVA + post-hoc workflow and reduces multiple testing burden.
The simplest way to use PERSEO is via run_perseo(), which handles family selection, differential expression, and contrast computation in one call.
library(PERSEO)
library(dplyr)
# Prepare metadata
metadata <- data.frame(
sample_id = colnames(counts_matrix),
condition = factor(c(rep("Control", 50), rep("Treatment", 50))),
age = rnorm(100, 50, 10),
batch = factor(rep(1:4, each = 25)),
row.names = colnames(counts_matrix)
)
# Set reference level
metadata$condition <- relevel(metadata$condition, ref = "Control")
# Run complete pipeline
results <- run_perseo(
counts_matrix = counts_matrix,
design_matrix = "~ condition + age + batch", # formula string
metadata = metadata, # required for formula
n_genes = 200, # features to sample
n_boot = 10, # bootstrap iterations
top_n = 4, # top families to select
criterion = "BIC", # model selection criterion
contrast_variable = "condition", # auto-generate contrasts
p_adjust_method = "BH", # FDR correction
parallel = TRUE, # enable parallelization
workers = 4, # number of cores
verbose = TRUE,
seed = 123
)
# View results
print(results)
# Access components
head(results$differential_expression$results) # Coefficient statistics
head(results$differential_expression$selection) # Best family per feature
head(results$differential_expression$contrasts) # Pairwise contrasts
results$family_selection$top_families_overall # Selected families
# Filter significant results (FDR < 0.05)
sig_de <- results$differential_expression$results %>%
filter(p_adj < 0.05, grepl("^condition", term))
sig_contrasts <- results$differential_expression$contrasts %>%
filter(p_adj < 0.05)For fine-grained control, call functions separately. See docs/find_families.md for detailed family selection documentation.
# Step 1: Family Selection
family_results <- find_families(
counts_matrix = counts_matrix,
n_genes = 300,
n_boot = 10,
top_n = 6,
criterion = "BIC",
seed = 123
)
selected_families <- family_results$top_families_overall
print(selected_families)
#> [1] "GG" "LOGNO" "NBI" "GA" "TF" "NO"
# Step 2: Differential Expression with selected families
de_results <- fit_gamlss_models(
counts_matrix = counts_matrix,
design_matrix = "~ condition + age + batch",
metadata = metadata,
candidate_families = selected_families,
contrast_variable = "condition",
criterion = "BIC",
parallel = TRUE,
workers = 4
)
# Access results
head(de_results$results) # Per-term coefficients
head(de_results$selection) # Best family per feature
head(de_results$contrasts) # Pairwise contrastscounts_matrix: Numeric matrix (features × samples)
- Rows = features (genes, proteins, metabolites, etc.)
- Columns = samples
- Row names = feature identifiers
- Column names = sample identifiers
- Values = numeric (finite)
design_matrix: Model specification (two options)
- Option 1: Formula string (e.g.,
"~ condition + batch") - Option 2: Pre-built design matrix from
model.matrix()
metadata: Data frame with sample metadata (required for formula input or automatic contrasts)
- Row names must match column names of
counts_matrix - Must have columns referenced in formula or
contrast_variable
When to use: You want PERSEO to build the design matrix and automatically generate all pairwise contrasts for a categorical variable.
Requirements:
design_matrixis a formula string or formula objectmetadatais provided (data.frame)contrast_variablenames a column in metadata
Example:
# Prepare metadata
metadata <- data.frame(
sample_id = colnames(counts_matrix),
tissue_type = factor(c(rep("Normal", 40), rep("Tumor_A", 30), rep("Tumor_B", 30))),
age = rnorm(100, 60, 10),
batch = factor(rep(1:4, each = 25)),
row.names = colnames(counts_matrix)
)
# Set reference level
metadata$tissue_type <- relevel(metadata$tissue_type, ref = "Normal")
# Run with automatic contrast generation
results <- fit_gamlss_models(
counts_matrix = counts_matrix,
design_matrix = "~ tissue_type + age + batch", # formula string
metadata = metadata, # REQUIRED
candidate_families = c("NBI", "GG", "LOGNO"),
contrast_variable = "tissue_type", # auto-generate all pairwise
criterion = "BIC",
parallel = TRUE,
workers = 4
)
# Automatically generates contrasts:
# - Tumor_A_vs_Normal
# - Tumor_B_vs_Normal
# - Tumor_B_vs_Tumor_A
head(results$contrasts)Benefits:
- Simple and concise
- No need to manually build design matrix
- All pairwise contrasts generated automatically
- Coefficient names match metadata column names
When to use: You need explicit control over the design matrix and want custom contrasts (e.g., average of groups, specific linear combinations).
Requirements:
design_matrixis a numeric matrix frommodel.matrix()contrast_matrixis provided (numeric matrix)- Column names of
contrast_matrixmatch coefficient names fromdesign_matrix
Example:
# Build design matrix manually
metadata$tissue_type <- relevel(factor(metadata$tissue_type), ref = "Normal")
design <- model.matrix(~ tissue_type + age + batch, data = metadata)
# Inspect coefficient names
colnames(design)
#> [1] "(Intercept)" "tissue_typeTumor_A" "tissue_typeTumor_B" "age" "batch2" "batch3" "batch4"
# Define custom contrast matrix
C <- matrix(0, nrow = 3, ncol = ncol(design))
colnames(C) <- colnames(design)
rownames(C) <- c("Tumor_A_vs_Normal", "Tumor_B_vs_Tumor_A", "AvgTumor_vs_Normal")
# Tumor_A vs Normal (just the Tumor_A coefficient)
C["Tumor_A_vs_Normal", "tissue_typeTumor_A"] <- 1
# Tumor_B vs Tumor_A (Tumor_B - Tumor_A)
C["Tumor_B_vs_Tumor_A", "tissue_typeTumor_B"] <- 1
C["Tumor_B_vs_Tumor_A", "tissue_typeTumor_A"] <- -1
# Average of tumors vs Normal: (Tumor_A + Tumor_B) / 2
C["AvgTumor_vs_Normal", "tissue_typeTumor_A"] <- 0.5
C["AvgTumor_vs_Normal", "tissue_typeTumor_B"] <- 0.5
# Run with explicit contrast matrix
results <- fit_gamlss_models(
counts_matrix = counts_matrix,
design_matrix = design, # pre-built matrix
candidate_families = c("NBI", "GG", "LOGNO"),
contrast_matrix = C, # explicit contrasts
criterion = "BIC",
parallel = TRUE,
workers = 4
)
head(results$contrasts)Benefits:
- Full control over design matrix construction
- Custom contrasts (averages, complex combinations)
- Explicit reference level handling
- Useful for non-standard designs
DO NOT:
# INVALID: Design matrix + contrast_variable without metadata
design <- model.matrix(~ condition + batch, data = metadata)
fit_gamlss_models(
counts_matrix = counts_matrix,
design_matrix = design, # matrix (not formula)
contrast_variable = "condition", # ERROR: automatic generation not supported
candidate_families = c("NBI", "GG")
)
# ERROR: When contrast_variable is provided and contrast_matrix is NULL,
# design_matrix must be a formula + metadata must be provided.DO:
# VALID: Use formula + metadata + contrast_variable
fit_gamlss_models(
counts_matrix = counts_matrix,
design_matrix = "~ condition + batch", # formula
metadata = metadata, # required
contrast_variable = "condition", # OK
candidate_families = c("NBI", "GG")
)
# OR: Use design matrix + explicit contrast_matrix
C <- matrix(...)
fit_gamlss_models(
counts_matrix = counts_matrix,
design_matrix = design, # matrix
contrast_matrix = C, # explicit
candidate_families = c("NBI", "GG")
)For multi-level factors (3+ levels), hierarchical testing reduces multiple testing burden:
- Omnibus test: Does the factor have any effect? (multi-df test)
- Contrasts: Only for features passing omnibus → compute pairwise comparisons
When to use:
- Factor has 3+ levels
- Want stricter false positive control
- Large datasets (1000+ features)
When to skip:
- Factor has only 2 levels (omnibus = single pairwise test)
- Exploratory analysis (want to see all contrasts)
- Pre-specified contrasts of interest
Example:
# Multi-level factor
metadata$tissue_type <- factor(rep(c("Normal", "Luminal", "Basal"), length.out = 100))
results <- fit_gamlss_models(
counts_matrix = counts_matrix,
design_matrix = "~ tissue_type + age + batch",
metadata = metadata,
candidate_families = c("NBI", "GG"),
contrast_variable = "tissue_type",
omnibus = TRUE, # Enable hierarchical testing
omnibus_threshold = 0.05, # Filter threshold
omnibus_test = "Wald", # "Wald" (fast) or "LRT" (robust)
parallel = TRUE,
workers = 4
)
# View omnibus results
head(results$omnibus)
#> # A tibble: 6 × 7
#> feature family test_type statistic df p_value pass
#> <chr> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 Gene_001 NBI Wald 12.5 2 0.0019 TRUE
#> 2 Gene_002 GG Wald 3.2 2 0.201 FALSE
#> 3 Gene_003 NBI Wald 18.7 2 0.0001 TRUE
# Contrasts only for features passing omnibus
head(results$contrasts)
# Filter significant contrasts
sig_contrasts <- results$contrasts %>%
filter(p_adj < 0.05)Omnibus test comparison:
| Test | Speed | Best For | Notes |
|---|---|---|---|
| Wald | Fast | Large samples (n > 30/group), standard families | Uses vcov from fitted model; asymptotic approximation |
| LRT | ~2× slower | Small samples (n < 30/group), complex families | Refits reduced model; more robust |
PERSEO supports two transformation strategies:
Behavior:
- Domain-preserving transformations
- Invalid observations excluded via masking
- No data modification
- Conservative, support-consistent approach
When to use:
- Default for most analyses
- Domain violations should exclude observations
- Support-aware family selection
Example:
results <- find_families(
counts_matrix = counts_matrix,
transform_mode = "strict", # default
group_by_support = TRUE,
criterion = "BIC"
)Behavior:
- Global affine transformations: y* = a·y + b (a > 0)
- Invertible with Jacobian correction
- No observation exclusion
- All observations retained after transformation
When to use:
- Want more flexibility
- Testing families across support boundaries
- Want to avoid observation exclusion
Example:
results <- find_families(
counts_matrix = counts_matrix,
transform_mode = "safe",
group_by_support = FALSE, # flexible family exploration
criterion = "BIC"
)Key differences:
| Aspect | Strict | Safe |
|---|---|---|
| Data modification | None | Global affine transformation |
| Invalid observations | Excluded | Included (after transform) |
| Jacobian correction | Yes | Yes |
| Invertibility | N/A | Yes (via metadata) |
| Use case | Conservative | Flexibility |
See docs/utils_transformations.md for technical details.
PERSEO includes built-in parallel processing support via future backend.
Automatic mode (recommended):
results <- run_perseo(
counts_matrix = counts_matrix,
design_matrix = design,
parallel = TRUE, # Auto-configure parallel backend
workers = 8, # Number of cores (auto-detected if NULL)
verbose = TRUE
)
# Automatically:
# 1. Sets up future::plan(multisession)
# 2. Runs analysis in parallel
# 3. Resets to sequential on completionManual mode (advanced):
library(future)
plan(multisession, workers = 8)
results <- run_perseo(
counts_matrix = counts_matrix,
design_matrix = design
)
plan(sequential)Recommended settings:
| Dataset Size | Parallel | Workers |
|---|---|---|
| < 1000 features | FALSE |
N/A |
| 1000–10,000 features | TRUE |
4–8 |
| > 10,000 features | TRUE |
8–16 |
Memory considerations:
- Each worker loads a copy of the data
- More workers = more memory usage
- If memory issues occur, reduce
workers
Family Selection (family_selection):
top_families_overall: Most frequently selected families across bootstrap samplesfreq_table_overall: Selection frequency per family- Higher frequency → more robust/versatile for your dataset
Differential Expression (differential_expression$results):
feature: Feature identifierterm: Coefficient name (e.g.,conditionTreatment,age,batch2)effect: Coefficient estimate (on link scale)se: Standard errorstat: Wald statistic (typically t or z)pval: Raw p-valuepadj: FDR-adjusted p-value (by term across features)
Selection (differential_expression$selection):
feature: Feature identifierbest_family: Winning family per featuren_valid_obs: Number of observations used (after common mask)ic_value: Jacobian-corrected information criterion value
Contrasts (differential_expression$contrasts):
feature: Feature identifierfamily: Best family for this featurecontrast: Contrast name (e.g.,Treatment_vs_Control)estimate: Point estimate of linear combinationse: Standard errorz: z-statisticp_value: Raw p-valuep_adj: FDR-adjusted p-value (by contrast across features)
Omnibus (differential_expression$omnibus, when enabled):
feature: Feature identifierfamily: Best familytest_type:"Wald"or"LRT"statistic: Test statisticdf: Degrees of freedomp_value: Omnibus p-valuepass: Logical;TRUEif p_value < omnibus_threshold
Standard workflow (omnibus = FALSE):
- Coefficient p-values adjusted by term across features
- Contrast p-values adjusted by contrast across features
- Each term/contrast treated independently
Hierarchical workflow (omnibus = TRUE):
- Stage 1 (Omnibus): Test each feature for factor effect (1 test per feature)
- Stage 2 (Contrasts): Only for significant features, compute pairwise tests
Example: 500 features, 3-level factor
| Approach | Stage 1 Tests | Stage 2 Tests | Total Tests | Power |
|---|---|---|---|---|
| Standard | 0 | 500 × 3 = 1500 | 1500 | Lower |
| Hierarchical | 500 | 150 × 3 = 450 | 950 | Higher |
Note: Omnibus test acts only as a gatekeeper. Contrast p-values are still adjusted by contrast across features (unchanged from standard workflow).
Filter significant DE results:
sig_results <- results$differential_expression$results %>%
filter(p_adj < 0.05, grepl("^condition", term)) %>%
arrange(p_adj)Filter significant contrasts:
sig_contrasts <- results$differential_expression$contrasts %>%
filter(p_adj < 0.05) %>%
arrange(p_adj)Check family distribution:
table(results$differential_expression$selection$best_family)
#> GG LOGNO NBI GA TF NO
#> 142 91 64 28 7 2Omnibus pass rate:
sum(results$differential_expression$omnibus$pass, na.rm = TRUE) /
nrow(results$differential_expression$omnibus) * 100
#> 32.5% passed omnibus filterQ: When should I use bootstrap = TRUE vs. bootstrap = FALSE?
A: Use bootstrap = TRUE (default) for fast exploratory analysis. The bootstrap samples a subset of features to identify common families, then applies them to all features. Use bootstrap = FALSE for comprehensive analysis where you want to evaluate all families on all features (slower but thorough).
Q: How do I choose between Workflow A and Workflow B?
A: Use Workflow A (formula + automatic contrasts) for simple designs with standard pairwise comparisons. Use Workflow B (design matrix + manual contrasts) when you need custom contrasts (e.g., averages, complex linear combinations) or explicit control over reference levels.
Q: What's the difference between Wald and LRT omnibus tests?
A: Wald is faster (uses vcov from fitted model) and suitable for large samples (n > 30/group). LRT is more robust (refits reduced model) and better for small samples (n < 30/group) or complex families. LRT is ~2× slower.
Q: Why do I get NA values in contrast results?
A: NAs occur when:
- Variance-covariance matrix extraction fails
- Vcov contains non-finite values
- Model did not converge properly
This is normal for some features; check selection$n_valid_obs to ensure sufficient data.
Q: How do I interpret effect values in results?
A: effect is the coefficient estimate on the link scale (not directly interpretable as fold-change). For count families (NBI, PO), the link is typically log. For positive families (GG, LOGNO), also log. For interpretation, focus on:
- Sign (positive/negative effect)
- Statistical significance (padj)
- Contrast estimates for direct comparisons
Q: Can I use PERSEO with non-count data?
A: Yes! PERSEO supports:
- Count data: RNA-seq counts → NBI, PO, ZIP, ZINBI
- Positive continuous: Proteomics intensities → GG, GA, LOGNO, IG
- Unit interval: Beta values, proportions → BE, BEINF, BEO
- Real-valued: Normalized/transformed data → NO, TF
Use group_by_support = TRUE to restrict families by empirical support.
Q: How do I speed up analysis?
A:
- Enable parallel processing:
parallel = TRUE, workers = 8 - Use
criterion = "BIC"(faster than GAIC) - Reduce
n_genesandn_bootfor family selection - Use
omnibus_test = "Wald"(faster than LRT) - Skip omnibus filtering if not needed (
omnibus = FALSE)
Q: What if all features are skipped?
A: Check min_n parameter. You may need fewer valid observations. Also check for:
- Extreme outliers
- Batch effects
- Data quality issues
Q: How are contrast p-values adjusted?
A: Contrast p-values are adjusted by contrast across all features using the specified FDR method (default: BH). Each contrast is treated as an independent hypothesis family. Example:
Treatment_vs_Control: All features with this contrast are adjusted togetherHigh_vs_Low: All features with this contrast are adjusted together
This is true regardless of omnibus setting. The omnibus test serves only as a gatekeeper, not as part of the adjustment procedure.
This project is licensed under the GNU General Public License v3.0 (GPL-3.0) — see the LICENSE file for details.
If you use PERSEO in your research, please cite:
Olmos-Piñero, S. & Fernández-Lanza Val, F.
PERSEO: Flexible differential expression analysis using GAMLSS.
Manuscript in preparation.
Contributions are welcome! Please open an issue or pull request on GitHub.
For questions or support:
- Issues: GitHub Issues
- Email: solmos97@gmail.com
Note: This package is under active development. API and functionality may change in future versions.
