semseeker detects Stochastic Epigenetic Mutations (SEMs) — aberrant, sample-specific methylation changes that deviate from the expected population signal. Starting from a normalised beta-value (or M-value) matrix, semseeker identifies hypomethylated and hypermethylated mutations at single-probe resolution, aggregates them into lesions (genomic clusters of co-occurring mutations), and computes population-level delta metrics (ΔBeta, ΔIV, ΔEV, ΔRV) that summarise stochastic epigenetic variability per region.
Supported input technologies:
- Illumina microarrays: EPIC (850k), 450k, 27k — detected automatically
- Bisulfite sequencing (WGBS / RRBS) and long-read sequencing (ONT / PacBio): planned (see documents/requirements.md)
Install from GitHub using devtools:
install.packages("devtools")
library(devtools)
install_github("drake69/semseeker")semseeker relies on several CRAN packages. Install them before the first use:
install.packages(c(
"doFuture", "FactoMineR", "factoextra", "FSA", "fst",
"future", "future.apply", "ggplot2", "gridExtra", "gtools",
"Hmisc", "lqmm", "progressr", "reshape2", "zoo",
"VennDiagram", "effsize", "pwr", "combinat", "philentropy",
"parallelly", "knitr", "lmtest", "usethis", "wordcloud",
"WebGestaltR", "sqldf", "caret", "openai", "glmnet"
))library(ChAMP)
idat_folder <- "~/source_idat/"
result_folder <- "~/result/"
myLoad <- champ.load(
directory = idat_folder,
method = "minfi",
methValue = "B",
autoimpute = TRUE,
filterDetP = TRUE,
ProbeCutoff = 0,
SampleCutoff = 0.1,
detPcut = 0.01,
filterBeads = TRUE,
beadCutoff = 0.05,
filterNoCG = TRUE,
filterSNPs = TRUE,
filterMultiHit = TRUE,
filterXY = TRUE,
arraytype = "EPIC" # or "450K"
)
myNorm <- champ.norm(
beta = myLoad$beta,
rgSet = myLoad$rgSet,
mset = myLoad$mset,
resultsDir = result_folder,
method = "SWAN",
arraytype = "EPIC",
cores = parallel::detectCores() - 1
)
saveRDS(myNorm, "~/normalizedData.rds")library(semseeker)
normalizedData <- readRDS("~/normalizedData.rds")
sample_sheet <- read.csv2("~/sample_sheet.csv")
semseeker(
sample_sheet = sample_sheet,
signal_data = normalizedData,
result_folder = "~/semseeker_result/"
)Results (BED files, pivot tables, delta-metric summaries) are written to
result_folder.
A fully worked example using public data from Gene Expression Omnibus (GEO) is
available in the repository's example/ folder.
- Rows = probes (Illumina probe IDs, e.g.
cg...) - Columns = samples
- Values = beta values in [0, 1] or M-values (detected automatically)
A data frame (or CSV readable with read.csv2) with at minimum these columns:
| Column | Description |
|---|---|
Sample_ID |
Unique sample identifier (must match matrix column names) |
Sample_Name |
Human-readable label |
Sample_Group |
Must be one of "Case", "Control", "Reference" |
Note: If no Reference population is available, duplicate the Control rows and assign
Sample_Group = "Reference"to the duplicates.
Note:
Sample_IDvalues are uppercased internally by semseeker (name_cleaning()); output file names and pivot column headers reflect the uppercased form.
For each sample semseeker writes to result_folder/Data/:
*_MUTATIONS_HYPO.bed/*_MUTATIONS_HYPER.bed— probe-level mutations*_LESIONS_HYPO.bed/*_LESIONS_HYPER.bed— genomic lesion clusters*_DELTA*.bed— ΔBeta, ΔIV, ΔEV, ΔRV per region- Pivot tables aggregating all samples per marker
If you use semseeker in published work, please cite the package:
citation("semseeker")A Zenodo DOI will be registered for each release tag (see documents/requirements.md).