Skip to content

Port UniProt evidence parser from Java to Python/PySpark#128

Draft
d0choa wants to merge 25 commits into
mainfrom
feat/uniprot-evidence-parser
Draft

Port UniProt evidence parser from Java to Python/PySpark#128
d0choa wants to merge 25 commits into
mainfrom
feat/uniprot-evidence-parser

Conversation

@d0choa
Copy link
Copy Markdown
Contributor

@d0choa d0choa commented May 11, 2026

Summary

Ports the legacy Java uniprot-evidence-parser to three native PTS steps:

  • uniprot_evidence_parse (Polars transformer) — parses Swiss-Prot flat file → intermediate parquet with per-entry diseases and variants.
  • uniprot_literature (PySpark) — emits literature evidence rows from disease CC blocks.
  • uniprot_variants (PySpark) — emits variant evidence rows from FT VARIANT features.

The two existing evidence_postprocess_uniprot_* blocks are retargeted from input/evidence/uniprot_*.json.gz (externally provided) to the new intermediate/evidence/uniprot_*.parquet outputs (format flipped json → parquet).

Schema reconciliation vs GCS reference

Compared against gs://open-targets-pipeline-runs/ds/26.03-test5/input/evidence/uniprot_*.json.gz:

reference rows new rows medium-confidence fraction
literature 7,697 9,141 ref 8.2% / new 8.4%
variants 47,867 50,370 ref 2.1% / new 2.3%

Row-count drift (+18% / +5%) is the expected delta from a newer Swiss-Prot release. Columns match exactly for literature; variants matches except variantId (chr_pos_ref_alt) which requires an rsID→genomic resolver not wired up here.

Notes for reviewers

  • The Java pipeline's INDEFINITE_DISEASE_NOTE_ASSOCIATIONS phrases were stale: UniProt now uses "variants" rather than "mutations" / "variations". The updated phrase list matches live data.
  • The /db_snp FT qualifier no longer exists in modern Swiss-Prot; rsIDs are extracted from inline dbSNP:rsNNN mentions inside /note text.
  • No somatic/germline split in the reference data — alleleOrigins, the somatic-census join, and the datatypeId branching were all dropped.
  • com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.3 is wired into both pyspark step properties so ontoma.OnToma() works locally; other PTS evidence pipelines using add_efo_mapping (clingen, orphanet) likely need the same treatment for local-run support — out of scope here.

Known gap: variant→disease linkage without an explicit text marker

Our parser links each FT VARIANT to the entry's diseases by matching the disease acronym in the variant's /note (e.g. "R -> Q (in BROVCA1; ...)"BROVCA1 → OMIM:604370). The Java pipeline used japi's structured object graph, which exposed the variant↔disease relationship even when the /note text mentioned no disease at all. Concrete example from our data: P40692 VAR_022665 has the description 'decreased but not abolished ATPase activity; dbSNP:rs28930073' and no disease tag, yet the Java reference links it to OMIM:114500 (Colorectal cancer) — one of the entry's 5 diseases. That information isn't in the flat file.

Result: we ship ~34% fewer rsID-less variant rows than the reference (8,991 vs 13,593). The gap is in less-clinically-curated variants — those that the curator didn't tag with a disease acronym in the first place — so the most precision-sensitive associations are unaffected. The acronym-tagged subset matches the reference at 99.4% key overlap (28,101 / 28,274). Closing this gap fully would require either bootstrapping from the legacy Java pipeline's output, or emitting the cross-product of variants × entry-diseases for description-less variants (high recall, low precision). Both are out of scope here.

Test plan

  • All tests pass: JAVA_HOME=<jdk17> uv run pytest test — 348 passing (319 baseline + 29 new).
  • Lint clean: uv run ruff check on the new modules.
  • End-to-end locally with Swiss-Prot from gs://open-targets-pipeline-runs/ds/26.03-test5/:
    • uv run pts --step uniprot_evidence_parse — ~7 s
    • uv run pts --step uniprot_literature — ~25 s
    • uv run pts --step uniprot_variants — ~30 s
  • Schema and confidence-distribution diff against the GCS reference (see table above and the semantic comparison comment below).
  • Reviewer: run evidence_postprocess_uniprot_literature and evidence_postprocess_uniprot_variants against the new parquets and confirm downstream consumers are happy.

Out of scope (follow-ups)

  • variantId field — requires an rsID→genomic-coordinate resolver.
  • Closing the variant→disease linkage gap described above.
  • The need for spark-nlp in clingen / orphanet properties for local runs.

d0choa added 24 commits May 11, 2026 19:20
…k-nlp jar

UniProt has moved from 'mutations'/'variations' to 'variants' in the six
disease-note phrases that drive the medium-confidence classification; the
legacy Java strings produced a near-zero match rate against the current
release. Updated phrases + tests now yield medium fractions within ~10%
of the GCS reference (2.3% vs 2.1% for variants, 8.4% vs 8.2% for
literature).

Also adds the johnsnowlabs spark-nlp jar to both pyspark steps'
properties block — ontoma's OnToma() constructor requires it locally.
Originally ported from the Java repo to feed the somatic/germline split
in uniprot_variants. The GCS reference revealed no somatic split exists
in the current output, so the file was disconnected from the pipeline
in d2160e2. Removing it now since it serves no purpose.
The brainstorming spec and implementation plan were committed early as
working artifacts; they should not ship with the PR. Removing from the
branch — files remain in the local working tree as untracked.
import polars as pl
from loguru import logger
from otter.config.model import Config
from otter.storage.synchronous.handle import StorageHandle
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand if relying into otter is the best way to do it.

@d0choa
Copy link
Copy Markdown
Contributor Author

d0choa commented May 11, 2026

Semantic comparison vs GCS reference

Ran a row-level semantic diff of the generated parquets against gs://open-targets-pipeline-runs/ds/26.03-test5/input/evidence/uniprot_*.json.gz. Both pipelines fan out rows when a disease has multiple EFO matches, so I aggregated by natural key first, then compared.

Literature — key = (target, disease)

count %
ref unique keys 7,039
new unique keys 7,067
keys in both 6,953 98.8% of ref
ref-only (likely removed from newer Swiss-Prot) 86 1.2%
new-only (added since reference snapshot) 114 1.6%

Field agreement across the 6,953 overlapping keys:

field agreement
datatypeId 100%
targetModulation 100%
confidence 99.9% (6 keys flip — 4 medium→high, 2 high→medium)
diseaseFromSource 98.0% (curator name updates)
literature PMID set 90.0% identical · 2.6% we have more · 6.9% we have fewer · 0.5% disjoint
mappedEfoSet 67.4% identical · 26.8% we have more · 4.5% disjoint (ontoma LUT version drift)

Variants with rsID — key = (target, disease, rsID)

count %
ref unique keys 28,274
new unique keys 29,200
keys in both 28,101 99.4% of ref
ref-only 173 0.6%
new-only 1,099 3.8%

Field agreement across the 28,101 overlapping keys:

field agreement
datatypeId 100%
targetModulation 100%
confidence 99.9% (22 keys flip)
diseaseFromSource 96.6% (curator wording changes)
literature PMID set 96.3% identical · 1.2% we have more · 0.7% we have fewer · 1.7% disjoint
mappedEfoSet 66.6% identical · 28.3% we have more (ontoma version drift)

Variants without rsID

rows unique (target, disease) pairs
reference 13,593 2,771
new 8,991 2,912

~34% fewer rows. This is the variant→disease linkage gap discussed in the PR description: when a variant's /note has no disease acronym, our parser can't link it, while the legacy Java pipeline did via japi's structured object graph.

What's japi? UniProt's official Java client library (uk.ac.ebi.uniprot:japi, on EBI Artifactory). Instead of parsing the flat file, it queries UniProt's servers and returns a typed Java object model — UniProtEntry, DiseaseCommentStructured, VariantFeature, etc. — with cross-references already resolved. In particular, DiseaseCommentStructured.getVariants() returns the variants the curators have associated with that disease, a relationship that doesn't appear anywhere in the rendered flat-file text. Our Python port reads the flat file (the only public artefact on the EBI FTP), so it can only recover the textual breadcrumbs the curator left behind.

Headlines

  • Schema is exactly right: datatypeId, targetModulation match 100%; confidence matches 99.9%.
  • Coverage is high: 98.8% / 99.4% key overlap.
  • Most disagreements are upstream drift — ontoma LUT version, UniProt curator note updates, newer Swiss-Prot release content.
  • One real bug surfaced: spotted a "B- cell" (extra space after the hyphen) in one of our diseaseFromSource values vs "B-cell" in the reference. Suggests the FT continuation-line joiner is occasionally preserving an unwanted space across a hyphenated word. Worth a follow-up regex fix.
  • One documented gap confirmed: rsID-less variants are under-represented by ~34% — see the japi explanation above for the structural reason.

Swiss-Prot occasionally wraps mid-compound-word (e.g. a CC line ending
'...T-cell-negative/B-' continued by 'cell-positive...'). The naive
' '.join() of continuation lines produced 'B- cell-positive', breaking
substring matches and inflating diseaseFromSource disagreements against
the GCS reference. The new joiner elides the space when the previous
part ends with '-'.

Applied to both the CC DISEASE accumulator and the FT VARIANT
qualifier accumulator. Recovers ~130 diseaseFromSource matches in
the row-level diff and lifts variant-key overlap from 28,101 to 28,120
against the reference snapshot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant