Port UniProt evidence parser from Java to Python/PySpark#128
Conversation
…cription fallback
…k-nlp jar UniProt has moved from 'mutations'/'variations' to 'variants' in the six disease-note phrases that drive the medium-confidence classification; the legacy Java strings produced a near-zero match rate against the current release. Updated phrases + tests now yield medium fractions within ~10% of the GCS reference (2.3% vs 2.1% for variants, 8.4% vs 8.2% for literature). Also adds the johnsnowlabs spark-nlp jar to both pyspark steps' properties block — ontoma's OnToma() constructor requires it locally.
Originally ported from the Java repo to feed the somatic/germline split in uniprot_variants. The GCS reference revealed no somatic split exists in the current output, so the file was disconnected from the pipeline in d2160e2. Removing it now since it serves no purpose.
The brainstorming spec and implementation plan were committed early as working artifacts; they should not ship with the PR. Removing from the branch — files remain in the local working tree as untracked.
| import polars as pl | ||
| from loguru import logger | ||
| from otter.config.model import Config | ||
| from otter.storage.synchronous.handle import StorageHandle |
There was a problem hiding this comment.
I don't understand if relying into otter is the best way to do it.
Semantic comparison vs GCS referenceRan a row-level semantic diff of the generated parquets against Literature — key = (target, disease)
Field agreement across the 6,953 overlapping keys:
Variants with rsID — key = (target, disease, rsID)
Field agreement across the 28,101 overlapping keys:
Variants without rsID
~34% fewer rows. This is the variant→disease linkage gap discussed in the PR description: when a variant's
Headlines
|
Swiss-Prot occasionally wraps mid-compound-word (e.g. a CC line ending '...T-cell-negative/B-' continued by 'cell-positive...'). The naive ' '.join() of continuation lines produced 'B- cell-positive', breaking substring matches and inflating diseaseFromSource disagreements against the GCS reference. The new joiner elides the space when the previous part ends with '-'. Applied to both the CC DISEASE accumulator and the FT VARIANT qualifier accumulator. Recovers ~130 diseaseFromSource matches in the row-level diff and lifts variant-key overlap from 28,101 to 28,120 against the reference snapshot.
Summary
Ports the legacy Java
uniprot-evidence-parserto three native PTS steps:uniprot_evidence_parse(Polars transformer) — parses Swiss-Prot flat file → intermediate parquet with per-entry diseases and variants.uniprot_literature(PySpark) — emits literature evidence rows from disease CC blocks.uniprot_variants(PySpark) — emits variant evidence rows from FT VARIANT features.The two existing
evidence_postprocess_uniprot_*blocks are retargeted frominput/evidence/uniprot_*.json.gz(externally provided) to the newintermediate/evidence/uniprot_*.parquetoutputs (format flipped json → parquet).Schema reconciliation vs GCS reference
Compared against
gs://open-targets-pipeline-runs/ds/26.03-test5/input/evidence/uniprot_*.json.gz:Row-count drift (+18% / +5%) is the expected delta from a newer Swiss-Prot release. Columns match exactly for literature; variants matches except
variantId(chr_pos_ref_alt) which requires an rsID→genomic resolver not wired up here.Notes for reviewers
INDEFINITE_DISEASE_NOTE_ASSOCIATIONSphrases were stale: UniProt now uses "variants" rather than "mutations" / "variations". The updated phrase list matches live data./db_snpFT qualifier no longer exists in modern Swiss-Prot; rsIDs are extracted from inlinedbSNP:rsNNNmentions inside/notetext.alleleOrigins, the somatic-census join, and thedatatypeIdbranching were all dropped.com.johnsnowlabs.nlp:spark-nlp_2.12:6.1.3is wired into both pyspark step properties soontoma.OnToma()works locally; other PTS evidence pipelines usingadd_efo_mapping(clingen, orphanet) likely need the same treatment for local-run support — out of scope here.Known gap: variant→disease linkage without an explicit text marker
Our parser links each FT VARIANT to the entry's diseases by matching the disease acronym in the variant's
/note(e.g."R -> Q (in BROVCA1; ...)"→BROVCA1→ OMIM:604370). The Java pipeline usedjapi's structured object graph, which exposed the variant↔disease relationship even when the/notetext mentioned no disease at all. Concrete example from our data:P40692 VAR_022665has the description'decreased but not abolished ATPase activity; dbSNP:rs28930073'and no disease tag, yet the Java reference links it to OMIM:114500 (Colorectal cancer) — one of the entry's 5 diseases. That information isn't in the flat file.Result: we ship ~34% fewer rsID-less variant rows than the reference (8,991 vs 13,593). The gap is in less-clinically-curated variants — those that the curator didn't tag with a disease acronym in the first place — so the most precision-sensitive associations are unaffected. The acronym-tagged subset matches the reference at 99.4% key overlap (28,101 / 28,274). Closing this gap fully would require either bootstrapping from the legacy Java pipeline's output, or emitting the cross-product of variants × entry-diseases for description-less variants (high recall, low precision). Both are out of scope here.
Test plan
JAVA_HOME=<jdk17> uv run pytest test— 348 passing (319 baseline + 29 new).uv run ruff checkon the new modules.gs://open-targets-pipeline-runs/ds/26.03-test5/:uv run pts --step uniprot_evidence_parse— ~7 suv run pts --step uniprot_literature— ~25 suv run pts --step uniprot_variants— ~30 sevidence_postprocess_uniprot_literatureandevidence_postprocess_uniprot_variantsagainst the new parquets and confirm downstream consumers are happy.Out of scope (follow-ups)
variantIdfield — requires an rsID→genomic-coordinate resolver.spark-nlpinclingen/orphanetproperties for local runs.