Skip to content

perf: replace jmzml JAXB parser with StAX-based mzML reader#6

Open
ypriverol wants to merge 9 commits intodevfrom
feature/stax-mzml-reader
Open

perf: replace jmzml JAXB parser with StAX-based mzML reader#6
ypriverol wants to merge 9 commits intodevfrom
feature/stax-mzml-reader

Conversation

@ypriverol
Copy link
Copy Markdown
Member

Summary

  • Replace jmzml's JAXB-based MzMLUnmarshaller with a custom StAX streaming parser (StaxMzMLParser)
  • Extracts only the 11 fields MSGF+ needs from mzML (spectrum ID, MS level, polarity, centroided, RT, ion mobility, precursor m/z/charge/intensity, activation method, peak arrays)
  • Single-pass index build on file open, full preload into memory on first random access
  • Eliminates repeated XML parsing during the database search phase

Architecture

StaxMzMLParser (core parser, index + preload + cache)
├── StaxMzMLSpectraMap (implements SpectrumAccessorBySpecIndex)
└── StaxMzMLSpectraIterator (implements Iterator<Spectrum>)

Wired into SpectraAccessor.java as drop-in replacement for MzMLSpectraMap/MzMLSpectraIterator.

Files Changed (5 files)

Source (2 new, 1 modified):

  • StaxMzMLParser.java — Core StAX parser with index, preload, base64/zlib decoding
  • StaxMzMLSpectraMap.java — SpectrumAccessorBySpecIndex implementation
  • StaxMzMLSpectraIterator.java — Sequential iterator with MS level filtering
  • SpectraAccessor.java — Wire StAX classes, remove jmzml dependency for spectrum access

Tests (1 new):

  • TestStaxMzMLParser.java — 18 tests covering parsing, indexing, caching, iteration, binary decoding

Benchmark (TMT 1.1GB mzML, target-decoy, 4 threads)

Metric Baseline (original master) PR2 (msLevel) StAX (this PR) Delta vs PR2 Cumulative
Wall time 1,245s 957s 853s -10.9% -31.4%
PSMs at 1% FDR 6,654 6,936 6,936 same +4.2%
Total SII 58,335 46,878 46,878 same same
Score completeness 100% 100% 100% same same

Test plan

  • mvn test passes (155 tests, 0 failures)
  • TestStaxMzMLParser: 18 tests covering index, parse, cache, iterator, binary decode
  • TestParsers.testMzML: existing mzML test passes with new backend
  • TestMZIdentMLGen: mzid output tests pass (score completeness verified)
  • Benchmark: identical PSMs and scores, 10.9% faster

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da609828-2539-461e-951f-18571a3cc0b0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/stax-mzml-reader

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@timosachsenberg
Copy link
Copy Markdown

Why more hits?

@ypriverol
Copy link
Copy Markdown
Member Author

What do you mean by more hits? I think when you pass ms2 level filter you manage to search only on the spectra that is ms2; this dataset is a TMT MS3 which means that if you don't pass the threshold you get also some hits in ms3.

@timosachsenberg
Copy link
Copy Markdown

I misread the table

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the previous jmzML/JAXB-based mzML reading path with a custom StAX streaming parser and wires it through SpectraAccessor as the default mzML backend, alongside broad removal of mzXML-related code paths and dependencies.

Changes:

  • Added StaxMzMLParser plus mzML iterator/map adapters and integrated them into SpectraAccessor for mzML access.
  • Removed jmzML/jmzreader (and related converter/adapter classes) from the build and codebase.
  • Removed/disabled mzXML support across CLI/scripts/utilities and tests (format constants, parameter validation, helper tools).

Reviewed changes

Copilot reviewed 71 out of 72 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/test/java/msgfplus/TestStaxMzMLParser.java New unit tests validating StAX mzML parsing, indexing, caching, and decoding.
src/test/java/msgfplus/TestScoring.java Switch log suppression from MzMLAdapter to StaxMzMLParser.
src/test/java/msgfplus/TestParsers.java Removes mzXML reading test case.
src/test/java/msgfplus/TestMSGFPlus.java Switch log suppression from MzMLAdapter to StaxMzMLParser.
src/main/java/edu/ucsd/msjava/mzml/StaxMzMLParser.java New StAX mzML parser with index build + preload/cache + binary decoding.
src/main/java/edu/ucsd/msjava/mzml/StaxMzMLSpectraMap.java New SpectrumAccessorBySpecIndex implementation backed by StaxMzMLParser.
src/main/java/edu/ucsd/msjava/mzml/StaxMzMLSpectraIterator.java New sequential iterator wrapper with MS-level filtering + polarity warnings.
src/main/java/edu/ucsd/msjava/msutil/SpectraAccessor.java Wires new StAX mzML backend; removes old jmzML adapter usage.
src/main/java/edu/ucsd/msjava/msutil/Spectrum.java Updates spectrum format inference to no longer map .mzXML to a supported format.
src/main/java/edu/ucsd/msjava/msutil/SpecFileFormat.java Removes MZXML constant and drops it from supported format list.
src/main/java/edu/ucsd/msjava/params/ParamManager.java Stops accepting mzXML in spectrum file parameters.
src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java Switch log suppression from MzMLAdapter to StaxMzMLParser.
src/main/java/edu/ucsd/msjava/ui/MzIDToTsv.java Switch log suppression from MzMLAdapter to StaxMzMLParser.
src/main/java/edu/ucsd/msjava/ui/MSGFPlus.java Switch log suppression from MzMLAdapter to StaxMzMLParser.
src/main/java/edu/ucsd/msjava/mzid/MzIDParser.java Switch log suppression from MzMLAdapter to StaxMzMLParser.
src/main/java/edu/ucsd/msjava/ui/PRMSpecGen.java Removes mzXML option handling in argument parsing / spectrum reading.
src/main/java/edu/ucsd/msjava/ui/MSProfile.java Removes mzXML option handling in format detection and iteration.
src/main/java/edu/ucsd/msjava/ui/MSDictionary.java Removes mzXML branch when selecting spectrum iterator backend.
src/main/java/edu/ucsd/msjava/scripts/SpecFileValidator.java Removes mzXML branch when selecting spectrum iterator backend.
src/main/java/edu/ucsd/msjava/scripts/SelectSpectra.java Switches from MzXMLSpectraMap to SpectraAccessor (still targets .mzXML).
src/main/java/edu/ucsd/msjava/scripts/MergeSpectra.java Disables mzXML merge entry point; refactors internal (disabled) path to SpectraAccessor.
src/main/java/edu/ucsd/msjava/scripts/CountSpectra.java Replaces mzXML counting implementation with UnsupportedOperationException.
src/main/java/edu/ucsd/msjava/misc/PreprocessSpec.java Removes mzXML option handling and mzXML iterator path.
src/main/java/edu/ucsd/msjava/misc/PhosAnalysis.java Removes mzXML branch and disables mzXML-dependent method body.
src/main/java/edu/ucsd/msjava/misc/IPRGStudy.java Disables mzXML-dependent logic path.
src/main/java/edu/ucsd/msjava/misc/HeckWhole.java Replaces some mzXML access with SpectraAccessor; disables mzXML conversion method.
src/main/java/edu/ucsd/msjava/misc/HeckRevision.java Replaces mzXML iterator/map usage with SpectraAccessor.
src/main/java/edu/ucsd/msjava/misc/HeckPercolator.java Disables mzXML-dependent logic path.
src/main/java/edu/ucsd/msjava/misc/HCDCIDETD.java Disables mzXML-dependent logic paths.
src/main/java/edu/ucsd/msjava/misc/ControlNew.java Disables mzXML-dependent logic path.
src/main/java/edu/ucsd/msjava/misc/Clauser.java Disables mzXML-dependent logic path.
src/main/java/edu/ucsd/msjava/misc/Chores.java Disables mzXML-dependent logic paths.
src/main/java/edu/ucsd/msjava/misc/CIDETDPairs.java Removes mzXML counting branch (mgf-only).
src/main/java/edu/ucsd/msjava/misc/AnnotatedSpecGenerator.java Removes mzXML map branch (mgf-only).
src/main/java/edu/ucsd/msjava/ui/MSGFLib.java Replaces mzXML handling with “no longer supported” message + continue.
src/main/java/edu/ucsd/msjava/ui/MSGF.java Replaces mzXML handling with “no longer supported” message + continue.
src/main/java/edu/ucsd/msjava/parser/PSMList.java Removes mzXML branch when selecting SpectrumAccessorBySpecIndex backend.
src/main/java/edu/ucsd/msjava/parser/SortedSpectraIterator.java Removes mzXML demo main() content.
src/main/java/edu/ucsd/msjava/msutil/SpecKey.java Removes mzXML demo main() content and mzXML iterator import.
src/main/java/edu/ucsd/msjava/mzml/MzMLAdapter.java Deleted (jmzML-based mzML adapter removed).
src/main/java/edu/ucsd/msjava/mzml/MzMLSpectraMap.java Deleted (jmzML-based spectra map removed).
src/main/java/edu/ucsd/msjava/mzml/MzMLSpectraIterator.java Deleted (jmzML-based iterator removed).
src/main/java/edu/ucsd/msjava/mzml/SpectrumConverter.java Deleted (jmzML Spectrum conversion removed).
src/main/java/edu/ucsd/msjava/parser/MzXMLSpectraMap.java Deleted (mzXML random access removed).
src/main/java/edu/ucsd/msjava/parser/MzXMLSpectraIterator.java Deleted (mzXML iterator removed).
src/main/java/edu/ucsd/msjava/parser/MzXMLToMgfConverter.java Deleted (mzXML->mgf conversion removed).
src/main/java/edu/ucsd/msjava/misc/MzXMLToMgf.java Deleted (mzXML->mgf tool removed).
src/main/java/edu/ucsd/msjava/msutil/test/MzXMLSpectraMapTest.java Deleted (mzXML test removed).
src/main/java/edu/ucsd/msjava/msutil/test/MzXMLSpectraIteratorTest.java Deleted (mzXML test removed).
src/main/java/edu/ucsd/msjava/scripts/AgilentCyclicSpecPreProcess.java Replaces mzXML logic with UnsupportedOperationException.
src/main/java/org/systemsbiology/jrap/stax/* Deleted legacy jrap stax mzXML/mzML parsing/indexing infrastructure.
pom.xml Removes jmzml/jmzreader dependencies and updates shading notes accordingly.
Comments suppressed due to low confidence (2)

src/main/java/edu/ucsd/msjava/msutil/SpectraAccessor.java:95

  • SpectraAccessor can be constructed with an unrecognized extension (e.g., now that .mzXML is no longer in SpecFileFormat), which makes specFormat null. getSpecMap()/getSpecItr() then dereference specFormat (e.g., specFormat.getPSIName()) and can throw a NullPointerException. Consider validating specFormat in the constructor and throwing an IllegalArgumentException with a clear message, or providing a safe fallback/error path before using specFormat.
    public SpectraAccessor(File specFile) {
        this(specFile, SpecFileFormat.getSpecFileFormat(specFile.getName()));
    }

    /**
     * Constructor that accepts a file and a file format
     *
     * @param specFile
     * @param specFormat
     */
    public SpectraAccessor(File specFile, SpecFileFormat specFormat) {
        this.specFile = specFile;
        this.specFormat = specFormat;
        this.spectrumParser = null;
    }

    /**
     * Set the MS level range for spectrum filtering (both inclusive).
     *
     * @param minMSLevel minimum MS level to consider (inclusive).
     * @param maxMSLevel maximum MS level to consider (inclusive).
     */
    public void setMSLevelRange(int minMSLevel, int maxMSLevel) {
        this.minMSLevel = minMSLevel;
        this.maxMSLevel = maxMSLevel;
    }

    public SpectrumAccessorBySpecIndex getSpecMap() {
        if (specMap == null) {
            if (specFormat == SpecFileFormat.MZML) {
                if (staxParser == null) {
                    try {
                        staxParser = new StaxMzMLParser(specFile);
                    } catch (Exception e) {
                        throw new RuntimeException("Failed to parse mzML file: " + specFile.getAbsolutePath(), e);
                    }
                }
                specMap = new StaxMzMLSpectraMap(staxParser, minMSLevel, maxMSLevel);
            } else if (specFormat == SpecFileFormat.DTA_TXT)
                specMap = new PNNLSpectraMap(specFile.getPath());
            else {
                SpectrumParser parser = null;
                if (specFormat == SpecFileFormat.MGF)
                    parser = new MgfSpectrumParser();
                else if (specFormat == SpecFileFormat.MS2)
                    parser = new MS2SpectrumParser();
                else if (specFormat == SpecFileFormat.PKL)
                    parser = new PklSpectrumParser();
                else
                    return null;

                spectrumParser = parser;
                specMap = new SpectraMap(specFile.getPath(), parser);
            }
        }

        if (specMap == null) {
            System.out.println("No spectra were found");
            System.out.println("File: " + specFile.getAbsolutePath());
            System.out.println("Format: " + specFormat.getPSIName());
        }

src/main/java/edu/ucsd/msjava/params/ParamManager.java:406

  • This change removes mzXML from the accepted spectrum file formats for CLI parameter parsing. The PR description focuses on replacing the mzML parser; if mzXML removal is intended, it should be called out explicitly (and ideally split into a separate PR) since it’s a breaking change for existing workflows.
    public void addSpecFileParam(boolean isOptional) {
        FileParameter specFileParam = new FileParameter(ParamNameEnum.SPECTRUM_FILE);
        if (isOptional) {
            specFileParam.setAsOptional();
        }
        specFileParam.addFileFormat(SpecFileFormat.MZML);
        specFileParam.addFileFormat(SpecFileFormat.MGF);
        specFileParam.addFileFormat(SpecFileFormat.MS2);
        specFileParam.addFileFormat(SpecFileFormat.PKL);
        specFileParam.addFileFormat(SpecFileFormat.DTA_TXT);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +29 to +38
@Override
public Spectrum next() {
Spectrum cur = currentSpectrum;
currentSpectrum = delegate.hasNext() ? delegate.next() : null;
if (currentSpectrum == null) hasNext = false;

if (cur.getScanPolarity() == Spectrum.Polarity.NEGATIVE) {
warnNegativePolarity(cur);
}
return cur;
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StaxMzMLSpectraIterator.next() does not guard against being called when hasNext() is false; currentSpectrum can be null and cur.getScanPolarity() will throw a NullPointerException. Iterator.next() should throw NoSuchElementException when exhausted.

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +154
public Spectrum getSpectrumBySpecIndex(int specIndex) {
Spectrum cached = cache.get(specIndex);
if (cached != null) return cached;

if (!allLoaded) {
try {
preloadAllSpectra();
} catch (Exception e) {
System.err.println("Error preloading spectra: " + e.getMessage());
return null;
}
return cache.get(specIndex);
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StaxMzMLParser.getSpectrumBySpecIndex() triggers a full-file preload on the first cache miss, even when specIndex is invalid/not present in the index. This can turn a simple “not found” lookup into an expensive full parse. Consider checking indexBySpecIdx.containsKey(specIndex) (or range) up front and returning null immediately for unknown indices.

Copilot uses AI. Check for mistakes.
@daichengxin
Copy link
Copy Markdown

So when the MS2 level is specified, does the number of identifications increase than original codes for the TMT MS3 datasets? Interesting

@ypriverol
Copy link
Copy Markdown
Member Author

ypriverol commented Apr 14, 2026

When the msgf+ search by mistake MS3, then it inflates the PSMs and reduce Ids. This is a big problem we have when we are searching because by default comet is in quantms charge 2; and for msgf+ we have to provide the fragmentation mode.

ypriverol and others added 9 commits April 14, 2026 21:18
Replace the jmzml JAXB-based MzMLUnmarshaller with a lightweight StAX
streaming parser that extracts only the 11 fields MSGF+ needs. The new
parser builds a spectrum index in a single pass, then preloads all
spectra into memory on first random access, eliminating repeated XML
parsing during the search phase.

Benchmark (TMT 1.1GB mzML, target-decoy, 4 threads):
- Wall time: 957s -> 853s (-10.9%)
- PSMs at 1% FDR: 6,936 (unchanged)
- Score completeness: 100% (unchanged)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port

- Remove jmzml (JAXB-based mzML parser) dependency from pom.xml
- Delete old jmzml-dependent classes: MzMLAdapter, MzMLSpectraMap,
  MzMLSpectraIterator, SpectrumConverter
- Add referenceableParamGroupRef resolution to StaxMzMLParser: builds
  a map of param groups during index pass, resolves refs during
  spectrum parsing (critical for files that define polarity, MS level,
  etc. in referenceable groups)
- Move turnOffLogs() utility to StaxMzMLParser, update all callers
- Keep fastutil dependency (needed by jmzidml at runtime)

JAR size reduced from 39.5MB to 38MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jmzReader (uk.ac.ebi.pride.tools:jmzreader:2.0.6) had zero imports
anywhere in the codebase — a dead dependency from earlier development.
All spectrum file format parsing uses custom implementations:
mzML (StaxMzMLParser), mzXML (embedded jrap/stax), MGF/MS2/PKL
(custom parsers).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove mzXML file format support entirely:
- Delete embedded jrap/stax library (20 files, ~5,800 lines)
- Delete MzXMLSpectraMap, MzXMLSpectraIterator, MzXMLToMgfConverter
- Delete MzXMLToMgf utility and mzXML test resources (38MB)
- Remove MZXML from SpecFileFormat enum, SpectraAccessor, ParamManager
- Update misc/scripts/ui classes to remove mzXML code paths

mzXML is a legacy format superseded by mzML. Users with mzXML files
can convert to mzML using msconvert (ProteoWizard).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- StaxMzMLParser: use ConcurrentHashMap for thread-safe spectrum cache,
  fix class-level doc (preload-all, not bounded LRU), check index before
  preloading, propagate exceptions instead of returning null
- StaxMzMLSpectraIterator: throw NoSuchElementException when exhausted
- SpectraAccessor: throw exception instead of System.exit(-1),
  validate specFormat is non-null in constructor
- SelectSpectra: update stale .mzXML reference to .mzML
- pom.xml: fix duplicate <manifest>, remove stale comments, note
  fastutil is required by jmzidentml at runtime

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Write search results directly to TSV from in-memory objects, bypassing
mzIdentML serialization. Output is column-identical to MzIDToTsv
(verified by diff on test.mgf search). This avoids generating large
.mzid files when only TSV is needed downstream (e.g. OpenMS
MSGFPlusAdapter, Percolator).

- New DirectTSVWriter class with same score/protein/mod logic as
  MZIdentMLGen but streaming tab-delimited output
- New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both
- Includes fixed + variable mods, MGF Title column, decoy filtering
- Backwards compatible: default remains mzid

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When -addFeatures 1 is used with -outputFormat tsv, the TSV now
includes all PSMFeatureFinder columns needed for Percolator:
ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio,
MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency,
NumMatchedMainIons, and all error statistics (MeanError/StdevError
for All and Top7, both absolute and relative).

These features were previously only available as UserParams in mzid
and were not extracted by OpenMS's addMSGFFeatures() — now they are
directly accessible as TSV columns.

The peptide modification format (M+15.995) is already compatible with
OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms
it to bracket notation M[+15.995] for AASequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace ConvertToMgf-based tests (class removed in PR #7) with
StaxMzMLParser and SpectraAccessor mzML parsing tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ypriverol ypriverol force-pushed the feature/stax-mzml-reader branch from abe1e38 to 3eb2966 Compare April 14, 2026 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants