Skip to content

Optimize parser and generator for fewer allocations and less regex work#70

Open
dduugg wants to merge 1 commit into
patsplat:masterfrom
dduugg:performance-optimizations
Open

Optimize parser and generator for fewer allocations and less regex work#70
dduugg wants to merge 1 commit into
patsplat:masterfrom
dduugg:performance-optimizations

Conversation

@dduugg
Copy link
Copy Markdown

@dduugg dduugg commented Apr 27, 2026

Summary

  • Cache tag-matching regex patterns in StreamParser as memoized class methods — the start/end tag patterns were previously recompiled from PTag.mappings on every call to parse().
  • Replace gsub(/\s+/, '') with delete(\" \t\n\r\") for base64 whitespace stripping in PData#to_ruby and data_tag — avoids the regex engine for a character-deletion task.
  • Use pack(\"m0\") in data_tag instead of pack(\"m\").gsub(/\s+/, '') — the m0 directive produces base64 without line breaks directly, eliminating an intermediate string allocation and a regex traversal over potentially large data.
  • Remove redundant .to_s in indent()@indent_str is already converted to String in initialize.
  • Avoid double contents.to_s in tag() by extracting to a local variable.

Benchmarks

2000 iterations each, measured with Benchmark.bm (Ruby 3.x, macOS arm64):

Benchmark Before (user) After (user) Δ
Parse small plist 0.131 s 0.087 s −34%
Parse file plist (AlbumData.xml) 2.302 s 2.315 s ≈ same
Emit nested Hash 0.027 s 0.028 s ≈ same
Emit IO/data element (JPEG) 2.697 s 0.371 s −86%

The regex caching pays off most when parsing many small documents in a single process (each parse previously rebuilt and compiled two regexes from scratch). The pack("m0") change is the largest absolute win — the old gsub traversed the entire base64 string, which is proportionally large for binary payloads like images.

- Cache start/end tag regex patterns as memoized class methods on
  StreamParser instead of recompiling them on every parse() call
- Replace gsub(/\s+/, '') with delete(" \t\n\r") for base64 whitespace
  stripping in PData and data_tag, avoiding regex engine overhead
- Use pack("m0") in data_tag to produce base64 without line breaks
  directly, eliminating an intermediate gsub allocation
- Remove redundant .to_s call in indent() since @indent_str is already
  converted in initialize
- Avoid calling contents.to_s twice in tag() by extracting to a local
@dduugg dduugg force-pushed the performance-optimizations branch from b74090f to 770b7fa Compare April 27, 2026 03:30
@dduugg dduugg marked this pull request as ready for review April 27, 2026 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant