bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer#2946
bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer#2946fmassot wants to merge 2 commits into
Conversation
… Wikipedia Adds a new criterion benchmark (`tokenizer_compare`) that measures throughput (MiB/s) of two UAX#29 tokenizer implementations on 64 MiB of English Wikipedia, matching alyze's own benchmark methodology. Implementations compared: - UnicodeSegmenterTokenizer: unicode_segmentation::unicode_word_indices() wrapped in tantivy's Tokenizer trait, with LowerCaser + RemoveLongFilter(255) - alyze: hand-rolled DFA with ASCII fast-path, via its Analyzer API Results on this machine: unicode_seg/tokenize_only ~88 MiB/s unicode_seg/full_pipeline ~74 MiB/s alyze/tokenize_only ~359 MiB/s alyze/full_pipeline ~225 MiB/s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3da16fa7b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| let mut count = 0u64; | ||
| for text in texts { | ||
| uax29::word::tokenize(text, uax29::word::Options::default(), |_, props| { | ||
| if props.is_word_like() { |
There was a problem hiding this comment.
Align word-like filtering between tokenizers
When the benchmark corpus contains emoji or other Extended_Pictographic symbols, this props.is_word_like() predicate makes the alyze path count/analyze tokens that unicode_word_indices() does not produce; the unicode-segmentation iterator only yields spans containing Alphabetic or Number characters. That means the tokenize_only and full_pipeline comparisons can run different workloads and report misleading throughput/counts on non-alphanumeric Wikipedia text, so the alyze side should use a predicate matching unicode_word_indices() or both sides should benchmark raw boundaries instead.
Useful? React with 👍 / 👎.
- ASCII variant: strips non-ASCII chars from Wikipedia corpus to isolate the ASCII fast-path in both tokenizers - Loghub variant: downloads real-world logs (Apache, Zookeeper, Linux, Mac, SSH) from zenodo.org/records/8196385 and caches them locally Results (64 MiB each): unicode_seg_ascii/tokenize_only ~434 MiB/s (vs alyze ~365 MiB/s) unicode_seg_loghub/tokenize_only ~634 MiB/s (vs alyze ~545 MiB/s) alyze_loghub/full_pipeline ~315 MiB/s (vs unicode_seg ~250 MiB/s) Key finding: unicode_segmentation's ASCII fast-path matches or beats alyze on ASCII-heavy corpora at the tokenize-only level; alyze's ReusableBuffer allocation strategy recovers the lead in the full pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb302231a7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }; | ||
| let full = Analyzer::new(AnalysisOptions { | ||
| case_sensitive: false, | ||
| maximum_token_length: Some(MAX_TOKEN_LEN), |
There was a problem hiding this comment.
Match Tantivy’s effective 254-byte length cutoff
When the corpus contains a word-like token that is exactly 255 bytes, this Alyze pipeline keeps it while the Tantivy side drops it: RemoveLongFilter::limit(255) only accepts tokens with token.text.len() < 255 (src/tokenizer/remove_long.rs:35-36), whereas Alyze’s maximum_token_length: Some(MAX_TOKEN_LEN) treats 255 as the maximum allowed byte length. That makes the full_pipeline variants process different token counts for long log/base64-like tokens, so the comparison can be skewed unless Alyze uses 254 here or the Tantivy limit is raised to 256.
Useful? React with 👍 / 👎.
Summary
Adds `benches/tokenizer_compare.rs`, a criterion benchmark comparing two UAX#29 word-breaking implementations across three corpora, matching alyze's own benchmark methodology.
Implementations compared:
Corpora:
Results (Apple M-series)
Key findings:
Running
```
cargo bench --bench tokenizer_compare
```
First run downloads data and caches it under `benches/.cache/`:
🤖 Generated with Claude Code