Skip to content

bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer#2946

Open
fmassot wants to merge 2 commits into
mainfrom
bench/alyze-vs-unicode-seg-tokenizer
Open

bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer#2946
fmassot wants to merge 2 commits into
mainfrom
bench/alyze-vs-unicode-seg-tokenizer

Conversation

@fmassot

@fmassot fmassot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds `benches/tokenizer_compare.rs`, a criterion benchmark comparing two UAX#29 word-breaking implementations across three corpora, matching alyze's own benchmark methodology.

Implementations compared:

  • `UnicodeSegmenterTokenizer`: tantivy `Tokenizer` wrapper around `unicode_segmentation::unicode_word_indices()`, with `LowerCaser` + `RemoveLongFilter(255)`
  • `alyze`: hand-rolled DFA with ASCII fast-path + ICU for non-ASCII, via its zero-allocation `Analyzer` API

Corpora:

  • Wikipedia (64 MiB, mixed Unicode) — same dataset as alyze's benchmark, downloaded from HuggingFace parquet
  • Wikipedia ASCII — same articles with non-ASCII chars stripped, isolates the ASCII fast-path
  • Loghub (64 MiB) — real-world logs from Apache, Zookeeper, Linux, Mac, SSH; downloaded from zenodo.org/records/8196385

Results (Apple M-series)

Corpus Variant `unicode_seg` `alyze`
Wikipedia (mixed) tokenize_only ~91 MiB/s ~367 MiB/s
Wikipedia (mixed) full_pipeline ~76 MiB/s ~236 MiB/s
Wikipedia (ASCII) tokenize_only ~434 MiB/s ~365 MiB/s
Wikipedia (ASCII) full_pipeline ~231 MiB/s ~241 MiB/s
Loghub tokenize_only ~634 MiB/s ~545 MiB/s
Loghub full_pipeline ~250 MiB/s ~315 MiB/s

Key findings:

  • On mixed Unicode (Wikipedia), alyze is ~4× faster at tokenization — its hand-rolled DFA handles non-ASCII without a slow fallback
  • On ASCII-only input, `unicode_segmentation`'s fast-path catches up and the two are essentially equivalent
  • On logs (nearly all ASCII, short lines), `unicode_seg` is faster at tokenization (~634 vs ~545 MiB/s), but alyze's `ReusableBuffer` zero-allocation pipeline wins end-to-end (~315 vs ~250 MiB/s)

Running

```
cargo bench --bench tokenizer_compare
```

First run downloads data and caches it under `benches/.cache/`:

  • Wikipedia: parquet shards from HuggingFace (~500 MB for the first shard)
  • Loghub: TAR.GZ archives from Zenodo (~100 MB total for the 5 datasets)

🤖 Generated with Claude Code

… Wikipedia

Adds a new criterion benchmark (`tokenizer_compare`) that measures throughput
(MiB/s) of two UAX#29 tokenizer implementations on 64 MiB of English Wikipedia,
matching alyze's own benchmark methodology.

Implementations compared:
- UnicodeSegmenterTokenizer: unicode_segmentation::unicode_word_indices() wrapped
  in tantivy's Tokenizer trait, with LowerCaser + RemoveLongFilter(255)
- alyze: hand-rolled DFA with ASCII fast-path, via its Analyzer API

Results on this machine:
  unicode_seg/tokenize_only  ~88 MiB/s
  unicode_seg/full_pipeline  ~74 MiB/s
  alyze/tokenize_only       ~359 MiB/s
  alyze/full_pipeline       ~225 MiB/s

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3da16fa7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

let mut count = 0u64;
for text in texts {
uax29::word::tokenize(text, uax29::word::Options::default(), |_, props| {
if props.is_word_like() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align word-like filtering between tokenizers

When the benchmark corpus contains emoji or other Extended_Pictographic symbols, this props.is_word_like() predicate makes the alyze path count/analyze tokens that unicode_word_indices() does not produce; the unicode-segmentation iterator only yields spans containing Alphabetic or Number characters. That means the tokenize_only and full_pipeline comparisons can run different workloads and report misleading throughput/counts on non-alphanumeric Wikipedia text, so the alyze side should use a predicate matching unicode_word_indices() or both sides should benchmark raw boundaries instead.

Useful? React with 👍 / 👎.

- ASCII variant: strips non-ASCII chars from Wikipedia corpus to isolate
  the ASCII fast-path in both tokenizers
- Loghub variant: downloads real-world logs (Apache, Zookeeper, Linux,
  Mac, SSH) from zenodo.org/records/8196385 and caches them locally

Results (64 MiB each):
  unicode_seg_ascii/tokenize_only  ~434 MiB/s  (vs alyze ~365 MiB/s)
  unicode_seg_loghub/tokenize_only ~634 MiB/s  (vs alyze ~545 MiB/s)
  alyze_loghub/full_pipeline       ~315 MiB/s  (vs unicode_seg ~250 MiB/s)

Key finding: unicode_segmentation's ASCII fast-path matches or beats
alyze on ASCII-heavy corpora at the tokenize-only level; alyze's
ReusableBuffer allocation strategy recovers the lead in the full pipeline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb302231a7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

};
let full = Analyzer::new(AnalysisOptions {
case_sensitive: false,
maximum_token_length: Some(MAX_TOKEN_LEN),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match Tantivy’s effective 254-byte length cutoff

When the corpus contains a word-like token that is exactly 255 bytes, this Alyze pipeline keeps it while the Tantivy side drops it: RemoveLongFilter::limit(255) only accepts tokens with token.text.len() < 255 (src/tokenizer/remove_long.rs:35-36), whereas Alyze’s maximum_token_length: Some(MAX_TOKEN_LEN) treats 255 as the maximum allowed byte length. That makes the full_pipeline variants process different token counts for long log/base64-like tokens, so the comparison can be skewed unless Alyze uses 254 here or the Tantivy limit is raised to 256.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant