bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer by fmassot · Pull Request #2946 · quickwit-oss/tantivy

fmassot · 2026-06-02T03:13:39Z

Summary

Adds `benches/tokenizer_compare.rs`, a criterion benchmark comparing two UAX#29 word-breaking implementations across three corpora, matching alyze's own benchmark methodology.

Implementations compared:

`UnicodeSegmenterTokenizer`: tantivy `Tokenizer` wrapper around `unicode_segmentation::unicode_word_indices()`, with `LowerCaser` + `RemoveLongFilter(255)`
`alyze`: hand-rolled DFA with ASCII fast-path + ICU for non-ASCII, via its zero-allocation `Analyzer` API

Corpora:

Wikipedia (64 MiB, mixed Unicode) — same dataset as alyze's benchmark, downloaded from HuggingFace parquet
Wikipedia ASCII — same articles with non-ASCII chars stripped, isolates the ASCII fast-path
Loghub (64 MiB) — real-world logs from Apache, Zookeeper, Linux, Mac, SSH; downloaded from zenodo.org/records/8196385

Results (Apple M-series)

Corpus	Variant	`unicode_seg`	`alyze`
Wikipedia (mixed)	tokenize_only	~91 MiB/s	~367 MiB/s
Wikipedia (mixed)	full_pipeline	~76 MiB/s	~236 MiB/s
Wikipedia (ASCII)	tokenize_only	~434 MiB/s	~365 MiB/s
Wikipedia (ASCII)	full_pipeline	~231 MiB/s	~241 MiB/s
Loghub	tokenize_only	~634 MiB/s	~545 MiB/s
Loghub	full_pipeline	~250 MiB/s	~315 MiB/s

Key findings:

On mixed Unicode (Wikipedia), alyze is ~4× faster at tokenization — its hand-rolled DFA handles non-ASCII without a slow fallback
On ASCII-only input, `unicode_segmentation`'s fast-path catches up and the two are essentially equivalent
On logs (nearly all ASCII, short lines), `unicode_seg` is faster at tokenization (~634 vs ~545 MiB/s), but alyze's `ReusableBuffer` zero-allocation pipeline wins end-to-end (~315 vs ~250 MiB/s)

Running

```
cargo bench --bench tokenizer_compare
```

First run downloads data and caches it under `benches/.cache/`:

Wikipedia: parquet shards from HuggingFace (~500 MB for the first shard)
Loghub: TAR.GZ archives from Zenodo (~100 MB total for the 5 datasets)

🤖 Generated with Claude Code

… Wikipedia Adds a new criterion benchmark (`tokenizer_compare`) that measures throughput (MiB/s) of two UAX#29 tokenizer implementations on 64 MiB of English Wikipedia, matching alyze's own benchmark methodology. Implementations compared: - UnicodeSegmenterTokenizer: unicode_segmentation::unicode_word_indices() wrapped in tantivy's Tokenizer trait, with LowerCaser + RemoveLongFilter(255) - alyze: hand-rolled DFA with ASCII fast-path, via its Analyzer API Results on this machine: unicode_seg/tokenize_only ~88 MiB/s unicode_seg/full_pipeline ~74 MiB/s alyze/tokenize_only ~359 MiB/s alyze/full_pipeline ~225 MiB/s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3da16fa7b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T03:19:12Z

+            let mut count = 0u64;
+            for text in texts {
+                uax29::word::tokenize(text, uax29::word::Options::default(), |_, props| {
+                    if props.is_word_like() {


Align word-like filtering between tokenizers

When the benchmark corpus contains emoji or other Extended_Pictographic symbols, this props.is_word_like() predicate makes the alyze path count/analyze tokens that unicode_word_indices() does not produce; the unicode-segmentation iterator only yields spans containing Alphabetic or Number characters. That means the tokenize_only and full_pipeline comparisons can run different workloads and report misleading throughput/counts on non-alphanumeric Wikipedia text, so the alyze side should use a predicate matching unicode_word_indices() or both sides should benchmark raw boundaries instead.

Useful? React with 👍 / 👎.

- ASCII variant: strips non-ASCII chars from Wikipedia corpus to isolate the ASCII fast-path in both tokenizers - Loghub variant: downloads real-world logs (Apache, Zookeeper, Linux, Mac, SSH) from zenodo.org/records/8196385 and caches them locally Results (64 MiB each): unicode_seg_ascii/tokenize_only ~434 MiB/s (vs alyze ~365 MiB/s) unicode_seg_loghub/tokenize_only ~634 MiB/s (vs alyze ~545 MiB/s) alyze_loghub/full_pipeline ~315 MiB/s (vs unicode_seg ~250 MiB/s) Key finding: unicode_segmentation's ASCII fast-path matches or beats alyze on ASCII-heavy corpora at the tokenize-only level; alyze's ReusableBuffer allocation strategy recovers the lead in the full pipeline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fb302231a7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T16:29:02Z

+    };
+    let full = Analyzer::new(AnalysisOptions {
+        case_sensitive: false,
+        maximum_token_length: Some(MAX_TOKEN_LEN),


Match Tantivy’s effective 254-byte length cutoff

When the corpus contains a word-like token that is exactly 255 bytes, this Alyze pipeline keeps it while the Tantivy side drops it: RemoveLongFilter::limit(255) only accepts tokens with token.text.len() < 255 (src/tokenizer/remove_long.rs:35-36), whereas Alyze’s maximum_token_length: Some(MAX_TOKEN_LEN) treats 255 as the maximum allowed byte length. That makes the full_pipeline variants process different token counts for long log/base64-like tokens, so the comparison can be skewed unless Alyze uses 254 here or the Tantivy limit is raised to 256.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer#2946

bench: compare UnicodeSegmenterTokenizer vs alyze UAX#29 tokenizer#2946
fmassot wants to merge 2 commits into
mainfrom
bench/alyze-vs-unicode-seg-tokenizer

fmassot commented Jun 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

fmassot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (Apple M-series)

Running

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fmassot commented Jun 2, 2026 •

edited

Loading