Skip to content

Fix ANKH tokenizer to load from checkpoint hub id#31

Open
avivko wants to merge 1 commit into
Synthyra:mainfrom
avivko:fix/ankh-tokenizer-from-checkpoint
Open

Fix ANKH tokenizer to load from checkpoint hub id#31
avivko wants to merge 1 commit into
Synthyra:mainfrom
avivko:fix/ankh-tokenizer-from-checkpoint

Conversation

@avivko
Copy link
Copy Markdown

@avivko avivko commented May 18, 2026

FAST_ANKH_ENCODER always attached ElnaggarLab/ankh-base, so ANKH3 checkpoints (256-token vocab) were tokenized with the wrong ids in embed.py. Load from config._name_or_path (same pattern as DPLM) with ankh-base fallback for bare configs.

Add slow tests that fast tokenizer ids match the official repo for ANKH_base, ANKH3_large, and ANKH3_xl.

Summary

  • FAST_ANKH_ENCODER no longer hardcodes ElnaggarLab/ankh-base; it loads the tokenizer from config._name_or_path (e.g. Synthyra/ANKH3_large), matching the checkpoint vocab.
  • Fixes wrong tokenization / fp16 NaNs in embed.py for ANKH3 large/XL.
  • Adds slow tests comparing fast vs official token ids for ANKH_base, ANKH3_large, and ANKH3_xl.

Hub follow-up

After merge, re-push modeling_ankh.py to Synthyra ANKH checkpoints (e.g. via get_weights.py) so trust_remote_code=True users pick up the fix without a local checkout.

Test plan

  • pytest testing/test_ankh_tokenizer.py -v (Docker, 6 passed)

FAST_ANKH_ENCODER always attached ElnaggarLab/ankh-base, so ANKH3
checkpoints (256-token vocab) were tokenized with the wrong ids in
embed.py. Load from config._name_or_path (same pattern as DPLM) with
ankh-base fallback for bare configs.

Add slow tests that fast tokenizer ids match the official repo for
ANKH_base, ANKH3_large, and ANKH3_xl.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant