Skip to content

Commit eb747fd

Browse files
committed
chore: improved tokenizer vocabulary warning
1 parent 85388cd commit eb747fd

2 files changed

Lines changed: 7 additions & 2 deletions

File tree

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,4 +171,8 @@ tutorials/instruction_tuning/prepared_data
171171
config_files/instruction_tuning
172172
data/lorem_ipsum_instruct.jsonl
173173
tutorials/scaling_up/logs*
174-
tutorials/scaling_up/experiments_old/*
174+
tutorials/scaling_up/experiments_old/*
175+
176+
results/*
177+
tutorials/einsum_transformer/experiments/*
178+
tutorials/warmstart/experiments/*

src/modalities/tokenization/tokenizer_wrapper.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,8 @@ def __init__(
118118
if len(self.tokenizer.get_vocab()) > old_vocab_size:
119119
raise NotImplementedError(
120120
"Currently only tokens already known to the tokenizers vocabulary can be added,"
121-
+ " as resizing the embedding matrix is not yet supported!"
121+
+ " as resizing the embedding matrix is not yet supported! "
122+
f"Before: {old_vocab_size}, after: {len(self.tokenizer.get_vocab())}"
122123
)
123124
self.max_length = max_length
124125
self.truncation = truncation

0 commit comments

Comments
 (0)