-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the issue as clearly as possible:
The ernie tokenizer's vocabulary cannot be reduced.
Steps/code to reproduce the bug:
from outlines_core.fsm.regex import reduced_vocabulary
from outlines.models.transformers import TransformerTokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baidu/ERNIE-4.5-0.3B-PT")
vocabulary = reduced_vocabulary(TransformerTokenizer(tokenizer))Expected result:
No error expectedError message:
Traceback (most recent call last):
File "/Users/neil/workspace/amphibian-apps/apps/mlx-engine/../../outlines_test.py", line 6, in <module>
vocabulary = reduced_vocabulary(TransformerTokenizer(tokenizer))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/neil/workspace/amphibian-apps/apps/mlx-engine/.venv/lib/python3.11/site-packages/outlines_core/fsm/regex.py", line 426, in reduced_vocabulary
raise RuntimeError(
RuntimeError: Cannot convert token `�@` (36865) to bytes: �@Outlines/Python version information:
Version information
Details
``` python -c "from outlines import _version; print(_version.version)" 1.1.0 python -c "import sys; print('Python', sys.version)" Python 3.11.11 (main, Dec 3 2024, 17:20:40) [Clang 16.0.0 (clang-1600.0.26.4)] uv pip freeze addict==2.4.0 aiofiles==24.1.0 aiohappyeyeballs==2.6.1 aiohttp==3.11.18 aioice==0.10.1 aiortc==1.13.0 aiosignal==1.3.2 airportsdata==20250622 annotated-types==0.7.0 antlr4-python3-runtime==4.9.3 anyio==4.9.0 attrs==25.3.0 audioread==3.0.1 av==14.4.0 babel==2.17.0 blis==1.3.0 brotli==1.1.0 catalogue==2.0.10 certifi==2025.6.15 cffi==1.17.1 charset-normalizer==3.4.2 click==8.2.1 cloudpathlib==0.21.1 cloudpickle==3.1.1 colorama==0.4.6 coloredlogs==15.0.1 confection==0.1.5 cryptography==45.0.5 csvw==3.5.1 curated-tokenizers==0.0.9 curated-transformers==0.1.1 cymem==2.0.11 dacite==1.9.2 datasets==4.0.0 decorator==5.2.1 dill==0.3.8 diskcache==5.6.3 dlinfo==2.0.0 dnspython==2.7.0 docopt==0.6.2 einops==0.8.1 einx==0.3.0 espeakng-loader==0.2.4 fastapi==0.115.14 fastrtc==0.0.29 fastrtc-moonshine-onnx==20241016 ffmpy==0.6.0 filelock==3.18.0 flatbuffers==25.2.10 frozendict==2.4.6 frozenlist==1.6.0 fsspec==2024.12.0 future==1.0.0 genson==1.3.0 google-crc32c==1.7.1 gradio==5.38.0 gradio-client==1.11.0 groovy==0.1.2 h11==0.16.0 hf-xet==1.1.5 httpcore==1.0.9 httpx==0.28.1 huggingface-hub==0.33.1 humanfriendly==10.0 idna==3.10 ifaddr==0.2.0 iniconfig==2.1.0 interegular==0.3.3 iso3166==2.1.1 isodate==0.7.2 jinja2==3.1.6 joblib==1.5.1 jsonpath-ng==1.7.0 jsonschema==4.24.0 jsonschema-specifications==2025.4.1 langcodes==3.5.0 language-data==1.3.0 language-tags==1.2.0 lark==1.2.2 lazy-loader==0.4 librosa==0.11.0 llvmlite==0.44.0 loguru==0.7.3 marisa-trie==1.2.1 markdown-it-py==3.0.0 markupsafe==2.1.5 mdurl==0.1.2 misaki==0.9.4 mlx==0.26.3 mlx-audio==0.2.3 mlx-lm==0.26.0 mlx-vlm @ git+https://github.com/neilmehta24/mlx-vlm.git@73523d6538ef31ee13d62ce5391e67d8754a93e8 mpmath==1.3.0 msgpack==1.1.1 multidict==6.4.3 multiprocess==0.70.16 murmurhash==1.0.13 nest-asyncio==1.6.0 networkx==3.4.2 num2words==0.5.14 numba==0.61.2 numpy==2.1.3 omegaconf==2.3.0 onnxruntime==1.22.1 opencv-python==4.10.0.84 orjson==3.11.0 outlines==1.1.0 outlines-core==0.1.26 packaging==25.0 pandas==2.3.1 phonemizer-fork==3.3.2 pillow==11.2.1 platformdirs==4.3.8 pluggy==1.6.0 ply==3.11 pooch==1.8.2 preshed==3.0.10 propcache==0.3.1 protobuf==6.31.1 pyarrow==21.0.0 pycparser==2.22 pydantic==2.11.7 pydantic-core==2.33.2 pydub==0.25.1 pyee==13.0.0 pygments==2.19.2 pylibsrtp==0.12.0 pyloudnorm==0.1.1 pyopenssl==25.1.0 pyparsing==3.2.3 pytest==8.4.1 python-dateutil==2.9.0.post0 python-multipart==0.0.20 pytz==2025.2 pyyaml==6.0.2 rdflib==7.1.4 referencing==0.36.2 regex==2024.11.6 requests==2.32.4 rfc3986==1.5.0 rich==14.0.0 rpds-py==0.25.1 ruff==0.12.4 safehttpx==0.1.6 safetensors==0.5.3 scikit-learn==1.7.1 scipy==1.16.0 segments==2.3.0 semantic-version==2.10.0 sentencepiece==0.2.0 setuptools==80.0.0 shellingham==1.5.4 six==1.17.0 smart-open==7.3.0.post1 sniffio==1.3.1 sounddevice==0.5.2 soundfile==0.13.1 soxr==0.5.0.post1 spacy==3.8.7 spacy-curated-transformers==0.3.1 spacy-legacy==3.0.12 spacy-loggers==1.0.5 srsly==2.5.1 starlette==0.46.2 sympy==1.14.0 thinc==8.3.6 threadpoolctl==3.6.0 tiktoken==0.9.0 timm==1.0.16 tokenizers==0.21.2 tomlkit==0.13.3 torch==2.7.0 torchvision==0.22.0 tqdm==4.67.1 transformers==4.53.0 typer==0.16.0 typing-extensions==4.13.2 typing-inspection==0.4.1 tzdata==2025.2 uritemplate==4.2.0 urllib3==2.5.0 uvicorn==0.35.0 wasabi==1.1.3 weasel==0.4.1 webrtcvad==2.0.10 websockets==15.0.1 wrapt==1.17.2 xxhash==3.5.0 yarl==1.20.0 ```Context for the issue:
Came across the issue when attempting to use outlines with a Ernie 4.5 model.
This is the temporary fix I came up with:
outlines_core.fsm.regex.re_replacement_seq = re.compile(r"^▁*\.*>*�+\.*s*@*(�@)*$")
But hopefully a fix can be made that can better handle occurrences of this character in tokenizers.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working