Skip to content

Vocabulary/ GPT2 : Bad interpretation of tokenId = 216 #190

@agourdel

Description

@agourdel

Describe the issue as clearly as possible:

The TokenId(216) of the GPT2 Alphabet which have the value "\u011c" has only the byte(28) in its Vec of the Vocabulary.
the byte 28 is '\x1C' so, it's possible there is a bad behavior when the alphabet is loaded.

Steps/code to reproduce the bug:

//

Expected result:

TokenId(226) = vec![0xC4, 0x9C];

Error message:

Outlines/Python version information:

Version information

Details ``` (command output here) ```

Context for the issue:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions