Skip to content

Proposal: Native char type for fast char ops in mypyc #21418

@VaggelisD

Description

@VaggelisD

Feature

Add mypy_extensions.char as a mypyc native type for one Unicode codepoint, stored unboxed as int32. Models the existing i64 / i32 / i16 / u8 pattern, applied to str.

Pitch

CPython already has codepoint-classification macros (Py_UNICODE_ISALNUM, Py_UNICODE_ISSPACE, Py_UNICODE_TOUPPER, etc.) that take an integer codepoint. The current str-method path goes through a 1-character PyUnicode only to dispatch into those same macros; char skips the round trip. The result is that PyObject operations, refcount pairs etc collapse to inlined integer ops.

Consider this function which scans forward while the char is part of a word (could be part of a lexer):

def find_word_end(s: str, start: int) -> int:
    i = start
    while i < len(s):
        c = s[i]
        if not (c.isalnum() or c == '_'):
            return i
        i += 1
    return i

Today, with c: str inferred, the loop body lowers to roughly:

PyObject *t = CPyStr_GetItem(s, i);    // cache lookup, INCREF
int alnum = CPyStr_IsAlnum(t);         // method dispatch on the 1-char str
int eq = CPyStr_Equal(t, underscore);  // string compare
Py_DECREF(t);                          // refcount op

With c: char = s[i] (the proposed type), the same body lowers to:

int32_t c = CPyStr_GetCharAt(s, i);  // inlined PyUnicode_READ
int alnum = Py_UNICODE_ISALNUM(c);   // inline macro on int32
int eq = (c == '_');                  // int compare

Speedups from a working prototype, both columns mypyc-compiled:

Workload Speedup
c.isspace() per char 3.27x
c.isdigit() per char 1.63x
ASCII c.upper() 10.92x
Identifier validator (tokenizer-shaped) 1.57x
Token classifier (lexer inner loop) 2.12x

Note: On Latin-1 text (most source-code-like input), s[i] does not actually allocate, since CPython caches single-character strings under U+0100. The savings are refcount traffic, function-call overheads, and the codepoint-to-PyObject-and-back round trip that s[i].method() forces. For non-Latin-1 text the picture changes and char would pull further ahead.

Implementation would follow the existing i64 template: new rprimitive, the same type-promotion plumbing that makes i64 and int interoperate, codepoint-taking method ops for the common str methods, and one IR pass to fold s[i] -> char into a direct int32 read. There are a couple of design choices to settle along the way (promotion direction, runtime class shape), but the question I'm asking here is whether the team is interested in the direction at all.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions