Feature
Add mypy_extensions.char as a mypyc native type for one Unicode codepoint, stored unboxed as int32. Models the existing i64 / i32 / i16 / u8 pattern, applied to str.
Pitch
CPython already has codepoint-classification macros (Py_UNICODE_ISALNUM, Py_UNICODE_ISSPACE, Py_UNICODE_TOUPPER, etc.) that take an integer codepoint. The current str-method path goes through a 1-character PyUnicode only to dispatch into those same macros; char skips the round trip. The result is that PyObject operations, refcount pairs etc collapse to inlined integer ops.
Consider this function which scans forward while the char is part of a word (could be part of a lexer):
def find_word_end(s: str, start: int) -> int:
i = start
while i < len(s):
c = s[i]
if not (c.isalnum() or c == '_'):
return i
i += 1
return i
Today, with c: str inferred, the loop body lowers to roughly:
PyObject *t = CPyStr_GetItem(s, i); // cache lookup, INCREF
int alnum = CPyStr_IsAlnum(t); // method dispatch on the 1-char str
int eq = CPyStr_Equal(t, underscore); // string compare
Py_DECREF(t); // refcount op
With c: char = s[i] (the proposed type), the same body lowers to:
int32_t c = CPyStr_GetCharAt(s, i); // inlined PyUnicode_READ
int alnum = Py_UNICODE_ISALNUM(c); // inline macro on int32
int eq = (c == '_'); // int compare
Speedups from a working prototype, both columns mypyc-compiled:
| Workload |
Speedup |
c.isspace() per char |
3.27x |
c.isdigit() per char |
1.63x |
ASCII c.upper() |
10.92x |
| Identifier validator (tokenizer-shaped) |
1.57x |
| Token classifier (lexer inner loop) |
2.12x |
Note: On Latin-1 text (most source-code-like input), s[i] does not actually allocate, since CPython caches single-character strings under U+0100. The savings are refcount traffic, function-call overheads, and the codepoint-to-PyObject-and-back round trip that s[i].method() forces. For non-Latin-1 text the picture changes and char would pull further ahead.
Implementation would follow the existing i64 template: new rprimitive, the same type-promotion plumbing that makes i64 and int interoperate, codepoint-taking method ops for the common str methods, and one IR pass to fold s[i] -> char into a direct int32 read. There are a couple of design choices to settle along the way (promotion direction, runtime class shape), but the question I'm asking here is whether the team is interested in the direction at all.
Feature
Add
mypy_extensions.charas a mypyc native type for one Unicode codepoint, stored unboxed asint32. Models the existingi64/i32/i16/u8pattern, applied tostr.Pitch
CPython already has codepoint-classification macros (
Py_UNICODE_ISALNUM,Py_UNICODE_ISSPACE,Py_UNICODE_TOUPPER, etc.) that take an integer codepoint. The current str-method path goes through a 1-characterPyUnicodeonly to dispatch into those same macros;charskips the round trip. The result is thatPyObjectoperations, refcount pairs etc collapse to inlined integer ops.Consider this function which scans forward while the char is part of a word (could be part of a lexer):
Today, with
c: strinferred, the loop body lowers to roughly:With
c: char = s[i](the proposed type), the same body lowers to:Speedups from a working prototype, both columns mypyc-compiled:
c.isspace()per charc.isdigit()per charc.upper()Note: On Latin-1 text (most source-code-like input),
s[i]does not actually allocate, since CPython caches single-character strings underU+0100. The savings are refcount traffic, function-call overheads, and the codepoint-to-PyObject-and-back round trip thats[i].method()forces. For non-Latin-1 text the picture changes andcharwould pull further ahead.Implementation would follow the existing
i64template: new rprimitive, the same type-promotion plumbing that makesi64andintinteroperate, codepoint-taking method ops for the commonstrmethods, and one IR pass to folds[i] -> charinto a direct int32 read. There are a couple of design choices to settle along the way (promotion direction, runtime class shape), but the question I'm asking here is whether the team is interested in the direction at all.