Proposal: Native `char` type for fast char ops in mypyc

### **Feature**

Add `mypy_extensions.char` as a mypyc native type for one Unicode codepoint, stored unboxed as `int32`. Models the existing `i64` / `i32` / `i16` / `u8` pattern, applied to `str`.


### **Pitch**

CPython already has codepoint-classification macros (`Py_UNICODE_ISALNUM`, `Py_UNICODE_ISSPACE`, `Py_UNICODE_TOUPPER`, etc.) that take an integer codepoint. The current str-method path goes through a 1-character `PyUnicode` only to dispatch into those same macros; `char` skips the round trip. The result is that `PyObject` operations, refcount pairs etc collapse to inlined integer ops. 


Consider this function which scans forward while the char is part of a word (could be part of a lexer):

```python3
def find_word_end(s: str, start: int) -> int:
    i = start
    while i < len(s):
        c = s[i]
        if not (c.isalnum() or c == '_'):
            return i
        i += 1
    return i
```

<br />

**Today**, with `c: str` inferred, the loop body lowers to roughly:

```c
PyObject *t = CPyStr_GetItem(s, i);    // cache lookup, INCREF
int alnum = CPyStr_IsAlnum(t);         // method dispatch on the 1-char str
int eq = CPyStr_Equal(t, underscore);  // string compare
Py_DECREF(t);                          // refcount op
```

<br />


**With `c: char = s[i]`** (the proposed type), the same body lowers to:

```c
int32_t c = CPyStr_GetCharAt(s, i);  // inlined PyUnicode_READ
int alnum = Py_UNICODE_ISALNUM(c);   // inline macro on int32
int eq = (c == '_');                  // int compare
```

<br />

Speedups from a [working prototype](https://github.com/VaggelisD/sqlglot-mypy/commit/b3fbc77600e53c51dfcd3a9cb5addacd835fc72b), both columns mypyc-compiled:

| Workload | Speedup |
|---|---:|
| `c.isspace()` per char | 3.27x |
| `c.isdigit()` per char | 1.63x |
| ASCII `c.upper()` | 10.92x |
| Identifier validator (tokenizer-shaped) | 1.57x |
| Token classifier (lexer inner loop) | 2.12x |

<br />

Note: On Latin-1 text (most source-code-like input), `s[i]` does not actually allocate, since CPython caches single-character strings under `U+0100`. The savings are refcount traffic, function-call overheads, and the codepoint-to-`PyObject`-and-back round trip that `s[i].method()` forces. For non-Latin-1 text the picture changes and `char` would pull further ahead.

Implementation would follow the existing `i64` template: new rprimitive, the same type-promotion plumbing that makes `i64` and `int` interoperate, codepoint-taking method ops for the common `str` methods, and one IR pass to fold `s[i] -> char` into a direct int32 read. There are a couple of design choices to settle along the way (promotion direction, runtime class shape), but the question I'm asking here is whether the team is interested in the direction at all.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Native `char` type for fast char ops in mypyc #21418

Feature

Pitch

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Workload	Speedup
`c.isspace()` per char	3.27x
`c.isdigit()` per char	1.63x
ASCII `c.upper()`	10.92x
Identifier validator (tokenizer-shaped)	1.57x
Token classifier (lexer inner loop)	2.12x

Uh oh!

Proposal: Native char type for fast char ops in mypyc #21418

Description

Feature

Pitch

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Proposal: Native `char` type for fast char ops in mypyc #21418