blog

feisuzhu · feisuzhu · commit c3ac6cdcf836 · 2026-01-22T19:56:43.000+08:00
diff --git a/blog.md b/blog.md
@@ -12,31 +12,310 @@ Let's jump into the role of an algorithm developer. Things are pretty clear: I d
 
 Of course we can't achieve that, but we can try minimizing the intrusion we have to intrduce to algorithm code. This reminds me the monkey-patching of `gevent` library, it patches (primarily) the `socket` library, replaces it with `gevent.socket` which can switch to other greenlets when IO would block, much like a goroutine (actually `gevent` is older than Golang!).
 
-...
-
-
-- limitation: single gpu
-- we defined the interface around huggingface
-- we tried pickle and it works
-    - but naive pickle is slow
-- we then put it into shared memory
-    - daemon
-    - how to share memory
-- we encounter details
-    - vae
-    - stablefast
-    - diffusers dyn module
-    - accelerate
-    - bitsandbytes
-    - ??
-    - we tried torch's shared tensor
-        - this is the initial impl
-        - posix shm limit
-        - can't reuse pickle result
-        - many fd
-- what we accidentally got
-    - save developer time
-    - saving cpu memory
-future optimizations
-- can we merge Storages of CUDA tensors?
+Since we were only using HuggingFace libs (`transformers`, `diffusers`) to load models at the time, the target become clear: We only introduce a monkey-patch call, and the rest of code should remain unchanged, `XXXPipeline.from_pretrained(...)` should be much faster.
+
+## Some Facts, Obvious Decisions and Assumptions
+
+**Overmind is a caching library, it caches model loading call results into system memory and later reconstruct it fast.**
+
+We skip discussing about how monkey-patching is implemented, that's a not-so-interesting detail. All we need to know is, it redirects all the `XXXPipeline.from_pretrained(...)` calls to `overmind.api.load(XXXPipeline.from_pretrained, ...)`.
+
+We use `pickle` to serialize our cache result since... we have no choice, and `torch.save` itself uses `pickle`, it's weird not to use it.
+
+We use a client/server architecture since we don't want to invalidate our cache when process terminates. There are many subprocess calls could benefit from it.
+
+We assume `XXXPipeline.from_pretrained` parameters to be simple hashable things (`str` and things alike) and other models loaded by overmind (explain later).
+
+The name `overmind` is borrowed from Starcraft, as you may have guessed.
+
+## Reconstruct it fast!
+
+We can't naively save `pickle.loads` result in memory and call it a day. After all, on a warmed up scenario, Linux page cache did its job caching on-disk models and we can still see a loading time measured at 10s of seconds.
+
+The inefficiency comes from memory copying. In Python, even creating millions of objects would cost no more than several hundred ms. However, for a memory copy of 10GiB, it would cost half a second. We must avoid memory copy as much as possible.
+
+Fortunately, most of the big memory chunks are Torch tensors, we can safely address only them and ignore the rest.
+
+Actually, I got the knowledge of the internal structure of a Torch tensor in the reduction code while researching the tensor sharing mechanism:
+
+```python
+# Copied from torch.multiprocessing.reductions, most of the code is removed
+def reduce_tensor(tensor):
+    ...
+    storage = tensor._typed_storage()
+    ...
+    metadata = (
+        tensor.storage_offset(),
+        tensor.size(),
+        tensor.stride(),
+        tensor.requires_grad,
+    )
+    return (rebuild_tensor, (type(tensor), storage, metadata))
+```
+
+Quite simple: a tensor is its type, its metadata and its underlying storage. Here `storage` is of type `TypedStorage`, but actually `TypedStorage` is just a simple wrapper to `UntypedStorage`. `UntypedStorage` is the class that actually holding all the tensor data.
+
+Our task become more specific now: How do we avoid copying `UntypedStorage`? Can we manage these tensor memory by ourselves and construct `UntypedStorage`s by pointing to the memory we manage?
+
+The answer is yes!
+
+Skimming through the C++ code of where `UntypedStorage` is constructed, we can easily find a code snippet like this:
+
+```cpp
+// Copied from torch/csrc/Storage.cpp
+static PyObject* THPStorage_get(THPStorage* self, PyObject* index) {
+    // ...omitting unrelated code...
+
+    auto new_storage_impl = make_storage_impl(
+        c10::StorageImpl::use_byte_size_t(),
+        slicelength,
+        at::DataPtr(
+            static_cast<void*>(data + start),
+            old_storage_impl,
+            [](void* s) {
+              c10::raw::intrusive_ptr::decref(static_cast<at::StorageImpl*>(s));
+            },
+            old_storage_impl->device()),
+        old_storage_impl->allocator(),
+        /* resizable */ false,
+        device_opt);
+
+    PyObject* _ret =
+        THPStorage_NewWithStorage(Py_TYPE(self), std::move(new_storage_impl));
+
+    return _ret;
+}
+```
+
+Not only can we can use a pointer, but the `at::DataPtr` class can also handle destruction, making the lifetime management much simpler.
+
+On the Python side, a pointer to a memory region is represented by a `memoryview` object, these objects support the buffer protocol. We can get a `memoryview` object from many other things, `bytes` and `mmap` are the 2 major things supporting it and also what we care about.
+
+Finally, we know what we should do: create a function that accepts a `memoryview` object and turns it into an `UntypedStorage` without copying. With ability to reconstruct `UntypedStorage` from `memoryview`, the actual tensor data don't have to be in the pickle stream, greatly reduced the data size we have to copy around.
+
+
+```cpp
+void initOvermindHelpers(py::module m) {
+    m.def("_make_untyped_storage", [](py::buffer b) {
+        auto info = new py::buffer_info(b.request());
+
+        auto size = info->size;
+        auto ptr = info->ptr;
+
+        return pybind11::reinterpret_steal<py::object>(THPStorage_NewWithStorage(
+            THPStorageClass,
+            c10::make_intrusive<at::StorageImpl>(
+                c10::StorageImpl::use_byte_size_t(),
+                size,
+                at::DataPtr(
+                    ptr,
+                    info,
+                    [](void* ptr) {
+                        py::gil_scoped_acquire gil;
+                        auto b = static_cast<py::buffer_info*>(ptr);
+                        delete b;
+                    },
+                    at::DeviceType::CPU
+                ),
+                /*allocator=*/nullptr,
+                /*resizable=*/false)
+        ));
+    });
+}
+```
+
+That's the core building block of overmind.
+
+
+## Sharing the tensors!
+
+
+... Note:
+
+    There's already a tensor sharing mechanism in PyTorch, but it doesn't fit our needs. More on this later.
+
+
+When we see 'share' and 'memory' comes together, we all have an urge to use `shmget` and its friends. It is "designed" to be used as a memory sharing mechanism right, why not? But it has 2 major flaws:
+
+- POSIX shm is a scarce resource, what you can use is determined by how sysadmin configure the system. An most extreme but ubiquitous
+
+
+
+
+3. **Shared Memory (`shmem.py`)**: Manages memory arenas using `memfd_create` (Linux) or named shared memory (Windows). Fragments are content-addressed by hash, enabling deduplication.
+
+## The Engineering Tricks
+
+### Trick #1: Shared Memory Without POSIX shm Limits
+
+PyTorch's built-in tensor sharing uses POSIX shared memory, but this hits practical limits quickly:
+
+- Docker defaults to 64MB of `/dev/shm`
+- Each `UntypedStorage` gets its own shm segment, even for tiny buffers
+- Each segment consumes a file descriptor
+- Reference counting prevents pickle reuse
+
+Our solution: use `memfd_create` to create anonymous memory-backed file descriptors, then pass them across processes via `/proc/{pid}/fd/{fd}`:
+
+```python
+class SharedMemory:
+    @classmethod
+    def create(cls, shift):
+        libc = ctypes.CDLL(None)
+        name = _make_filename(shift).encode('utf-8')
+        fd = libc.memfd_create(name, os.O_RDWR)
+        os.ftruncate(fd, 1 << shift)
+        mem_id = (os.getpid(), fd)
+        return cls(fd=fd, name=name, mem_id=mem_id)
+
+    @classmethod
+    def rebuild(cls, mem_id):
+        pid, fd = mem_id
+        local_fd = os.open(f'/proc/{pid}/fd/{fd}', os.O_RDWR)
+        return cls(fd=local_fd, mem_id=mem_id)
+```
+
+This bypasses the `SCM_RIGHTS` API entirely—no wrestling with ancillary messages on Unix sockets. The fd remains valid as long as the daemon is alive.
+
+We allocate exponentially-growing arenas (starting at 8GB, doubling each time) and pack tensor data into them with content-based deduplication. A 64-bit MetroHash identifies fragments:
+
+```cpp
+uint64_t metrohash64_1(const uint8_t * key, uint64_t len, uint32_t seed);
+```
+
+Same tensor content? Same fragment. No redundant storage.
+
+### Trick #2: Custom Pickle Reducers for Tensors
+
+Standard pickle serializes tensors as bytes—12GB for a SDXL pipeline, taking 14s to dump and 6s to load. We register custom reducers that store tensor data in shared memory and only pickle a small `Fragment` reference:
+
+```python
+def _reduce_storage(storage):
+    if storage.size() == 0:
+        return (rebuild_storage_empty, (type(storage),))
+    else:
+        device = storage.device
+        storage = storage.cpu()
+        frag = hoarder.put(storage)  # Store in shared memory
+        return (_rebuild_storage_on_client, (frag, device))
+
+def _rebuild_storage_on_client(frag, device):
+    mv = borrower.borrow(frag)  # Get memoryview from shared memory
+    storage = _make_untyped_storage(mv)  # Zero-copy!
+    if device.type == 'cuda':
+        return storage.cuda(device.index)
+    return storage
+```
+
+Result: the pickle shrinks from 12GB to ~150 bytes. Unpickling takes 0.15s.
+
+### Trick #3: Zero-Copy torch.jit Modules
+
+Stable-fast and other JIT-based optimizations store models as TorchScript. Loading these normally copies the data multiple times:
+
+1. `memoryview` → `bytes` (copy #1)
+2. `bytes` → `std::string` (copy #2)
+3. Alignment padding (copy #3)
+
+We wrote a C++ extension that accepts Python buffer protocol directly:
+
+```cpp
+m.def("import_ir_module_from_buffer_0copy",
+    [](std::shared_ptr<torch::jit::CompilationUnit> cu, py::buffer buffer) {
+        auto info = buffer.request();
+        imemstream in((char*)info.ptr, info.size);  // No copy!
+        return import_ir_module(std::move(cu), in, ...);
+    });
+```
+
+Similarly, `_make_untyped_storage` creates a `torch.UntypedStorage` that directly wraps a `memoryview`, with the buffer's lifetime tied to the Python object via a custom destructor.
+
+### Trick #4: Surviving the HuggingFace Ecosystem
+
+Real-world usage exposed edge cases in diffusers, transformers, accelerate, and bitsandbytes:
+
+**diffusers dynamic modules**: Model repos can include Python files that get imported at runtime into a `diffusers_modules` namespace. The client doesn't have these in `sys.path`, breaking unpickling. Fix: pre-import the module on the client:
+
+```python
+def diffusers_dyn_module_workaround():
+    from diffusers.utils.constants import HF_MODULES_CACHE
+    modpath = Path(HF_MODULES_CACHE) / "diffusers_modules/__init__.py"
+    spec = importlib.util.spec_from_file_location("diffusers_modules", modpath)
+    sys.modules["diffusers_modules"] = importlib.util.module_from_spec(spec)
+```
+
+**The `vae=vae` pattern**: Users often load a VAE separately and pass it to a pipeline. If we naively pickle this, we lose the caching benefit. Solution: attach an `_overmind_ref` marker to loaded models and resolve it server-side:
+
+```python
+def replace_ref(obj):
+    if (ref := getattr(obj, '_overmind_ref', None)):
+        return False, ref
+    return True, obj
+```
+
+**accelerate hooks**: Quantized models via bitsandbytes come with `AlignDevices` hooks that don't pickle. We strip them:
+
+```python
+from accelerate.hooks import remove_hook_from_module
+remove_hook_from_module(model, True)
+model.__dict__.pop('to', None)  # Remove warning monkeypatches
+model.__dict__.pop('cuda', None)
+```
+
+**CUDA tensors**: Quantization happens on GPU, but we can't keep CUDA tensors in the daemon (it would block the GPU). We move to CPU, pickle, then restore to the original device on the client.
+
+### The Trade-offs
+
+- **Single GPU only**: We normalize all `device_map` configurations to `cuda:0`. Multi-GPU would require tracking device placement.
+- **Cold load overhead**: Using `dill` for closures adds ~14s to cold loads (pure Python serialization). This is a one-time cost.
+- **No training**: We force `requires_grad=False`. Overmind is for inference.
+
+## Performance & Results
+
+| Scenario | Time |
+|----------|------|
+| Vanilla `from_pretrained` | ~15s |
+| Overmind cold load | ~14s (dill overhead) |
+| Overmind warm load | **0.15s** |
+
+Real workload (Image3D pipeline):
+
+| Configuration | Total Runtime |
+|---------------|---------------|
+| Without Overmind | 123.5s |
+| Overmind cold | 137.8s |
+| Overmind warm | **109.0s** |
+
+The warm case saves 14.5s per run—multiply that by hundreds of daily iterations during research, and you get hours back.
+
+### Bonus: CPU Memory Savings
+
+Since all processes share the same tensor backing store, you're not duplicating multi-GB models across workers. A single SDXL pipeline in memory serves all clients.
+
+## Summary & Getting Started
+
+Overmind makes model loading boring—in the best way. One import, one function call, and your 15-second loads become 0.2-second cache hits.
+
+```python
+import overmind.api
+overmind.api.monkey_patch_all()
+
+# That's it. Your from_pretrained calls are now cached.
+pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-1.0")
+```
+
+Or explicitly:
+
+```python
+from overmind.api import load
+pipeline = load(DiffusionPipeline.from_pretrained, "stabilityai/stable-diffusion-xl-refiner-1.0")
+```
+
+The daemon starts automatically on first use and persists across script invocations. Same Python environment, same cache.
+
+**Future directions**:
+- Merging CUDA tensor storages for even faster GPU uploads
+- Optimizing cold load time by replacing dill with a faster serializer
+
+If you're shipping inference services or iterating rapidly on diffusion models, give Overmind a try. Your workflow will thank you.