|
1 | | -# overmind |
| 1 | +# Overmind |
2 | 2 |
|
3 | | -Library to hold PyTorch models in shared memory (and transferring to other processes later, to speed up model loading) |
| 3 | +**Cut PyTorch model loading time from 15s to 0.2s with zero-copy shared memory caching.** |
| 4 | + |
| 5 | +[](https://badge.fury.io/py/overmind-cache) |
| 6 | +[](LICENSE) |
| 7 | + |
| 8 | +Overmind is a non-intrusive caching library that dramatically speeds up PyTorch model loading by storing serialized models in shared memory. Once a model is loaded, subsequent loads from any process take milliseconds instead of seconds. |
| 9 | + |
| 10 | +Named after the [Overmind from StarCraft](https://starcraft.fandom.com/wiki/Overmind), it coordinates model caching across processes like the Overmind coordinates the Zerg Swarm. |
| 11 | + |
| 12 | +Note that the package name on PyPI is `overmind-cache`, since `overmind` is taken. |
| 13 | + |
| 14 | +## Features |
| 15 | + |
| 16 | +- **Fast model loading** - First load caches to shared memory; subsequent loads are ~5x faster |
| 17 | +- **Process-agnostic** - Cache persists across process restarts via a background server |
| 18 | +- **Non-intrusive** - Just add one line of code; no changes to model loading logic |
| 19 | +- **Memory efficient** - Multiple processes share the same cached tensors in memory |
| 20 | +- **Broad compatibility** - Works with `diffusers`, `transformers`, `bitsandbytes` quantization, and vanilla `torch.load` |
| 21 | + |
| 22 | +## Installation |
| 23 | + |
| 24 | +```bash |
| 25 | +pip install overmind-cache |
| 26 | +``` |
| 27 | + |
| 28 | +Or install from source: |
| 29 | + |
| 30 | +```bash |
| 31 | +git clone https://github.com/taichi-dev/overmind.git |
| 32 | +cd overmind |
| 33 | +pip install -e . |
| 34 | +``` |
| 35 | + |
| 36 | +## Quick Start |
| 37 | + |
| 38 | +### Option 1: Monkey Patching (Recommended) |
| 39 | + |
| 40 | +Add a single line at the top of your script to automatically accelerate all supported model loading: |
| 41 | + |
| 42 | +```python |
| 43 | +import overmind.api |
| 44 | +overmind.api.monkey_patch_all() |
| 45 | + |
| 46 | +# Your existing code works unchanged! |
| 47 | +from diffusers import DiffusionPipeline |
| 48 | + |
| 49 | +pipeline = DiffusionPipeline.from_pretrained( |
| 50 | + "stabilityai/stable-diffusion-xl-base-1.0", |
| 51 | + torch_dtype=torch.float16 |
| 52 | +) |
| 53 | +pipeline.to('cuda') |
| 54 | +# First run: ~24s |
| 55 | +# Subsequent runs: ~1s (mainly consumed by .to('cuda')) |
| 56 | +``` |
| 57 | + |
| 58 | +### Option 2: Explicit API |
| 59 | + |
| 60 | +For the ones don't like monkey-patching, use the `load` function directly: |
| 61 | + |
| 62 | +```python |
| 63 | +from overmind.api import load |
| 64 | +from diffusers import DiffusionPipeline |
| 65 | + |
| 66 | +pipeline = load( |
| 67 | + DiffusionPipeline.from_pretrained, |
| 68 | + "stabilityai/stable-diffusion-xl-base-1.0", |
| 69 | + torch_dtype=torch.float16 |
| 70 | +) |
| 71 | +``` |
| 72 | + |
| 73 | +## Supported Libraries |
| 74 | + |
| 75 | +Overmind automatically patches these loading functions: |
| 76 | + |
| 77 | +| Library | Functions | |
| 78 | +|------------------|---------------------------------------------------------------------------------------------------------------------------------------------| |
| 79 | +| **Diffusers** | `DiffusionPipeline.from_pretrained`, `ModelMixin.from_pretrained`, `SchedulerMixin.from_pretrained`, `FromSingleFileMixin.from_single_file` | |
| 80 | +| **Transformers** | `PreTrainedModel.from_pretrained`, `PreTrainedTokenizerBase.from_pretrained`, `AutoProcessor.from_pretrained`, `pipeline` | |
| 81 | +| **PyTorch** | `torch.load`, `torch.jit.load` | |
| 82 | +| **Safetensors** | `safetensors.torch.load_file` | |
| 83 | +| **TorchVision** | `vgg16`, `vgg19` | |
| 84 | +| **OpenCLIP** | `create_model_and_transforms` | |
| 85 | + |
| 86 | +### Custom Patch Points |
| 87 | + |
| 88 | +Create an `overmind.cfg` file in your package root to add custom patch points: |
| 89 | + |
| 90 | +``` |
| 91 | +# overmind.cfg |
| 92 | +mylib.models::MyModel.from_pretrained |
| 93 | +mylib.utils::load_checkpoint |
| 94 | +``` |
| 95 | + |
| 96 | +## CLI Commands |
| 97 | + |
| 98 | +```bash |
| 99 | +# Start the server manually (usually auto-started) |
| 100 | +overmind-server |
| 101 | + |
| 102 | +# Start as daemon |
| 103 | +overmind-server --daemon |
| 104 | + |
| 105 | +# List cached models |
| 106 | +overmind-list |
| 107 | + |
| 108 | +# Shutdown the server (clears cache) |
| 109 | +overmind-shutdown |
| 110 | +``` |
| 111 | + |
| 112 | +## Environment Variables |
| 113 | + |
| 114 | +| Variable | Description | |
| 115 | +|---------------------------|---------------------------------------------------------------------| |
| 116 | +| `OVERMIND_DISABLE` | Set to any value to disable Overmind, falling back to a local cache | |
| 117 | +| `OVERMIND_NO_LOCAL_CACHE` | Disable local caching too | |
| 118 | + |
| 119 | + |
| 120 | +## Benchmarks |
| 121 | + |
| 122 | +Loading a Stable Diffusion ControlNet pipeline with VAE, on Linux + Intel i9-11900K + RTX 4090: |
| 123 | + |
| 124 | +Using [demo-vae.py](demo-vae.py) as example: |
| 125 | + |
| 126 | + |
| 127 | +| Run | `vae` | `depth` | `edge` | `pipeline` | to('cuda') | Total | |
| 128 | +|-------------------|-------|---------|--------|------------|------------|-------| |
| 129 | +| w/o Overmind | 1.18s | 0.98s | 1.41s | 1.65s | 0.91s | 6.16 | |
| 130 | +| w/ Overmind (1st) | 5.44s | 5.17s | 5.41s | 7.29s | 0.86s | 24.20 | |
| 131 | +| w/ Overmind (2nd) | 0.00s | 0.01s | 0.01s | 0.20s | 0.87s | 1.12 | |
| 132 | + |
| 133 | +The first load with Overmind is slower due to pickling overhead. Subsequent loads are **5-6x faster** than without Overmind, with the only remaining cost being the `to('cuda')` transfer. |
| 134 | + |
| 135 | +## License |
| 136 | + |
| 137 | +Apache 2.0 |
| 138 | + |
| 139 | +## Contributing |
| 140 | + |
| 141 | +Contributions are welcome! Please feel free to submit a Pull Request. |
| 142 | + |
| 143 | +## Acknowledgments |
| 144 | + |
| 145 | +Developed by [Taichi Graphics](https://github.com/taichi-dev) for production AI inference workloads. |
0 commit comments