-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Invitation for a discussion] Much improved CPU memory management #11748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Invitation for a discussion] Much improved CPU memory management #11748
Conversation
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Looks like exactly what I've wanted for a long time, but unfortunately I can't get it to work. Running a basic SDXL workflow, I'm seeing a message indicating some kind of read failure:
from clip_text_transformers_convert I guess that might be a fastsafetensors problem. |
|
I figured it out. All the pop implementations add the key to self._deleted before calling self.get_tensor(key), so they always throw a KeyError After fixing that I can at least run SDXL and bf16 Chroma without errors. Looks like it still won't work with quantized models though, which is unfortunate. The disk loader seems to be able to keep RAM usage lower than normal Comfy, but it doesn't seem to be completely problem-free. At least for workflows where everything does fit into RAM, it seems to slow things down; it's quite noticeable in workflows that run the TE many times (ie. when doing prompt scheduling). |
|
One of the simpler fix would be accurate VAE decode/encode calculation, currenly before the VAE decode occurs, comfyui just removes the whole model from VRAM even though the decoding works within 4GB VRAM max with tiled decoding. |
The goal here is to be able to make generations while filling RAM and VRAM to the brink, not just VAE.
Yeah, still trying to figure it out. Tried running FP8 Flux, fixed weight loading in places, but there's something wrong with dtypes, and it's proving quite difficult to debug without knowing the backend |
Hi there,
I've been painfully trying to run LTX-2 in Colab on L4 - it could barely fit, only to produce a 20s 720p video in 18 minutes. Horrible result.
The dumbest thing about it was - the model and video would fit just fine in its 24+53 GB of memory, but because Comfy cannot unload the models partially (from RAM), it would spend 10 out of those 18 minutes unloading and reloading the text encoder and unet with --cache-none or pressure cache.
This doesn't make any sense whatsoever, especially given that with 53GB of RAM you're just missing out on a couple of GB. Unloading 27GB of weights to save 2 is insane.
So I went on to research what would it take to implement RAM memory management and came up with this. Not much.
What it does:
Benchmarks:
I know nothing about your architecture, but this exercise tells me that proper RAM memory management is entirely possible. And you probably won't even have to rely on fastsafetensors, since regular safetensors are also designed to allow partial weight loading from disk.
Y'all very welcome to clone and try it yourself