[Invitation for a discussion] Much improved CPU memory management #11748

ifilipis · 2026-01-09T00:22:58Z

Hi there,

I've been painfully trying to run LTX-2 in Colab on L4 - it could barely fit, only to produce a 20s 720p video in 18 minutes. Horrible result.

The dumbest thing about it was - the model and video would fit just fine in its 24+53 GB of memory, but because Comfy cannot unload the models partially (from RAM), it would spend 10 out of those 18 minutes unloading and reloading the text encoder and unet with --cache-none or pressure cache.

This doesn't make any sense whatsoever, especially given that with 53GB of RAM you're just missing out on a couple of GB. Unloading 27GB of weights to save 2 is insane.

So I went on to research what would it take to implement RAM memory management and came up with this. Not much.

What it does:

New pipeline for loading weights from disk to GPU. Presently, you have to fully materialize weights in RAM before uploading them to GPU.
- I have two pipelines: GDS (which you all previously complained about two months ago), and disk to RAM to GPU, which loads weights to GPU in chunks and avoids storing the full dict in RAM (allegedly similar to what --gpu-only does). It should also be able to balance the memory load between disk, RAM and GPU simultaneously.
- In T4 GPU scenario on Colab (aka 12GB RAM, 15GB VRAM), it enabled me to load a BF16 UNet that quite literally won't fit in RAM. LTX test is pending.
New disk memory tier, which integrates with native memory management and allows partial offload from RAM. That is, when you need to unload something from VRAM to RAM, but there's still not enough space, it will partially offload weights to disk.
- It is also able to retrieve them MUCH faster than reloading from zero.
- And obviously, no writing to disk - just reading. That's the best part of it.

Benchmarks:

Z-Image BF16 + Qwen BF16, NORMAL_VRAM, T4 in Colab, 1024x1024, 9 steps,
- This PR (no GDS): cold 190s, warm 41s, 1.88s/it
- Native: BF16, FP8 - NORMAL_VRAM, LOW_VRAM, NO_VRAM, cache-none - all cause RAM OOM crash on UNETLoader. I'm pretty sure it somehow worked a few weeks ago
- Saw --disable-pinned-memory suggestion over here - Enhancement on VRAM (and maybe RAM?) handling between samplers on LTX 2 comfy native workflow #11726 (comment) - same thing: RAM OOM
LTX-2 and FLUX.2 dev - leaving the best for the weekend

I know nothing about your architecture, but this exercise tells me that proper RAM memory management is entirely possible. And you probably won't even have to rely on fastsafetensors, since regular safetensors are also designed to allow partial weight loading from disk.

Y'all very welcome to clone and try it yourself

socket-security · 2026-01-09T00:24:10Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	fastsafetensors@0.2.0

View full report

asagi4 · 2026-01-09T16:20:34Z

Looks like exactly what I've wanted for a long time, but unfortunately I can't get it to work.

Running a basic SDXL workflow, I'm seeing a message indicating some kind of read failure:

nogds_file_reader._thread failed: pread(fd=23, buffer=0x17364000, offset=7102070784, count=3145 728, l=135168), c=133446
and after that the workflows throws KeyError: 'clip_g.positional_embedding'

from clip_text_transformers_convert
Looks like it possibly failed to load the checkpoint and just didn't stop properly, so the state dict got half populated?

I guess that might be a fastsafetensors problem.

asagi4 · 2026-01-09T17:19:46Z

I figured it out. All the pop implementations add the key to self._deleted before calling self.get_tensor(key), so they always throw a KeyError

After fixing that I can at least run SDXL and bf16 Chroma without errors. Looks like it still won't work with quantized models though, which is unfortunate.

The disk loader seems to be able to keep RAM usage lower than normal Comfy, but it doesn't seem to be completely problem-free. At least for workflows where everything does fit into RAM, it seems to slow things down; it's quite noticeable in workflows that run the TE many times (ie. when doing prompt scheduling).

MeiYi-dev · 2026-01-10T13:12:57Z

One of the simpler fix would be accurate VAE decode/encode calculation, currenly before the VAE decode occurs, comfyui just removes the whole model from VRAM even though the decoding works within 4GB VRAM max with tiled decoding.

ifilipis · 2026-01-10T17:57:19Z

One of the simpler fix would be accurate VAE decode/encode calculation, currenly before the VAE decode occurs, comfyui just removes the whole model from VRAM even though the decoding works within 4GB VRAM max with tiled decoding.

The goal here is to be able to make generations while filling RAM and VRAM to the brink, not just VAE.

Looks like it still won't work with quantized models though, which is unfortunate.

Yeah, still trying to figure it out. Tried running FP8 Flux, fixed weight loading in places, but there's something wrong with dtypes, and it's proving quite difficult to debug without knowing the backend

ifilipis added 14 commits January 8, 2026 13:02

Add streaming safetensors loading with disk weight cache

f925f8f

Enable lazy disk-tier loading for streaming weights

5f2188e

Implement partial disk weight materialization

45a7707

Chunk nogds tensor reads and fix view mappings

010d544

Add disk weight materialization logging

397c216

Fix nogds chunk buffer dtype

f8caa93

Fix view meta passthrough and skip empty logs

1cc2a57

Honor dtype overrides for disk weight loads

0ad0a39

Update design notes for disk loading changes

1ec01dd

Recast disk weights with dtype overrides

5c60954

Fix disk weight dtype materialization

557e4ee

Integrate disk offload into memory management

97189bf

Avoid RAM OOM when unloading to CPU

c13ec6a

Fix kwargs

b93f165

ifilipis requested review from Kosinkadink, comfyanonymous and guill as code owners January 9, 2026 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Invitation for a discussion] Much improved CPU memory management #11748

[Invitation for a discussion] Much improved CPU memory management #11748

ifilipis commented Jan 9, 2026 •

edited

Loading

Uh oh!

socket-security bot commented Jan 9, 2026

Uh oh!

asagi4 commented Jan 9, 2026

Uh oh!

asagi4 commented Jan 9, 2026

Uh oh!

MeiYi-dev commented Jan 10, 2026

Uh oh!

ifilipis commented Jan 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Invitation for a discussion] Much improved CPU memory management #11748

Are you sure you want to change the base?

[Invitation for a discussion] Much improved CPU memory management #11748

Conversation

ifilipis commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security bot commented Jan 9, 2026

Uh oh!

asagi4 commented Jan 9, 2026

Uh oh!

asagi4 commented Jan 9, 2026

Uh oh!

MeiYi-dev commented Jan 10, 2026

Uh oh!

ifilipis commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ifilipis commented Jan 9, 2026 •

edited

Loading

ifilipis commented Jan 10, 2026 •

edited

Loading