Skip to content

Conversation

@ifilipis
Copy link

@ifilipis ifilipis commented Jan 9, 2026

Hi there,

I've been painfully trying to run LTX-2 in Colab on L4 - it could barely fit, only to produce a 20s 720p video in 18 minutes. Horrible result.

The dumbest thing about it was - the model and video would fit just fine in its 24+53 GB of memory, but because Comfy cannot unload the models partially (from RAM), it would spend 10 out of those 18 minutes unloading and reloading the text encoder and unet with --cache-none or pressure cache.

This doesn't make any sense whatsoever, especially given that with 53GB of RAM you're just missing out on a couple of GB. Unloading 27GB of weights to save 2 is insane.

So I went on to research what would it take to implement RAM memory management and came up with this. Not much.

What it does:

  • New pipeline for loading weights from disk to GPU. Presently, you have to fully materialize weights in RAM before uploading them to GPU.
    • I have two pipelines: GDS (which you all previously complained about two months ago), and disk to RAM to GPU, which loads weights to GPU in chunks and avoids storing the full dict in RAM (allegedly similar to what --gpu-only does). It should also be able to balance the memory load between disk, RAM and GPU simultaneously.
    • In T4 GPU scenario on Colab (aka 12GB RAM, 15GB VRAM), it enabled me to load a BF16 UNet that quite literally won't fit in RAM. LTX test is pending.
  • New disk memory tier, which integrates with native memory management and allows partial offload from RAM. That is, when you need to unload something from VRAM to RAM, but there's still not enough space, it will partially offload weights to disk.
    • It is also able to retrieve them MUCH faster than reloading from zero.
    • And obviously, no writing to disk - just reading. That's the best part of it.

Benchmarks:

I know nothing about your architecture, but this exercise tells me that proper RAM memory management is entirely possible. And you probably won't even have to rely on fastsafetensors, since regular safetensors are also designed to allow partial weight loading from disk.

Y'all very welcome to clone and try it yourself

@socket-security
Copy link

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedfastsafetensors@​0.2.0100100100100100

View full report

@asagi4
Copy link
Contributor

asagi4 commented Jan 9, 2026

Looks like exactly what I've wanted for a long time, but unfortunately I can't get it to work.

Running a basic SDXL workflow, I'm seeing a message indicating some kind of read failure:

nogds_file_reader._thread failed: pread(fd=23, buffer=0x17364000, offset=7102070784, count=3145 728, l=135168), c=133446
and after that the workflows throws KeyError: 'clip_g.positional_embedding'

from clip_text_transformers_convert
Looks like it possibly failed to load the checkpoint and just didn't stop properly, so the state dict got half populated?

I guess that might be a fastsafetensors problem.

@asagi4
Copy link
Contributor

asagi4 commented Jan 9, 2026

I figured it out. All the pop implementations add the key to self._deleted before calling self.get_tensor(key), so they always throw a KeyError

After fixing that I can at least run SDXL and bf16 Chroma without errors. Looks like it still won't work with quantized models though, which is unfortunate.

The disk loader seems to be able to keep RAM usage lower than normal Comfy, but it doesn't seem to be completely problem-free. At least for workflows where everything does fit into RAM, it seems to slow things down; it's quite noticeable in workflows that run the TE many times (ie. when doing prompt scheduling).

@MeiYi-dev
Copy link

One of the simpler fix would be accurate VAE decode/encode calculation, currenly before the VAE decode occurs, comfyui just removes the whole model from VRAM even though the decoding works within 4GB VRAM max with tiled decoding.

@ifilipis
Copy link
Author

ifilipis commented Jan 10, 2026

One of the simpler fix would be accurate VAE decode/encode calculation, currenly before the VAE decode occurs, comfyui just removes the whole model from VRAM even though the decoding works within 4GB VRAM max with tiled decoding.

The goal here is to be able to make generations while filling RAM and VRAM to the brink, not just VAE.

Looks like it still won't work with quantized models though, which is unfortunate.

Yeah, still trying to figure it out. Tried running FP8 Flux, fixed weight loading in places, but there's something wrong with dtypes, and it's proving quite difficult to debug without knowing the backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants