From 4d6ac81ed0375b2ccab6c40cc6166d8eddbdf957 Mon Sep 17 00:00:00 2001
From: Valtteri Valo <valtterimiikovalo@gmail.com>
Date: Tue, 17 Feb 2026 22:16:27 +0200
Subject: [PATCH] fix division by zero in anneal_beta when total_timesteps <
 batch_size

When total_timesteps is smaller than one batch (total_agents * horizon),
total_epochs = total_timesteps / batch_size evaluates to 0. The
anneal_beta formula then computes current_epoch / total_epochs = 0 / 0,
which is NaN in IEEE 754. This NaN propagates through priority replay
weights (mb_prio) into the loss, silently producing NaN for all losses.

cosine_annealing() already guards against this (line 714: if T == 0
return lr_base), but the anneal_beta computation on the next line does
not. Clamping total_epochs to at least 1 fixes both.
---
 pufferlib/extensions/pufferlib.cpp | 1 +
 1 file changed, 1 insertion(+)

diff --git a/pufferlib/extensions/pufferlib.cpp b/pufferlib/extensions/pufferlib.cpp
index fadd2858e..675f4d75d 100644
--- a/pufferlib/extensions/pufferlib.cpp
+++ b/pufferlib/extensions/pufferlib.cpp
@@ -366,6 +366,7 @@ void train_impl(PuffeRL& pufferl) {
     Muon* muon = pufferl.muon;
 
     int total_epochs = hypers.total_timesteps / batch_size;
+    if (total_epochs < 1) total_epochs = 1;
 
     if (anneal_lr) {
         float lr_min = hypers.min_lr_ratio * hypers.lr;