feat(pt): log pre-clip gradient total_norm and per-param norm to tensorboard#5252
feat(pt): log pre-clip gradient total_norm and per-param norm to tensorboard#5252OutisLi wants to merge 2 commits intodeepmodeling:masterfrom
Conversation
📝 WalkthroughWalkthroughAdds pre-clip gradient diagnostic collection and conditional TensorBoard logging during training when gradient clipping is enabled; collects per-parameter norms, computes total norm via clipping call, checks non-finite values, and logs total_norm, histogram of per-parameter norms, and top-10 parameter norms. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
deepmd/pt/train/training.py (2)
994-1000:pre_clip_named_normscollected on every step instead of only on logging stepsThe collection iterates all model parameters and calls
.norm()per gradient on every training step, but the data is only consumed attensorboard_freqintervals. For large models withtensorboard_freq > 1this is wasted work on every non-logging step.Move the collection inside (or co-located with) the tensorboard frequency check, e.g., guard it with the same condition used at line 1344:
♻️ Proposed fix
- if self.gradient_max_norm > 0.0: - # Collect per-parameter gradient norms before clipping. - if self.enable_tensorboard: - pre_clip_named_norms = [ - (name, p.grad.detach().norm().item()) - for name, p in self.wrapper.named_parameters() - if p.grad is not None - ] - # FSDP2 sharded DTensor gradients don't support error_if_nonfinite; use manual isfinite check instead. - total_norm = torch.nn.utils.clip_grad_norm_( + if self.gradient_max_norm > 0.0: + # Collect per-parameter gradient norms before clipping (only on TB log steps). + _tb_log_this_step = self.enable_tensorboard and ( + display_step_id % self.tensorboard_freq == 0 + or display_step_id == 1 + ) + if _tb_log_this_step: + pre_clip_named_norms = [ + (name, p.grad.detach().norm().item()) + for name, p in self.wrapper.named_parameters() + if p.grad is not None + ] + # FSDP2 sharded DTensor gradients don't support error_if_nonfinite; use manual isfinite check instead. + total_norm = torch.nn.utils.clip_grad_norm_(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/training.py` around lines 994 - 1000, pre_clip_named_norms is being computed every training step even when TensorBoard logging doesn't occur; move the comprehension that builds pre_clip_named_norms (which iterates self.wrapper.named_parameters() and calls p.grad.detach().norm().item()) so it only runs when self.enable_tensorboard is true AND the tensorboard logging condition (the same check used for tensorboard_freq elsewhere) is met; in practice, remove the current unconditional block and recreate pre_clip_named_norms inside the tensorboard logging branch where other TensorBoard metrics are prepared so the expensive per-parameter norm computation only happens on logging steps.
1353-1370: Gradient norm tags are not namespaced bytask_keyin multi-task trainingIn multi-task mode each call to
step()uses a randomly selectedtask_key. The tags"grad/total_norm","grad/param_norms", and"grad_top10/{name}"carry no task identifier, so TensorBoard will interleave readings from different tasks onto the same scalar/histogram series, making them uninterpretable.♻️ Proposed fix
- writer.add_scalar( - "grad/total_norm", total_norm.item(), display_step_id - ) + writer.add_scalar( + f"grad/{task_key}/total_norm", total_norm.item(), display_step_id + ) norms = torch.tensor(...) - writer.add_histogram("grad/param_norms", norms, display_step_id) + writer.add_histogram(f"grad/{task_key}/param_norms", norms, display_step_id) ... - writer.add_scalar(f"grad_top10/{name}", gn, display_step_id) + writer.add_scalar(f"grad_top10/{task_key}/{name}", gn, display_step_id)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/training.py` around lines 1353 - 1370, The grad logging block under the self.gradient_max_norm check is missing the per-task namespace; modify the TensorBoard tag strings used in the writer calls (the writer.add_scalar and writer.add_histogram invocations inside the block that references pre_clip_named_norms, total_norm, display_step_id) to include the current task identifier (task_key) as a prefix or namespace (e.g. f"{task_key}/grad/total_norm", f"{task_key}/grad/param_norms", and f"{task_key}/grad_top10/{name}" after you shorten name as done now) so that all three logged series are namespaced by task; keep the existing name-shortening logic for per-parameter entries and only change the tag strings passed to writer.add_scalar/add_histogram.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/pt/train/training.py`:
- Around line 996-1000: The per-parameter norm collection stored in
pre_clip_named_norms uses p.grad.detach().norm() which yields only shard-local
norms under FSDP2/ZeRO stage >= 2, producing misleading TensorBoard diagnostics;
update the code in the block that builds pre_clip_named_norms (referencing
pre_clip_named_norms, self.wrapper.named_parameters(), and self.zero_stage) to
either skip collecting per-parameter norms when self.zero_stage >= 2 or add a
clear comment/log entry that these norms are shard-local and not representative
of full-parameter norms; implement the skip by gating the comprehension on
self.zero_stage < 2 (or add the explicit documentation next to the existing
FSDP2 comment) so distributed runs do not report understated per-param norms.
- Around line 1352-1370: The TensorBoard logging block references total_norm and
pre_clip_named_norms but those are only set inside the optimizer-specific branch
for self.opt_type, causing UnboundLocalError for LKF; to fix, initialize
total_norm = None and pre_clip_named_norms = [] before the optimizer/step()
branch so they exist on all code paths, and change the TB logging to only emit
grad/total_norm and iterate pre_clip_named_norms when total_norm is not None (or
pre_clip_named_norms is non-empty) so the block gated by gradient_max_norm > 0.0
won't reference undefined variables.
---
Nitpick comments:
In `@deepmd/pt/train/training.py`:
- Around line 994-1000: pre_clip_named_norms is being computed every training
step even when TensorBoard logging doesn't occur; move the comprehension that
builds pre_clip_named_norms (which iterates self.wrapper.named_parameters() and
calls p.grad.detach().norm().item()) so it only runs when
self.enable_tensorboard is true AND the tensorboard logging condition (the same
check used for tensorboard_freq elsewhere) is met; in practice, remove the
current unconditional block and recreate pre_clip_named_norms inside the
tensorboard logging branch where other TensorBoard metrics are prepared so the
expensive per-parameter norm computation only happens on logging steps.
- Around line 1353-1370: The grad logging block under the self.gradient_max_norm
check is missing the per-task namespace; modify the TensorBoard tag strings used
in the writer calls (the writer.add_scalar and writer.add_histogram invocations
inside the block that references pre_clip_named_norms, total_norm,
display_step_id) to include the current task identifier (task_key) as a prefix
or namespace (e.g. f"{task_key}/grad/total_norm",
f"{task_key}/grad/param_norms", and f"{task_key}/grad_top10/{name}" after you
shorten name as done now) so that all three logged series are namespaced by
task; keep the existing name-shortening logic for per-parameter entries and only
change the tag strings passed to writer.add_scalar/add_histogram.
There was a problem hiding this comment.
Pull request overview
Adds TensorBoard diagnostics for gradient norms to help debug/monitor training stability when gradient clipping is enabled.
Changes:
- Collect per-parameter gradient norms before clipping.
- Log pre-clip total gradient norm and a histogram of per-parameter norms to TensorBoard.
- Log the top-10 largest per-parameter gradient norms with shortened parameter names.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5252 +/- ##
==========================================
- Coverage 82.00% 81.99% -0.01%
==========================================
Files 750 750
Lines 75082 75092 +10
Branches 3615 3616 +1
==========================================
+ Hits 61571 61574 +3
- Misses 12347 12354 +7
Partials 1164 1164 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
deepmd/pt/train/training.py (1)
1058-1064:⚠️ Potential issue | 🟠 MajorUnboundLocalError still possible for LKF with gradient clipping enabled.
total_norm/pre_clip_named_normsare initialized only inside the Adam/AdaMuon branch, but the TensorBoard block runs for all optimizers. Ifopt_type == "LKF"andgradient_max_norm > 0,total_normis referenced before assignment. This is the same issue raised previously and still applies.🐛 Proposed fix
- if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]: + # Initialize gradient diagnostics for all optimizer paths + total_norm: torch.Tensor | None = None + pre_clip_named_norms: list[tuple[str, float]] = [] + + if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]: cur_lr = self.scheduler.get_last_lr()[0] ... - # === Initialize gradient diagnostics variables === - total_norm: torch.Tensor | None = None - pre_clip_named_norms: list[tuple[str, float]] = []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/training.py` around lines 1058 - 1064, The TensorBoard logging reads total_norm and pre_clip_named_norms regardless of optimizer, but those variables are only created inside the Adam/AdaMuon branch causing UnboundLocalError when opt_type == "LKF" and gradient_max_norm > 0; fix by initializing total_norm (e.g., to None) and pre_clip_named_norms (empty list) before the optimizer-specific branches (or explicitly set them in the LKF branch) so they always exist when the TensorBoard/logging block runs; reference the symbols total_norm, pre_clip_named_norms, opt_type, LKF, and gradient_max_norm to locate the relevant logic in the training loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepmd/pt/train/training.py`:
- Around line 1069-1073: The pre-clip norm collection uses _step_id %
tensorboard_freq while the TensorBoard logging condition uses display_step_id,
causing an off-by-one mismatch; update the pre-clip norm collection condition to
use display_step_id (or the same expression used for logging) instead of
_step_id so both diagnostics collection and TensorBoard logging run on the exact
same steps—look for the code that checks (_step_id % self.tensorboard_freq == 0
or _step_id == 1) and change it to mirror the logging check (e.g.,
(display_step_id % self.tensorboard_freq == 0 or display_step_id == 1) or
otherwise use display_step_id in that block), ensuring you reference
enable_tensorboard, zero_stage, tensorboard_freq, _step_id, and display_step_id
when making the change.
- Around line 1430-1433: Run ruff to get exact locations, then fix the 19 lint
issues: for the 15 TRY003 instances (long exception messages) replace patterns
that build or pass long messages outside the exception by putting the message
directly into the exception constructor (e.g., change separate message variables
or concatenations used before raise into raise SomeError("full message") inside
the functions/methods where those raises occur); for the B905 warning add
strict=True to the zip(...) call found in the batching/iterator code; for the
B007 unused loop variable (loop in the training/validation iteration) rename the
unused variable to _ or remove it if not needed; and for the RUF059
unpacked-but-unused variable replace the unused unpacked name with _ or drop the
unpacking and only extract needed values. Ensure all edits target the raise
sites, the zip(...) call, the specific loop, and the unpacking in training.py
and re-run ruff until clean.
---
Duplicate comments:
In `@deepmd/pt/train/training.py`:
- Around line 1058-1064: The TensorBoard logging reads total_norm and
pre_clip_named_norms regardless of optimizer, but those variables are only
created inside the Adam/AdaMuon branch causing UnboundLocalError when opt_type
== "LKF" and gradient_max_norm > 0; fix by initializing total_norm (e.g., to
None) and pre_clip_named_norms (empty list) before the optimizer-specific
branches (or explicitly set them in the LKF branch) so they always exist when
the TensorBoard/logging block runs; reference the symbols total_norm,
pre_clip_named_norms, opt_type, LKF, and gradient_max_norm to locate the
relevant logic in the training loop.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
deepmd/pt/train/training.py (2)
1062-1064:⚠️ Potential issue | 🔴 Critical
UnboundLocalErrorwhenopt_type == "LKF"with gradient clipping and TensorBoard enabled — still unresolved
total_normandpre_clip_named_normsare initialised at lines 1063–1064 inside theif self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]:block. The TensorBoard logging block (line 1435) is outside that block and references both variables:if self.gradient_max_norm > 0.0 and total_norm is not None: # UnboundLocalError for LKFPython's short-circuit on
andonly protects whengradient_max_norm == 0.0; with clipping enabled under LKF the name lookup raisesUnboundLocalErrorbeforeis not Noneis ever evaluated.🐛 Proposed fix — hoist initialisation before the optimizer branch
+ total_norm: torch.Tensor | None = None + pre_clip_named_norms: list[tuple[str, float]] = [] if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]: cur_lr = self.scheduler.get_last_lr()[0] ... loss.backward() - # === Initialize gradient diagnostics variables === - total_norm: torch.Tensor | None = None - pre_clip_named_norms: list[tuple[str, float]] = [] if self.gradient_max_norm > 0.0:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/training.py` around lines 1062 - 1064, The variables total_norm and pre_clip_named_norms must be defined before the optimizer-type conditional so they exist regardless of self.opt_type; move or hoist their initialization (total_norm: torch.Tensor | None = None and pre_clip_named_norms: list[tuple[str, float]] = []) out of the if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"] block so the later TensorBoard/gradient-logging code that checks self.gradient_max_norm and total_norm can safely reference them even when opt_type == "LKF".
1455-1458:⚠️ Potential issue | 🟡 MinorMulti-task: shared-parameter norms logged under the active task key only
self.wrapper.named_parameters()yields parameters from all model branches. In multi-task training the backward pass only computes gradients for the currently selected branch (plus any shared backbone). Becausetask_keyrotates per step, the same shared-backbone parameter will appear under whichever task happened to be sampled — e.g.water/grad_top10/model.Default.descriptor.layers.0.weightone step andethanol/grad_top10/model.Default.descriptor.layers.0.weightthe next — making per-parameter histories fragmented and comparison across branches unreliable.Consider either (a) using a fixed prefix (e.g.
grad_top10/...) independent oftask_key, or (b) filteringpre_clip_named_normsto only include parameters whose names containtask_key.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepmd/pt/train/training.py` around lines 1455 - 1458, The current logging uses task_key in the scalar name so shared parameters from self.wrapper.named_parameters() get logged under rotating task-specific prefixes (task_key), fragmenting histories; change writer.add_scalar calls for the top-10 pre_clip_named_norms to use a fixed prefix (e.g. "grad_top10/{name}") instead of f"{task_key}/grad_top10/{name}" so shared-backbone params have a stable series, or if you prefer per-task isolation, filter pre_clip_named_norms to only include parameter names containing the current task_key before logging (use the variable pre_clip_named_norms and the task_key string to implement the filter) and keep writer.add_scalar naming consistent with that choice.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@deepmd/pt/train/training.py`:
- Around line 1062-1064: The variables total_norm and pre_clip_named_norms must
be defined before the optimizer-type conditional so they exist regardless of
self.opt_type; move or hoist their initialization (total_norm: torch.Tensor |
None = None and pre_clip_named_norms: list[tuple[str, float]] = []) out of the
if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"] block so the
later TensorBoard/gradient-logging code that checks self.gradient_max_norm and
total_norm can safely reference them even when opt_type == "LKF".
- Around line 1455-1458: The current logging uses task_key in the scalar name so
shared parameters from self.wrapper.named_parameters() get logged under rotating
task-specific prefixes (task_key), fragmenting histories; change
writer.add_scalar calls for the top-10 pre_clip_named_norms to use a fixed
prefix (e.g. "grad_top10/{name}") instead of f"{task_key}/grad_top10/{name}" so
shared-backbone params have a stable series, or if you prefer per-task
isolation, filter pre_clip_named_norms to only include parameter names
containing the current task_key before logging (use the variable
pre_clip_named_norms and the task_key string to implement the filter) and keep
writer.add_scalar naming consistent with that choice.
Summary by CodeRabbit