Skip to content

Comments

feat(pt): log pre-clip gradient total_norm and per-param norm to tensorboard#5252

Open
OutisLi wants to merge 2 commits intodeepmodeling:masterfrom
OutisLi:pr/tboard
Open

feat(pt): log pre-clip gradient total_norm and per-param norm to tensorboard#5252
OutisLi wants to merge 2 commits intodeepmodeling:masterfrom
OutisLi:pr/tboard

Conversation

@OutisLi
Copy link
Collaborator

@OutisLi OutisLi commented Feb 21, 2026

Summary by CodeRabbit

  • New Features
    • Enhanced gradient diagnostics during training: when TensorBoard logging is enabled, training now records total gradient norm, per-parameter gradient norms, and a histogram of per-parameter norms. The system also highlights the top-10 largest parameter gradients as individual scalars and surfaces total-norm values in training logs to improve analysis and monitoring.

Copilot AI review requested due to automatic review settings February 21, 2026 12:19
@dosubot
Copy link

dosubot bot commented Feb 21, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@OutisLi OutisLi changed the title feat: log pre-clip gradient total_norm and per-param norm to tensorboard feat(pt): log pre-clip gradient total_norm and per-param norm to tensorboard Feb 21, 2026
@dosubot dosubot bot added the new feature label Feb 21, 2026
@OutisLi OutisLi requested review from iProzd and njzjz February 21, 2026 12:22
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 21, 2026

📝 Walkthrough

Walkthrough

Adds pre-clip gradient diagnostic collection and conditional TensorBoard logging during training when gradient clipping is enabled; collects per-parameter norms, computes total norm via clipping call, checks non-finite values, and logs total_norm, histogram of per-parameter norms, and top-10 parameter norms.

Changes

Cohort / File(s) Summary
Gradient diagnostic & logging
deepmd/pt/train/training.py
Adds scaffolding to collect per-parameter gradient norms before clipping when gradient_max_norm > 0.0. Computes total_norm via clip_grad_norm_, preserves non-finite checks, and conditionally logs total_norm, a CPU FP32 histogram (task_key/grad/param_norms), and top-10 parameter norms (task_key/grad_top10/{name}). Collection/logging gated by enable_tensorboard, zero_stage < 2, and logging frequency; strips name prefixes when present.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Trainer
participant Backward as Backward Pass
participant GradCollector as Grad Collector
participant Clipper as clip_grad_norm_
participant TB as TensorBoard Writer
Trainer->>Backward: trigger backward()
Backward->>GradCollector: gather per-param gradients (if enabled)
GradCollector->>GradCollector: compute per-parameter norms list
GradCollector->>Clipper: call clip_grad_norm_ (compute total_norm)
Clipper->>GradCollector: return total_norm
GradCollector->>GradCollector: check non-finite total_norm
GradCollector->>TB: log total_norm, histogram, top-10 norms (if collected)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding logging of pre-clip gradient metrics (total_norm and per-parameter norms) to tensorboard in the PyTorch training module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
deepmd/pt/train/training.py (2)

994-1000: pre_clip_named_norms collected on every step instead of only on logging steps

The collection iterates all model parameters and calls .norm() per gradient on every training step, but the data is only consumed at tensorboard_freq intervals. For large models with tensorboard_freq > 1 this is wasted work on every non-logging step.

Move the collection inside (or co-located with) the tensorboard frequency check, e.g., guard it with the same condition used at line 1344:

♻️ Proposed fix
-                if self.gradient_max_norm > 0.0:
-                    # Collect per-parameter gradient norms before clipping.
-                    if self.enable_tensorboard:
-                        pre_clip_named_norms = [
-                            (name, p.grad.detach().norm().item())
-                            for name, p in self.wrapper.named_parameters()
-                            if p.grad is not None
-                        ]
-                    # FSDP2 sharded DTensor gradients don't support error_if_nonfinite; use manual isfinite check instead.
-                    total_norm = torch.nn.utils.clip_grad_norm_(
+                if self.gradient_max_norm > 0.0:
+                    # Collect per-parameter gradient norms before clipping (only on TB log steps).
+                    _tb_log_this_step = self.enable_tensorboard and (
+                        display_step_id % self.tensorboard_freq == 0
+                        or display_step_id == 1
+                    )
+                    if _tb_log_this_step:
+                        pre_clip_named_norms = [
+                            (name, p.grad.detach().norm().item())
+                            for name, p in self.wrapper.named_parameters()
+                            if p.grad is not None
+                        ]
+                    # FSDP2 sharded DTensor gradients don't support error_if_nonfinite; use manual isfinite check instead.
+                    total_norm = torch.nn.utils.clip_grad_norm_(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 994 - 1000, pre_clip_named_norms is
being computed every training step even when TensorBoard logging doesn't occur;
move the comprehension that builds pre_clip_named_norms (which iterates
self.wrapper.named_parameters() and calls p.grad.detach().norm().item()) so it
only runs when self.enable_tensorboard is true AND the tensorboard logging
condition (the same check used for tensorboard_freq elsewhere) is met; in
practice, remove the current unconditional block and recreate
pre_clip_named_norms inside the tensorboard logging branch where other
TensorBoard metrics are prepared so the expensive per-parameter norm computation
only happens on logging steps.

1353-1370: Gradient norm tags are not namespaced by task_key in multi-task training

In multi-task mode each call to step() uses a randomly selected task_key. The tags "grad/total_norm", "grad/param_norms", and "grad_top10/{name}" carry no task identifier, so TensorBoard will interleave readings from different tasks onto the same scalar/histogram series, making them uninterpretable.

♻️ Proposed fix
-                    writer.add_scalar(
-                        "grad/total_norm", total_norm.item(), display_step_id
-                    )
+                    writer.add_scalar(
+                        f"grad/{task_key}/total_norm", total_norm.item(), display_step_id
+                    )
                     norms = torch.tensor(...)
-                    writer.add_histogram("grad/param_norms", norms, display_step_id)
+                    writer.add_histogram(f"grad/{task_key}/param_norms", norms, display_step_id)
                     ...
-                        writer.add_scalar(f"grad_top10/{name}", gn, display_step_id)
+                        writer.add_scalar(f"grad_top10/{task_key}/{name}", gn, display_step_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 1353 - 1370, The grad logging block
under the self.gradient_max_norm check is missing the per-task namespace; modify
the TensorBoard tag strings used in the writer calls (the writer.add_scalar and
writer.add_histogram invocations inside the block that references
pre_clip_named_norms, total_norm, display_step_id) to include the current task
identifier (task_key) as a prefix or namespace (e.g.
f"{task_key}/grad/total_norm", f"{task_key}/grad/param_norms", and
f"{task_key}/grad_top10/{name}" after you shorten name as done now) so that all
three logged series are namespaced by task; keep the existing name-shortening
logic for per-parameter entries and only change the tag strings passed to
writer.add_scalar/add_histogram.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/pt/train/training.py`:
- Around line 996-1000: The per-parameter norm collection stored in
pre_clip_named_norms uses p.grad.detach().norm() which yields only shard-local
norms under FSDP2/ZeRO stage >= 2, producing misleading TensorBoard diagnostics;
update the code in the block that builds pre_clip_named_norms (referencing
pre_clip_named_norms, self.wrapper.named_parameters(), and self.zero_stage) to
either skip collecting per-parameter norms when self.zero_stage >= 2 or add a
clear comment/log entry that these norms are shard-local and not representative
of full-parameter norms; implement the skip by gating the comprehension on
self.zero_stage < 2 (or add the explicit documentation next to the existing
FSDP2 comment) so distributed runs do not report understated per-param norms.
- Around line 1352-1370: The TensorBoard logging block references total_norm and
pre_clip_named_norms but those are only set inside the optimizer-specific branch
for self.opt_type, causing UnboundLocalError for LKF; to fix, initialize
total_norm = None and pre_clip_named_norms = [] before the optimizer/step()
branch so they exist on all code paths, and change the TB logging to only emit
grad/total_norm and iterate pre_clip_named_norms when total_norm is not None (or
pre_clip_named_norms is non-empty) so the block gated by gradient_max_norm > 0.0
won't reference undefined variables.

---

Nitpick comments:
In `@deepmd/pt/train/training.py`:
- Around line 994-1000: pre_clip_named_norms is being computed every training
step even when TensorBoard logging doesn't occur; move the comprehension that
builds pre_clip_named_norms (which iterates self.wrapper.named_parameters() and
calls p.grad.detach().norm().item()) so it only runs when
self.enable_tensorboard is true AND the tensorboard logging condition (the same
check used for tensorboard_freq elsewhere) is met; in practice, remove the
current unconditional block and recreate pre_clip_named_norms inside the
tensorboard logging branch where other TensorBoard metrics are prepared so the
expensive per-parameter norm computation only happens on logging steps.
- Around line 1353-1370: The grad logging block under the self.gradient_max_norm
check is missing the per-task namespace; modify the TensorBoard tag strings used
in the writer calls (the writer.add_scalar and writer.add_histogram invocations
inside the block that references pre_clip_named_norms, total_norm,
display_step_id) to include the current task identifier (task_key) as a prefix
or namespace (e.g. f"{task_key}/grad/total_norm",
f"{task_key}/grad/param_norms", and f"{task_key}/grad_top10/{name}" after you
shorten name as done now) so that all three logged series are namespaced by
task; keep the existing name-shortening logic for per-parameter entries and only
change the tag strings passed to writer.add_scalar/add_histogram.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds TensorBoard diagnostics for gradient norms to help debug/monitor training stability when gradient clipping is enabled.

Changes:

  • Collect per-parameter gradient norms before clipping.
  • Log pre-clip total gradient norm and a histogram of per-parameter norms to TensorBoard.
  • Log the top-10 largest per-parameter gradient norms with shortened parameter names.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link

codecov bot commented Feb 21, 2026

Codecov Report

❌ Patch coverage is 30.76923% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.99%. Comparing base (65eea4b) to head (a932822).

Files with missing lines Patch % Lines
deepmd/pt/train/training.py 30.76% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5252      +/-   ##
==========================================
- Coverage   82.00%   81.99%   -0.01%     
==========================================
  Files         750      750              
  Lines       75082    75092      +10     
  Branches     3615     3616       +1     
==========================================
+ Hits        61571    61574       +3     
- Misses      12347    12354       +7     
  Partials     1164     1164              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
deepmd/pt/train/training.py (1)

1058-1064: ⚠️ Potential issue | 🟠 Major

UnboundLocalError still possible for LKF with gradient clipping enabled.

total_norm / pre_clip_named_norms are initialized only inside the Adam/AdaMuon branch, but the TensorBoard block runs for all optimizers. If opt_type == "LKF" and gradient_max_norm > 0, total_norm is referenced before assignment. This is the same issue raised previously and still applies.

🐛 Proposed fix
-            if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]:
+            # Initialize gradient diagnostics for all optimizer paths
+            total_norm: torch.Tensor | None = None
+            pre_clip_named_norms: list[tuple[str, float]] = []
+
+            if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]:
                 cur_lr = self.scheduler.get_last_lr()[0]
                 ...
-                # === Initialize gradient diagnostics variables ===
-                total_norm: torch.Tensor | None = None
-                pre_clip_named_norms: list[tuple[str, float]] = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 1058 - 1064, The TensorBoard
logging reads total_norm and pre_clip_named_norms regardless of optimizer, but
those variables are only created inside the Adam/AdaMuon branch causing
UnboundLocalError when opt_type == "LKF" and gradient_max_norm > 0; fix by
initializing total_norm (e.g., to None) and pre_clip_named_norms (empty list)
before the optimizer-specific branches (or explicitly set them in the LKF
branch) so they always exist when the TensorBoard/logging block runs; reference
the symbols total_norm, pre_clip_named_norms, opt_type, LKF, and
gradient_max_norm to locate the relevant logic in the training loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deepmd/pt/train/training.py`:
- Around line 1069-1073: The pre-clip norm collection uses _step_id %
tensorboard_freq while the TensorBoard logging condition uses display_step_id,
causing an off-by-one mismatch; update the pre-clip norm collection condition to
use display_step_id (or the same expression used for logging) instead of
_step_id so both diagnostics collection and TensorBoard logging run on the exact
same steps—look for the code that checks (_step_id % self.tensorboard_freq == 0
or _step_id == 1) and change it to mirror the logging check (e.g.,
(display_step_id % self.tensorboard_freq == 0 or display_step_id == 1) or
otherwise use display_step_id in that block), ensuring you reference
enable_tensorboard, zero_stage, tensorboard_freq, _step_id, and display_step_id
when making the change.
- Around line 1430-1433: Run ruff to get exact locations, then fix the 19 lint
issues: for the 15 TRY003 instances (long exception messages) replace patterns
that build or pass long messages outside the exception by putting the message
directly into the exception constructor (e.g., change separate message variables
or concatenations used before raise into raise SomeError("full message") inside
the functions/methods where those raises occur); for the B905 warning add
strict=True to the zip(...) call found in the batching/iterator code; for the
B007 unused loop variable (loop in the training/validation iteration) rename the
unused variable to _ or remove it if not needed; and for the RUF059
unpacked-but-unused variable replace the unused unpacked name with _ or drop the
unpacking and only extract needed values. Ensure all edits target the raise
sites, the zip(...) call, the specific loop, and the unpacking in training.py
and re-run ruff until clean.

---

Duplicate comments:
In `@deepmd/pt/train/training.py`:
- Around line 1058-1064: The TensorBoard logging reads total_norm and
pre_clip_named_norms regardless of optimizer, but those variables are only
created inside the Adam/AdaMuon branch causing UnboundLocalError when opt_type
== "LKF" and gradient_max_norm > 0; fix by initializing total_norm (e.g., to
None) and pre_clip_named_norms (empty list) before the optimizer-specific
branches (or explicitly set them in the LKF branch) so they always exist when
the TensorBoard/logging block runs; reference the symbols total_norm,
pre_clip_named_norms, opt_type, LKF, and gradient_max_norm to locate the
relevant logic in the training loop.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6ef654c and 7b3c306.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
deepmd/pt/train/training.py (2)

1062-1064: ⚠️ Potential issue | 🔴 Critical

UnboundLocalError when opt_type == "LKF" with gradient clipping and TensorBoard enabled — still unresolved

total_norm and pre_clip_named_norms are initialised at lines 1063–1064 inside the if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]: block. The TensorBoard logging block (line 1435) is outside that block and references both variables:

if self.gradient_max_norm > 0.0 and total_norm is not None:  # UnboundLocalError for LKF

Python's short-circuit on and only protects when gradient_max_norm == 0.0; with clipping enabled under LKF the name lookup raises UnboundLocalError before is not None is ever evaluated.

🐛 Proposed fix — hoist initialisation before the optimizer branch
+                total_norm: torch.Tensor | None = None
+                pre_clip_named_norms: list[tuple[str, float]] = []
                 if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"]:
                     cur_lr = self.scheduler.get_last_lr()[0]
                     ...
                     loss.backward()
-                    # === Initialize gradient diagnostics variables ===
-                    total_norm: torch.Tensor | None = None
-                    pre_clip_named_norms: list[tuple[str, float]] = []
                     if self.gradient_max_norm > 0.0:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 1062 - 1064, The variables
total_norm and pre_clip_named_norms must be defined before the optimizer-type
conditional so they exist regardless of self.opt_type; move or hoist their
initialization (total_norm: torch.Tensor | None = None and pre_clip_named_norms:
list[tuple[str, float]] = []) out of the if self.opt_type in ["Adam", "AdamW",
"AdaMuon", "HybridMuon"] block so the later TensorBoard/gradient-logging code
that checks self.gradient_max_norm and total_norm can safely reference them even
when opt_type == "LKF".

1455-1458: ⚠️ Potential issue | 🟡 Minor

Multi-task: shared-parameter norms logged under the active task key only

self.wrapper.named_parameters() yields parameters from all model branches. In multi-task training the backward pass only computes gradients for the currently selected branch (plus any shared backbone). Because task_key rotates per step, the same shared-backbone parameter will appear under whichever task happened to be sampled — e.g. water/grad_top10/model.Default.descriptor.layers.0.weight one step and ethanol/grad_top10/model.Default.descriptor.layers.0.weight the next — making per-parameter histories fragmented and comparison across branches unreliable.

Consider either (a) using a fixed prefix (e.g. grad_top10/...) independent of task_key, or (b) filtering pre_clip_named_norms to only include parameters whose names contain task_key.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deepmd/pt/train/training.py` around lines 1455 - 1458, The current logging
uses task_key in the scalar name so shared parameters from
self.wrapper.named_parameters() get logged under rotating task-specific prefixes
(task_key), fragmenting histories; change writer.add_scalar calls for the top-10
pre_clip_named_norms to use a fixed prefix (e.g. "grad_top10/{name}") instead of
f"{task_key}/grad_top10/{name}" so shared-backbone params have a stable series,
or if you prefer per-task isolation, filter pre_clip_named_norms to only include
parameter names containing the current task_key before logging (use the variable
pre_clip_named_norms and the task_key string to implement the filter) and keep
writer.add_scalar naming consistent with that choice.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@deepmd/pt/train/training.py`:
- Around line 1062-1064: The variables total_norm and pre_clip_named_norms must
be defined before the optimizer-type conditional so they exist regardless of
self.opt_type; move or hoist their initialization (total_norm: torch.Tensor |
None = None and pre_clip_named_norms: list[tuple[str, float]] = []) out of the
if self.opt_type in ["Adam", "AdamW", "AdaMuon", "HybridMuon"] block so the
later TensorBoard/gradient-logging code that checks self.gradient_max_norm and
total_norm can safely reference them even when opt_type == "LKF".
- Around line 1455-1458: The current logging uses task_key in the scalar name so
shared parameters from self.wrapper.named_parameters() get logged under rotating
task-specific prefixes (task_key), fragmenting histories; change
writer.add_scalar calls for the top-10 pre_clip_named_norms to use a fixed
prefix (e.g. "grad_top10/{name}") instead of f"{task_key}/grad_top10/{name}" so
shared-backbone params have a stable series, or if you prefer per-task
isolation, filter pre_clip_named_norms to only include parameter names
containing the current task_key before logging (use the variable
pre_clip_named_norms and the task_key string to implement the filter) and keep
writer.add_scalar naming consistent with that choice.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7b3c306 and a932822.

📒 Files selected for processing (1)
  • deepmd/pt/train/training.py

@OutisLi OutisLi requested a review from njzjz February 24, 2026 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants