Skip to content

Pr1 reward#40

Merged
rbx merged 2 commits intoFairRootGroup:masterfrom
enlorenz:pr1_reward
Mar 24, 2026
Merged

Pr1 reward#40
rbx merged 2 commits intoFairRootGroup:masterfrom
enlorenz:pr1_reward

Conversation

@enlorenz
Copy link
Contributor

No description provided.

rbx and others added 2 commits March 24, 2026 12:52
Adds per-core power tracking alongside the existing flat node-level model,
without changing the reward signal. The flat model charges COST_USED_MW for
any used node regardless of utilization; the proportional model scales the
compute portion linearly from COST_IDLE_MW (0 cores) to COST_USED_MW (96
cores), i.e. COST_IDLE + (COST_USED - COST_IDLE) * (cores_used / CORES_PER_NODE).

New metrics (eval log + TensorBoard):
- PropPower (MWh): agent/base/base_off under proportional model
- metrics/prop_power_mwh, baseline_prop_power_mwh, baseline_off_prop_power_mwh
- metrics/savings_prop_power_vs_baseline_off (MWh gap vs baseline_off)
- metrics/savings_prop_cost_vs_baseline_off (€ gap vs baseline_off, directly
  comparable to metrics/savings_off)

All proportional values are computed post-hoc from already-tracked per-step
lists (episode_on_nodes, episode_used_cores, episode_price_stats), so no
per-step changes to power_cost() or power_consumption_mwh() were needed.
- introduce new reward helpers that use per-node core utilization instead of node-count proxies
- compute efficiency from delivered core-hours per MWh with smooth target-based normalization
- compute price timing reward from total used cores so it depends on useful work, not packing shape
- keep legacy reward functions untouched and switch calculate() to the new terms when node state is available
- add regression tests covering packing-sensitive efficiency and packing-invariant price reward
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

The pull request introduces proportional (per-core) power and cost metrics throughout the codebase. New metric calculations are added to the metrics tracker, logged in callbacks during training, and appended to episode summary output. Additionally, the reward calculation system is extended with utilization-aware efficiency rewards and cores-based price reward functions.

Changes

Cohort / File(s) Summary
Metrics Computation
src/metrics_tracker.py
Extended record_episode_completion() to compute per-core proportional power and cost metrics for agent, baseline, and "baseline off" variants using imported config values (COST_IDLE_MW, COST_USED_MW, CORES_PER_NODE, MAX_NODES). New metrics added to returned episode data dict.
Metrics Logging and Display
src/callbacks.py, src/evaluation_summary.py
Extended imports in callbacks to include new config constants. Added metrics logging block in _on_step computing and recording proportional power and savings metrics. Extended episode summary output in evaluation summary to include formatted PropPower field with agent and baseline values.
Reward Calculation
src/reward_calculation.py
Added utility function _used_cores_from_state() to derive used cores from per-node state. Introduced new reward functions: _reward_energy_efficiency_utilization_normalized() for efficiency scoring via delivered core-hours per MWh, and _reward_price_utilization() replacing node-count-based price reward with cores-volume-scaled variant. Updated calculate() control flow to use new utilization-aware functions when state arrays provided.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • RewardCalculator #6: Refactors RewardCalculator and modifies reward function implementations in reward_calculation.py, directly related to the new reward functions introduced in this PR.
  • Pr1 price reward #31: Modifies reward_calculation.py's price and efficiency reward logic to use alternative usage measurement approaches, overlapping with the cores-based reward functions added here.
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and does not clearly describe the main changes. 'Pr1 reward' lacks specificity and does not convey meaningful information about what was changed. Use a more descriptive title that clearly explains the main change, such as 'Add proportional power metrics and improve reward calculation with utilization-awareness'.
Description check ❓ Inconclusive No pull request description was provided by the author, making it impossible to assess whether it relates to the changeset. Add a description explaining the purpose of the changes, the new metrics and reward functions introduced, and how they improve the system.
✅ Passed checks (1 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
src/reward_calculation.py (2)

42-51: Consider extracting shared core-counting logic.

_used_cores_from_state duplicates the core calculation logic from _power_consumption_from_state_mwh (lines 22-28). Consider extracting the common mask/clip pattern into a shared helper to avoid drift if the calculation changes.

♻️ Optional: Extract shared logic
+def _compute_used_cores(nodes: np.ndarray, cores_available: np.ndarray) -> np.ndarray:
+    """Compute used cores per node with masking and clipping."""
+    on_mask = nodes >= 0
+    used_cores = np.zeros_like(cores_available, dtype=float)
+    if np.any(on_mask):
+        used_cores[on_mask] = CORES_PER_NODE - cores_available[on_mask]
+        used_cores = np.clip(used_cores, 0.0, float(CORES_PER_NODE))
+    return used_cores
+
+
 def _used_cores_from_state(nodes: np.ndarray, cores_available: np.ndarray) -> float:
     """Calculate the total number of used cores across all powered nodes."""
-    on_mask = nodes >= 0
-    if not np.any(on_mask):
-        return 0.0
-
-    used_cores = np.zeros_like(cores_available, dtype=float)
-    used_cores[on_mask] = CORES_PER_NODE - cores_available[on_mask]
-    used_cores = np.clip(used_cores, 0.0, float(CORES_PER_NODE))
-    return float(np.sum(used_cores))
+    return float(np.sum(_compute_used_cores(nodes, cores_available)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/reward_calculation.py` around lines 42 - 51, Both _used_cores_from_state
and _power_consumption_from_state_mwh duplicate the mask/clip core-counting
logic; extract a small helper (e.g., _compute_used_cores_mask or similar) that
accepts nodes, cores_available and CORES_PER_NODE and returns the per-node used
cores (already masked and clipped) so both _used_cores_from_state and
_power_consumption_from_state_mwh call that helper and then sum/convert as
needed; ensure the helper signature and returned dtype match existing uses and
replace the duplicated lines in both functions (referencing nodes,
cores_available, CORES_PER_NODE).

333-374: Consider consolidating with _reward_price.

_reward_price_utilization shares substantial logic with _reward_price (lines 213-258). Both implement the same saturation curve, overdrive handling, and floor logic. A unified helper accepting a "work volume" parameter would reduce duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/reward_calculation.py` around lines 333 - 374, Extract the shared
saturation/overdrive logic from _reward_price_utilization and _reward_price into
a single helper (e.g. _price_reward_with_volume(self, current_price: float,
average_future_price: float, work_volume: float) -> float) that: computes
context_avg via self._price_context_average, uses self.prices.MAX_PRICE /
MIN_PRICE and price_span, computes relative_advantage and advantage_component
with self.PRICE_ADVANTAGE_GAIN, converts work_volume to equivalent_used_nodes
using CORES_PER_NODE, selects tau using self.PRICE_NODE_TAU_POS /
PRICE_NODE_TAU_NEG, computes load_component = 1 - exp(-equivalent_used_nodes /
tau) and raw_reward = advantage_component * load_component, then applies the
negative-price overdrive path and tanh saturation exactly as in both functions
(preserving NEGATIVE_PRICE_TAU, NEGATIVE_PRICE_NODE_TAU,
NEGATIVE_PRICE_OVERDRIVE_ALLOW_ABOVE_ONE, NEGATIVE_PRICE_OVERDRIVE_GAIN,
NEGATIVE_PRICE_OVERDRIVE_MAX_REWARD, NEGATIVE_PRICE_OVERDRIVE_FLOOR) and returns
float. Replace _reward_price_utilization and _reward_price to call this helper:
pass used_cores for work_volume in _reward_price_utilization and the appropriate
work volume (1.0 or equivalent) in _reward_price, keeping all existing behavior
and types.
src/metrics_tracker.py (1)

180-207: Add strict=True to zip() calls for robustness.

The episode lists (episode_on_nodes, episode_used_cores, episode_price_stats, etc.) should always have the same length. Adding strict=True would catch any length mismatch bugs early rather than silently truncating the sum.

♻️ Add strict=True
         agent_prop_power_mwh: float = sum(
             COST_IDLE_MW * on + _compute_delta_mw * (cores / CORES_PER_NODE)
-            for on, cores in zip(self.episode_on_nodes, self.episode_used_cores)
+            for on, cores in zip(self.episode_on_nodes, self.episode_used_cores, strict=True)
         )

Apply similarly to lines 196, 201, and 205.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/metrics_tracker.py` around lines 180 - 207, The zip() calls that
aggregate episode lists (used in computing agent_prop_power_mwh,
baseline_prop_power_mwh, baseline_off_prop_power_mwh, agent_prop_cost, and
baseline_off_prop_cost) should use strict=True to surface length mismatches
instead of silently truncating; update each zip(...) to zip(..., strict=True)
for the comprehensions that iterate over self.episode_on_nodes,
self.episode_used_cores, self.episode_baseline_used_cores,
self.episode_baseline_used_nodes, and self.episode_price_stats so mismatched
lengths raise an error early.
src/callbacks.py (1)

61-87: Code duplication with metrics_tracker.py.

These proportional power formulas (lines 63-74, 79-86) are duplicated verbatim in metrics_tracker.py (lines 184-206). While currently consistent, any future formula change must be applied in both locations. Consider extracting to a shared utility or computing only in metrics_tracker.py and accessing the result here.

Additionally, the static analyzer flags zip() calls without strict=True. Since these lists should always have equal length, adding strict=True would catch length mismatches early.

♻️ Add strict=True to zip calls
             agent_prop_power = sum(
                 COST_IDLE_MW * on + _delta * (cores / CORES_PER_NODE)
-                for on, cores in zip(env.metrics.episode_on_nodes, env.metrics.episode_used_cores)
+                for on, cores in zip(env.metrics.episode_on_nodes, env.metrics.episode_used_cores, strict=True)
             )
             baseline_prop_power = sum(
                 COST_IDLE_MW * MAX_NODES + _delta * (cores / CORES_PER_NODE)
                 for cores in env.metrics.episode_baseline_used_cores
             )
             baseline_off_prop_power = sum(
                 COST_IDLE_MW * used + _delta * (cores / CORES_PER_NODE)
-                for used, cores in zip(env.metrics.episode_baseline_used_nodes, env.metrics.episode_baseline_used_cores)
+                for used, cores in zip(env.metrics.episode_baseline_used_nodes, env.metrics.episode_baseline_used_cores, strict=True)
             )

Apply similar changes to lines 81 and 85.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/callbacks.py` around lines 61 - 87, The proportional power/cost
calculations (agent_prop_power, baseline_prop_power, baseline_off_prop_power,
agent_prop_cost, baseline_off_prop_cost) are duplicated with metrics_tracker.py;
extract them into a single shared utility (e.g., a function named
compute_proportional_power_metrics or a method on the metrics object) and call
that from both locations instead of duplicating logic, and when consuming
parallel lists (episode_on_nodes, episode_used_cores, episode_price_stats,
episode_baseline_used_nodes, episode_baseline_used_cores) change zip(...) calls
to zip(..., strict=True) to enforce equal-length lists and catch mismatches
early. Ensure the shared function returns the same named results and update the
logger calls (metrics/prop_power_mwh, metrics/baseline_prop_power_mwh,
metrics/baseline_off_prop_power_mwh, metrics/savings_prop_power_vs_baseline_off,
metrics/savings_prop_cost_vs_baseline_off) to use those returned values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/callbacks.py`:
- Around line 61-87: The proportional power/cost calculations (agent_prop_power,
baseline_prop_power, baseline_off_prop_power, agent_prop_cost,
baseline_off_prop_cost) are duplicated with metrics_tracker.py; extract them
into a single shared utility (e.g., a function named
compute_proportional_power_metrics or a method on the metrics object) and call
that from both locations instead of duplicating logic, and when consuming
parallel lists (episode_on_nodes, episode_used_cores, episode_price_stats,
episode_baseline_used_nodes, episode_baseline_used_cores) change zip(...) calls
to zip(..., strict=True) to enforce equal-length lists and catch mismatches
early. Ensure the shared function returns the same named results and update the
logger calls (metrics/prop_power_mwh, metrics/baseline_prop_power_mwh,
metrics/baseline_off_prop_power_mwh, metrics/savings_prop_power_vs_baseline_off,
metrics/savings_prop_cost_vs_baseline_off) to use those returned values.

In `@src/metrics_tracker.py`:
- Around line 180-207: The zip() calls that aggregate episode lists (used in
computing agent_prop_power_mwh, baseline_prop_power_mwh,
baseline_off_prop_power_mwh, agent_prop_cost, and baseline_off_prop_cost) should
use strict=True to surface length mismatches instead of silently truncating;
update each zip(...) to zip(..., strict=True) for the comprehensions that
iterate over self.episode_on_nodes, self.episode_used_cores,
self.episode_baseline_used_cores, self.episode_baseline_used_nodes, and
self.episode_price_stats so mismatched lengths raise an error early.

In `@src/reward_calculation.py`:
- Around line 42-51: Both _used_cores_from_state and
_power_consumption_from_state_mwh duplicate the mask/clip core-counting logic;
extract a small helper (e.g., _compute_used_cores_mask or similar) that accepts
nodes, cores_available and CORES_PER_NODE and returns the per-node used cores
(already masked and clipped) so both _used_cores_from_state and
_power_consumption_from_state_mwh call that helper and then sum/convert as
needed; ensure the helper signature and returned dtype match existing uses and
replace the duplicated lines in both functions (referencing nodes,
cores_available, CORES_PER_NODE).
- Around line 333-374: Extract the shared saturation/overdrive logic from
_reward_price_utilization and _reward_price into a single helper (e.g.
_price_reward_with_volume(self, current_price: float, average_future_price:
float, work_volume: float) -> float) that: computes context_avg via
self._price_context_average, uses self.prices.MAX_PRICE / MIN_PRICE and
price_span, computes relative_advantage and advantage_component with
self.PRICE_ADVANTAGE_GAIN, converts work_volume to equivalent_used_nodes using
CORES_PER_NODE, selects tau using self.PRICE_NODE_TAU_POS / PRICE_NODE_TAU_NEG,
computes load_component = 1 - exp(-equivalent_used_nodes / tau) and raw_reward =
advantage_component * load_component, then applies the negative-price overdrive
path and tanh saturation exactly as in both functions (preserving
NEGATIVE_PRICE_TAU, NEGATIVE_PRICE_NODE_TAU,
NEGATIVE_PRICE_OVERDRIVE_ALLOW_ABOVE_ONE, NEGATIVE_PRICE_OVERDRIVE_GAIN,
NEGATIVE_PRICE_OVERDRIVE_MAX_REWARD, NEGATIVE_PRICE_OVERDRIVE_FLOOR) and returns
float. Replace _reward_price_utilization and _reward_price to call this helper:
pass used_cores for work_volume in _reward_price_utilization and the appropriate
work volume (1.0 or equivalent) in _reward_price, keeping all existing behavior
and types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 45f0f9ca-363e-4608-beec-aa03db755f76

📥 Commits

Reviewing files that changed from the base of the PR and between 937cb2d and e1e8b55.

📒 Files selected for processing (4)
  • src/callbacks.py
  • src/evaluation_summary.py
  • src/metrics_tracker.py
  • src/reward_calculation.py

class RewardCalculator:

"""Calculates rewards with pre-computed normalization bounds."""
EFFICIENCY_TARGET_RATIO = 0.70
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why .7 here?

@rbx rbx merged commit c31e453 into FairRootGroup:master Mar 24, 2026
4 checks passed
This was referenced Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants