Skip to content

Explore per-core power calculation#38

Closed
rbx wants to merge 3 commits intomasterfrom
exploration
Closed

Explore per-core power calculation#38
rbx wants to merge 3 commits intomasterfrom
exploration

Conversation

@rbx
Copy link
Member

@rbx rbx commented Mar 20, 2026

No description provided.

rbx added 3 commits March 20, 2026 13:27
Adds per-core power tracking alongside the existing flat node-level model,
without changing the reward signal. The flat model charges COST_USED_MW for
any used node regardless of utilization; the proportional model scales the
compute portion linearly from COST_IDLE_MW (0 cores) to COST_USED_MW (96
cores), i.e. COST_IDLE + (COST_USED - COST_IDLE) * (cores_used / CORES_PER_NODE).

New metrics (eval log + TensorBoard):
- PropPower (MWh): agent/base/base_off under proportional model
- metrics/prop_power_mwh, baseline_prop_power_mwh, baseline_off_prop_power_mwh
- metrics/savings_prop_power_vs_baseline_off (MWh gap vs baseline_off)
- metrics/savings_prop_cost_vs_baseline_off (€ gap vs baseline_off, directly
  comparable to metrics/savings_off)

All proportional values are computed post-hoc from already-tracked per-step
lists (episode_on_nodes, episode_used_cores, episode_price_stats), so no
per-step changes to power_cost() or power_consumption_mwh() were needed.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

The changes refactor power and energy cost calculations from node-count-based to core-utilization-based metrics across the reward system. Job completion tracking is relocated from launch-time to completion-time in job management. Multiple function signatures are updated to pass metrics objects and new core-usage parameters throughout baseline, environment, and reward calculation modules. Episode summary output is enhanced with power consumption metrics.

Changes

Cohort / File(s) Summary
Power & Cost Model Refactoring
src/reward_calculation.py, src/baseline.py
Replaced node-idle-based power cost model with core-utilization-based approach. power_cost() and power_consumption_mwh() now accept num_on_nodes and cores_used instead of num_used_nodes and num_idle_nodes. Energy efficiency reward calculation updated to use core-scaled delta above idle baseline.
Job Completion Tracking Relocation
src/job_management.py
Moved completion-time metric updates from assign_jobs_to_available_nodes to process_ongoing_jobs. Added per-job wait_time storage at launch. Updated function signature to accept metrics: MetricsTracker and is_baseline: bool parameters for baseline-specific tracking.
Metrics & Function Call Updates
src/environment.py
Updated process_ongoing_jobs() invocation to pass self.metrics and set is_baseline=False. Changed reward calculator input from num_idle_nodes to num_used_cores. Adjusted power consumption calculation parameters to use num_on_nodes and num_used_cores.
Baseline Processing
src/baseline.py
Updated process_ongoing_jobs() call to pass metrics and is_baseline=True. Refactored power/cost computation to use core-based parameters in power_cost() and power_consumption_mwh() calls.
Metrics & Output Formatting
src/evaluation_summary.py, src/metrics_tracker.py
Added formatted power consumption metrics to episode summary showing agent_power_consumption_mwh divided by baseline values with (agent/base/base_off) label. Minor blank-line formatting adjustment in MetricsTracker.record_episode_completion().
Testing Enhancements
test/test_sanity_env.py
Added --check-gym and --carry-over-state CLI flags to conditionally run Stable-Baselines3 environment validation and carry-over state continuity tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive No description was provided by the author, making it impossible to assess whether it relates to the changeset. Add a description explaining the rationale and impact of the per-core power calculation changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: refactoring power calculation from node-based to per-core metrics across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/reward_calculation.py (1)

267-301: Consider moving the price signal to core occupancy too.

Lines 294-295 switched cost and efficiency to num_used_cores, but Lines 299-301 still scale the price term with num_used_nodes. With that mix, a one-core workload and a fully packed node get the same price reward if they touch the same number of nodes, even though their power exposure is now very different. Passing core occupancy into _reward_price() would keep the reward components aligned with the new per-core model.

♻️ Possible direction
-    def _reward_price(self, current_price: float, average_future_price: float, num_used_nodes: int) -> float:
+    def _reward_price(self, current_price: float, average_future_price: float, num_used_cores: int) -> float:
         """
         Active signed price reward with fast saturation and negative-price overdrive.
 
         - Saturates quickly with better-than-context prices and used nodes.
         - Always applies overdrive when current price is negative.
         """
-        if num_used_nodes <= 0:
+        if num_used_cores <= 0:
             return 0.0
+        used_node_equiv = num_used_cores / CORES_PER_NODE
 
         context_avg = self._price_context_average(average_future_price)
         price_span = max(self.prices.MAX_PRICE - self.prices.MIN_PRICE, 1e-6)
         relative_advantage = (context_avg - current_price) / price_span
 
         advantage_component = self.PRICE_ADVANTAGE_GAIN * relative_advantage
         tau = self.PRICE_NODE_TAU_POS if advantage_component >= 0.0 else self.PRICE_NODE_TAU_NEG
-        node_component = 1.0 - np.exp(-num_used_nodes / tau)
+        node_component = 1.0 - np.exp(-used_node_equiv / tau)
         raw_reward = advantage_component * node_component
 
         if current_price < 0.0:
             negative_strength = (1.0 - np.exp(-abs(current_price) / self.NEGATIVE_PRICE_TAU))
-            negative_node_component = (1.0 - np.exp(-num_used_nodes / self.NEGATIVE_PRICE_NODE_TAU))
+            negative_node_component = (1.0 - np.exp(-used_node_equiv / self.NEGATIVE_PRICE_NODE_TAU))
             overdrive = negative_node_component * negative_strength
             ...
 
-        price_reward = self._reward_price(
-            current_price, average_future_price, num_used_nodes
-        )
+        price_reward = self._reward_price(
+            current_price, average_future_price, num_used_cores
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/reward_calculation.py` around lines 267 - 301, The price-term is still
scaled by node count while cost/efficiency moved to per-core, causing mismatched
signals; update calculate to pass num_used_cores (or a derived core occupancy)
into _reward_price instead of num_used_nodes, and update _reward_price's
signature/implementation to interpret that parameter as core-occupancy (adjust
scaling/normalization inside _reward_price accordingly); ensure any docstrings
and callers reflect the new parameter name (e.g., core_occupancy or
num_used_cores) so price_reward aligns with the per-core cost/efficiency model.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/evaluation_summary.py`:
- Around line 66-70: The episode summary f-string is emitting a duplicated
"Power=" field; update the f-string that builds the line using
baseline_occupancy_nodes_pct and episode_data (keys agent_power_consumption_mwh,
baseline_power_consumption_mwh, baseline_power_consumption_off_mwh) to remove
the redundant "Power=" segment so only the single Power metric (already emitted
earlier) remains; locate the formatted string construction in
evaluation_summary.py and delete the second Power=/... group while preserving
the remaining text and numeric formatting.

In `@src/job_management.py`:
- Around line 81-83: Episode metrics are being polluted because completion-time
increments (in job completion handling around running_jobs and updates to
jobs_completed and avg_wait_time) count jobs that were submitted in previous
episodes; update the logic so episode boundaries are respected by either: (a)
tagging each Job object with its submission_episode (e.g., set
submission_episode when the job is enqueued) and only increment per-episode
counters (jobs_completed, avg_wait_time used by record_episode_completion())
when job.submission_episode == current_episode, or (b) maintain separate
episode-local launch/completion counters (e.g., episode_jobs_launched,
episode_jobs_completed, episode_wait_accumulator) that are reset by
record_episode_completion() and incremented at launch/completion time; ensure
record_episode_completion() uses those episode-local counters (not global
running_jobs/queue/backlog) so completion_rate cannot exceed 100%.

In `@test/test_sanity_env.py`:
- Around line 387-389: The environment returned by make_env_with_carry() passed
into check_env() isn't explicitly closed; update the snippet that calls
check_env(make_env_with_carry(), warn=True) to capture the created env (e.g.,
env = make_env_with_carry()), pass that env to check_env(env, warn=True), and
then call env.close() (or use a try/finally to ensure env.close() runs) after
the check to match the cleanup pattern used elsewhere.

---

Nitpick comments:
In `@src/reward_calculation.py`:
- Around line 267-301: The price-term is still scaled by node count while
cost/efficiency moved to per-core, causing mismatched signals; update calculate
to pass num_used_cores (or a derived core occupancy) into _reward_price instead
of num_used_nodes, and update _reward_price's signature/implementation to
interpret that parameter as core-occupancy (adjust scaling/normalization inside
_reward_price accordingly); ensure any docstrings and callers reflect the new
parameter name (e.g., core_occupancy or num_used_cores) so price_reward aligns
with the per-core cost/efficiency model.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7e10be4-9d08-4e37-8880-42306e668ce1

📥 Commits

Reviewing files that changed from the base of the PR and between 95c564e and 7e5ecb6.

📒 Files selected for processing (7)
  • src/baseline.py
  • src/environment.py
  • src/evaluation_summary.py
  • src/job_management.py
  • src/metrics_tracker.py
  • src/reward_calculation.py
  • test/test_sanity_env.py

Comment on lines +66 to +70
f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%, "
f"Power={float(episode_data['agent_power_consumption_mwh']):.1f}/"
f"{float(episode_data['baseline_power_consumption_mwh']):.1f}/"
f"{float(episode_data['baseline_power_consumption_off_mwh']):.1f} MWh "
f"(agent/base/base_off)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Drop the second Power= field.

Line 67 duplicates the same metric already emitted at Lines 40-43, so the episode line now contains two Power= sections. That makes the output noisy and brittle for any downstream log parsing.

✂️ Proposed fix
-        f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%, "
-        f"Power={float(episode_data['agent_power_consumption_mwh']):.1f}/"
-        f"{float(episode_data['baseline_power_consumption_mwh']):.1f}/"
-        f"{float(episode_data['baseline_power_consumption_off_mwh']):.1f} MWh "
-        f"(agent/base/base_off)"
+        f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%, "
f"Power={float(episode_data['agent_power_consumption_mwh']):.1f}/"
f"{float(episode_data['baseline_power_consumption_mwh']):.1f}/"
f"{float(episode_data['baseline_power_consumption_off_mwh']):.1f} MWh "
f"(agent/base/base_off)"
f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/evaluation_summary.py` around lines 66 - 70, The episode summary f-string
is emitting a duplicated "Power=" field; update the f-string that builds the
line using baseline_occupancy_nodes_pct and episode_data (keys
agent_power_consumption_mwh, baseline_power_consumption_mwh,
baseline_power_consumption_off_mwh) to remove the redundant "Power=" segment so
only the single Power metric (already emitted earlier) remains; locate the
formatted string construction in evaluation_summary.py and delete the second
Power=/... group while preserving the remaining text and numeric formatting.

Comment on lines +81 to +83
Completion is counted here (when duration hits zero), not at launch time.
'wait_time' on each job is the time spent in the queue before being launched,
which is the standard HPC metric for scheduler responsiveness.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Completion-time episode counters now mix different job cohorts.

src/environment.py Lines 294-304 only reset episode metrics between truncations; they do not clear running_jobs, the queue, or backlog. After Lines 117-125 moved these increments to completion time, a job submitted in episode N and finishing in episode N+1 contributes to episode N+1's jobs_completed and avg_wait_time, while record_episode_completion() still divides by that episode's jobs_submitted. That makes the per-episode completion stats drift and can push completion_rate past 100%. Track submission cohort/episode on the job, or keep separate episode-local "started/launched" counters for the summary.

Also applies to: 114-125

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/job_management.py` around lines 81 - 83, Episode metrics are being
polluted because completion-time increments (in job completion handling around
running_jobs and updates to jobs_completed and avg_wait_time) count jobs that
were submitted in previous episodes; update the logic so episode boundaries are
respected by either: (a) tagging each Job object with its submission_episode
(e.g., set submission_episode when the job is enqueued) and only increment
per-episode counters (jobs_completed, avg_wait_time used by
record_episode_completion()) when job.submission_episode == current_episode, or
(b) maintain separate episode-local launch/completion counters (e.g.,
episode_jobs_launched, episode_jobs_completed, episode_wait_accumulator) that
are reset by record_episode_completion() and incremented at launch/completion
time; ensure record_episode_completion() uses those episode-local counters (not
global running_jobs/queue/backlog) so completion_rate cannot exceed 100%.

Comment on lines +387 to +389
from stable_baselines3.common.env_checker import check_env
check_env(make_env_with_carry(), warn=True)
print("[OK] gym check passed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify check_env usage and whether close() is paired in the same control block.
rg -n -C4 'if args\.check_gym|check_env\(|\.close\(' test/test_sanity_env.py

Repository: FairRootGroup/powersched

Length of output: 2611


🏁 Script executed:

#!/bin/bash
# Search for other usages of check_env in the repository
rg -n 'check_env' --type py

Repository: FairRootGroup/powersched

Length of output: 510


🌐 Web query:

stable-baselines3 check_env closes environment cleanup

💡 Result:

Yes, the stable-baselines3 check_env function calls env.close on the environment during execution, specifically as part of the render check when skip_render_check=False (default is True, so it skips by default). If env.render_mode is set, it calls env.render followed by env.close. For environments wrapped in vectorized environments like DummyVecEnv (used internally in _check_nan), the VecEnv.close method is not explicitly called by check_env, but VecEnvs have a close method for cleanup that closes underlying envs (e.g., SubprocVecEnv sends "close" command and joins processes). check_env does not perform full environment cleanup beyond potential render-related close; it leaves the environment in a stepped state after calling reset and step for validation. Users should handle further cleanup if needed after calling check_env(env). To avoid render close: pass skip_render_check=True (default). Official source code confirms: in check_env, if not skipping render check: _check_render(env), which does env.render; env.close if render_mode present.

Citations:


🏁 Script executed:

#!/bin/bash
# Check test_checkenv.py for check_env usage and cleanup
cat -n test/test_checkenv.py

Repository: FairRootGroup/powersched

Length of output: 1065


Close the environment instance returned by check_env() explicitly.

The environment created at line 388 is not explicitly closed. The check_env() function from stable-baselines3 does not automatically close the passed environment instance, leaving it in a stepped state. In this long-running sanity script that may be invoked repeatedly, explicit cleanup prevents resource/descriptor leaks, consistent with the pattern used elsewhere in this file (lines 115, 136, 198, 378).

♻️ Proposed fix
     if args.check_gym:
         from stable_baselines3.common.env_checker import check_env
-        check_env(make_env_with_carry(), warn=True)
+        gym_env = make_env_with_carry()
+        try:
+            check_env(gym_env, warn=True)
+        finally:
+            gym_env.close()
         print("[OK] gym check passed")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from stable_baselines3.common.env_checker import check_env
check_env(make_env_with_carry(), warn=True)
print("[OK] gym check passed")
from stable_baselines3.common.env_checker import check_env
gym_env = make_env_with_carry()
try:
check_env(gym_env, warn=True)
finally:
gym_env.close()
print("[OK] gym check passed")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/test_sanity_env.py` around lines 387 - 389, The environment returned by
make_env_with_carry() passed into check_env() isn't explicitly closed; update
the snippet that calls check_env(make_env_with_carry(), warn=True) to capture
the created env (e.g., env = make_env_with_carry()), pass that env to
check_env(env, warn=True), and then call env.close() (or use a try/finally to
ensure env.close() runs) after the check to match the cleanup pattern used
elsewhere.

@rbx rbx closed this Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant