Explore per-core power calculation by rbx · Pull Request #38 · FairRootGroup/powersched

rbx · 2026-03-20T15:35:51Z

No description provided.

…t when it is launched

Adds per-core power tracking alongside the existing flat node-level model, without changing the reward signal. The flat model charges COST_USED_MW for any used node regardless of utilization; the proportional model scales the compute portion linearly from COST_IDLE_MW (0 cores) to COST_USED_MW (96 cores), i.e. COST_IDLE + (COST_USED - COST_IDLE) * (cores_used / CORES_PER_NODE). New metrics (eval log + TensorBoard): - PropPower (MWh): agent/base/base_off under proportional model - metrics/prop_power_mwh, baseline_prop_power_mwh, baseline_off_prop_power_mwh - metrics/savings_prop_power_vs_baseline_off (MWh gap vs baseline_off) - metrics/savings_prop_cost_vs_baseline_off (€ gap vs baseline_off, directly comparable to metrics/savings_off) All proportional values are computed post-hoc from already-tracked per-step lists (episode_on_nodes, episode_used_cores, episode_price_stats), so no per-step changes to power_cost() or power_consumption_mwh() were needed.

coderabbitai · 2026-03-20T15:36:14Z

📝 Walkthrough

Walkthrough

The changes refactor power and energy cost calculations from node-count-based to core-utilization-based metrics across the reward system. Job completion tracking is relocated from launch-time to completion-time in job management. Multiple function signatures are updated to pass metrics objects and new core-usage parameters throughout baseline, environment, and reward calculation modules. Episode summary output is enhanced with power consumption metrics.

Changes

Cohort / File(s)	Summary
Power & Cost Model Refactoring `src/reward_calculation.py`, `src/baseline.py`	Replaced node-idle-based power cost model with core-utilization-based approach. `power_cost()` and `power_consumption_mwh()` now accept `num_on_nodes` and `cores_used` instead of `num_used_nodes` and `num_idle_nodes`. Energy efficiency reward calculation updated to use core-scaled delta above idle baseline.
Job Completion Tracking Relocation `src/job_management.py`	Moved completion-time metric updates from `assign_jobs_to_available_nodes` to `process_ongoing_jobs`. Added per-job `wait_time` storage at launch. Updated function signature to accept `metrics: MetricsTracker` and `is_baseline: bool` parameters for baseline-specific tracking.
Metrics & Function Call Updates `src/environment.py`	Updated `process_ongoing_jobs()` invocation to pass `self.metrics` and set `is_baseline=False`. Changed reward calculator input from `num_idle_nodes` to `num_used_cores`. Adjusted power consumption calculation parameters to use `num_on_nodes` and `num_used_cores`.
Baseline Processing `src/baseline.py`	Updated `process_ongoing_jobs()` call to pass `metrics` and `is_baseline=True`. Refactored power/cost computation to use core-based parameters in `power_cost()` and `power_consumption_mwh()` calls.
Metrics & Output Formatting `src/evaluation_summary.py`, `src/metrics_tracker.py`	Added formatted power consumption metrics to episode summary showing `agent_power_consumption_mwh` divided by baseline values with `(agent/base/base_off)` label. Minor blank-line formatting adjustment in `MetricsTracker.record_episode_completion()`.
Testing Enhancements `test/test_sanity_env.py`	Added `--check-gym` and `--carry-over-state` CLI flags to conditionally run Stable-Baselines3 environment validation and carry-over state continuity tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Pr1 price reward #31 — Modifies RewardCalculator.calculate() signature and energy efficiency reward computation logic in src/reward_calculation.py.
Add standalone script to analyze savings and lambda dependency #28 — Updates baseline step returns, environment metrics, and MetricsTracker episode fields with core-based occupancy reporting.
Simplify metrics handling in step() #7 — Refactors job/metrics handling to pass shared MetricsTracker object through assign_jobs_to_available_nodes and related functions.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	No description was provided by the author, making it impossible to assess whether it relates to the changeset.	Add a description explaining the rationale and impact of the per-core power calculation changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately summarizes the main change: refactoring power calculation from node-based to per-core metrics across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/reward_calculation.py (1)

267-301: Consider moving the price signal to core occupancy too.

Lines 294-295 switched cost and efficiency to num_used_cores, but Lines 299-301 still scale the price term with num_used_nodes. With that mix, a one-core workload and a fully packed node get the same price reward if they touch the same number of nodes, even though their power exposure is now very different. Passing core occupancy into _reward_price() would keep the reward components aligned with the new per-core model.

♻️ Possible direction

-    def _reward_price(self, current_price: float, average_future_price: float, num_used_nodes: int) -> float:
+    def _reward_price(self, current_price: float, average_future_price: float, num_used_cores: int) -> float:
         """
         Active signed price reward with fast saturation and negative-price overdrive.
 
         - Saturates quickly with better-than-context prices and used nodes.
         - Always applies overdrive when current price is negative.
         """
-        if num_used_nodes <= 0:
+        if num_used_cores <= 0:
             return 0.0
+        used_node_equiv = num_used_cores / CORES_PER_NODE
 
         context_avg = self._price_context_average(average_future_price)
         price_span = max(self.prices.MAX_PRICE - self.prices.MIN_PRICE, 1e-6)
         relative_advantage = (context_avg - current_price) / price_span
 
         advantage_component = self.PRICE_ADVANTAGE_GAIN * relative_advantage
         tau = self.PRICE_NODE_TAU_POS if advantage_component >= 0.0 else self.PRICE_NODE_TAU_NEG
-        node_component = 1.0 - np.exp(-num_used_nodes / tau)
+        node_component = 1.0 - np.exp(-used_node_equiv / tau)
         raw_reward = advantage_component * node_component
 
         if current_price < 0.0:
             negative_strength = (1.0 - np.exp(-abs(current_price) / self.NEGATIVE_PRICE_TAU))
-            negative_node_component = (1.0 - np.exp(-num_used_nodes / self.NEGATIVE_PRICE_NODE_TAU))
+            negative_node_component = (1.0 - np.exp(-used_node_equiv / self.NEGATIVE_PRICE_NODE_TAU))
             overdrive = negative_node_component * negative_strength
             ...
 
-        price_reward = self._reward_price(
-            current_price, average_future_price, num_used_nodes
-        )
+        price_reward = self._reward_price(
+            current_price, average_future_price, num_used_cores
+        )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/reward_calculation.py` around lines 267 - 301, The price-term is still
scaled by node count while cost/efficiency moved to per-core, causing mismatched
signals; update calculate to pass num_used_cores (or a derived core occupancy)
into _reward_price instead of num_used_nodes, and update _reward_price's
signature/implementation to interpret that parameter as core-occupancy (adjust
scaling/normalization inside _reward_price accordingly); ensure any docstrings
and callers reflect the new parameter name (e.g., core_occupancy or
num_used_cores) so price_reward aligns with the per-core cost/efficiency model.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/evaluation_summary.py`:
- Around line 66-70: The episode summary f-string is emitting a duplicated
"Power=" field; update the f-string that builds the line using
baseline_occupancy_nodes_pct and episode_data (keys agent_power_consumption_mwh,
baseline_power_consumption_mwh, baseline_power_consumption_off_mwh) to remove
the redundant "Power=" segment so only the single Power metric (already emitted
earlier) remains; locate the formatted string construction in
evaluation_summary.py and delete the second Power=/... group while preserving
the remaining text and numeric formatting.

In `@src/job_management.py`:
- Around line 81-83: Episode metrics are being polluted because completion-time
increments (in job completion handling around running_jobs and updates to
jobs_completed and avg_wait_time) count jobs that were submitted in previous
episodes; update the logic so episode boundaries are respected by either: (a)
tagging each Job object with its submission_episode (e.g., set
submission_episode when the job is enqueued) and only increment per-episode
counters (jobs_completed, avg_wait_time used by record_episode_completion())
when job.submission_episode == current_episode, or (b) maintain separate
episode-local launch/completion counters (e.g., episode_jobs_launched,
episode_jobs_completed, episode_wait_accumulator) that are reset by
record_episode_completion() and incremented at launch/completion time; ensure
record_episode_completion() uses those episode-local counters (not global
running_jobs/queue/backlog) so completion_rate cannot exceed 100%.

In `@test/test_sanity_env.py`:
- Around line 387-389: The environment returned by make_env_with_carry() passed
into check_env() isn't explicitly closed; update the snippet that calls
check_env(make_env_with_carry(), warn=True) to capture the created env (e.g.,
env = make_env_with_carry()), pass that env to check_env(env, warn=True), and
then call env.close() (or use a try/finally to ensure env.close() runs) after
the check to match the cleanup pattern used elsewhere.

---

Nitpick comments:
In `@src/reward_calculation.py`:
- Around line 267-301: The price-term is still scaled by node count while
cost/efficiency moved to per-core, causing mismatched signals; update calculate
to pass num_used_cores (or a derived core occupancy) into _reward_price instead
of num_used_nodes, and update _reward_price's signature/implementation to
interpret that parameter as core-occupancy (adjust scaling/normalization inside
_reward_price accordingly); ensure any docstrings and callers reflect the new
parameter name (e.g., core_occupancy or num_used_cores) so price_reward aligns
with the per-core cost/efficiency model.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a7e10be4-9d08-4e37-8880-42306e668ce1

📥 Commits

Reviewing files that changed from the base of the PR and between 95c564e and 7e5ecb6.

📒 Files selected for processing (7)

src/baseline.py
src/environment.py
src/evaluation_summary.py
src/job_management.py
src/metrics_tracker.py
src/reward_calculation.py
test/test_sanity_env.py

coderabbitai · 2026-03-20T15:47:00Z

src/evaluation_summary.py

+        f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%, "
+        f"Power={float(episode_data['agent_power_consumption_mwh']):.1f}/"
+        f"{float(episode_data['baseline_power_consumption_mwh']):.1f}/"
+        f"{float(episode_data['baseline_power_consumption_off_mwh']):.1f} MWh "
+        f"(agent/base/base_off)"


⚠️ Potential issue | 🟡 Minor

Drop the second Power= field.

Line 67 duplicates the same metric already emitted at Lines 40-43, so the episode line now contains two Power= sections. That makes the output noisy and brittle for any downstream log parsing.

✂️ Proposed fix

- f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%, " - f"Power={float(episode_data['agent_power_consumption_mwh']):.1f}/" - f"{float(episode_data['baseline_power_consumption_mwh']):.1f}/" - f"{float(episode_data['baseline_power_consumption_off_mwh']):.1f} MWh " - f"(agent/base/base_off)" + f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%, "

f"Power={float(episode_data['agent_power_consumption_mwh']):.1f}/"

f"{float(episode_data['baseline_power_consumption_mwh']):.1f}/"

f"{float(episode_data['baseline_power_consumption_off_mwh']):.1f} MWh "

f"(agent/base/base_off)"

f"Baseline Occupancy (Nodes)={baseline_occupancy_nodes_pct:.2f}%"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/evaluation_summary.py` around lines 66 - 70, The episode summary f-string is emitting a duplicated "Power=" field; update the f-string that builds the line using baseline_occupancy_nodes_pct and episode_data (keys agent_power_consumption_mwh, baseline_power_consumption_mwh, baseline_power_consumption_off_mwh) to remove the redundant "Power=" segment so only the single Power metric (already emitted earlier) remains; locate the formatted string construction in evaluation_summary.py and delete the second Power=/... group while preserving the remaining text and numeric formatting.

coderabbitai · 2026-03-20T15:47:01Z

src/job_management.py

+    Completion is counted here (when duration hits zero), not at launch time.
+    'wait_time' on each job is the time spent in the queue before being launched,
+    which is the standard HPC metric for scheduler responsiveness.


⚠️ Potential issue | 🟠 Major

Completion-time episode counters now mix different job cohorts.

src/environment.py Lines 294-304 only reset episode metrics between truncations; they do not clear running_jobs, the queue, or backlog. After Lines 117-125 moved these increments to completion time, a job submitted in episode N and finishing in episode N+1 contributes to episode N+1's jobs_completed and avg_wait_time, while record_episode_completion() still divides by that episode's jobs_submitted. That makes the per-episode completion stats drift and can push completion_rate past 100%. Track submission cohort/episode on the job, or keep separate episode-local "started/launched" counters for the summary.

Also applies to: 114-125

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/job_management.py` around lines 81 - 83, Episode metrics are being polluted because completion-time increments (in job completion handling around running_jobs and updates to jobs_completed and avg_wait_time) count jobs that were submitted in previous episodes; update the logic so episode boundaries are respected by either: (a) tagging each Job object with its submission_episode (e.g., set submission_episode when the job is enqueued) and only increment per-episode counters (jobs_completed, avg_wait_time used by record_episode_completion()) when job.submission_episode == current_episode, or (b) maintain separate episode-local launch/completion counters (e.g., episode_jobs_launched, episode_jobs_completed, episode_wait_accumulator) that are reset by record_episode_completion() and incremented at launch/completion time; ensure record_episode_completion() uses those episode-local counters (not global running_jobs/queue/backlog) so completion_rate cannot exceed 100%.

coderabbitai · 2026-03-20T15:47:01Z

test/test_sanity_env.py

+        from stable_baselines3.common.env_checker import check_env
+        check_env(make_env_with_carry(), warn=True)
+        print("[OK] gym check passed")


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify check_env usage and whether close() is paired in the same control block. rg -n -C4 'if args\.check_gym|check_env\(|\.close\(' test/test_sanity_env.py

Repository: FairRootGroup/powersched

Length of output: 2611

🏁 Script executed:

#!/bin/bash # Search for other usages of check_env in the repository rg -n 'check_env' --type py

Repository: FairRootGroup/powersched

Length of output: 510

🌐 Web query:

stable-baselines3 check_env closes environment cleanup

💡 Result:

Yes, the stable-baselines3 check_env function calls env.close on the environment during execution, specifically as part of the render check when skip_render_check=False (default is True, so it skips by default). If env.render_mode is set, it calls env.render followed by env.close. For environments wrapped in vectorized environments like DummyVecEnv (used internally in _check_nan), the VecEnv.close method is not explicitly called by check_env, but VecEnvs have a close method for cleanup that closes underlying envs (e.g., SubprocVecEnv sends "close" command and joins processes). check_env does not perform full environment cleanup beyond potential render-related close; it leaves the environment in a stepped state after calling reset and step for validation. Users should handle further cleanup if needed after calling check_env(env). To avoid render close: pass skip_render_check=True (default). Official source code confirms: in check_env, if not skipping render check: _check_render(env), which does env.render; env.close if render_mode present.

Citations:

1: https://stable-baselines3.readthedocs.io/en/master/_modules/stable_baselines3/common/env_checker.html

2: https://stable-baselines3.readthedocs.io/en/v1.0/_modules/stable_baselines3/common/env_checker.html

3: https://stable-baselines3.readthedocs.io/en/master/common/env_checker.html

🏁 Script executed:

#!/bin/bash # Check test_checkenv.py for check_env usage and cleanup cat -n test/test_checkenv.py

Repository: FairRootGroup/powersched

Length of output: 1065

Close the environment instance returned by check_env() explicitly.

The environment created at line 388 is not explicitly closed. The check_env() function from stable-baselines3 does not automatically close the passed environment instance, leaving it in a stepped state. In this long-running sanity script that may be invoked repeatedly, explicit cleanup prevents resource/descriptor leaks, consistent with the pattern used elsewhere in this file (lines 115, 136, 198, 378).

♻️ Proposed fix

if args.check_gym: from stable_baselines3.common.env_checker import check_env - check_env(make_env_with_carry(), warn=True) + gym_env = make_env_with_carry() + try: + check_env(gym_env, warn=True) + finally: + gym_env.close() print("[OK] gym check passed")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from stable_baselines3.common.env_checker import check_env

check_env(make_env_with_carry(), warn=True)

print("[OK] gym check passed")

from stable_baselines3.common.env_checker import check_env

gym_env = make_env_with_carry()

try:

check_env(gym_env, warn=True)

finally:

gym_env.close()

print("[OK] gym check passed")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/test_sanity_env.py` around lines 387 - 389, The environment returned by make_env_with_carry() passed into check_env() isn't explicitly closed; update the snippet that calls check_env(make_env_with_carry(), warn=True) to capture the created env (e.g., env = make_env_with_carry()), pass that env to check_env(env, warn=True), and then call env.close() (or use a try/finally to ensure env.close() runs) after the check to match the cleanup pattern used elsewhere.

rbx added 3 commits March 20, 2026 13:27

Fix: Increment completed jobs metric when a job actually finishes, no…

b2afdf6

…t when it is launched

test

7e5ecb6

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

rbx closed this Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore per-core power calculation#38

Explore per-core power calculation#38
rbx wants to merge 3 commits intomasterfrom
exploration

rbx commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

coderabbitai bot Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rbx commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 20, 2026 •

edited

Loading