WIP: core-based power calc exploration by rbx · Pull Request #39 · FairRootGroup/powersched

rbx · 2026-03-20T17:02:59Z

Switch power model to proportional per-core.

Replace binary node-level power (idle or fully-used per node) with a proportional model: all on-nodes draw idle baseline, compute delta scales linearly with cores_used/CORES_PER_NODE.

Updates power_cost(), power_consumption_mwh(), energy efficiency reward, and all call sites in environment.py and baseline.py.

…t when it is launched

Replace binary node-level power (idle or fully-used per node) with a proportional model: all on-nodes draw idle baseline, compute delta scales linearly with cores_used/CORES_PER_NODE. Updates power_cost(), power_consumption_mwh(), energy efficiency reward, and all call sites in environment.py and baseline.py.

coderabbitai · 2026-03-20T17:03:14Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request refactors the metrics tracking and power computation system across the job scheduling pipeline. Changes shift from node-count-based to core-utilization-based power calculations, introduce a shared MetricsTracker for baseline and environment steps, update job wait-time tracking at assignment, add gymnasium environment validation, implement log merging utilities, and enhance training subprocess logging with per-run log files.

Changes

Cohort / File(s)	Summary
Core Metrics & Job Processing `src/baseline.py`, `src/environment.py`, `src/job_management.py`	Updated `process_ongoing_jobs` signature to accept `metrics` and `is_baseline` parameters; shifted job completion accounting to when duration ≤ 0 and now updates metrics; `assign_jobs_to_available_nodes` records `wait_time` on job assignment; both baseline and environment steps pass MetricsTracker and updated node/core arguments to job processing functions; explicit int casting on aggregated node/job counts in baseline.
Power & Reward Calculation `src/reward_calculation.py`	Transitioned power model from node-split (`num_used_nodes` vs `num_idle_nodes`) to core-utilization-based (`num_on_nodes`, `cores_used`); `power_cost()` and `power_consumption_mwh()` now compute idle baseline across all on-nodes plus linear compute scaling by core fraction; `_reward_energy_efficiency_normalized()` refactored to derive efficiency from compute power ratio; `RewardCalculator.calculate()` updated to accept `num_used_cores` and derive `num_on_nodes` internally.
Testing Infrastructure `test/test_sanity_env.py`, `test/run_all.py`	Added `--check-gym` CLI flag and optional gymnasium environment validation block in `test_sanity_env.py`; removed `test_sanity_env` execution from main test suite in `run_all.py`.
Data Utilities & Logging `data/merge_logs.py`, `train_iter.py`	Introduced new `merge_logs.py` script for chronologically ordering and merging multiple Slurm partition logs with header preservation and timestamp-based sorting; enhanced `train_iter.py` with `make_log_dir()` and `label_to_filename()` helpers; updated `run_all_parallel()` to manage per-run log files with explicit close on completion and progress counter display.

Sequence Diagram(s)

sequenceDiagram
    participant Env as ComputeClusterEnv
    participant JM as job_management
    participant MT as MetricsTracker
    participant RC as RewardCalculator

    Env->>JM: process_ongoing_jobs(nodes, cores, running_jobs, metrics, is_baseline=False)
    JM->>JM: Decrement job duration
    alt Job completed (duration ≤ 0)
        JM->>JM: Extract wait_time from job
        JM->>MT: Update non-baseline completion counter<br/>and wait_time sums
        JM->>JM: Remove from running_jobs
    end
    JM-->>Env: Return completed job IDs

    Env->>Env: Calculate num_on_nodes, num_used_cores
    Env->>RC: calculate(num_used_nodes, num_idle_nodes,<br/>num_used_cores, prices, metrics, ...)
    RC->>RC: power_cost(num_on_nodes, num_used_cores, price)
    RC->>RC: _reward_energy_efficiency_normalized<br/>(num_on_nodes, num_used_cores)
    RC-->>Env: Return reward signal
    Env-->>Env: Update env state and metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Simplify metrics handling in step() #7 — Overlapping changes to MetricsTracker propagation across baseline_step, assign_jobs_to_available_nodes, and process_ongoing_jobs call sites with shared metrics architecture.
Add standalone script to analyze savings and lambda dependency #28 — Concurrent modifications to baseline.py, environment.py, and metrics flow including num_used_cores computation and baseline power calculation semantics.
Rebuilt carry-over logic to fit with new infrastructure. Also added two standalone plotting tests. #11 — Direct overlap in job_management.py job completion and wait-time metrics accounting, including updates to when/how episode and baseline counters are incremented.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 59.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'WIP: core-based power calc exploration' accurately reflects the main change of switching from a binary node-level power model to a core-based proportional power calculation model.
Description check	✅ Passed	The description clearly explains the power model switch from binary node-level to proportional per-core, details the key components updated, and aligns with the changeset modifications across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@data/merge_logs.py`:
- Around line 43-44: The output path can be one of the input files and will be
overwritten; add a guard after input_files is computed (where input_files is
set) to check args.output against the resolved input file paths (use
absolute/normalized paths) and error out or exit if they match any entry in
input_files; reference the variables input_files and args.output (and the main
function or script entry) and ensure you compare normalized paths
(os.path.abspath/os.path.realpath) to prevent accidental overlap before
proceeding to the write step around the block that handles writing the merged
output.
- Around line 18-27: parse_submit currently returns an empty string for
non-data/malformed lines which sorts before valid timestamps; change
parse_submit to return None instead of '' for malformed/no-timestamp lines
(function: parse_submit) and update the sorting key that consumes it to treat
None as greater than any real timestamp (e.g., sort by a tuple like
(parse_submit(line) is None, parse_submit(line)) or explicitly map None to a max
sentinel) so malformed rows appear after valid chronological entries.
- Around line 56-58: The loop over input_files in merge_logs.py currently opens
each path directly and will crash on unreadable/missing files; wrap the
open/read (the with open(path, 'r') as fh: and fh.readlines() operations inside
a try/except catching OSError (or Exception) and on failure write a clear error
to stderr (including the failing path and exception message) and exit with a
non-zero status (sys.exit(1)); ensure you reference the variables path,
input_files and lines so the error message is informative and the rest of the
script does not continue after a failure.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 86012ad3-df56-4be8-881b-c04959ed4f34

📥 Commits

Reviewing files that changed from the base of the PR and between 8454317 and a1fe782.

📒 Files selected for processing (1)

data/merge_logs.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

data/merge_logs.py (1)

82-82: Avoid parsing the same timestamp twice in sort key.

At Line 82, parse_submit(line) is called twice per element during sort. Compute once in a small helper key function for clarity and lower overhead.

Proposed refactor

-    data_lines.sort(key=lambda line: (parse_submit(line) is None, parse_submit(line)))
+    def submit_sort_key(line: str):
+        ts = parse_submit(line)
+        return (ts is None, ts or "")
+
+    data_lines.sort(key=submit_sort_key)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@data/merge_logs.py` at line 82, The sort key calls parse_submit(line) twice
which is wasteful and unclear; replace the lambda with a small helper key
function (e.g., def _submit_key(line): ts = parse_submit(line); return (ts is
None, ts)) and pass that to data_lines.sort so each line's timestamp is parsed
once; update the call at data_lines.sort(...) to use _submit_key and keep
parse_submit as the timestamp extractor.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@data/merge_logs.py`:
- Around line 84-86: Wrap the output file write in a try/except consistent with
the read-side handling: surround the with open(args.output, 'w') as out: block
that writes header and data_lines (variables header and data_lines) with a try,
catch exceptions (Exception as e), print a clear error message to sys.stderr
including args.output and the exception, and call sys.exit(1) to exit with
non-zero status; ensure you reference args.output, header and data_lines and
mirror the same formatting used by the read-side error handling.

---

Nitpick comments:
In `@data/merge_logs.py`:
- Line 82: The sort key calls parse_submit(line) twice which is wasteful and
unclear; replace the lambda with a small helper key function (e.g., def
_submit_key(line): ts = parse_submit(line); return (ts is None, ts)) and pass
that to data_lines.sort so each line's timestamp is parsed once; update the call
at data_lines.sort(...) to use _submit_key and keep parse_submit as the
timestamp extractor.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5d8fcf81-443f-444d-9ac5-42c00c501590

📥 Commits

Reviewing files that changed from the base of the PR and between a1fe782 and feda016.

📒 Files selected for processing (1)

data/merge_logs.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

powersched.def (1)
36-45: Pin remaining Python dependencies for deterministic training images.

These package installs lack version pinning; without exact pins or a lock file, future resolver changes can alter behavior and metrics between rebuilds. Use a lock file (via pip-tools, Poetry, or uv) or generate exact pinned versions with pip freeze to ensure reproducible training runs across different build times.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@powersched.def` around lines 36 - 45, The install commands in powersched.def
install multiple Python packages (stable-baselines3, gymnasium, tensorboard,
matplotlib, colorama, numpy, pandas, torchinfo, setuptools) without pinned
versions, which risks non-deterministic builds; replace this by adding an exact,
committed lock of dependencies and reference it from powersched.def (e.g.,
generate a requirements.txt or requirements.lock via pip freeze or
pip-tools/Poetry/venv lock and change the pip install call to use that pinned
file), or alternatively pin each package to exact versions in the installation
list so builds are reproducible.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@powersched.def`:
- Around line 32-33: The pip install line pins only torch to 2.5.1 causing
potential incompatibilities; update the install invocation that currently lists
"torch==2.5.1 torchvision torchaudio --index-url
https://download.pytorch.org/whl/rocm6.2" to explicitly pin torchvision and
torchaudio to the ROCm-compatible versions (use torchvision==0.20.1 and
torchaudio==2.5.1) while keeping the same index URL and --no-cache-dir flag so
the installed packages match torch==2.5.1.
- Line 29: The line that pipes a remote get-pip.py script into Python ("curl -sS
https://bootstrap.pypa.io/get-pip.py | python3.12") should be replaced with
using Python's built-in ensurepip module: invoke Python 3.12 to run the
ensurepip module to bootstrap pip from the interpreter's bundled wheels (and
then run the pip module to upgrade pip if desired); remove the curl|python
pattern and call ensurepip/pip instead to avoid remote script execution.

---

Nitpick comments:
In `@powersched.def`:
- Around line 36-45: The install commands in powersched.def install multiple
Python packages (stable-baselines3, gymnasium, tensorboard, matplotlib,
colorama, numpy, pandas, torchinfo, setuptools) without pinned versions, which
risks non-deterministic builds; replace this by adding an exact, committed lock
of dependencies and reference it from powersched.def (e.g., generate a
requirements.txt or requirements.lock via pip freeze or pip-tools/Poetry/venv
lock and change the pip install call to use that pinned file), or alternatively
pin each package to exact versions in the installation list so builds are
reproducible.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 493a226f-a848-4ca3-acdf-e94d6386f4df

📥 Commits

Reviewing files that changed from the base of the PR and between feda016 and 511c104.

📒 Files selected for processing (1)

powersched.def

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@train_iter.py`:
- Around line 213-217: The code opens a file handle (log_fh) then calls
subprocess.Popen and appends (proc, label, log_fh) to active, but if
subprocess.Popen raises an exception the open file will leak; wrap the Popen
call in a try/except/finally (or try/except) so that if
subprocess.Popen(command, env=current_env, stdout=log_fh,
stderr=subprocess.STDOUT) raises, you close log_fh before re-raising or handling
the error, and only append (proc, label, log_fh) to the active list after Popen
succeeds (identify the block around log_path, label_to_filename, log_fh,
subprocess.Popen, and active.append).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ca49c92-e50d-4dc4-a8de-238eeb05b053

📥 Commits

Reviewing files that changed from the base of the PR and between 511c104 and c2dfb8b.

📒 Files selected for processing (2)

data/merge_logs.py
train_iter.py

✅ Files skipped from review due to trivial changes (1)

data/merge_logs.py

rbx added 3 commits March 20, 2026 13:27

Fix: Increment completed jobs metric when a job actually finishes, no…

b2afdf6

…t when it is launched

tests: adjust no longer relevant params

d4f02e5

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread data/merge_logs.py Outdated

Comment thread data/merge_logs.py Outdated

Comment thread data/merge_logs.py Outdated

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread data/merge_logs.py Outdated

rbx force-pushed the exploration branch from 511c104 to 769d65b Compare March 23, 2026 10:32

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

FairRootGroup deleted a comment from coderabbitai Bot Mar 23, 2026

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread train_iter.py Outdated

rbx force-pushed the exploration branch from 7f7d941 to 8454317 Compare March 24, 2026 09:42

rbx closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: core-based power calc exploration#39

WIP: core-based power calc exploration#39
rbx wants to merge 3 commits intomasterfrom
exploration

rbx commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rbx commented Mar 20, 2026

Uh oh!

coderabbitai Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading