Skip to content

WIP: core-based power calc exploration#39

Closed
rbx wants to merge 3 commits intomasterfrom
exploration
Closed

WIP: core-based power calc exploration#39
rbx wants to merge 3 commits intomasterfrom
exploration

Conversation

@rbx
Copy link
Copy Markdown
Member

@rbx rbx commented Mar 20, 2026

Switch power model to proportional per-core.

Replace binary node-level power (idle or fully-used per node) with a proportional model: all on-nodes draw idle baseline, compute delta scales linearly with cores_used/CORES_PER_NODE.

Updates power_cost(), power_consumption_mwh(), energy efficiency reward, and all call sites in environment.py and baseline.py.

rbx added 3 commits March 20, 2026 13:27
Replace binary node-level power (idle or fully-used per node) with a
proportional model: all on-nodes draw idle baseline, compute delta
scales linearly with cores_used/CORES_PER_NODE.

Updates power_cost(), power_consumption_mwh(), energy efficiency reward,
and all call sites in environment.py and baseline.py.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request refactors the metrics tracking and power computation system across the job scheduling pipeline. Changes shift from node-count-based to core-utilization-based power calculations, introduce a shared MetricsTracker for baseline and environment steps, update job wait-time tracking at assignment, add gymnasium environment validation, implement log merging utilities, and enhance training subprocess logging with per-run log files.

Changes

Cohort / File(s) Summary
Core Metrics & Job Processing
src/baseline.py, src/environment.py, src/job_management.py
Updated process_ongoing_jobs signature to accept metrics and is_baseline parameters; shifted job completion accounting to when duration ≤ 0 and now updates metrics; assign_jobs_to_available_nodes records wait_time on job assignment; both baseline and environment steps pass MetricsTracker and updated node/core arguments to job processing functions; explicit int casting on aggregated node/job counts in baseline.
Power & Reward Calculation
src/reward_calculation.py
Transitioned power model from node-split (num_used_nodes vs num_idle_nodes) to core-utilization-based (num_on_nodes, cores_used); power_cost() and power_consumption_mwh() now compute idle baseline across all on-nodes plus linear compute scaling by core fraction; _reward_energy_efficiency_normalized() refactored to derive efficiency from compute power ratio; RewardCalculator.calculate() updated to accept num_used_cores and derive num_on_nodes internally.
Testing Infrastructure
test/test_sanity_env.py, test/run_all.py
Added --check-gym CLI flag and optional gymnasium environment validation block in test_sanity_env.py; removed test_sanity_env execution from main test suite in run_all.py.
Data Utilities & Logging
data/merge_logs.py, train_iter.py
Introduced new merge_logs.py script for chronologically ordering and merging multiple Slurm partition logs with header preservation and timestamp-based sorting; enhanced train_iter.py with make_log_dir() and label_to_filename() helpers; updated run_all_parallel() to manage per-run log files with explicit close on completion and progress counter display.

Sequence Diagram(s)

sequenceDiagram
    participant Env as ComputeClusterEnv
    participant JM as job_management
    participant MT as MetricsTracker
    participant RC as RewardCalculator

    Env->>JM: process_ongoing_jobs(nodes, cores, running_jobs, metrics, is_baseline=False)
    JM->>JM: Decrement job duration
    alt Job completed (duration ≤ 0)
        JM->>JM: Extract wait_time from job
        JM->>MT: Update non-baseline completion counter<br/>and wait_time sums
        JM->>JM: Remove from running_jobs
    end
    JM-->>Env: Return completed job IDs

    Env->>Env: Calculate num_on_nodes, num_used_cores
    Env->>RC: calculate(num_used_nodes, num_idle_nodes,<br/>num_used_cores, prices, metrics, ...)
    RC->>RC: power_cost(num_on_nodes, num_used_cores, price)
    RC->>RC: _reward_energy_efficiency_normalized<br/>(num_on_nodes, num_used_cores)
    RC-->>Env: Return reward signal
    Env-->>Env: Update env state and metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'WIP: core-based power calc exploration' accurately reflects the main change of switching from a binary node-level power model to a core-based proportional power calculation model.
Description check ✅ Passed The description clearly explains the power model switch from binary node-level to proportional per-core, details the key components updated, and aligns with the changeset modifications across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@data/merge_logs.py`:
- Around line 43-44: The output path can be one of the input files and will be
overwritten; add a guard after input_files is computed (where input_files is
set) to check args.output against the resolved input file paths (use
absolute/normalized paths) and error out or exit if they match any entry in
input_files; reference the variables input_files and args.output (and the main
function or script entry) and ensure you compare normalized paths
(os.path.abspath/os.path.realpath) to prevent accidental overlap before
proceeding to the write step around the block that handles writing the merged
output.
- Around line 18-27: parse_submit currently returns an empty string for
non-data/malformed lines which sorts before valid timestamps; change
parse_submit to return None instead of '' for malformed/no-timestamp lines
(function: parse_submit) and update the sorting key that consumes it to treat
None as greater than any real timestamp (e.g., sort by a tuple like
(parse_submit(line) is None, parse_submit(line)) or explicitly map None to a max
sentinel) so malformed rows appear after valid chronological entries.
- Around line 56-58: The loop over input_files in merge_logs.py currently opens
each path directly and will crash on unreadable/missing files; wrap the
open/read (the with open(path, 'r') as fh: and fh.readlines() operations inside
a try/except catching OSError (or Exception) and on failure write a clear error
to stderr (including the failing path and exception message) and exit with a
non-zero status (sys.exit(1)); ensure you reference the variables path,
input_files and lines so the error message is informative and the rest of the
script does not continue after a failure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 86012ad3-df56-4be8-881b-c04959ed4f34

📥 Commits

Reviewing files that changed from the base of the PR and between 8454317 and a1fe782.

📒 Files selected for processing (1)
  • data/merge_logs.py

Comment thread data/merge_logs.py Outdated
Comment thread data/merge_logs.py Outdated
Comment thread data/merge_logs.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
data/merge_logs.py (1)

82-82: Avoid parsing the same timestamp twice in sort key.

At Line 82, parse_submit(line) is called twice per element during sort. Compute once in a small helper key function for clarity and lower overhead.

Proposed refactor
-    data_lines.sort(key=lambda line: (parse_submit(line) is None, parse_submit(line)))
+    def submit_sort_key(line: str):
+        ts = parse_submit(line)
+        return (ts is None, ts or "")
+
+    data_lines.sort(key=submit_sort_key)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@data/merge_logs.py` at line 82, The sort key calls parse_submit(line) twice
which is wasteful and unclear; replace the lambda with a small helper key
function (e.g., def _submit_key(line): ts = parse_submit(line); return (ts is
None, ts)) and pass that to data_lines.sort so each line's timestamp is parsed
once; update the call at data_lines.sort(...) to use _submit_key and keep
parse_submit as the timestamp extractor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@data/merge_logs.py`:
- Around line 84-86: Wrap the output file write in a try/except consistent with
the read-side handling: surround the with open(args.output, 'w') as out: block
that writes header and data_lines (variables header and data_lines) with a try,
catch exceptions (Exception as e), print a clear error message to sys.stderr
including args.output and the exception, and call sys.exit(1) to exit with
non-zero status; ensure you reference args.output, header and data_lines and
mirror the same formatting used by the read-side error handling.

---

Nitpick comments:
In `@data/merge_logs.py`:
- Line 82: The sort key calls parse_submit(line) twice which is wasteful and
unclear; replace the lambda with a small helper key function (e.g., def
_submit_key(line): ts = parse_submit(line); return (ts is None, ts)) and pass
that to data_lines.sort so each line's timestamp is parsed once; update the call
at data_lines.sort(...) to use _submit_key and keep parse_submit as the
timestamp extractor.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5d8fcf81-443f-444d-9ac5-42c00c501590

📥 Commits

Reviewing files that changed from the base of the PR and between a1fe782 and feda016.

📒 Files selected for processing (1)
  • data/merge_logs.py

Comment thread data/merge_logs.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
powersched.def (1)

36-45: Pin remaining Python dependencies for deterministic training images.

These package installs lack version pinning; without exact pins or a lock file, future resolver changes can alter behavior and metrics between rebuilds. Use a lock file (via pip-tools, Poetry, or uv) or generate exact pinned versions with pip freeze to ensure reproducible training runs across different build times.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@powersched.def` around lines 36 - 45, The install commands in powersched.def
install multiple Python packages (stable-baselines3, gymnasium, tensorboard,
matplotlib, colorama, numpy, pandas, torchinfo, setuptools) without pinned
versions, which risks non-deterministic builds; replace this by adding an exact,
committed lock of dependencies and reference it from powersched.def (e.g.,
generate a requirements.txt or requirements.lock via pip freeze or
pip-tools/Poetry/venv lock and change the pip install call to use that pinned
file), or alternatively pin each package to exact versions in the installation
list so builds are reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@powersched.def`:
- Around line 32-33: The pip install line pins only torch to 2.5.1 causing
potential incompatibilities; update the install invocation that currently lists
"torch==2.5.1 torchvision torchaudio --index-url
https://download.pytorch.org/whl/rocm6.2" to explicitly pin torchvision and
torchaudio to the ROCm-compatible versions (use torchvision==0.20.1 and
torchaudio==2.5.1) while keeping the same index URL and --no-cache-dir flag so
the installed packages match torch==2.5.1.
- Line 29: The line that pipes a remote get-pip.py script into Python ("curl -sS
https://bootstrap.pypa.io/get-pip.py | python3.12") should be replaced with
using Python's built-in ensurepip module: invoke Python 3.12 to run the
ensurepip module to bootstrap pip from the interpreter's bundled wheels (and
then run the pip module to upgrade pip if desired); remove the curl|python
pattern and call ensurepip/pip instead to avoid remote script execution.

---

Nitpick comments:
In `@powersched.def`:
- Around line 36-45: The install commands in powersched.def install multiple
Python packages (stable-baselines3, gymnasium, tensorboard, matplotlib,
colorama, numpy, pandas, torchinfo, setuptools) without pinned versions, which
risks non-deterministic builds; replace this by adding an exact, committed lock
of dependencies and reference it from powersched.def (e.g., generate a
requirements.txt or requirements.lock via pip freeze or pip-tools/Poetry/venv
lock and change the pip install call to use that pinned file), or alternatively
pin each package to exact versions in the installation list so builds are
reproducible.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 493a226f-a848-4ca3-acdf-e94d6386f4df

📥 Commits

Reviewing files that changed from the base of the PR and between feda016 and 511c104.

📒 Files selected for processing (1)
  • powersched.def

@FairRootGroup FairRootGroup deleted a comment from coderabbitai Bot Mar 23, 2026
@FairRootGroup FairRootGroup deleted a comment from coderabbitai Bot Mar 23, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@train_iter.py`:
- Around line 213-217: The code opens a file handle (log_fh) then calls
subprocess.Popen and appends (proc, label, log_fh) to active, but if
subprocess.Popen raises an exception the open file will leak; wrap the Popen
call in a try/except/finally (or try/except) so that if
subprocess.Popen(command, env=current_env, stdout=log_fh,
stderr=subprocess.STDOUT) raises, you close log_fh before re-raising or handling
the error, and only append (proc, label, log_fh) to the active list after Popen
succeeds (identify the block around log_path, label_to_filename, log_fh,
subprocess.Popen, and active.append).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9ca49c92-e50d-4dc4-a8de-238eeb05b053

📥 Commits

Reviewing files that changed from the base of the PR and between 511c104 and c2dfb8b.

📒 Files selected for processing (2)
  • data/merge_logs.py
  • train_iter.py
✅ Files skipped from review due to trivial changes (1)
  • data/merge_logs.py

Comment thread train_iter.py Outdated
@rbx rbx closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant