Conversation
…t when it is launched
Replace binary node-level power (idle or fully-used per node) with a proportional model: all on-nodes draw idle baseline, compute delta scales linearly with cores_used/CORES_PER_NODE. Updates power_cost(), power_consumption_mwh(), energy efficiency reward, and all call sites in environment.py and baseline.py.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request refactors the metrics tracking and power computation system across the job scheduling pipeline. Changes shift from node-count-based to core-utilization-based power calculations, introduce a shared MetricsTracker for baseline and environment steps, update job wait-time tracking at assignment, add gymnasium environment validation, implement log merging utilities, and enhance training subprocess logging with per-run log files. Changes
Sequence Diagram(s)sequenceDiagram
participant Env as ComputeClusterEnv
participant JM as job_management
participant MT as MetricsTracker
participant RC as RewardCalculator
Env->>JM: process_ongoing_jobs(nodes, cores, running_jobs, metrics, is_baseline=False)
JM->>JM: Decrement job duration
alt Job completed (duration ≤ 0)
JM->>JM: Extract wait_time from job
JM->>MT: Update non-baseline completion counter<br/>and wait_time sums
JM->>JM: Remove from running_jobs
end
JM-->>Env: Return completed job IDs
Env->>Env: Calculate num_on_nodes, num_used_cores
Env->>RC: calculate(num_used_nodes, num_idle_nodes,<br/>num_used_cores, prices, metrics, ...)
RC->>RC: power_cost(num_on_nodes, num_used_cores, price)
RC->>RC: _reward_energy_efficiency_normalized<br/>(num_on_nodes, num_used_cores)
RC-->>Env: Return reward signal
Env-->>Env: Update env state and metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@data/merge_logs.py`:
- Around line 43-44: The output path can be one of the input files and will be
overwritten; add a guard after input_files is computed (where input_files is
set) to check args.output against the resolved input file paths (use
absolute/normalized paths) and error out or exit if they match any entry in
input_files; reference the variables input_files and args.output (and the main
function or script entry) and ensure you compare normalized paths
(os.path.abspath/os.path.realpath) to prevent accidental overlap before
proceeding to the write step around the block that handles writing the merged
output.
- Around line 18-27: parse_submit currently returns an empty string for
non-data/malformed lines which sorts before valid timestamps; change
parse_submit to return None instead of '' for malformed/no-timestamp lines
(function: parse_submit) and update the sorting key that consumes it to treat
None as greater than any real timestamp (e.g., sort by a tuple like
(parse_submit(line) is None, parse_submit(line)) or explicitly map None to a max
sentinel) so malformed rows appear after valid chronological entries.
- Around line 56-58: The loop over input_files in merge_logs.py currently opens
each path directly and will crash on unreadable/missing files; wrap the
open/read (the with open(path, 'r') as fh: and fh.readlines() operations inside
a try/except catching OSError (or Exception) and on failure write a clear error
to stderr (including the failing path and exception message) and exit with a
non-zero status (sys.exit(1)); ensure you reference the variables path,
input_files and lines so the error message is informative and the rest of the
script does not continue after a failure.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
data/merge_logs.py (1)
82-82: Avoid parsing the same timestamp twice in sort key.At Line 82,
parse_submit(line)is called twice per element during sort. Compute once in a small helper key function for clarity and lower overhead.Proposed refactor
- data_lines.sort(key=lambda line: (parse_submit(line) is None, parse_submit(line))) + def submit_sort_key(line: str): + ts = parse_submit(line) + return (ts is None, ts or "") + + data_lines.sort(key=submit_sort_key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@data/merge_logs.py` at line 82, The sort key calls parse_submit(line) twice which is wasteful and unclear; replace the lambda with a small helper key function (e.g., def _submit_key(line): ts = parse_submit(line); return (ts is None, ts)) and pass that to data_lines.sort so each line's timestamp is parsed once; update the call at data_lines.sort(...) to use _submit_key and keep parse_submit as the timestamp extractor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@data/merge_logs.py`:
- Around line 84-86: Wrap the output file write in a try/except consistent with
the read-side handling: surround the with open(args.output, 'w') as out: block
that writes header and data_lines (variables header and data_lines) with a try,
catch exceptions (Exception as e), print a clear error message to sys.stderr
including args.output and the exception, and call sys.exit(1) to exit with
non-zero status; ensure you reference args.output, header and data_lines and
mirror the same formatting used by the read-side error handling.
---
Nitpick comments:
In `@data/merge_logs.py`:
- Line 82: The sort key calls parse_submit(line) twice which is wasteful and
unclear; replace the lambda with a small helper key function (e.g., def
_submit_key(line): ts = parse_submit(line); return (ts is None, ts)) and pass
that to data_lines.sort so each line's timestamp is parsed once; update the call
at data_lines.sort(...) to use _submit_key and keep parse_submit as the
timestamp extractor.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
powersched.def (1)
36-45: Pin remaining Python dependencies for deterministic training images.These package installs lack version pinning; without exact pins or a lock file, future resolver changes can alter behavior and metrics between rebuilds. Use a lock file (via pip-tools, Poetry, or uv) or generate exact pinned versions with
pip freezeto ensure reproducible training runs across different build times.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@powersched.def` around lines 36 - 45, The install commands in powersched.def install multiple Python packages (stable-baselines3, gymnasium, tensorboard, matplotlib, colorama, numpy, pandas, torchinfo, setuptools) without pinned versions, which risks non-deterministic builds; replace this by adding an exact, committed lock of dependencies and reference it from powersched.def (e.g., generate a requirements.txt or requirements.lock via pip freeze or pip-tools/Poetry/venv lock and change the pip install call to use that pinned file), or alternatively pin each package to exact versions in the installation list so builds are reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@powersched.def`:
- Around line 32-33: The pip install line pins only torch to 2.5.1 causing
potential incompatibilities; update the install invocation that currently lists
"torch==2.5.1 torchvision torchaudio --index-url
https://download.pytorch.org/whl/rocm6.2" to explicitly pin torchvision and
torchaudio to the ROCm-compatible versions (use torchvision==0.20.1 and
torchaudio==2.5.1) while keeping the same index URL and --no-cache-dir flag so
the installed packages match torch==2.5.1.
- Line 29: The line that pipes a remote get-pip.py script into Python ("curl -sS
https://bootstrap.pypa.io/get-pip.py | python3.12") should be replaced with
using Python's built-in ensurepip module: invoke Python 3.12 to run the
ensurepip module to bootstrap pip from the interpreter's bundled wheels (and
then run the pip module to upgrade pip if desired); remove the curl|python
pattern and call ensurepip/pip instead to avoid remote script execution.
---
Nitpick comments:
In `@powersched.def`:
- Around line 36-45: The install commands in powersched.def install multiple
Python packages (stable-baselines3, gymnasium, tensorboard, matplotlib,
colorama, numpy, pandas, torchinfo, setuptools) without pinned versions, which
risks non-deterministic builds; replace this by adding an exact, committed lock
of dependencies and reference it from powersched.def (e.g., generate a
requirements.txt or requirements.lock via pip freeze or pip-tools/Poetry/venv
lock and change the pip install call to use that pinned file), or alternatively
pin each package to exact versions in the installation list so builds are
reproducible.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@train_iter.py`:
- Around line 213-217: The code opens a file handle (log_fh) then calls
subprocess.Popen and appends (proc, label, log_fh) to active, but if
subprocess.Popen raises an exception the open file will leak; wrap the Popen
call in a try/except/finally (or try/except) so that if
subprocess.Popen(command, env=current_env, stdout=log_fh,
stderr=subprocess.STDOUT) raises, you close log_fh before re-raising or handling
the error, and only append (proc, label, log_fh) to the active list after Popen
succeeds (identify the block around log_path, label_to_filename, log_fh,
subprocess.Popen, and active.append).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9ca49c92-e50d-4dc4-a8de-238eeb05b053
📒 Files selected for processing (2)
data/merge_logs.pytrain_iter.py
✅ Files skipped from review due to trivial changes (1)
- data/merge_logs.py
Switch power model to proportional per-core.
Replace binary node-level power (idle or fully-used per node) with a proportional model: all on-nodes draw idle baseline, compute delta scales linearly with cores_used/CORES_PER_NODE.
Updates power_cost(), power_consumption_mwh(), energy efficiency reward, and all call sites in environment.py and baseline.py.