Skip to content

add weighted evaluation aggregation#243

Open
janrth wants to merge 6 commits into
Nixtla:mainfrom
janrth:241/weights_for_eval
Open

add weighted evaluation aggregation#243
janrth wants to merge 6 commits into
Nixtla:mainfrom
janrth:241/weights_for_eval

Conversation

@janrth
Copy link
Copy Markdown
Contributor

@janrth janrth commented May 14, 2026

Summary

Adds weighted aggregation support to evaluate while keeping the core loss functions unchanged.

The new agg_fn="weighted_mean" option computes metrics at the existing per-series level first, then aggregates them with user-provided or automatically computed weights. This preserves the current loss semantics and avoids pushing aggregation logic into every loss function.

Changes

  • Added weights argument to evaluate.
  • Added agg_fn="weighted_mean".
  • Supports explicit weights with:
    • unique_id, weight
    • unique_id, cutoff, weight
  • Supports weights="auto", which uses validation target volume:
    • sum(y) per unique_id
    • sum(y) per (cutoff, unique_id) when cross-validation cutoffs are present
  • Added validation for missing, duplicated, or malformed weights.
  • Added pandas and polars test coverage.

Notes

Weighted aggregation is applied after each metric has been computed per series. For AutoML tuning in MLForecast, this means each CV fold can compute a weighted score, and MLForecast then averages those fold-level scores.

Solves #241

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7be2207ce5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread utilsforecast/evaluation.py
@nasaul
Copy link
Copy Markdown
Contributor

nasaul commented May 21, 2026

Review Summary

Overall the design is clean and the narwhals-based implementation fits the project well. A few items to address before merge:

Correctness

  • Distributed path silently drops weights. _distributed_evaluate (evaluation.py:336) isn't updated to forward weights. Users will hit the generic "agg_fn is not supported in distributed" error without realizing weights were ignored. Please raise an explicit NotImplementedError for agg_fn=\"weighted_mean\" on the distributed branch.
  • Undocumented cutoff broadcast. When df has cutoff_col but the weights dataframe doesn't, the same weight is broadcast across all cutoffs (evaluation.py:99-100, 143-144). Either document this in the docstring or require an opt-in.
  • Auto-weights semantics. The docstring says "sum of the target over each series forecast horizon" — please emphasize this uses the forecast/validation window (not training), since small true values in the eval window produce small weights.

Code quality

  • Redundant finite check after the join (evaluation.py:149-150) — _get_weights already calls _check_weights_are_finite; only the null check is needed.
  • Rename weights_df_source (evaluation.py:353, 467) → forecasts_df / original_df.
  • Replace the hand-rolled group_by(...).agg(len) > 1 duplicate check (evaluation.py:106-112) with unique().shape[0] != shape[0].
  • Fold is_null().any() and (~is_finite()).any() into a single boolean expression to scan the column once.
  • PEP 8 nit: id_cols[1:] # exclude id_col → two spaces before #.

Performance

  • Collapse _weighted_mean_agg into one group_by. The intermediate __utilsforecast_weighted_{i} columns (evaluation.py:152-158) and the post-aggregation divide step (evaluation.py:169-175) can be merged:
    ```python
    df_nw = df_nw.group_by(*group_cols).agg(
    nw.col(_WEIGHT_COL).sum().alias(_WEIGHT_SUM_COL),
    *[(nw.col(model) * nw.col(_WEIGHT_COL)).sum().alias(model) for model in model_cols],
    )
    ```
    Then divide once — saves two with_columns passes over wide model sets.
  • The repeated .any() scans for null/finite/zero-sum can be reduced; consider folding the zero-sum check into the divide step.

Tests

Coverage is good, but please add:

  • Weights without cutoff_col while df has cutoff_col (broadcast case).
  • Zero-sum-weights case (the raise at evaluation.py:166 is uncovered).
  • Cross-engine smoke test (polars df + pandas weights and vice versa).

Suggested order

  1. Explicit error for weights on distributed path.
  2. Single-pass _weighted_mean_agg.
  3. Docstring updates (broadcast + auto-weights semantics).
  4. Drop redundant finite check; merge null/finite scans.
  5. Add missing tests.

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@janrth
Copy link
Copy Markdown
Contributor Author

janrth commented May 21, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 52edbdeb2d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread utilsforecast/evaluation.py
@janrth
Copy link
Copy Markdown
Contributor Author

janrth commented May 21, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 586db014df

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +152 to +154
df_nw = df_nw.group_by(*group_cols).agg(
nw.col(_WEIGHT_COL).sum().alias(_WEIGHT_SUM_COL),
*[
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Exclude missing metric rows from weighted denominator

When any metric output contains missing values (a common case for metrics like mape on zero targets), this aggregation computes the denominator from sum(weight) across all series but computes each numerator from sum(weight * metric). That means rows with missing metric values still contribute to the denominator, so the weighted score is biased downward (and can become 0 instead of NaN for all-missing groups). This differs from the existing agg_fn='mean' behavior, which excludes missing values per column; the weighted path should similarly use per-model effective weight sums over non-missing metric rows.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants