Skip to content

Use robust centrality dispersion measures#348

Draft
oleksandr-pavlyk wants to merge 4 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:use-robust-centrality-dispersion-measures
Draft

Use robust centrality dispersion measures#348
oleksandr-pavlyk wants to merge 4 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:use-robust-centrality-dispersion-measures

Conversation

@oleksandr-pavlyk
Copy link
Copy Markdown
Collaborator

@oleksandr-pavlyk oleksandr-pavlyk commented May 4, 2026

This PR closes #342

  1. Introduces std::array<ValueType, N> nvbench::detail::statistics::compute_percentiles(Iter begin, Iter end, std::array<int, N> percentiles) which finds requested percentile levels using "zero-based nearest endpoint-scaled rank".
  2. Report 25-th, 50-th, and 75-th percentiles (first quartile $Q_1$, median $Q_2$, third quartile $Q_3$) as well absolute interquartile range $Q_3 - Q_1$ and relative interquartile range $(Q_3 - Q_1) / Q_2$ for GPU and CPU times in measure_cold.
  3. Make "nv/cold/time/gpu/mean" and "nv/cold/time/gpu/stdev/relative" hidden, and replace them with "nv/cold/time/gpu/median" and "nv/cold/time/gpu/ir/relative", respectively.
  4. Make similar change for "nv/cold/time/cpu*"
  5. Change stdrel_criterion to track "ir/relative" noise, i.e. $(Q_3 - Q_1) / Q_2$. This noise value is much less susceptible to outliers.
  6. Modified nvbench_compare to reflect changes above
    a. It now queries median for time value, and relative interquartile range as noise
    b. Comparison logic has been changed to get reference noise from min(ref_noise, cmp_noise) to max(ref_noise, cmp_noise). The relative difference between durations should be permitted to be wider if one measurement is significantly noisier than another.

Consequence of the change to stdrel_criterion the number of samples collected by default has become significantly smaller, especially with use of sufficient count of warm-up iterations.

@oleksandr-pavlyk oleksandr-pavlyk self-assigned this May 5, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 5, 2026
@oleksandr-pavlyk oleksandr-pavlyk added the release: breaking change Include in "Breaking Changes" section of release notes. label May 5, 2026
Percentiles on empty dataset are NaN, not infinity

Add Robust statistics of CPU times to summary

Fixed name for nv/cold/time/gpu/q3, corrected value reported for
nv/cold/time/gpu/ir/relative

Use median and IR to compute location and noise in measure_cold

Also in stdrel_criterion, compute noise as IR / median.
1. For JSON files that contains repeated measurements of run-time
   axis values, make sure that scripts compares corresponding
   reference entries.

   If cmp had two states with the same name and ref had two, we
   would compare measurements for each state in cmp against the
   first state in ref.

   Change here introduces counters tracking how many times each
   particular axis value, and retrieve corresponding entry in ref.

Previously, I had

```

|  BlockSize  |  NumBlocks  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------|
|     2^8     |     64      |   1.776 ms |       0.46% |   1.777 ms |       0.40% |  1.024 us |   0.06% |   SAME   |
|     2^8     |     64      |   1.776 ms |       0.46% |   1.774 ms |       0.52% | -2.048 us |  -0.12% |   SAME   |
|     2^8     |     64      |   1.776 ms |       0.46% |   1.773 ms |       0.52% | -3.072 us |  -0.17% |   SAME   |
|     2^8     |     64      |   1.776 ms |       0.46% |   1.774 ms |       0.58% | -2.048 us |  -0.12% |   SAME   |
|     2^8     |     64      |   1.776 ms |       0.46% |   1.773 ms |       0.58% | -3.072 us |  -0.17% |   SAME   |
```

and now it becomes

```

|  BlockSize  |  NumBlocks  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------|
|     2^8     |     64      |   1.776 ms |       0.46% |   1.777 ms |       0.40% |  1.024 us |   0.06% |   SAME   |
|     2^8     |     64      |   1.773 ms |       0.64% |   1.774 ms |       0.52% |  1.024 us |   0.06% |   SAME   |
|     2^8     |     64      |   1.774 ms |       0.46% |   1.773 ms |       0.52% | -1.024 us |  -0.06% |   SAME   |
|     2^8     |     64      |   1.773 ms |       0.46% |   1.774 ms |       0.58% |  1.024 us |   0.06% |   SAME   |
|     2^8     |     64      |   1.774 ms |       0.52% |   1.773 ms |       0.58% | -1.024 us |  -0.06% |   SAME   |
```

With the following raw data expected

```
(py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. | .benchmarks[] | .states[] | .summaries[] | select(.tag == "nv/cold/time/gpu/median") | .data[] | .value' base.json
"0.0017756160497665405"
"0.0017725440263748169"
"0.001773568034172058"
"0.0017725440263748169"
"0.001773568034172058"

(py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. | .benchmarks[] | .states[] | .summaries[] | select(.tag == "nv/cold/time/gpu/median") | .data[] | .value' test.json
"0.0017766400575637818"
"0.001773568034172058"
"0.0017725440263748169"
"0.001773568034172058"
"0.0017725440263748169"
```

2. nvbench_compare changes from using min_noise = min(ref_noise, cmp_noise) to using max_noise = max(ref_noise, cmp_noise)
   Using larger of ref and cmp noise level as a reference against which to gauge timing difference ratio makes more sense.
@oleksandr-pavlyk oleksandr-pavlyk force-pushed the use-robust-centrality-dispersion-measures branch from 8d1b316 to 6e45072 Compare May 15, 2026 14:21
@oleksandr-pavlyk
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 15, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Added quartile and percentile-based statistics (Q1, median, Q3) to timing measurements.
    • Introduced interquartile range (IQR) metric for robust noise calculation.
  • Improvements

    • Enhanced noise measurement methodology using IQR-over-median instead of standard deviation.
    • Refined benchmark comparison logic for more robust pass/fail decisions.
    • Improved summary descriptions for CPU and GPU timing statistics.

Walkthrough

The PR replaces conventional mean/standard-deviation statistics with robust quartile and interquartile-range (IQR) metrics across NVBench CPU and GPU timing measurements. A new compute_percentiles API computes exact percentile values; this integrates into CPU-only and cold measurement summaries, updates the criterion's noise tracking to use IQR/median, and refactors the comparison script to use median and relative-IQR for pass/fail tolerance decisions.

Changes

Robust statistics implementation

Layer / File(s) Summary
Percentile computation API
nvbench/detail/statistics.cuh, testing/statistics.cu
Added compute_percentiles template overloads accepting std::array<int, N> or C-style percentile arrays; implementation sorts input, computes exact ranks via round(quantile * (S - 1)), and returns NaN for empty input. Unit tests cover multiple dataset sizes, out-of-range percentile clamping, and empty-input behavior.
CPU-only measurement percentiles
nvbench/detail/measure_cpu_only.cxx
measure_cpu_only_base::generate_summaries() computes Q1, median, Q3 from m_cpu_times and records new summary keys for cpu quartiles, absolute interquartile range, and relative IR-over-median noise metric.
Criterion noise and cold measurement summaries
nvbench/detail/stdrel_criterion.cxx, nvbench/detail/measure_cold.cu
stdrel_criterion::do_add_measurement replaces standard-deviation-based noise with IQR-over-median computed from CUDA timing percentiles (requires 5+ samples, recorded only if finite). CPU and GPU cold measurement summaries now include Q1/median/Q3 and IR absolute/relative metadata; mean summaries marked hidden with refined descriptions referencing isolated kernel execution times.
Benchmark comparison metric updates
python/scripts/nvbench_compare.py
Reference-state matching now tracks occurrence counts per state name and selects the Nth reference state instead of the first match. Benchmark summary extraction uses nv/cold/time/gpu/median (instead of mean) and nv/cold/time/gpu/ir/relative (instead of stdev/relative). Pass/fail tolerance uses maximum of reference/compare noise (max_noise) instead of minimum (min_noise); "unknown" classification when max_noise is None.

Assessment against linked issues

Objective Addressed Explanation
Replace mean with median and stdev with IQR metrics in JSON summaries [#342]
Mark mean and stdev/relative as hidden in cold measurements [#342]
Implement quartile and IQR computation with exact percentile method [#342]
Update comparison script to use new robust metrics [#342]

Suggested labels

type: enhancement

Suggested reviewers

  • alliepiper
  • gevtushenko

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
testing/statistics.cu (2)

90-97: ⚡ Quick win

suggestion: Lines 94-96 only assert non-finite values; this would still pass if behavior regresses to infinity. Assert std::isnan(actual[i]) to lock the empty-input contract explicitly.

As per coding guidelines, "testing/**/*: Focus on whether tests cover observable behavior, remain deterministic, handle GPU availability and CUDA version differences correctly, avoid excessive runtime, and exercise install/export/package boundaries where relevant."


56-80: ⚡ Quick win

suggestion: Add a 2-element dataset case (for example, {10, 20} with percentile 50) to pin the rank-rounding rule at half steps and prevent future semantic drift.

As per coding guidelines, "testing/**/*: Focus on whether tests cover observable behavior, remain deterministic, handle GPU availability and CUDA version differences correctly, avoid excessive runtime, and exercise install/export/package boundaries where relevant."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d1769301-edcf-4efd-9f48-b076a0f8a23e

📥 Commits

Reviewing files that changed from the base of the PR and between 44ec7de and 6e45072.

📒 Files selected for processing (6)
  • nvbench/detail/measure_cold.cu
  • nvbench/detail/measure_cpu_only.cxx
  • nvbench/detail/statistics.cuh
  • nvbench/detail/stdrel_criterion.cxx
  • python/scripts/nvbench_compare.py
  • testing/statistics.cu

Comment on lines +306 to +314
auto &summ = m_state.add_summary("nv/cold/time/cpu/ir/relative");
summ.set_string("name", "Noise");
summ.set_string("hint", "percentage");
summ.set_string("description",
"Relative interquartile range of isolated kernel execution CPU times");
const auto cpu_time_ir = cpu_time_third_quartile - cpu_time_first_quartile;
const auto cpu_robust_noise = cpu_time_ir / cpu_time_median;
summ.set_float64("value", cpu_robust_noise);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Relative IR is computed as IQR / median at Line 312 and Line 417 with no zero/finite guard. For very short or quantized timings, median can be zero and produce inf/nan summaries. Guard both calculations and publish only finite values.

As per coding guidelines, "nvbench/**/*: Focus on benchmark correctness, ... measurement semantics, statistical summaries, and test coverage."

Also applies to: 411-419

Comment on lines +215 to +223
auto &summ = m_state.add_summary("nv/cpu_only/time/cpu/ir/relative");
summ.set_string("name", "IR");
summ.set_string("hint", "percentage");
summ.set_string("description",
"Relative interquartile range of CPU times of isolated kernel executions");
summ.set_string("hide", "Hidden by default.");
const auto cpu_ir = cpu_third_quartile - cpu_first_quartile;
const auto cpu_robust_noise = cpu_ir / cpu_median;
summ.set_float64("value", cpu_robust_noise);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: The relative IR metric at Line 222 divides by cpu_median without a zero/finite guard. This can emit inf/nan in nv/cpu_only/time/cpu/ir/relative and break downstream consumers that expect finite numeric summaries. Guard cpu_median and only publish finite values.

As per coding guidelines, "nvbench/**/*: Focus on benchmark correctness, ... measurement semantics, statistical summaries, and test coverage."

Comment on lines +99 to +104
/**
* Computes exact percentile values using rank round(p / 100 * (S - 1)).
*
* The input range is copied before sorting, so const iterators are supported.
* If the input has fewer than 1 sample, all percentiles are returned as infinity.
*/
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: Line 103 says empty input returns infinity, but Lines 115-116 return quiet_NaN(). Update the doc comment to match the implementation contract.

Also applies to: 115-116

Comment on lines +44 to +57
// require at least 5 samples for meaningful noise estimate
if (m_total_samples > 4)
{
m_noise_tracker.push_back(cuda_rel_stdev);
// Compute convergence statistics using CUDA timings:
const auto [cuda_first_quartile, cuda_median, cuda_third_quartile] =
nvbench::detail::statistics::compute_percentiles(m_cuda_times.cbegin(),
m_cuda_times.cend(),
{25, 50, 75});
const auto cuda_noise = (cuda_third_quartile - cuda_first_quartile) / cuda_median;

if (std::isfinite(cuda_noise))
{
m_noise_tracker.push_back(cuda_noise);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

critical: With the new gate at Line 45, m_noise_tracker can stay empty while do_is_finished() still reaches m_noise_tracker.back() at Line 69 (e.g., min-time reached before 5 samples). This is undefined behavior and can crash. Add an explicit empty check before any back() access.

As per coding guidelines, "nvbench/**/*: Focus on benchmark correctness, CUDA stream/event ordering, synchronization behavior, error handling, ... measurement semantics, statistical summaries, and test coverage."

Comment on lines 348 to 363
for cmp_device_id in cmp_device_ids:
rows = []
plot_data = {"cmp": {}, "ref": {}, "cmp_noise": {}, "ref_noise": {}}
counters = {}

for cmp_state in cmp_states:
cmp_state_name = cmp_state["name"]
counters[cmp_state_name] = counters.get(cmp_state_name, 0) + 1
ref_state = next(
filter(lambda st: st["name"] == cmp_state_name, ref_states), None
islice(
filter(lambda st: st["name"] == cmp_state_name, ref_states),
counters[cmp_state_name] - 1,
None,
),
None,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Scope occurrence matching by device. The new counters map resets once per cmp_device_id, but this loop still walks every cmp_state and the reference lookup only filters on name. On benchmarks with multiple devices, later device sections can reuse the first reference occurrence and compare against the wrong state. Filter cmp_states by cmp_device_id before incrementing the counter, and constrain the reference lookup to the corresponding ref device as well.

Comment on lines 432 to +443
if ref_noise and cmp_noise:
ref_noise = float(ref_noise)
cmp_noise = float(cmp_noise)
min_noise = min(ref_noise, cmp_noise)
max_noise = max(ref_noise, cmp_noise)
elif ref_noise:
ref_noise = float(ref_noise)
min_noise = ref_noise
max_noise = ref_noise
elif cmp_noise:
cmp_noise = float(cmp_noise)
min_noise = cmp_noise
max_noise = cmp_noise
else:
min_noise = None # Noise is inf
max_noise = None # Noise is inf
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Zero noise is treated as infinite noise. These branches use truthiness, so a valid 0.0 relative IQR falls through to max_noise = None. That turns perfectly stable measurements into Unknown instead of applying a zero tolerance. Test ref_noise is not None / cmp_noise is not None before calling max().

Comment on lines +706 to +708
print(" - Pass (diff <= max_noise): %d" % pass_count)
print(" - Unknown (infinite noise): %d" % unknown_count)
print(" - Failure (diff > min_noise): %d" % failure_count)
print(" - Failure (diff > max_noise): %d" % failure_count)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: The summary text does not match the actual rule. Classification uses abs(frac_diff) <= max_noise, but the footer says diff <= max_noise / diff > max_noise, which is misleading for FAST results and drops the absolute-value part. Updating the text to reflect abs(%Diff) would make the report consistent with the table logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release: breaking change Include in "Breaking Changes" section of release notes.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Use robust statistic in NVBench summary

1 participant