Use robust centrality dispersion measures#348
Conversation
Percentiles on empty dataset are NaN, not infinity Add Robust statistics of CPU times to summary Fixed name for nv/cold/time/gpu/q3, corrected value reported for nv/cold/time/gpu/ir/relative Use median and IR to compute location and noise in measure_cold Also in stdrel_criterion, compute noise as IR / median.
1. For JSON files that contains repeated measurements of run-time axis values, make sure that scripts compares corresponding reference entries. If cmp had two states with the same name and ref had two, we would compare measurements for each state in cmp against the first state in ref. Change here introduces counters tracking how many times each particular axis value, and retrieve corresponding entry in ref. Previously, I had ``` | BlockSize | NumBlocks | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------| | 2^8 | 64 | 1.776 ms | 0.46% | 1.777 ms | 0.40% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.774 ms | 0.52% | -2.048 us | -0.12% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.773 ms | 0.52% | -3.072 us | -0.17% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.774 ms | 0.58% | -2.048 us | -0.12% | SAME | | 2^8 | 64 | 1.776 ms | 0.46% | 1.773 ms | 0.58% | -3.072 us | -0.17% | SAME | ``` and now it becomes ``` | BlockSize | NumBlocks | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status | |-------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------| | 2^8 | 64 | 1.776 ms | 0.46% | 1.777 ms | 0.40% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.773 ms | 0.64% | 1.774 ms | 0.52% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.774 ms | 0.46% | 1.773 ms | 0.52% | -1.024 us | -0.06% | SAME | | 2^8 | 64 | 1.773 ms | 0.46% | 1.774 ms | 0.58% | 1.024 us | 0.06% | SAME | | 2^8 | 64 | 1.774 ms | 0.52% | 1.773 ms | 0.58% | -1.024 us | -0.06% | SAME | ``` With the following raw data expected ``` (py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. | .benchmarks[] | .states[] | .summaries[] | select(.tag == "nv/cold/time/gpu/median") | .data[] | .value' base.json "0.0017756160497665405" "0.0017725440263748169" "0.001773568034172058" "0.0017725440263748169" "0.001773568034172058" (py313) opavlyk@NV-22T4X34:~/repos/nvbench$ jq '. | .benchmarks[] | .states[] | .summaries[] | select(.tag == "nv/cold/time/gpu/median") | .data[] | .value' test.json "0.0017766400575637818" "0.001773568034172058" "0.0017725440263748169" "0.001773568034172058" "0.0017725440263748169" ``` 2. nvbench_compare changes from using min_noise = min(ref_noise, cmp_noise) to using max_noise = max(ref_noise, cmp_noise) Using larger of ref and cmp noise level as a reference against which to gauge timing difference ratio makes more sense.
These measures are less sensitive to outliers
8d1b316 to
6e45072
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThe PR replaces conventional mean/standard-deviation statistics with robust quartile and interquartile-range (IQR) metrics across NVBench CPU and GPU timing measurements. A new ChangesRobust statistics implementation
Assessment against linked issues
Suggested labels
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (2)
testing/statistics.cu (2)
90-97: ⚡ Quick winsuggestion: Lines 94-96 only assert non-finite values; this would still pass if behavior regresses to infinity. Assert
std::isnan(actual[i])to lock the empty-input contract explicitly.As per coding guidelines, "testing/**/*: Focus on whether tests cover observable behavior, remain deterministic, handle GPU availability and CUDA version differences correctly, avoid excessive runtime, and exercise install/export/package boundaries where relevant."
56-80: ⚡ Quick winsuggestion: Add a 2-element dataset case (for example,
{10, 20}with percentile50) to pin the rank-rounding rule at half steps and prevent future semantic drift.As per coding guidelines, "testing/**/*: Focus on whether tests cover observable behavior, remain deterministic, handle GPU availability and CUDA version differences correctly, avoid excessive runtime, and exercise install/export/package boundaries where relevant."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d1769301-edcf-4efd-9f48-b076a0f8a23e
📒 Files selected for processing (6)
nvbench/detail/measure_cold.cunvbench/detail/measure_cpu_only.cxxnvbench/detail/statistics.cuhnvbench/detail/stdrel_criterion.cxxpython/scripts/nvbench_compare.pytesting/statistics.cu
| auto &summ = m_state.add_summary("nv/cold/time/cpu/ir/relative"); | ||
| summ.set_string("name", "Noise"); | ||
| summ.set_string("hint", "percentage"); | ||
| summ.set_string("description", | ||
| "Relative interquartile range of isolated kernel execution CPU times"); | ||
| const auto cpu_time_ir = cpu_time_third_quartile - cpu_time_first_quartile; | ||
| const auto cpu_robust_noise = cpu_time_ir / cpu_time_median; | ||
| summ.set_float64("value", cpu_robust_noise); | ||
| } |
There was a problem hiding this comment.
important: Relative IR is computed as IQR / median at Line 312 and Line 417 with no zero/finite guard. For very short or quantized timings, median can be zero and produce inf/nan summaries. Guard both calculations and publish only finite values.
As per coding guidelines, "nvbench/**/*: Focus on benchmark correctness, ... measurement semantics, statistical summaries, and test coverage."
Also applies to: 411-419
| auto &summ = m_state.add_summary("nv/cpu_only/time/cpu/ir/relative"); | ||
| summ.set_string("name", "IR"); | ||
| summ.set_string("hint", "percentage"); | ||
| summ.set_string("description", | ||
| "Relative interquartile range of CPU times of isolated kernel executions"); | ||
| summ.set_string("hide", "Hidden by default."); | ||
| const auto cpu_ir = cpu_third_quartile - cpu_first_quartile; | ||
| const auto cpu_robust_noise = cpu_ir / cpu_median; | ||
| summ.set_float64("value", cpu_robust_noise); |
There was a problem hiding this comment.
important: The relative IR metric at Line 222 divides by cpu_median without a zero/finite guard. This can emit inf/nan in nv/cpu_only/time/cpu/ir/relative and break downstream consumers that expect finite numeric summaries. Guard cpu_median and only publish finite values.
As per coding guidelines, "nvbench/**/*: Focus on benchmark correctness, ... measurement semantics, statistical summaries, and test coverage."
| /** | ||
| * Computes exact percentile values using rank round(p / 100 * (S - 1)). | ||
| * | ||
| * The input range is copied before sorting, so const iterators are supported. | ||
| * If the input has fewer than 1 sample, all percentiles are returned as infinity. | ||
| */ |
There was a problem hiding this comment.
suggestion: Line 103 says empty input returns infinity, but Lines 115-116 return quiet_NaN(). Update the doc comment to match the implementation contract.
Also applies to: 115-116
| // require at least 5 samples for meaningful noise estimate | ||
| if (m_total_samples > 4) | ||
| { | ||
| m_noise_tracker.push_back(cuda_rel_stdev); | ||
| // Compute convergence statistics using CUDA timings: | ||
| const auto [cuda_first_quartile, cuda_median, cuda_third_quartile] = | ||
| nvbench::detail::statistics::compute_percentiles(m_cuda_times.cbegin(), | ||
| m_cuda_times.cend(), | ||
| {25, 50, 75}); | ||
| const auto cuda_noise = (cuda_third_quartile - cuda_first_quartile) / cuda_median; | ||
|
|
||
| if (std::isfinite(cuda_noise)) | ||
| { | ||
| m_noise_tracker.push_back(cuda_noise); | ||
| } |
There was a problem hiding this comment.
critical: With the new gate at Line 45, m_noise_tracker can stay empty while do_is_finished() still reaches m_noise_tracker.back() at Line 69 (e.g., min-time reached before 5 samples). This is undefined behavior and can crash. Add an explicit empty check before any back() access.
As per coding guidelines, "nvbench/**/*: Focus on benchmark correctness, CUDA stream/event ordering, synchronization behavior, error handling, ... measurement semantics, statistical summaries, and test coverage."
| for cmp_device_id in cmp_device_ids: | ||
| rows = [] | ||
| plot_data = {"cmp": {}, "ref": {}, "cmp_noise": {}, "ref_noise": {}} | ||
| counters = {} | ||
|
|
||
| for cmp_state in cmp_states: | ||
| cmp_state_name = cmp_state["name"] | ||
| counters[cmp_state_name] = counters.get(cmp_state_name, 0) + 1 | ||
| ref_state = next( | ||
| filter(lambda st: st["name"] == cmp_state_name, ref_states), None | ||
| islice( | ||
| filter(lambda st: st["name"] == cmp_state_name, ref_states), | ||
| counters[cmp_state_name] - 1, | ||
| None, | ||
| ), | ||
| None, | ||
| ) |
There was a problem hiding this comment.
important: Scope occurrence matching by device. The new counters map resets once per cmp_device_id, but this loop still walks every cmp_state and the reference lookup only filters on name. On benchmarks with multiple devices, later device sections can reuse the first reference occurrence and compare against the wrong state. Filter cmp_states by cmp_device_id before incrementing the counter, and constrain the reference lookup to the corresponding ref device as well.
| if ref_noise and cmp_noise: | ||
| ref_noise = float(ref_noise) | ||
| cmp_noise = float(cmp_noise) | ||
| min_noise = min(ref_noise, cmp_noise) | ||
| max_noise = max(ref_noise, cmp_noise) | ||
| elif ref_noise: | ||
| ref_noise = float(ref_noise) | ||
| min_noise = ref_noise | ||
| max_noise = ref_noise | ||
| elif cmp_noise: | ||
| cmp_noise = float(cmp_noise) | ||
| min_noise = cmp_noise | ||
| max_noise = cmp_noise | ||
| else: | ||
| min_noise = None # Noise is inf | ||
| max_noise = None # Noise is inf |
There was a problem hiding this comment.
important: Zero noise is treated as infinite noise. These branches use truthiness, so a valid 0.0 relative IQR falls through to max_noise = None. That turns perfectly stable measurements into Unknown instead of applying a zero tolerance. Test ref_noise is not None / cmp_noise is not None before calling max().
| print(" - Pass (diff <= max_noise): %d" % pass_count) | ||
| print(" - Unknown (infinite noise): %d" % unknown_count) | ||
| print(" - Failure (diff > min_noise): %d" % failure_count) | ||
| print(" - Failure (diff > max_noise): %d" % failure_count) |
There was a problem hiding this comment.
suggestion: The summary text does not match the actual rule. Classification uses abs(frac_diff) <= max_noise, but the footer says diff <= max_noise / diff > max_noise, which is misleading for FAST results and drops the absolute-value part. Updating the text to reflect abs(%Diff) would make the report consistent with the table logic.
This PR closes #342
std::array<ValueType, N> nvbench::detail::statistics::compute_percentiles(Iter begin, Iter end, std::array<int, N> percentiles)which finds requested percentile levels using "zero-based nearest endpoint-scaled rank".measure_cold."nv/cold/time/gpu/mean"and"nv/cold/time/gpu/stdev/relative"hidden, and replace them with"nv/cold/time/gpu/median"and"nv/cold/time/gpu/ir/relative", respectively."nv/cold/time/cpu*"stdrel_criterionto track"ir/relative"noise, i.e.nvbench_compareto reflect changes abovea. It now queries median for time value, and relative interquartile range as noise
b. Comparison logic has been changed to get reference noise from
min(ref_noise, cmp_noise)tomax(ref_noise, cmp_noise). The relative difference between durations should be permitted to be wider if one measurement is significantly noisier than another.Consequence of the change to
stdrel_criterionthe number of samples collected by default has become significantly smaller, especially with use of sufficient count of warm-up iterations.