Skip to content

[CUB] Add DeviceFind lower/upper bound for sorted values via merge-path#8780

Open
AneeshGidda wants to merge 3 commits intoNVIDIA:mainfrom
AneeshGidda:cub/device-find-bound-sorted-values
Open

[CUB] Add DeviceFind lower/upper bound for sorted values via merge-path#8780
AneeshGidda wants to merge 3 commits intoNVIDIA:mainfrom
AneeshGidda:cub/device-find-bound-sorted-values

Conversation

@AneeshGidda
Copy link
Copy Markdown

@AneeshGidda AneeshGidda commented May 1, 2026

Description

closes #7964

Adds two new cub::DeviceFind algorithms that exploit the additional precondition that the values (needles) are sorted, in addition to the range (haystack):

  • cub::DeviceFind::LowerBoundSortedValues
  • cub::DeviceFind::UpperBoundSortedValues

The implementation uses the Merge-Path algorithm to partition the combined traversal across thread blocks, achieving O(N + M) total device work versus the O(M log N) of the existing LowerBound/UpperBound (which perform M independent binary searches).

Public API

cub::DeviceFind::LowerBoundSortedValues(
  void* d_temp_storage, size_t& temp_storage_bytes,
  RangeIteratorT d_range, RangeNumItemsT range_num_items,
  ValuesIteratorT d_values, ValuesNumItemsT values_num_items,
  OutputIteratorT d_output, CompareOpT comp,
  cudaStream_t stream = 0);

cub::DeviceFind::UpperBoundSortedValues(/* same signature */);

Both [d_range, d_range + range_num_items) and [d_values, d_values + values_num_items) must be sorted consistently with comp.

Files added

  • Interface: cub/device/device_find.cuh
  • Implementation:
    • cub/agent/agent_find_bound_sorted_values.cuh
    • cub/device/dispatch/dispatch_find_bound_sorted_values.cuh
    • cub/device/dispatch/tuning/tuning_find_bound_sorted_values.cuh
  • Tests: cub/test/catch2_test_device_find_bound_sorted_values.cu
  • Benchmarks: cub/benchmarks/bench/find_bound/{lower_bound,upper_bound,lower_bound_sorted_values,upper_bound_sorted_values}.cu + shared find_bound_common.cuh

Benchmarks

Tested on RTX 4090. Speedup = LowerBound time ÷ LowerBoundSortedValues time (and analogously for upper)

LowerBoundSortedValues vs LowerBound

Elements Needles % I8 I16 I32 I64
64K 1 0.79× 0.75× 0.85× 0.82×
64K 25 0.64× 0.64× 0.71× 0.77×
64K 50 0.58× 0.64× 0.74× 0.78×
1M 1 0.55× 0.85× 0.93× 0.75×
1M 25 0.80× 1.75× 1.96× 1.55×
1M 50 0.82× 2.31× 2.42× 2.54×
16M 1 0.29× 0.95× 0.94× 0.70×
16M 25 5.62× 5.38× 9.23× 7.11×
16M 50 6.83× 5.97× 14.26× 10.74×
256M 1 1.14× 2.35× 1.98× 1.13×
256M 25 16.30× 23.03× 27.55× 17.87×
256M 50 17.94× 24.67× 38.36× 26.26×

UpperBoundSortedValues vs UpperBound

Elements Needles % I8 I16 I32 I64
64K 1 0.81× 0.80× 0.83× 0.78×
64K 25 0.66× 0.65× 0.79× 0.74×
64K 50 0.63× 0.65× 0.69× 0.78×
1M 1 0.61× 0.90× 0.88× 0.77×
1M 25 0.92× 1.77× 1.95× 1.60×
1M 50 0.87× 2.28× 2.44× 2.50×
16M 1 0.30× 0.96× 0.93× 0.71×
16M 25 5.77× 5.39× 9.24× 7.14×
16M 50 6.99× 6.03× 14.28× 10.76×
256M 1 2.04× 2.36× 1.98× 1.13×
256M 25 17.14× 23.07× 27.54× 17.87×
256M 50 18.16× 24.74× 38.33× 26.23×

Takeaways

  • 256M elements, ≥25% needles: 17–38× faster.
  • 16M elements, ≥25% needles: 5–14× faster.
  • Small inputs or 1% needles: merge-path setup overhead dominates and the sorted-values variant is slower.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 1, 2026
@AneeshGidda AneeshGidda force-pushed the cub/device-find-bound-sorted-values branch 2 times, most recently from f0ce49d to 2818850 Compare May 1, 2026 20:43
@AneeshGidda AneeshGidda force-pushed the cub/device-find-bound-sorted-values branch from 2818850 to ed7170f Compare May 1, 2026 20:52
@AneeshGidda AneeshGidda marked this pull request as ready for review May 1, 2026 21:23
@AneeshGidda AneeshGidda requested review from a team as code owners May 1, 2026 21:23
@AneeshGidda AneeshGidda requested review from NaderAlAwar and shwina May 1, 2026 21:23
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 1, 2026
Copy link
Copy Markdown
Contributor

@fbusato fbusato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the contribution, @AneeshGidda! and sorry for the delay. The results look amazing. I started the review and I added some comments

const int needles_count = total_in_tile - haystack_count;

{
auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);
const auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);

Comment on lines +121 to +124
for (int i = threadIdx.x; i < haystack_count; i += BlockThreads)
{
storage.haystack[i] = d_range_cm[i];
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor. we could rewrite this loop with a fixed number of iterations + unroll, and check if (index < haystack_count)


{
auto d_values_cm = try_make_cache_modified_iterator<LoadModifier>(d_values + values_beg);
for (int i = threadIdx.x; i < needles_count; i += BlockThreads)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


const auto partition_comp = Mode::make_partition_comp(compare_op);

const int d0_thread =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion. Move ItemsPerThread * static_cast<int>(threadIdx.x) to a variable because it is in both branches.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think (::cuda::std::min) is still needed

IsFullTile ? (ItemsPerThread * static_cast<int>(threadIdx.x))
: (::cuda::std::min) (ItemsPerThread * static_cast<int>(threadIdx.x), total_in_tile);

const int i0 = static_cast<int>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question. Is static_cast<int>( required here? cub::MergePath should return the right type

Comment on lines +58 to +64
HaystackIt d_range,
Offset range_count,
NeedlesIt d_values,
Offset values_count,
Offset num_diagonals,
Offset* range_beg_offsets,
PartitionCompOp partition_comp)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, use _CCCL_GRID_CONSTANT where possible

return error;
}

return dispatch_compute_cap(policy_selector, cc, [&](auto policy_getter) -> cudaError_t {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should have a fully qualified namespace

using traits_t = policy_traits<decltype(policy_getter)>;

const Offset total_items = range_count + values_count;
const Offset num_tiles = ::cuda::ceil_div(total_items, static_cast<Offset>(traits_t::tile_size));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const Offset num_tiles = ::cuda::ceil_div(total_items, static_cast<Offset>(traits_t::tile_size));
const Offset num_tiles = ::cuda::ceil_div(total_items, Offset{traits_t::tile_size});

// Lightweight pass; not worth exposing through the tuning system.
constexpr int threads_per_partition_block = 256;
const int partition_grid_size =
static_cast<int>(::cuda::ceil_div(num_diagonals, static_cast<Offset>(threads_per_partition_block)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static_cast<int>(::cuda::ceil_div(num_diagonals, static_cast<Offset>(threads_per_partition_block)));
static_cast<int>(::cuda::ceil_div(num_diagonals, Offset{threads_per_partition_block}));


#include <thrust/system/cuda/detail/core/triple_chevron_launch.h>

#include <cuda/std/__algorithm/min.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing headers

#include <cuda/__cmath/ceil_div.h>
#include <cuda/std/__type_traits/is_empty.h>

@github-project-automation github-project-automation Bot moved this from In Review to In Progress in CCCL May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[FEA]: Add batched binary search algorithm for sorted needles and haystack

2 participants