[CUB] Add DeviceFind lower/upper bound for sorted values via merge-path by AneeshGidda · Pull Request #8780 · NVIDIA/cccl

AneeshGidda · 2026-05-01T18:52:44Z

Description

Adds two new cub::DeviceFind algorithms that exploit the additional precondition that the values (needles) are sorted, in addition to the range (haystack):

cub::DeviceFind::LowerBoundSortedValues
cub::DeviceFind::UpperBoundSortedValues

The implementation uses the Merge-Path algorithm to partition the combined traversal across thread blocks, achieving O(N + M) total device work versus the O(M log N) of the existing LowerBound/UpperBound (which perform M independent binary searches).

Public API

cub::DeviceFind::LowerBoundSortedValues(
  void* d_temp_storage, size_t& temp_storage_bytes,
  RangeIteratorT d_range, RangeNumItemsT range_num_items,
  ValuesIteratorT d_values, ValuesNumItemsT values_num_items,
  OutputIteratorT d_output, CompareOpT comp,
  cudaStream_t stream = 0);

cub::DeviceFind::UpperBoundSortedValues(/* same signature */);

Both [d_range, d_range + range_num_items) and [d_values, d_values + values_num_items) must be sorted consistently with comp.

Files added

Interface: cub/device/device_find.cuh
Implementation:
- cub/agent/agent_find_bound_sorted_values.cuh
- cub/device/dispatch/dispatch_find_bound_sorted_values.cuh
- cub/device/dispatch/tuning/tuning_find_bound_sorted_values.cuh
Tests: cub/test/catch2_test_device_find_bound_sorted_values.cu
Benchmarks: cub/benchmarks/bench/find_bound/{lower_bound,upper_bound,lower_bound_sorted_values,upper_bound_sorted_values}.cu + shared find_bound_common.cuh

Benchmarks

Tested on RTX 4090. Speedup = LowerBound time ÷ LowerBoundSortedValues time (and analogously for upper)

`LowerBoundSortedValues` vs `LowerBound`

Elements	Needles %	I8	I16	I32	I64
64K	1	0.79×	0.75×	0.85×	0.82×
64K	25	0.64×	0.64×	0.71×	0.77×
64K	50	0.58×	0.64×	0.74×	0.78×
1M	1	0.55×	0.85×	0.93×	0.75×
1M	25	0.80×	1.75×	1.96×	1.55×
1M	50	0.82×	2.31×	2.42×	2.54×
16M	1	0.29×	0.95×	0.94×	0.70×
16M	25	5.62×	5.38×	9.23×	7.11×
16M	50	6.83×	5.97×	14.26×	10.74×
256M	1	1.14×	2.35×	1.98×	1.13×
256M	25	16.30×	23.03×	27.55×	17.87×
256M	50	17.94×	24.67×	38.36×	26.26×

`UpperBoundSortedValues` vs `UpperBound`

Elements	Needles %	I8	I16	I32	I64
64K	1	0.81×	0.80×	0.83×	0.78×
64K	25	0.66×	0.65×	0.79×	0.74×
64K	50	0.63×	0.65×	0.69×	0.78×
1M	1	0.61×	0.90×	0.88×	0.77×
1M	25	0.92×	1.77×	1.95×	1.60×
1M	50	0.87×	2.28×	2.44×	2.50×
16M	1	0.30×	0.96×	0.93×	0.71×
16M	25	5.77×	5.39×	9.24×	7.14×
16M	50	6.99×	6.03×	14.28×	10.76×
256M	1	2.04×	2.36×	1.98×	1.13×
256M	25	17.14×	23.07×	27.54×	17.87×
256M	50	18.16×	24.74×	38.33×	26.23×

Takeaways

256M elements, ≥25% needles: 17–38× faster.
16M elements, ≥25% needles: 5–14× faster.
Small inputs or 1% needles: merge-path setup overhead dominates and the sorted-values variant is slower.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-05-01T18:52:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fbusato

thanks a lot for the contribution, @AneeshGidda! and sorry for the delay. The results look amazing. I started the review and I added some comments

fbusato · 2026-05-08T00:31:03Z

+    const int needles_count  = total_in_tile - haystack_count;
+
+    {
+      auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);


Suggested change

auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);

const auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);

fbusato · 2026-05-08T00:33:15Z

+      for (int i = threadIdx.x; i < haystack_count; i += BlockThreads)
+      {
+        storage.haystack[i] = d_range_cm[i];
+      }


minor. we could rewrite this loop with a fixed number of iterations + unroll, and check if (index < haystack_count)

fbusato · 2026-05-08T00:33:26Z

+
+    {
+      auto d_values_cm = try_make_cache_modified_iterator<LoadModifier>(d_values + values_beg);
+      for (int i = threadIdx.x; i < needles_count; i += BlockThreads)


fbusato · 2026-05-08T00:34:49Z

+
+    const auto partition_comp = Mode::make_partition_comp(compare_op);
+
+    const int d0_thread =


suggestion. Move ItemsPerThread * static_cast<int>(threadIdx.x) to a variable because it is in both branches.

I don't think (::cuda::std::min) is still needed

fbusato · 2026-05-08T00:36:37Z

+      IsFullTile ? (ItemsPerThread * static_cast<int>(threadIdx.x))
+                 : (::cuda::std::min) (ItemsPerThread * static_cast<int>(threadIdx.x), total_in_tile);
+
+    const int i0 = static_cast<int>(


question. Is static_cast<int>( required here? cub::MergePath should return the right type

fbusato · 2026-05-08T00:55:03Z

+  HaystackIt d_range,
+  Offset range_count,
+  NeedlesIt d_values,
+  Offset values_count,
+  Offset num_diagonals,
+  Offset* range_beg_offsets,
+  PartitionCompOp partition_comp)


suggestion, use _CCCL_GRID_CONSTANT where possible

fbusato · 2026-05-08T00:56:30Z

+    return error;
+  }
+
+  return dispatch_compute_cap(policy_selector, cc, [&](auto policy_getter) -> cudaError_t {


I think this should have a fully qualified namespace

fbusato · 2026-05-08T00:57:31Z

+    using traits_t = policy_traits<decltype(policy_getter)>;
+
+    const Offset total_items   = range_count + values_count;
+    const Offset num_tiles     = ::cuda::ceil_div(total_items, static_cast<Offset>(traits_t::tile_size));


Suggested change

const Offset num_tiles = ::cuda::ceil_div(total_items, static_cast<Offset>(traits_t::tile_size));

const Offset num_tiles = ::cuda::ceil_div(total_items, Offset{traits_t::tile_size});

fbusato · 2026-05-08T00:58:20Z

+      // Lightweight pass; not worth exposing through the tuning system.
+      constexpr int threads_per_partition_block = 256;
+      const int partition_grid_size =
+        static_cast<int>(::cuda::ceil_div(num_diagonals, static_cast<Offset>(threads_per_partition_block)));


Suggested change

static_cast<int>(::cuda::ceil_div(num_diagonals, static_cast<Offset>(threads_per_partition_block)));

static_cast<int>(::cuda::ceil_div(num_diagonals, Offset{threads_per_partition_block}));

fbusato · 2026-05-08T00:59:45Z

+
+#include <thrust/system/cuda/detail/core/triple_chevron_launch.h>
+
+#include <cuda/std/__algorithm/min.h>


missing headers

#include <cuda/__cmath/ceil_div.h> #include <cuda/std/__type_traits/is_empty.h>

github-project-automation Bot added this to CCCL May 1, 2026

github-project-automation Bot moved this to Todo in CCCL May 1, 2026

cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 1, 2026

AneeshGidda force-pushed the cub/device-find-bound-sorted-values branch 2 times, most recently from f0ce49d to 2818850 Compare May 1, 2026 20:43

[cub] Add DeviceFind lower/upper bound for sorted values via merge-path

ed7170f

AneeshGidda force-pushed the cub/device-find-bound-sorted-values branch from 2818850 to ed7170f Compare May 1, 2026 20:52

AneeshGidda added 2 commits May 1, 2026 16:54

Merge branch 'main' into cub/device-find-bound-sorted-values

58675bb

[cub] Fix find_bound_sorted_values test arch_id->compute_capability

606a834

AneeshGidda marked this pull request as ready for review May 1, 2026 21:23

AneeshGidda requested review from a team as code owners May 1, 2026 21:23

AneeshGidda requested review from NaderAlAwar and shwina May 1, 2026 21:23

cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 1, 2026

fbusato requested changes May 8, 2026

View reviewed changes

github-project-automation Bot moved this from In Review to In Progress in CCCL May 8, 2026

	auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);
	const auto d_range_cm = try_make_cache_modified_iterator<LoadModifier>(d_range + range_beg);


		const auto partition_comp = Mode::make_partition_comp(compare_op);

		const int d0_thread =

	const Offset num_tiles = ::cuda::ceil_div(total_items, static_cast<Offset>(traits_t::tile_size));
	const Offset num_tiles = ::cuda::ceil_div(total_items, Offset{traits_t::tile_size});

	static_cast<int>(::cuda::ceil_div(num_diagonals, static_cast<Offset>(threads_per_partition_block)));
	static_cast<int>(::cuda::ceil_div(num_diagonals, Offset{threads_per_partition_block}));


		#include <thrust/system/cuda/detail/core/triple_chevron_launch.h>

		#include <cuda/std/__algorithm/min.h>

Conversation

AneeshGidda commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Public API

Files added

Benchmarks

LowerBoundSortedValues vs LowerBound

UpperBoundSortedValues vs UpperBound

Takeaways

Checklist

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

fbusato left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AneeshGidda commented May 1, 2026 •

edited

Loading

`LowerBoundSortedValues` vs `LowerBound`

`UpperBoundSortedValues` vs `UpperBound`