A GPU-accelerated library of parallel primitives implemented in CUDA, with C++ host code and Python bindings via pybind11. Use it from Python with NumPy arrays for reduce, scan, histogram, and radix sort—without writing CUDA yourself.
gpuprims provides a small set of high-performance, single-GPU primitives commonly used as building blocks in parallel algorithms:
| Primitive | Description | Supported types |
|---|---|---|
| reduce_sum | Sum of a 1D array → scalar | int32, float32 |
| exclusive_scan | Exclusive prefix sum → 1D array | int32, float32 |
| histogram | Fixed-bin counts → 1D array of counts | uint32, int32 (non-negative) |
| radix_sort | Stable ascending sort → 1D array | uint32 |
All operations run on the GPU. Inputs are 1D contiguous NumPy arrays; the library copies data to the device, runs the kernel, and returns results to Python.
- CUDA Toolkit 11.x or 12.x
- C++17-capable compiler (e.g. GCC 9+, Clang 10+, or MSVC with CUDA on Windows)
- Python 3.9 or newer
- NumPy 1.20+
- CMake 3.18+ (used when building the Python extension)
-
Clone the repository and enter the project directory:
git clone <repository-url> cd gpu-parallel-primitives-library
-
Ensure CUDA is available. Set
CUDA_PATHif your toolkit is not in the default location. -
Install the package (builds the CUDA extension via CMake):
pip install .For editable/development installs (recommended if you change code):
pip install -e .Optional development dependencies (tests, formatters):
pip install -e ".[dev]"
import gpuprims
import numpy as np
# Reduce: sum of array → scalar
x = np.array([1, 2, 3, 4, 5], dtype=np.int32)
print(gpuprims.reduce_sum(x)) # 15
# Exclusive scan: prefix sum (first element is 0)
y = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
print(gpuprims.exclusive_scan(y)) # [0. 1. 3. 6.]
# Histogram: count values in [0, bins)
z = np.array([0, 1, 1, 2, 2, 2], dtype=np.uint32)
print(gpuprims.histogram(z, 3)) # [1, 2, 3]
# Radix sort: stable ascending sort (uint32 only)
w = np.array([3, 1, 4, 1, 5], dtype=np.uint32)
print(gpuprims.radix_sort(w)) # [1, 1, 3, 4, 5]Run the full example script from the repo root:
python examples/example_usage.pygpuprims.reduce_sum(x)— Returns the sum ofx.x: 1Dint32orfloat32array.gpuprims.exclusive_scan(x)— Returns exclusive prefix sum; same shape and dtype asx.x: 1Dint32orfloat32array.gpuprims.histogram(x, bins)— Returns a 1D array of lengthbinswith counts for values in[0, bins).x: 1Duint32or non-negativeint32array.gpuprims.radix_sort(x)— Returns a new 1D array with values sorted in ascending order (stable).x: 1Duint32array.
Input arrays must be 1D, contiguous, and of a supported dtype; otherwise the library may raise or behavior is undefined.
- Python API → pybind11 bindings → C++ wrappers → CUDA kernels
- Build: CMake (reduce, scan, histogram, radix_sort, wrappers, bindings). The Python wheel is built with scikit-build-core.
From the project root:
pytest tests/Requires the package to be installed (e.g. pip install -e ".[dev]").