Skip to content

AneeshGidda/gpu_parallel_primitives_library

Repository files navigation

GPU Parallel Primitives Library

A GPU-accelerated library of parallel primitives implemented in CUDA, with C++ host code and Python bindings via pybind11. Use it from Python with NumPy arrays for reduce, scan, histogram, and radix sort—without writing CUDA yourself.


What is this?

gpuprims provides a small set of high-performance, single-GPU primitives commonly used as building blocks in parallel algorithms:

Primitive Description Supported types
reduce_sum Sum of a 1D array → scalar int32, float32
exclusive_scan Exclusive prefix sum → 1D array int32, float32
histogram Fixed-bin counts → 1D array of counts uint32, int32 (non-negative)
radix_sort Stable ascending sort → 1D array uint32

All operations run on the GPU. Inputs are 1D contiguous NumPy arrays; the library copies data to the device, runs the kernel, and returns results to Python.


Requirements

  • CUDA Toolkit 11.x or 12.x
  • C++17-capable compiler (e.g. GCC 9+, Clang 10+, or MSVC with CUDA on Windows)
  • Python 3.9 or newer
  • NumPy 1.20+
  • CMake 3.18+ (used when building the Python extension)

Installation

  1. Clone the repository and enter the project directory:

    git clone <repository-url>
    cd gpu-parallel-primitives-library
  2. Ensure CUDA is available. Set CUDA_PATH if your toolkit is not in the default location.

  3. Install the package (builds the CUDA extension via CMake):

    pip install .

    For editable/development installs (recommended if you change code):

    pip install -e .

    Optional development dependencies (tests, formatters):

    pip install -e ".[dev]"

Quick Start

import gpuprims
import numpy as np

# Reduce: sum of array → scalar
x = np.array([1, 2, 3, 4, 5], dtype=np.int32)
print(gpuprims.reduce_sum(x))   # 15

# Exclusive scan: prefix sum (first element is 0)
y = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32)
print(gpuprims.exclusive_scan(y))   # [0. 1. 3. 6.]

# Histogram: count values in [0, bins)
z = np.array([0, 1, 1, 2, 2, 2], dtype=np.uint32)
print(gpuprims.histogram(z, 3))   # [1, 2, 3]

# Radix sort: stable ascending sort (uint32 only)
w = np.array([3, 1, 4, 1, 5], dtype=np.uint32)
print(gpuprims.radix_sort(w))   # [1, 1, 3, 4, 5]

Run the full example script from the repo root:

python examples/example_usage.py

API Summary

  • gpuprims.reduce_sum(x) — Returns the sum of x. x: 1D int32 or float32 array.
  • gpuprims.exclusive_scan(x) — Returns exclusive prefix sum; same shape and dtype as x. x: 1D int32 or float32 array.
  • gpuprims.histogram(x, bins) — Returns a 1D array of length bins with counts for values in [0, bins). x: 1D uint32 or non-negative int32 array.
  • gpuprims.radix_sort(x) — Returns a new 1D array with values sorted in ascending order (stable). x: 1D uint32 array.

Input arrays must be 1D, contiguous, and of a supported dtype; otherwise the library may raise or behavior is undefined.


Project layout

  • Python APIpybind11 bindingsC++ wrappersCUDA kernels
  • Build: CMake (reduce, scan, histogram, radix_sort, wrappers, bindings). The Python wheel is built with scikit-build-core.

Running tests

From the project root:

pytest tests/

Requires the package to be installed (e.g. pip install -e ".[dev]").

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors