Skip to content

Conversation

@sh3ll3x3c
Copy link

Description

Add CUDA 12.8 support for NVIDIA RTX 50-series (Blackwell/sm_120) GPUs.

The current Dockerfile.onnx.gpu uses CUDA 12.4 which doesn't support the new RTX 50-series architecture (sm_120). This PR adds a new Dockerfile that enables GPU inference on RTX 5090, 5080, 5070 Ti, and 5070 cards.

Key changes:

  • CUDA 12.8.1 base image (required for sm_120 architecture)
  • PyTorch nightly with cu128 support (stable PyTorch doesn't support sm_120 yet)
  • onnxruntime-gpu from Microsoft's CUDA 12 index (default PyPI package lacks CUDAExecutionProvider for CUDA 12)
  • flash_attn build skipped by default (optional, significantly reduces build time)

Related issue: Users with RTX 50-series GPUs cannot use GPU acceleration with the current Docker images.

Type of change

  • New feature (non-breaking change which adds functionality)

How has this change been tested, please provide a testcase or example of how you tested the change?

Tested on NVIDIA GeForce RTX 5090:

  1. Built the image:

    docker build -f docker/dockerfiles/Dockerfile.onnx.gpu.cuda128 \
      -t roboflow/roboflow-inference-server-gpu-cuda128 .
  2. Verified CUDA provider is available:

    docker exec <container> python -c "import onnxruntime as ort; print(ort.get_available_providers())"
    # Output: ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
  3. Tested inference speed (1080p image, object detection):

    • First request: ~54s (model download + load)
    • Subsequent requests: 65-100ms (GPU inference)
  4. Verified GPU memory allocation via nvidia-smi

Any specific deployment considerations

  • This Dockerfile uses PyTorch nightly builds (cu128) since stable PyTorch doesn't yet support sm_120
  • Once PyTorch stable releases with CUDA 12.8/sm_120 support, the Dockerfile can be updated to use stable builds
  • Users who need Paligemma/Florence2 support can uncomment the flash_attn build step (adds significant build time)

Docs

  • Docs updated? What were the changes:

Documentation update suggested: Add a note in the Docker deployment docs mentioning Dockerfile.onnx.gpu.cuda128 for RTX 50-series GPU users.

Add new Dockerfile.onnx.gpu.cuda128 to enable GPU inference on
NVIDIA RTX 50-series (Blackwell/sm_120) GPUs including RTX 5090,
5080, 5070 Ti, and 5070.

Key changes:
- Use CUDA 12.8.1 base image (required for sm_120 architecture)
- Install PyTorch nightly with cu128 support
- Install onnxruntime-gpu from Microsoft's CUDA 12 index to enable
  CUDAExecutionProvider (default PyPI package lacks CUDA 12 support)
- Skip flash_attn build by default (optional, reduces build time)

Build:
  docker build -f docker/dockerfiles/Dockerfile.onnx.gpu.cuda128 \
    -t roboflow/roboflow-inference-server-gpu-cuda128 .

Run:
  docker run --gpus all -p 9001:9001 \
    roboflow/roboflow-inference-server-gpu-cuda128

Tested on RTX 5090 with ~65-100ms inference times.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@CLAassistant
Copy link

CLAassistant commented Dec 28, 2025

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants