[4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI #31842

mikaylagawarecki · 2026-01-06T23:13:51Z

Purpose

Stacked on #31547

Test Plan

pytest tests/kernels/core/test_pos_encoding.py -v
pytest tests/kernels/core/test_rotary_embedding.py -v
pytest tests/kernels/core/test_apply_rotary_emb.py -v
pytest tests/kernels/core/test_fused_qk_norm_rope.py -v
pytest tests/kernels/test_apply_repetition_penalties.py -v
pytest tests/kernels/test_top_k_per_row.py -v

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Mikayla Gawarecki <[email protected]>

gemini-code-assist

Code Review

This pull request is a significant and well-executed effort to migrate numerous CUDA kernels to PyTorch's stable ABI, which will improve forward compatibility. The refactoring is extensive, touching many files to adopt stable tensor APIs, dispatch macros, and header-only includes. However, I've identified a recurring critical issue across several files: input.get_device() is incorrectly used to obtain the CUDA stream. This will lead to a compilation failure because the stable ABI function get_current_cuda_stream requires a device index (int32_t), which should be retrieved with input.get_device_index(). I have provided specific comments and suggestions for each occurrence. Addressing these compilation errors should put the PR in excellent shape.

gemini-code-assist · 2026-01-06T23:15:48Z

csrc/activation_kernels.cu

+  torch::stable::accelerator::DeviceGuard device_guard(input.get_device());    \
+  cudaStream_t stream = get_current_cuda_stream(input.get_device());           \


The function get_current_cuda_stream expects a device index of type int32_t, but input.get_device() returns a torch::stable::Device object. This will cause a compilation error. You should use input.get_device_index() instead. While DeviceGuard can accept a Device object, using get_device_index() for both is more consistent.

torch::stable::accelerator::DeviceGuard device_guard(input.get_device_index()); cudaStream_t stream = get_current_cuda_stream(input.get_device_index());

gemini-code-assist · 2026-01-06T23:15:49Z

csrc/activation_kernels.cu

+  torch::stable::accelerator::DeviceGuard device_guard(input.get_device());   \
+  cudaStream_t stream = get_current_cuda_stream(input.get_device());          \


The function get_current_cuda_stream expects a device index of type int32_t, but input.get_device() returns a torch::stable::Device object. This will cause a compilation error. You should use input.get_device_index() instead. While DeviceGuard can accept a Device object, using get_device_index() for both is more consistent.

torch::stable::accelerator::DeviceGuard device_guard(input.get_device_index()); cudaStream_t stream = get_current_cuda_stream(input.get_device_index());

gemini-code-assist · 2026-01-06T23:15:49Z

csrc/activation_kernels.cu

+  torch::stable::accelerator::DeviceGuard device_guard(input.get_device());   \
+  cudaStream_t stream = get_current_cuda_stream(input.get_device());          \


The function get_current_cuda_stream expects a device index of type int32_t, but input.get_device() returns a torch::stable::Device object. This will cause a compilation error. You should use input.get_device_index() instead. While DeviceGuard can accept a Device object, using get_device_index() for both is more consistent.

torch::stable::accelerator::DeviceGuard device_guard(input.get_device_index()); cudaStream_t stream = get_current_cuda_stream(input.get_device_index());

gemini-code-assist · 2026-01-06T23:15:49Z

csrc/activation_kernels.cu

+  torch::stable::accelerator::DeviceGuard device_guard(input.get_device());    \
+  cudaStream_t stream = get_current_cuda_stream(input.get_device());           \


The function get_current_cuda_stream expects a device index of type int32_t, but input.get_device() returns a torch::stable::Device object. This will cause a compilation error. You should use input.get_device_index() instead. While DeviceGuard can accept a Device object, using get_device_index() for both is more consistent.

torch::stable::accelerator::DeviceGuard device_guard(input.get_device_index()); cudaStream_t stream = get_current_cuda_stream(input.get_device_index());

gemini-code-assist · 2026-01-06T23:15:49Z

csrc/layernorm_quant_kernels.cu

+  const torch::stable::accelerator::DeviceGuard device_guard(
+      input.get_device());
+  const cudaStream_t stream = get_current_cuda_stream(input.get_device());


The function get_current_cuda_stream expects a device index of type int32_t, but input.get_device() returns a torch::stable::Device object. This will cause a compilation error. You should use input.get_device_index() instead. While DeviceGuard can accept a Device object, using get_device_index() for both is more consistent.

const torch::stable::accelerator::DeviceGuard device_guard( input.get_device_index()); const cudaStream_t stream = get_current_cuda_stream(input.get_device_index());

gemini-code-assist · 2026-01-06T23:15:49Z

csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu

+  const torch::stable::accelerator::DeviceGuard device_guard(
+      input.get_device());
+  const cudaStream_t stream = get_current_cuda_stream(input.get_device());


The function get_current_cuda_stream expects a device index of type int32_t, but input.get_device() returns a torch::stable::Device object. This will cause a compilation error. You should use input.get_device_index() instead. While DeviceGuard can accept a Device object, using get_device_index() for both is more consistent.

const torch::stable::accelerator::DeviceGuard device_guard( input.get_device_index()); const cudaStream_t stream = get_current_cuda_stream(input.get_device_index());

gemini-code-assist · 2026-01-06T23:15:49Z

csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu

+  const torch::stable::accelerator::DeviceGuard device_guard(
+      input.get_device());
+  const cudaStream_t stream = get_current_cuda_stream(input.get_device());


The function get_current_cuda_stream expects a device index of type int32_t, but input.get_device() returns a torch::stable::Device object. This will cause a compilation error. You should use input.get_device_index() instead. While DeviceGuard can accept a Device object, using get_device_index() for both is more consistent.

const torch::stable::accelerator::DeviceGuard device_guard( input.get_device_index()); const cudaStream_t stream = get_current_cuda_stream(input.get_device_index());

Signed-off-by: Mikayla Gawarecki <[email protected]>

mikaylagawarecki added 10 commits December 29, 2025 10:46

Migrate activation kernels to torch stable ABI

455bcef

Signed-off-by: Mikayla Gawarecki <[email protected]>

Only build _C_stable when backend is CUDA

f314201

Signed-off-by: Mikayla Gawarecki <[email protected]>

Factor out get_current_stream into common util

8d5493c

Signed-off-by: Mikayla Gawarecki <[email protected]>

Include stable_ops.h in cpu/torch_bindings.cpp

6694e60

Signed-off-by: Mikayla Gawarecki <[email protected]>

Add torch_nightly to kernels.yaml

e388e04

Signed-off-by: Mikayla Gawarecki <[email protected]>

Fix data ptr

5a42bd9

Signed-off-by: Mikayla Gawarecki <[email protected]>

Rename extension _C_stable --> _C_stable_libtorch

5c479ad

Signed-off-by: Mikayla Gawarecki <[email protected]>

Build _C_stable_libtorch for both CUDA and RoCM

13423c9

Signed-off-by: Mikayla Gawarecki <[email protected]>

Migrate permute_cols and cuda_view

c15f378

Signed-off-by: Mikayla Gawarecki <[email protected]>

[3/n] Migrate norm kernels to libtorch stable ABI

2072ce5

Signed-off-by: Mikayla Gawarecki <[email protected]>

mikaylagawarecki changed the title ~~Stable abi phase3~~ [4/n] Migrate pos_encoding sampler and fused_qknorm_rope Jan 6, 2026

mergify bot added ci/build nvidia cpu Related to CPU backends labels Jan 6, 2026

github-project-automation bot added this to NVIDIA Jan 6, 2026

mikaylagawarecki changed the title ~~[4/n] Migrate pos_encoding sampler and fused_qknorm_rope~~ [4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI Jan 6, 2026

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

[4/n] Migrate pos_encoding sampler and fused_qknorm_rope

5eee0b9

Signed-off-by: Mikayla Gawarecki <[email protected]>

mikaylagawarecki force-pushed the stable-abi-phase3 branch from 07f7e6d to 5eee0b9 Compare January 6, 2026 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI #31842

[4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI #31842

mikaylagawarecki commented Jan 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

gemini-code-assist bot Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		torch::stable::accelerator::DeviceGuard device_guard(input.get_device()); \
		cudaStream_t stream = get_current_cuda_stream(input.get_device()); \

Uh oh!

[4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI #31842

Are you sure you want to change the base?

[4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI #31842

Conversation

mikaylagawarecki commented Jan 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikaylagawarecki commented Jan 6, 2026 •

edited by github-actions bot

Loading