[Bugfix] Fix Var Length Batched Padding in Granite Speech #31906

alex-jw-brooks · 2026-01-07T16:21:52Z

Purpose

Fixes a bug in granite speech padding - the features are variable length, so we pad tensors to be [bsz, longest_feature, 160], but when the multimodal inputs are batched, they are provided as a list of dim [feat_len, 160], which breaks the pad call expecting a 3D tensor.

(EngineCore_DP0 pid=2881328)   File "/u/abrooks9944/vllm/vllm/model_executor/models/granite_speech.py", line 675, in _parse_and_validate_audio_input
(EngineCore_DP0 pid=2881328)     input_features = self._pad_and_stack_input_features(
(EngineCore_DP0 pid=2881328)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2881328)   File "/u/abrooks9944/vllm/vllm/model_executor/models/granite_speech.py", line 738, in _pad_and_stack_input_features
(EngineCore_DP0 pid=2881328)     torch.nn.functional.pad(feats, (0, 0, 0, pad, 0, 0))
    outputs = self.engine_core.get_output()
(EngineCore_DP0 pid=2881328)   File "/u/abrooks9944/miniforge3/envs/inference/lib/python3.12/site-packages/torch/nn/functional.py", line 5294, in pad
(EngineCore_DP0 pid=2881328)     return torch._C._nn.pad(input, pad, mode, value)
(EngineCore_DP0 pid=2881328) RuntimeError: Padding length should be less than or equal to two times the input dimension but got padding length 6 and input of dimension 2

This PR unsqueezes the features if they're 2D to ensure the padding is 3D.

Test Plan

Can be reproduced with something like this:

import io
import librosa
import numpy as np
from numpy import ndarray as NDArray
import soundfile as sf
import wave
from transformers import AutoTokenizer
from tqdm import tqdm
from vllm import LLM
from vllm.lora.request import LoRARequest
from vllm import LLM

TEST_FILES = [
    "1.wav",
    "2.wav",
    "3.wav",
    "4.wav",
    "5.wav",
    "6.wav",
]


def audio_to_wav(audio_frame: bytes) -> bytes:
    """Convert mono pcm 16000 sample rate audio bytes to WAV content returned as bytes

    Args:
        audio_frame (bytes): audio bytes

    Returns:
        bytes: WAV bytes
    """
    wav_buffer = io.BytesIO()
    with wave.open(wav_buffer, "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(16000)
        wf.writeframes(audio_frame)

    wav_buffer.seek(0)
    return wav_buffer.read()


def float2int(sound: NDArray) -> NDArray:
    """Convert the NDArray containing sound from floats in the range [-1.0,1.0] to int16 in the range [-32768,32767]

    Args:
        sound (NDArray): array of float32 sound values

    Returns:
        NDArray: sound as int16 sound values between [-32768,32767]
    """
    return (sound * 32768.0).astype(np.int16)


def read_float32_wav_to_wav(filepath: str) -> bytes | None:
    """
    Reads a 32-bit floating point PCM WAV file and returns its raw audio data as bytes.

    Args:
        filepath (str): The path to the WAV file.

    Returns:
        bytes: The raw audio data as a bytes object.
    """
    try:
        # Read the audio data and sample rate
        data, samplerate = sf.read(filepath, dtype="float32")
        data = float2int(data)
        audio_bytes = data.tobytes()
        return audio_to_wav(audio_bytes)
    except Exception as e:
        print(f"Error reading WAV file: {e}")
        return None

MODEL = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

question = "can you transcribe the speech into a written format?"

model = LLM(
    model=MODEL,
    enable_lora=True,
    max_lora_rank=64,
    limit_mm_per_prompt={"audio": 1},
    enforce_eager=True,
)

prompt_with_audio = "<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.\n Today's Date: November 21, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|><|audio|>\ncan you transcribe the speech into a written format?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>"


batch = []
for test_file in tqdm(TEST_FILES):
    audio_array, sampling_rate = librosa.load(test_file, sr=None) 

    inputs = {
        
        "prompt": prompt_with_audio,
        "multi_modal_data": {
            "audio": audio_array,
        }
    }
    batch.append(inputs)

outputs = model.generate(
    batch,
    lora_request=LoRARequest("speech", 1, MODEL),
    use_tqdm=False,
)

for output in outputs:
    print(output.outputs[0].text)

Test Result

It pads correctly and doesn't crash.

@DarkLight1337 could you please take a look?

Signed-off-by: Alex-Brooks <[email protected]>

gemini-code-assist

Code Review

This pull request addresses a bug in Granite Speech padding for batched variable-length audio features, which caused a crash due to a dimension mismatch. The fix correctly unsqueezes 2D tensors to 3D before padding. My review includes a suggestion to enhance the robustness of this fix by handling mixed-dimension tensors within a batch, which will prevent potential future runtime errors.

gemini-code-assist · 2026-01-07T16:23:39Z

vllm/model_executor/models/granite_speech.py

+            if len(input_features) and input_features[0].ndim == 2:
+                input_features = [
+                    inp_features.unsqueeze(dim=0) for inp_features in input_features
+                ]


The current implementation checks only the first tensor in input_features to decide whether to unsqueeze all tensors. This is brittle and assumes all tensors in the list have the same number of dimensions. If the list contains a mix of 2D and 3D tensors, or tensors with other dimensions, this could lead to unexpected RuntimeErrors downstream. A more robust approach would be to handle each tensor individually and validate its dimensions.

if len(input_features): processed_features = [] for inp_features in input_features: if inp_features.ndim == 2: processed_features.append(inp_features.unsqueeze(0)) elif inp_features.ndim == 3: processed_features.append(inp_features) else: raise ValueError( f'Expected 2D or 3D tensor in input_features, ' f'but got a {inp_features.ndim}D tensor.') input_features = processed_features

Signed-off-by: Alex-Brooks <[email protected]>

fix batched padding in granite speech

95d7522

Signed-off-by: Alex-Brooks <[email protected]>

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

only unsqueeze 2d tensors

001752f

Signed-off-by: Alex-Brooks <[email protected]>

alex-jw-brooks changed the title ~~[Bugfix] Fix Batched Padding in Granite Speech~~ [Bugfix] Fix Var Length Batched Padding in Granite Speech Jan 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix Var Length Batched Padding in Granite Speech #31906

[Bugfix] Fix Var Length Batched Padding in Granite Speech #31906

alex-jw-brooks commented Jan 7, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Bugfix] Fix Var Length Batched Padding in Granite Speech #31906

Are you sure you want to change the base?

[Bugfix] Fix Var Length Batched Padding in Granite Speech #31906

Conversation

alex-jw-brooks commented Jan 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alex-jw-brooks commented Jan 7, 2026 •

edited by github-actions bot

Loading