Skip to content

Conversation

@alex-jw-brooks
Copy link
Contributor

@alex-jw-brooks alex-jw-brooks commented Jan 7, 2026

Purpose

Fixes a bug in granite speech padding - the features are variable length, so we pad tensors to be [bsz, longest_feature, 160], but when the multimodal inputs are batched, they are provided as a list of dim [feat_len, 160], which breaks the pad call expecting a 3D tensor.

(EngineCore_DP0 pid=2881328)   File "/u/abrooks9944/vllm/vllm/model_executor/models/granite_speech.py", line 675, in _parse_and_validate_audio_input
(EngineCore_DP0 pid=2881328)     input_features = self._pad_and_stack_input_features(
(EngineCore_DP0 pid=2881328)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2881328)   File "/u/abrooks9944/vllm/vllm/model_executor/models/granite_speech.py", line 738, in _pad_and_stack_input_features
(EngineCore_DP0 pid=2881328)     torch.nn.functional.pad(feats, (0, 0, 0, pad, 0, 0))
    outputs = self.engine_core.get_output()
(EngineCore_DP0 pid=2881328)   File "/u/abrooks9944/miniforge3/envs/inference/lib/python3.12/site-packages/torch/nn/functional.py", line 5294, in pad
(EngineCore_DP0 pid=2881328)     return torch._C._nn.pad(input, pad, mode, value)
(EngineCore_DP0 pid=2881328) RuntimeError: Padding length should be less than or equal to two times the input dimension but got padding length 6 and input of dimension 2

This PR unsqueezes the features if they're 2D to ensure the padding is 3D.

Test Plan

Can be reproduced with something like this:

import io
import librosa
import numpy as np
from numpy import ndarray as NDArray
import soundfile as sf
import wave
from transformers import AutoTokenizer
from tqdm import tqdm
from vllm import LLM
from vllm.lora.request import LoRARequest
from vllm import LLM

TEST_FILES = [
    "1.wav",
    "2.wav",
    "3.wav",
    "4.wav",
    "5.wav",
    "6.wav",
]


def audio_to_wav(audio_frame: bytes) -> bytes:
    """Convert mono pcm 16000 sample rate audio bytes to WAV content returned as bytes

    Args:
        audio_frame (bytes): audio bytes

    Returns:
        bytes: WAV bytes
    """
    wav_buffer = io.BytesIO()
    with wave.open(wav_buffer, "wb") as wf:
        wf.setnchannels(1)
        wf.setsampwidth(2)
        wf.setframerate(16000)
        wf.writeframes(audio_frame)

    wav_buffer.seek(0)
    return wav_buffer.read()


def float2int(sound: NDArray) -> NDArray:
    """Convert the NDArray containing sound from floats in the range [-1.0,1.0] to int16 in the range [-32768,32767]

    Args:
        sound (NDArray): array of float32 sound values

    Returns:
        NDArray: sound as int16 sound values between [-32768,32767]
    """
    return (sound * 32768.0).astype(np.int16)


def read_float32_wav_to_wav(filepath: str) -> bytes | None:
    """
    Reads a 32-bit floating point PCM WAV file and returns its raw audio data as bytes.

    Args:
        filepath (str): The path to the WAV file.

    Returns:
        bytes: The raw audio data as a bytes object.
    """
    try:
        # Read the audio data and sample rate
        data, samplerate = sf.read(filepath, dtype="float32")
        data = float2int(data)
        audio_bytes = data.tobytes()
        return audio_to_wav(audio_bytes)
    except Exception as e:
        print(f"Error reading WAV file: {e}")
        return None

MODEL = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

question = "can you transcribe the speech into a written format?"

model = LLM(
    model=MODEL,
    enable_lora=True,
    max_lora_rank=64,
    limit_mm_per_prompt={"audio": 1},
    enforce_eager=True,
)

prompt_with_audio = "<|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024.\n Today's Date: November 21, 2025. You are Granite, developed by IBM. You are a helpful AI assistant.<|end_of_text|>\n<|start_of_role|>user<|end_of_role|><|audio|>\ncan you transcribe the speech into a written format?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>"


batch = []
for test_file in tqdm(TEST_FILES):
    audio_array, sampling_rate = librosa.load(test_file, sr=None) 

    inputs = {
        
        "prompt": prompt_with_audio,
        "multi_modal_data": {
            "audio": audio_array,
        }
    }
    batch.append(inputs)

outputs = model.generate(
    batch,
    lora_request=LoRARequest("speech", 1, MODEL),
    use_tqdm=False,
)

for output in outputs:
    print(output.outputs[0].text)

Test Result

It pads correctly and doesn't crash.

@DarkLight1337 could you please take a look?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in Granite Speech padding for batched variable-length audio features, which caused a crash due to a dimension mismatch. The fix correctly unsqueezes 2D tensors to 3D before padding. My review includes a suggestion to enhance the robustness of this fix by handling mixed-dimension tensors within a batch, which will prevent potential future runtime errors.

Comment on lines 676 to 679
if len(input_features) and input_features[0].ndim == 2:
input_features = [
inp_features.unsqueeze(dim=0) for inp_features in input_features
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation checks only the first tensor in input_features to decide whether to unsqueeze all tensors. This is brittle and assumes all tensors in the list have the same number of dimensions. If the list contains a mix of 2D and 3D tensors, or tensors with other dimensions, this could lead to unexpected RuntimeErrors downstream. A more robust approach would be to handle each tensor individually and validate its dimensions.

            if len(input_features):
                processed_features = []
                for inp_features in input_features:
                    if inp_features.ndim == 2:
                        processed_features.append(inp_features.unsqueeze(0))
                    elif inp_features.ndim == 3:
                        processed_features.append(inp_features)
                    else:
                        raise ValueError(
                            f'Expected 2D or 3D tensor in input_features, '
                            f'but got a {inp_features.ndim}D tensor.')
                input_features = processed_features

Signed-off-by: Alex-Brooks <[email protected]>
@alex-jw-brooks alex-jw-brooks changed the title [Bugfix] Fix Batched Padding in Granite Speech [Bugfix] Fix Var Length Batched Padding in Granite Speech Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant