feat : Add Support of Qwen2.5Omni Model and MiniCPM-o-4_5 Model by KKkai0315 · Pull Request #612 · UbiquitousLearning/mllm

KKkai0315 · 2026-01-23T07:51:41Z

add support of Qwen2.5Omni Model (talker part still has some problem)
now it can receive text, image and audio input, and generate text output
add support of MiniCPM-o-4_5 Model

Summary by CodeRabbit

New Features
- Full multimodal Qwen2.5 Omni: text, vision, audio support with tokenizer, audio preprocessing, comprehensive configs, and interactive example CLIs.
- MiniCPM‑o4.5: multimodal model with TTS and end-to-end token→wav synthesis, prompt cache tools, tokenizers, and example CLIs.
Backends & Layers
- CPU backend adds ConvTranspose1D and Tanh ops with corresponding NN layer support.
Tools
- New scripts to convert token2wav formats and export prompt caches.
Tests
- Kernel/unit tests for ConvTranspose1D and Tanh.

…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.

… debug print statements from Qwen3DecoderLayer

…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.

…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.

… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.

…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.

coderabbitai · 2026-01-23T07:52:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds two example projects (Qwen2.5-Omni and MiniCPM-o45), large multimodal model implementations (configs, tokenizers, preprocessors, modeling, token2wav), new ops/layers (ConvTranspose1D, Tanh) with CPU implementations and tests, build updates, and Python tooling for conversion and prompt-cache export.

Changes

Cohort / File(s)	Summary
Build / Examples `examples/CMakeLists.txt`, `examples/qwen2_5omni/CMakeLists.txt`, `examples/minicpm_o45/CMakeLists.txt`	Register new example subprojects and executables for qwen2_5omni and minicpm_o45.
Example CLIs & JSON configs `examples/qwen2_5omni/.cpp`, `examples/qwen2_5omni/config_qwen2_5omni_7B.json`, `examples/minicpm_o45/`	Add text/image/audio runner CLIs and JSON configs; MiniCPM-o45 main, config, and example build targets.
Qwen2_5Omni: Modeling & Config `mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`, `mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp`	Introduce extensive multimodal model code (vision/audio/text/talker/token2wav), RoPE/position logic, and public config structs — large API surface to review.
Qwen2_5Omni: Tokenization & Preproc `mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp`, `mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp`	Add tokenizer, message formats, BPE integration, and audio Mel-spectrogram preprocessor — check token insertion, feature shapes, and special-token handling.
MiniCPM-o45: Modeling / Token2Wav / Tokenizers `mllm/models/minicpm_o45/`, `mllm/models/minicpm_o45/token2wav_`, `mllm/models/minicpm_o45/*py`	Add full MiniCPM-o45 stack: configs, tokenizers, TTS, token2wav (flow + HiFT), weight-norm helpers, prompt cache I/O, and conversion/export scripts — heavy, I/O and inference paths require close review.
New Ops / aops Interfaces `mllm/core/aops/ConvTranspose1DOp.{hpp,cpp}`, `mllm/core/aops/TanhOp.{hpp,cpp}`	Declare ConvTranspose1D and Tanh op types, options, load/trace/reshape/getParams; forward NYI in base op.
CPU Backend Implementations `mllm/backends/cpu/ops/ConvTranspose1DOp.{hpp,cpp}`, `mllm/backends/cpu/ops/TanhOp.{hpp,cpp}`, `mllm/backends/cpu/CPUBackend.cpp`	Add CPU implementations and register factories in CPUBackend; review threading, dtype handling, and numeric correctness.
NN Layers & Exports `mllm/nn/layers/ConvTranspose1D.{hpp,cpp}`, `mllm/nn/layers/Tanh.{hpp,cpp}`, `mllm/nn/Nn.hpp`	Add layer wrappers and export includes for new ops.
IR / Op Registration / RTTI `mllm/compile/ir/GeneratedRTTIKind.hpp`, `mllm/compile/ir/NodeRTTIClassOfImpl.hpp`, `mllm/compile/ir/linalg/Op.{hpp,cpp}`, `mllm/core/OpTypes.hpp`	Register new op kinds and RTTI macros; watch for duplicate macro definitions and enum ordering.
Tests `tests/cpu/TanhKernelTest.hpp`, `tests/cpu/ConvTranspose1DKernelTest.hpp`, `tests/cpu/KernelTest.cpp`	Add unit/kernel tests and reference implementations for Tanh and ConvTranspose1D.
Core tweaks & misc `mllm/backends/cpu/ops/EmbeddingOp.cpp`, `mllm/models/minicpm_o2_6/modeling_chattts.hpp`	Refactor embedding OOB handling and remove some manual deletes — inspect warning suppression and lifetime changes.
Tools / Scripts `mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py`, `mllm/models/minicpm_o45/export_prompt_cache.py`	Add conversion and prompt-cache export utilities for MiniCPM-o45 workflows.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as Runner
    participant Tokenizer as Qwen2_5OmniTokenizer
    participant Preproc as Audio/Image Preprocessor
    participant Model as Qwen2_5OmniForCausalLM

    User->>CLI: provide input (text / image / audio)
    CLI->>Tokenizer: load tokenizer & config
    CLI->>Preproc: (if media) process file -> features
    CLI->>Tokenizer: convertMessage / convertVisionMessage / convertAudioMessage
    Tokenizer-->>CLI: token ids + feature tensors
    CLI->>Model: forward(input_ids, feature_tensors)
    Model->>Model: thinker/encoders fuse modalities
    Model-->>CLI: logits / streamed token outputs
    loop streaming
        CLI->>Tokenizer: detokenize(next_token)
        Tokenizer-->>CLI: text chunk
        CLI->>User: stream output
    end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

feat: x86 matmul and transpose support (SDPA llama runnable) #641 — changes to embedding and CPU embedding/GGUF dequantization paths; closely related to EmbeddingOp refactor.
feat(deepseek-ocr): deepseek-ocr support(On working) #486 — CPU backend op factory registrations and new op additions; related to CPUBackend updates.
feat: add kai&qnn-vl&opencl #489 — op registrations and backend changes overlapping new OpTypes/RTTI additions.

Suggested reviewers

yirongjie
chenghuaWang
oreomaker

Poem

"🐇 I hopped through headers, tokens, and song,

Built mel, images, and voices all day long,
From prompt to waveform with a cheerful cheer,
I nibble bugs, then watch the audio appear,
A rabbit applauds — compile, run, and hear!"

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 1.40% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description mentions adding Qwen2.5Omni and MiniCPM-o-4_5 support with multimodal input capability, but lacks detail and does not follow the provided template structure with clear sections.	Expand the description with more details about implementation scope, testing done, known limitations (e.g., talker issues), and follow the template structure provided in the repository.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat : Add Support of Qwen2.5Omni Model and MiniCPM-o-4_5 Model' directly and clearly summarizes the primary changes in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🤖 Fix all issues with AI agents

In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 24-29: The code currently silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2";
instead validate model_version.get() explicitly: examine the string returned by
model_version.get(), set file_version to mllm::ModelFileVersion::kV1 or ::kV2
for "v1" and "v2" respectively, and for any other value print a clear error
mentioning the allowed values and terminate (non-zero exit) to avoid loading the
wrong format; update the block that assigns file_version (and any related usage)
to enforce this validation and fail fast on unsupported values.
- Around line 48-60: The example currently sets audio_path and prompt_text to
empty strings causing processAudioFile()/convertAudioMessage() to get empty
audio via mllm::audio::readWAV("") and fail; restore interactive input by
re-enabling the std::getline calls (or alternatively accept paths via CLI args)
so audio_path and prompt_text are populated before calling
processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit"
remains and that prompt_text falls back to a default if empty to avoid empty
prompt usage.

In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json`:
- Around line 5-7: The config contains unused flags enable_audio_output and
enable_talker (alongside model_type "qwen2_5_omni"); update the JSON to avoid
misleading users by either removing the keys enable_audio_output and
enable_talker or setting both to false. Locate the entries for
"enable_audio_output" and "enable_talker" in the config_qwen2_5omni_7B.json and
change their values to false (or delete those lines) so only supported
functionality (the thinker/text output) is advertised.

In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The current logic silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is unknown; update the
branch around model_version.get() so you explicitly accept only "v1" and "v2",
set file_version to mllm::ModelFileVersion::kV1 or kV2 accordingly, and
otherwise report an error (include the invalid value and allowed values) and
terminate with a non‑zero exit or throw an exception; reference the
model_version variable and the file_version / mllm::ModelFileVersion symbols
when making this validation change.
- Around line 52-60: Uncomment the two std::getline calls so the program reads
user input instead of using empty strings: after the fmt::print("Image path (or
'exit/quit'): "); call restore std::getline(std::cin, image_path) to populate
image_path (so the exit/quit check works), and after fmt::print("Prompt text:
"); restore std::getline(std::cin, prompt_text) to populate prompt_text; ensure
the variables image_path and prompt_text are used as before and that included
<iostream> usage is consistent with the surrounding code.

In `@examples/qwen2_5omni/text_infer.cpp`:
- Around line 24-29: The code silently defaults to mllm::ModelFileVersion::kV1
when model_version is unknown; change this to validate the input from
model_version.get() and reject unsupported values instead of defaulting. In the
block that sets file_version (referencing model_version.get(), file_version, and
mllm::ModelFileVersion), check for "v1" and "v2" explicitly and if neither
matches, print a clear error mentioning the unsupported model_version and
terminate (e.g., return/exit with non-zero or throw) so the program fails fast
rather than loading the wrong format.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 942-958: The code leaves attn uninitialized when
key_states.dtype() is neither kFloat32 nor kFloat16; add a fallback else branch
in the attention block (around the key_states.dtype() checks) that handles other
dtypes by casting query_states and key_states to kFloat32, computing attn =
nn::functional::matmul(...)* (1.f / sqrtf(head_dim_)), applying mask_ and
softmax_, and then casting attn back to the original key_states.dtype() if
needed (or alternatively throw a clear runtime_error mentioning unsupported
dtype) so that attn is always initialized before using
nn::functional::matmul(attn, value_states).

In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp`:
- Around line 303-316: The code computes img_token_nums from grid_thw and then
inserts tokens without validating it; add a check (similar to
convertAudioMessage's audio_token_nums > 0) after computing img_token_nums and
before ids.insert to ensure img_token_nums > 0, and on failure call
MLLM_ERROR_EXIT with a clear message (e.g., "Invalid image token count") so you
don't call ids.insert(img_token_nums - 1, ...) with zero/negative counts; locate
symbols grid_thw, img_token_nums, image_token_id, ids.insert and the surrounding
convertVisionMessage logic to make the change.

🧹 Nitpick comments (7)

mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp (2)

98-114: Redundant condition in whitespace handling logic.

After the while loop on line 100 consumes all whitespace, str[pos] is guaranteed to be a non-whitespace character (when pos < str.size()). Therefore, the condition std::iswspace(str[pos]) on line 101 is never true, making the second half of the condition unreachable.

This also makes lines 109-114 redundant since the only way to reach them is if the first block restored pos (line 105), but that only happens when the condition on line 101 is false—which it always is unless we're at end of string.

Consider clarifying the intended behavior or simplifying this logic.

189-193: Clarify the technical reason for the max_pixels constraint.

The comment "interestingly, the answer went bad when setting max_pixels higher" is informal and doesn't explain the root cause. Consider documenting the technical reason (e.g., memory constraints, model training resolution, numerical stability) to help future maintainers understand why this limit exists.

mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp (2)

19-37: Consider adding existence checks or documenting required fields.

The config parser accesses required fields like text_config, hidden_size, intermediate_size, etc. directly without .contains() checks. If the config file is missing these required fields, the parser will throw an exception.

This fail-fast behavior may be intentional, but consider either:

Adding explicit existence checks with descriptive error messages

Documenting the required config structure in a header comment

80-118: Consider extracting common parsing logic to reduce duplication.

The root-level config parsing (lines 80-118) largely duplicates the thinker_config parsing (lines 23-79). This is understandable for supporting two config formats, but could be refactored into helper methods if the config format evolves.
mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (2)
958-961: Consider passing layer_idx as a constructor parameter instead of public member.

layer_idx_ is a public member that's set externally (line 1003). This breaks encapsulation and could lead to bugs if the member is not set before use.
🛠️ Suggested refactor
 class Qwen2_5OmniAttention final : public nn::Module {
+  int layer_idx_;
   // ... other members ...
 
  public:
-  Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {
+  Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg, int layer_idx)
+      : nn::Module(name), layer_idx_(layer_idx) {
     // ...
   }
-
-  int layer_idx_;
 };
Then update Qwen2_5OmniDecoder and the module registration accordingly.
853-858: Add explicit keyword for consistency with other constructors.

Qwen2_5OmniMLP, Qwen2_5OmniAttention (line 889), Qwen2_5OmniDecoder (line 972), and Qwen2_5OmniText (line 1001) are missing the explicit keyword that other similar classes use (e.g., Qwen2_5OmniVisionMLP at line 421).
🛠️ Suggested fix
-  Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {
+  explicit Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {
Apply similar changes to Qwen2_5OmniAttention, Qwen2_5OmniDecoder, and Qwen2_5OmniText.
mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp (1)
43-55: Add M_PI definition guard for portability.

M_PI is not part of the C++ standard and is only a POSIX extension. While commonly available on Unix-like systems, it may be undefined on other platforms (notably Windows with MSVC without _USE_MATH_DEFINES). The codebase already uses this pattern in third_party/wenet_audio/fft.h and mllm/backends/cpu/ops/STFTOp.cpp.
🛠️ Suggested fix
+#ifndef M_PI
+#define M_PI 3.14159265358979323846
+#endif
+
 inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/audio_infer.cpp

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟡 Minor

Validate --model_version instead of silently falling back to v1.

An unknown value currently defaults to v1, which can load the wrong format and fail in confusing ways. Consider rejecting unsupported values explicitly.

Suggested fix

- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; - if (model_version.get() == "v1") { - file_version = mllm::ModelFileVersion::kV1; - } else if (model_version.get() == "v2") { - file_version = mllm::ModelFileVersion::kV2; - } + mllm::ModelFileVersion file_version; + if (model_version.get() == "v1") { + file_version = mllm::ModelFileVersion::kV1; + } else if (model_version.get() == "v2") { + file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported --model_version: {}\n", model_version.get()); + return 1; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

}

mllm::ModelFileVersion file_version;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

} else {

fmt::print("Unsupported --model_version: {}\n", model_version.get());

return 1;

}

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/audio_infer.cpp` around lines 24 - 29, The code currently silently defaults file_version to mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2"; instead validate model_version.get() explicitly: examine the string returned by model_version.get(), set file_version to mllm::ModelFileVersion::kV1 or ::kV2 for "v1" and "v2" respectively, and for any other value print a clear error mentioning the allowed values and terminate (non-zero exit) to avoid loading the wrong format; update the block that assigns file_version (and any related usage) to enforce this validation and fail fast on unsupported values.

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/audio_infer.cpp

+    std::string audio_path;
+    std::string prompt_text;
+
+    fmt::print("Audio path (or 'exit/quit'): ");
+    //std::getline(std::cin, audio_path);
+    //if (audio_path == "exit" || audio_path == "quit") { return 0; }
+    audio_path = "";
+
+    fmt::print("Prompt text: ");
+    //std::getline(std::cin, prompt_text);
+    //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
+    prompt_text = "";
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd examples/qwen2_5omni && wc -l audio_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 190

🏁 Script executed:

head -70 examples/qwen2_5omni/audio_infer.cpp | tail -30

Repository: UbiquitousLearning/mllm

Length of output: 1202

🏁 Script executed:

cat -n examples/qwen2_5omni/audio_infer.cpp | sed -n '40,70p'

Repository: UbiquitousLearning/mllm

Length of output: 1510

🏁 Script executed:

rg -A 10 "convertAudioMessage" examples/qwen2_5omni/ --type cpp

Repository: UbiquitousLearning/mllm

Length of output: 1132

🏁 Script executed:

rg -B 5 -A 10 "audio_file_path" examples/qwen2_5omni/ --type cpp | head -50

Repository: UbiquitousLearning/mllm

Length of output: 1466

🏁 Script executed:

fd "\.hpp$|\.h$" examples/qwen2_5omni --type f

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

rg "convertAudioMessage" --type cpp --type hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 94

🏁 Script executed:

find . -type f -name "*.h" -o -name "*.hpp" | grep -E "(qwen|tokenizer)" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 1152

🏁 Script executed:

rg "convertAudioMessage" -A 15 --max-count 5

Repository: UbiquitousLearning/mllm

Length of output: 3215

🏁 Script executed:

rg "audio_file_path" -B 5 -A 10 --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 16860

🏁 Script executed:

rg "MLLM_ERROR_EXIT" -B 2 -A 2 --max-count 5

Repository: UbiquitousLearning/mllm

Length of output: 50379

🏁 Script executed:

rg "readWAV" -B 3 -A 8 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 3737

Enable interactive input for the example to work as intended.

The example currently has input hardcoded to empty strings, causing convertAudioMessage() to fail. When an empty audio_path is passed to processAudioFile(), it attempts to read from an empty path via mllm::audio::readWAV(""), which returns empty audio data. This triggers the error check if (audio_result.input_features.isNil() || audio_result.feature_length <= 0) and exits with MLLM_ERROR_EXIT(ExitCode::kIOError, "Failed to extract audio features for Qwen2.5-Omni.").

Restore the commented std::getline() calls to enable interactive input, or wire the paths to CLI arguments.

Suggested fix (restore interactive input)

- fmt::print("Audio path (or 'exit/quit'): "); - //std::getline(std::cin, audio_path); - //if (audio_path == "exit" || audio_path == "quit") { return 0; } - audio_path = ""; - - fmt::print("Prompt text: "); - //std::getline(std::cin, prompt_text); - //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } - prompt_text = ""; + fmt::print("Audio path (or 'exit/quit'): "); + std::getline(std::cin, audio_path); + if (audio_path == "exit" || audio_path == "quit") { return 0; } + + fmt::print("Prompt text: "); + std::getline(std::cin, prompt_text); + if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::string audio_path;

std::string prompt_text;

fmt::print("Audio path (or 'exit/quit'): ");

//std::getline(std::cin, audio_path);

//if (audio_path == "exit" || audio_path == "quit") { return 0; }

audio_path = "";

fmt::print("Prompt text: ");

//std::getline(std::cin, prompt_text);

//if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

prompt_text = "";

std::string audio_path;

std::string prompt_text;

fmt::print("Audio path (or 'exit/quit'): ");

std::getline(std::cin, audio_path);

if (audio_path == "exit" || audio_path == "quit") { return 0; }

fmt::print("Prompt text: ");

std::getline(std::cin, prompt_text);

if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/audio_infer.cpp` around lines 48 - 60, The example currently sets audio_path and prompt_text to empty strings causing processAudioFile()/convertAudioMessage() to get empty audio via mllm::audio::readWAV("") and fail; restore interactive input by re-enabling the std::getline calls (or alternatively accept paths via CLI args) so audio_path and prompt_text are populated before calling processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit" remains and that prompt_text falls back to a default if empty to avoid empty prompt usage.

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/config_qwen2_5omni_7B.json

+  "enable_audio_output": true,
+  "enable_talker": true,
+  "model_type": "qwen2_5_omni",


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, verify the file exists and show context around the lines mentioned echo "=== Checking the config file ===" find . -name "config_qwen2_5omni_7B.json" -type f 2>/dev/null | head -5 echo -e "\n=== Config file content ===" if [ -f "examples/qwen2_5omni/config_qwen2_5omni_7B.json" ]; then cat -n "examples/qwen2_5omni/config_qwen2_5omni_7B.json" fi echo -e "\n=== Search for flag usage in code ===" rg -n --type=cpp 'enable_talker|enable_audio_output' -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 16869

🏁 Script executed:

#!/bin/bash # Search for flag usage in Python and other files echo "=== Search for enable_talker and enable_audio_output in all files ===" rg -n 'enable_talker|enable_audio_output' -i --type-list | head -5 echo -e "\n=== Search without type restriction ===" rg -n 'enable_talker|enable_audio_output' -A 3 -B 1 | head -50 echo -e "\n=== Search for talker_config usage ===" rg -n 'talker_config' -A 2 -B 1 | head -50 echo -e "\n=== Check for model loading/config ===" rg -n 'def.*load|def.*model' --type=py | grep -i config | head -20 echo -e "\n=== Find Python files in examples directory ===" find examples -name "*.py" -type f 2>/dev/null | head -10

Repository: UbiquitousLearning/mllm

Length of output: 1512

🏁 Script executed:

#!/bin/bash # Let's find where config files are loaded and used echo "=== Look for config loading functions ===" rg -n 'load.*config|from_pretrained' --type=py -A 2 | head -60 echo -e "\n=== Search for qwen omni references ===" rg -n 'qwen.*omni|omni.*model' --type=py -i | head -30 echo -e "\n=== Explore directory structure ===" find . -type d -name "*qwen*" -o -name "*model*" -o -name "*config*" 2>/dev/null | head -20 echo -e "\n=== Look for model initialization in Python ===" find . -name "*.py" -type f 2>/dev/null | head -20 echo -e "\n=== Check for source code structure ===" ls -la | head -20

Repository: UbiquitousLearning/mllm

Length of output: 3977

🏁 Script executed:

#!/bin/bash # Search for where JSON configs are loaded echo "=== Search for JSON config loading ===" rg -n '\.json|load.*config' --type=py | grep -i 'config.*json\|json.*config' | head -20 echo -e "\n=== Search for qwen2_5omni model class or file ===" find . -type f $ -name "*.py" -o -name "*.hpp" -o -name "*.cpp" -o -name "*.h" $ -exec grep -l 'qwen2_5omni\|Qwen2_5Omni' {} \; 2>/dev/null | head -20 echo -e "\n=== Check if config file is actually used in any Python test or runner ===" rg -n 'config_qwen2_5omni' -A 3 | head -30 echo -e "\n=== Look for files that reference the examples directory ===" rg -n 'examples/qwen2_5omni' | head -20 echo -e "\n=== Search for enable_ flags in code (broader search) ===" rg -n 'enable_' --type=py -B 2 -A 2 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 3540

🏁 Script executed:

#!/bin/bash # Read the configuration file echo "=== configuration_qwen2_5omni.hpp ===" wc -l mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp head -100 mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp echo -e "\n=== modeling_qwen2_5omni.hpp ===" wc -l mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp echo -e "\n=== Search for enable_talker in C++ files ===" rg -n 'enable_talker|enable_audio_output' mllm/models/qwen2_5omni/ -A 3 -B 1

Repository: UbiquitousLearning/mllm

Length of output: 5986

🏁 Script executed:

#!/bin/bash # Check if the flags are mentioned anywhere in the entire qwen2_5omni files echo "=== All files in qwen2_5omni directory ===" ls -la mllm/models/qwen2_5omni/ echo -e "\n=== Search ALL qwen2_5omni files for enable_ flags ===" grep -r 'enable_talker\|enable_audio_output' mllm/models/qwen2_5omni/ 2>/dev/null || echo "Not found in C++ files" echo -e "\n=== Search ALL qwen2_5omni files for talker ===" grep -n 'talker' mllm/models/qwen2_5omni/*.hpp mllm/models/qwen2_5omni/*.cpp 2>/dev/null | head -20 echo -e "\n=== Full configuration_qwen2_5omni.hpp to see if flags are read ===" cat mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp | tail -80

Repository: UbiquitousLearning/mllm

Length of output: 4080

🏁 Script executed:

#!/bin/bash # Check if flags are used anywhere in the entire repository echo "=== Search entire repo for enable_talker ===" rg -n 'enable_talker' --type-list 2>/dev/null || true rg -n 'enable_talker' 2>/dev/null | head -20 echo -e "\n=== Search entire repo for enable_audio_output ===" rg -n 'enable_audio_output' 2>/dev/null | head -20 echo -e "\n=== Check if flags affect model initialization ===" rg -n 'talker_config\|thinker_config' mllm/models/qwen2_5omni/ -B 2 -A 2 echo -e "\n=== Look at the inference examples to see how config is used ===" head -50 examples/qwen2_5omni/text_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 2339

Remove or set enable_talker and enable_audio_output to false—these flags are currently unused.

These flags are defined in the config but never parsed or checked in the model code. Setting them to true misleads users about available functionality since only the thinker (text output) is currently loaded and used. To avoid confusion, either remove these flags or set them to false until talker and audio output support are actually implemented.

Suggested change

- "enable_audio_output": true, - "enable_talker": true, + "enable_audio_output": false, + "enable_talker": false,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"enable_audio_output": true,

"enable_talker": true,

"model_type": "qwen2_5_omni",

"enable_audio_output": false,

"enable_talker": false,

"model_type": "qwen2_5_omni",

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json` around lines 5 - 7, The config contains unused flags enable_audio_output and enable_talker (alongside model_type "qwen2_5_omni"); update the JSON to avoid misleading users by either removing the keys enable_audio_output and enable_talker or setting both to false. Locate the entries for "enable_audio_output" and "enable_talker" in the config_qwen2_5omni_7B.json and change their values to false (or delete those lines) so only supported functionality (the thinker/text output) is advertised.

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/image_infer.cpp

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟡 Minor

Validate --model_version instead of silently falling back to v1.

An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.

Suggested fix

- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; - if (model_version.get() == "v1") { - file_version = mllm::ModelFileVersion::kV1; - } else if (model_version.get() == "v2") { - file_version = mllm::ModelFileVersion::kV2; - } + mllm::ModelFileVersion file_version; + if (model_version.get() == "v1") { + file_version = mllm::ModelFileVersion::kV1; + } else if (model_version.get() == "v2") { + file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported --model_version: {}\n", model_version.get()); + return 1; + }

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The current logic silently defaults file_version to mllm::ModelFileVersion::kV1 when model_version.get() is unknown; update the branch around model_version.get() so you explicitly accept only "v1" and "v2", set file_version to mllm::ModelFileVersion::kV1 or kV2 accordingly, and otherwise report an error (include the invalid value and allowed values) and terminate with a non‑zero exit or throw an exception; reference the model_version variable and the file_version / mllm::ModelFileVersion symbols when making this validation change.

coderabbitai · 2026-01-23T08:00:33Z

examples/qwen2_5omni/image_infer.cpp

+    fmt::print("Image path (or 'exit/quit'): ");
+    image_path = "";
+    //std::getline(std::cin, image_path);
+    if (image_path == "exit" || image_path == "quit") { return 0; }
+
+    fmt::print("Prompt text: ");
+    prompt_text = "";
+    //std::getline(std::cin, prompt_text);
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd examples/qwen2_5omni && wc -l image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 85

🏁 Script executed:

cd examples/qwen2_5omni && sed -n '45,65p' image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 730

🏁 Script executed:

cd examples/qwen2_5omni && grep -n "getline" image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 157

🏁 Script executed:

cd examples/qwen2_5omni && head -52 image_infer.cpp | tail -20

Repository: UbiquitousLearning/mllm

Length of output: 789

🏁 Script executed:

cd examples/qwen2_5omni && cat -n image_infer.cpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1480

🏁 Script executed:

cd examples/qwen2_5omni && sed -n '50,75p' image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1119

Uncomment the std::getline() calls to restore interactive input.

The example prints prompts for user input but then hardcodes empty strings and has the std::getline() calls commented out. This makes the example non-functional since it will always attempt to process empty image paths and prompts. Restore the interactive input handling by uncommenting lines 54 and 59 (the std::getline() calls).

Suggested fix

- fmt::print("Image path (or 'exit/quit'): "); - image_path = ""; - //std::getline(std::cin, image_path); - if (image_path == "exit" || image_path == "quit") { return 0; } - - fmt::print("Prompt text: "); - prompt_text = ""; - //std::getline(std::cin, prompt_text); + fmt::print("Image path (or 'exit/quit'): "); + std::getline(std::cin, image_path); + if (image_path == "exit" || image_path == "quit") { return 0; } + + fmt::print("Prompt text: "); + std::getline(std::cin, prompt_text);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fmt::print("Image path (or 'exit/quit'): ");

image_path = "";

//std::getline(std::cin, image_path);

if (image_path == "exit" || image_path == "quit") { return 0; }

fmt::print("Prompt text: ");

prompt_text = "";

//std::getline(std::cin, prompt_text);

fmt::print("Image path (or 'exit/quit'): ");

std::getline(std::cin, image_path);

if (image_path == "exit" || image_path == "quit") { return 0; }

fmt::print("Prompt text: ");

std::getline(std::cin, prompt_text);

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/image_infer.cpp` around lines 52 - 60, Uncomment the two std::getline calls so the program reads user input instead of using empty strings: after the fmt::print("Image path (or 'exit/quit'): "); call restore std::getline(std::cin, image_path) to populate image_path (so the exit/quit check works), and after fmt::print("Prompt text: "); restore std::getline(std::cin, prompt_text) to populate prompt_text; ensure the variables image_path and prompt_text are used as before and that included <iostream> usage is consistent with the surrounding code.

coderabbitai · 2026-01-23T08:00:33Z

examples/qwen2_5omni/text_infer.cpp

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟡 Minor

Validate --model_version instead of silently falling back to v1.

An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.

Suggested fix

- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; - if (model_version.get() == "v1") { - file_version = mllm::ModelFileVersion::kV1; - } else if (model_version.get() == "v2") { - file_version = mllm::ModelFileVersion::kV2; - } + mllm::ModelFileVersion file_version; + if (model_version.get() == "v1") { + file_version = mllm::ModelFileVersion::kV1; + } else if (model_version.get() == "v2") { + file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported --model_version: {}\n", model_version.get()); + return 1; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

}

mllm::ModelFileVersion file_version;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

} else {

fmt::print("Unsupported --model_version: {}\n", model_version.get());

return 1;

}

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/text_infer.cpp` around lines 24 - 29, The code silently defaults to mllm::ModelFileVersion::kV1 when model_version is unknown; change this to validate the input from model_version.get() and reject unsupported values instead of defaulting. In the block that sets file_version (referencing model_version.get(), file_version, and mllm::ModelFileVersion), check for "v1" and "v2" explicitly and if neither matches, print a clear error mentioning the unsupported model_version and terminate (e.g., return/exit with non-zero or throw) so the program fails fast rather than loading the wrong format.

coderabbitai · 2026-01-23T08:00:33Z

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp

+    Tensor attn;
+    if (key_states.dtype() == kFloat32) {
+      attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));
+      attn = mask_(attn);
+      attn = softmax_(attn);
+    } else if (key_states.dtype() == kFloat16) {
+      attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));
+      attn = mask_(attn);
+      attn = softmax_(attn);
+      attn = attn.to(kFloat16);
+    }
+
+    auto output = nn::functional::matmul(attn, value_states);
+    output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_});
+    output = o_proj_(output);
+    return {output};
+  }


⚠️ Potential issue | 🟠 Major

Missing fallback for unsupported dtype in attention computation.

If key_states.dtype() is neither kFloat32 nor kFloat16, the attn tensor remains uninitialized (line 942 declares it as Tensor attn;), and the subsequent nn::functional::matmul(attn, value_states) on line 954 would cause undefined behavior.

🛠️ Suggested fix

if (key_states.dtype() == kFloat32) { attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); } else if (key_states.dtype() == kFloat16) { attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); attn = attn.to(kFloat16); + } else { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation."); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Tensor attn;

if (key_states.dtype() == kFloat32) {

attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

} else if (key_states.dtype() == kFloat16) {

attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

attn = attn.to(kFloat16);

}

auto output = nn::functional::matmul(attn, value_states);

output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_});

output = o_proj_(output);

return {output};

}

Tensor attn;

if (key_states.dtype() == kFloat32) {

attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

} else if (key_states.dtype() == kFloat16) {

attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

attn = attn.to(kFloat16);

} else {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation.");

}

auto output = nn::functional::matmul(attn, value_states);

output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_});

output = o_proj_(output);

return {output};

}

🤖 Prompt for AI Agents

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 942 - 958, The code leaves attn uninitialized when key_states.dtype() is neither kFloat32 nor kFloat16; add a fallback else branch in the attention block (around the key_states.dtype() checks) that handles other dtypes by casting query_states and key_states to kFloat32, computing attn = nn::functional::matmul(...)* (1.f / sqrtf(head_dim_)), applying mask_ and softmax_, and then casting attn back to the original key_states.dtype() if needed (or alternatively throw a clear runtime_error mentioning unsupported dtype) so that attn is always initialized before using nn::functional::matmul(attn, value_states).

coderabbitai · 2026-01-23T08:00:33Z

mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp

+    auto grid_t = grid_thw.ptr<int32_t>()[0];
+    auto grid_h = grid_thw.ptr<int32_t>()[1];
+    auto grid_w = grid_thw.ptr<int32_t>()[2];
+    int32_t img_token_nums = grid_t * grid_h * grid_w;
+    img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);
+
+    auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");
+    {
+      auto it = std::find(ids.begin(), ids.end(), image_token_id);
+      if (it == ids.end()) {
+        MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template.");
+      }
+      ids.insert(it + 1, img_token_nums - 1, image_token_id);
+    }


⚠️ Potential issue | 🟡 Minor

Add validation for image token count before insertion.

Unlike convertAudioMessage which validates audio_token_nums > 0 (lines 347-349), convertVisionMessage doesn't validate img_token_nums before the ids.insert() call. If grid dimensions somehow result in zero or negative values, inserting img_token_nums - 1 tokens would cause undefined behavior.

🛠️ Suggested fix

int32_t img_token_nums = grid_t * grid_h * grid_w; img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); + if (img_token_nums <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni."); + } auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

auto grid_t = grid_thw.ptr<int32_t>()[0];

auto grid_h = grid_thw.ptr<int32_t>()[1];

auto grid_w = grid_thw.ptr<int32_t>()[2];

int32_t img_token_nums = grid_t * grid_h * grid_w;

img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);

auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");

{

auto it = std::find(ids.begin(), ids.end(), image_token_id);

if (it == ids.end()) {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template.");

}

ids.insert(it + 1, img_token_nums - 1, image_token_id);

}

auto grid_t = grid_thw.ptr<int32_t>()[0];

auto grid_h = grid_thw.ptr<int32_t>()[1];

auto grid_w = grid_thw.ptr<int32_t>()[2];

int32_t img_token_nums = grid_t * grid_h * grid_w;

img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);

if (img_token_nums <= 0) {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni.");

}

auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");

{

auto it = std::find(ids.begin(), ids.end(), image_token_id);

if (it == ids.end()) {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template.");

}

ids.insert(it + 1, img_token_nums - 1, image_token_id);

}

🤖 Prompt for AI Agents

In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp` around lines 303 - 316, The code computes img_token_nums from grid_thw and then inserts tokens without validating it; add a check (similar to convertAudioMessage's audio_token_nums > 0) after computing img_token_nums and before ids.insert to ensure img_token_nums > 0, and on failure call MLLM_ERROR_EXIT with a clear message (e.g., "Invalid image token count") so you don't call ids.insert(img_token_nums - 1, ...) with zero/negative counts; locate symbols grid_thw, img_token_nums, image_token_id, ids.insert and the surrounding convertVisionMessage logic to make the change.

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp`:
- Around line 11-22: Add concise doc comments above the CPUConvTranspose1DOp
class and CPUConvTranspose1DOpFactory to document their roles and expected
tensor shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU
backend of aops::ConvTranspose1DOp, list expected input tensors (e.g., input,
weight, optional bias), expected output shape behavior (how output length is
computed from stride/padding/dilation/output_padding), and any preconditions
(memory layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory
explaining it constructs CPUConvTranspose1DOp instances from
aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions
to the CPU implementation.

In `@mllm/core/aops/ConvTranspose1DOp.cpp`:
- Around line 17-23: Validate options_.groups before any division/modulo in
ConvTranspose1DOp::load and ConvTranspose1DOp::reshape by checking it is > 0 and
that it cleanly divides the relevant channel counts (e.g., options_.out_channels
% options_.groups == 0 and any other channel dimension used with groups is
divisible). If a check fails, return/raise a clear error (e.g., throw
std::invalid_argument or use the existing error/reporting mechanism) with a
message referencing the op name and invalid group value. Update the logic around
weight_ = weight_.view(...) in load() and the corresponding shape calculations
in reshape() to assume groups is validated so the divisions/modulo are safe.
Ensure tests or callers expecting validation get a deterministic error rather
than UB/crash.
- Around line 72-81: The code must validate computed seq_out and option
constraints before allocating the output tensor: compute seq_out as currently
done, then check that seq_out > 0 and that options_.output_padding <
options_.stride and that options_.padding >= 0, options_.dilation >= 1,
options_.kernel_size > 0 and options_.stride > 0; if any check fails,
return/raise a clear configuration error (e.g., throw std::invalid_argument or
use the project’s error/reporting helper) instead of calling
outputs.emplace_back(Tensor::empty(...)); place these checks immediately after
the seq_out calculation and before the outputs.emplace_back line in
ConvTranspose1DOp (referencing seq_out, options_.output_padding,
options_.stride, options_.padding, options_.dilation, options_.kernel_size).

In `@mllm/core/aops/ConvTranspose1DOp.hpp`:
- Around line 12-49: Add API documentation comments for the new public types:
place a brief docstring above the struct ConvTranspose1DOpOptions describing its
purpose (options for 1D transposed convolution), and document each field
(in_channels, out_channels, kernel_size, stride, padding, output_padding,
dilation, groups, bias) including units, valid ranges/constraints (e.g.,
positive ints, output_padding < max(stride, dilation), groups divides
in_channels/out_channels, kernel_size > 0), and default meanings; also add a
short doc comment above the ConvTranspose1DOp class describing its role,
lifecycle methods (load, trace, forward, reshape, setup, getParams), and
invariants about weight_ and bias_ (shapes derived from options, bias present
only if options.bias is true). Ensure comments reference the exact symbols
ConvTranspose1DOpOptions, ConvTranspose1DOp, weight_, bias_ so callers can
locate the documented behavior.

In `@mllm/core/aops/TanhOp.hpp`:
- Around line 11-30: Add API doc comments for TanhOp and TanhOpOptions: place a
brief class-level comment above struct TanhOpOptions and class TanhOp describing
the operation's purpose (element-wise hyperbolic tangent), expected inputs
(single tensor of arbitrary shape), outputs (single tensor with same shape and
dtype), and any error/shape expectations (e.g., require one input tensor,
matching output count; throw or assert on incorrect counts). Also add short
comments on the public methods load, trace, forward, reshape, setup and the
options() accessor indicating their roles (e.g., load: load parameters from
ParameterFile; forward: compute element-wise tanh; reshape/setup: validate and
set output shapes; trace: record shapes/metadata) and mention any preconditions
or postconditions (shape invariants), following existing doc style in the
codebase.

🧹 Nitpick comments (4)

mllm/backends/cpu/ops/ConvTranspose1DOp.cpp (2)
48-48: Unnecessary zero-initialization of output.

The output buffer is zero-filled here, but the subsequent loop at line 86 directly assigns output_ptr[output_idx] = sum; rather than accumulating. Each output element is computed completely before being written, making this zero-fill redundant overhead.
♻️ Suggested fix
-  std::fill_n(output_ptr, output.numel(), 0.0f);
+  // No zero-fill needed: each output element is fully computed before assignment
52-90: Redundant dtype switch after assertion.

The switch (output.dtype()) at line 52 is redundant since MLLM_RT_ASSERT_EQ(output.dtype(), kFloat32) at line 39 already guarantees the dtype. Consider removing the switch or, if future dtype support is planned, removing the assertion.
♻️ Suggested simplification
-  switch (output.dtype()) {
-    case kFloat32:
-      MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, {
+  MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, {
         // ... loop body unchanged ...
-      });
-      break;
-    default: NYI("ConvTranspose1D: unsupported data type");
-  }
+  });
tests/cpu/ConvTranspose1DKernelTest.hpp (1)
114-133: Redundant variable extraction in failure path.

Lines 117-128 extract all config values into local variables, but these are only used in the print() call. Consider simplifying by passing the config directly or just printing the failing config map.
♻️ Simplified failure handling
   bool testConvTranspose1D(const std::vector<std::unordered_map<std::string, int32_t>>& cfgs) {
     for (auto& cfg : cfgs) {
       if (!testConvTranspose1DOnce(cfg)) {
-        auto batch = cfg.at("batch");
-        auto in_channel = cfg.at("in_channel");
-        auto out_channel = cfg.at("out_channel");
-        auto sequence = cfg.at("sequence");
-        auto kernel_size = cfg.at("kernel_size");
-        auto stride = cfg.at("stride");
-        auto padding = cfg.at("padding");
-        auto output_padding = cfg.at("output_padding");
-        auto dilation = cfg.at("dilation");
-        auto groups = cfg.at("groups");
-        auto bias = cfg.at("bias");
-        print(batch, in_channel, out_channel, sequence, kernel_size, stride, padding, output_padding, dilation, groups, bias);
+        // Print failing configuration for debugging
+        for (const auto& [key, value] : cfg) {
+          print(key, "=", value);
+        }
         return false;
       }
     }
     return true;
   }
mllm/nn/layers/Tanh.hpp (1)

11-18: Add a brief class doc comment.

Public APIs should have a short description of purpose/inputs/outputs. As per coding guidelines, please add a brief doc comment for the Tanh layer.

coderabbitai · 2026-01-25T11:33:31Z

mllm/backends/cpu/ops/ConvTranspose1DOp.hpp

+class CPUConvTranspose1DOp final : public aops::ConvTranspose1DOp {
+ public:
+  explicit CPUConvTranspose1DOp(const aops::ConvTranspose1DOpOptions& options);
+
+  void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+};
+
+class CPUConvTranspose1DOpFactory : public TypedOpFactory<OpTypes::kConvTranspose1D, aops::ConvTranspose1DOpOptions> {
+ public:
+  std::shared_ptr<BaseOp> createOpImpl(const aops::ConvTranspose1DOpOptions& options) override {
+    return std::make_shared<CPUConvTranspose1DOp>(options);
+  }


⚠️ Potential issue | 🟡 Minor

Add brief docs for CPU op/factory.

Please add short comments describing the CPU op’s role and any expectations about inputs/outputs to keep backend APIs self-documenting. As per coding guidelines, public APIs should include clear docstrings/comments.

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp` around lines 11 - 22, Add concise doc comments above the CPUConvTranspose1DOp class and CPUConvTranspose1DOpFactory to document their roles and expected tensor shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU backend of aops::ConvTranspose1DOp, list expected input tensors (e.g., input, weight, optional bias), expected output shape behavior (how output length is computed from stride/padding/dilation/output_padding), and any preconditions (memory layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory explaining it constructs CPUConvTranspose1DOp instances from aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions to the CPU implementation.

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/ConvTranspose1DOp.cpp

+void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) {
+  switch (ploader->version()) {
+    case ModelFileVersion::kV1: {
+      weight_ = ploader->pull(getName() + ".weight");
+      if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); }
+      weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size});
+      if (options_.bias) { bias_ = bias_.view({options_.out_channels}); }


⚠️ Potential issue | 🟠 Major

Guard against invalid groups before division/modulo.

options_.groups is used in division/modulo in both load() and reshape(). If it’s 0 (or incompatible with channels), this will crash or corrupt shapes. Add explicit validation before use. As per coding guidelines, validate inputs for public APIs.

🔧 Suggested fix

void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) { + if (options_.groups <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0"); + } + MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0); switch (ploader->version()) { case ModelFileVersion::kV1: { weight_ = ploader->pull(getName() + ".weight"); if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); } weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size}); if (options_.bias) { bias_ = bias_.view({options_.out_channels}); } break; } @@ void ConvTranspose1DOp::reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) { const auto& i = inputs[0]; const auto& ishape = i.shape(); if (ishape.size() != 3) { MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp expects 3D input, got {} D", ishape.size()); outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); return; } + if (options_.groups <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0"); + outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); + return; + } const int batch = ishape[0]; const int in_channels = ishape[1]; const int sequence = ishape[2]; MLLM_RT_ASSERT_EQ(in_channels, options_.in_channels); MLLM_RT_ASSERT_EQ(in_channels % options_.groups, 0); MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0);

Also applies to: 68-71

🤖 Prompt for AI Agents

In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 17 - 23, Validate options_.groups before any division/modulo in ConvTranspose1DOp::load and ConvTranspose1DOp::reshape by checking it is > 0 and that it cleanly divides the relevant channel counts (e.g., options_.out_channels % options_.groups == 0 and any other channel dimension used with groups is divisible). If a check fails, return/raise a clear error (e.g., throw std::invalid_argument or use the existing error/reporting mechanism) with a message referencing the op name and invalid group value. Update the logic around weight_ = weight_.view(...) in load() and the corresponding shape calculations in reshape() to assume groups is validated so the divisions/modulo are safe. Ensure tests or callers expecting validation get a deterministic error rather than UB/crash.

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/ConvTranspose1DOp.cpp

+  const int kernel_size = options_.kernel_size;
+  const int stride = options_.stride;
+  const int dilation = options_.dilation;
+  const int padding = options_.padding;
+  const int output_padding = options_.output_padding;
+
+  const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1;
+
+  auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out};
+  outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device()));


⚠️ Potential issue | 🟠 Major

Validate output size and output_padding constraints.

seq_out can become <= 0 for invalid parameters (e.g., negative padding/dilation, output_padding ≥ stride). Right now this creates an invalid tensor. Add explicit checks before allocation to fail fast and surface configuration errors. As per coding guidelines, validate inputs for public APIs.

🔧 Suggested fix

const int kernel_size = options_.kernel_size; const int stride = options_.stride; const int dilation = options_.dilation; const int padding = options_.padding; const int output_padding = options_.output_padding; + if (kernel_size <= 0 || stride <= 0 || dilation <= 0 || padding < 0 || output_padding < 0 || output_padding >= stride) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid ConvTranspose1D options"); + outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); + return; + } const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1; + if (seq_out <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1D produces non-positive output length"); + outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); + return; + } auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out}; outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device()));

🤖 Prompt for AI Agents

In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 72 - 81, The code must validate computed seq_out and option constraints before allocating the output tensor: compute seq_out as currently done, then check that seq_out > 0 and that options_.output_padding < options_.stride and that options_.padding >= 0, options_.dilation >= 1, options_.kernel_size > 0 and options_.stride > 0; if any check fails, return/raise a clear configuration error (e.g., throw std::invalid_argument or use the project’s error/reporting helper) instead of calling outputs.emplace_back(Tensor::empty(...)); place these checks immediately after the seq_out calculation and before the outputs.emplace_back line in ConvTranspose1DOp (referencing seq_out, options_.output_padding, options_.stride, options_.padding, options_.dilation, options_.kernel_size).

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/ConvTranspose1DOp.hpp

+struct ConvTranspose1DOpOptions : public BaseOpOptions<ConvTranspose1DOpOptions> {
+  int32_t in_channels;
+  int32_t out_channels;
+  int32_t kernel_size;
+  int32_t stride = 1;
+  int32_t padding = 0;
+  int32_t output_padding = 0;
+  int32_t dilation = 1;
+  int32_t groups = 1;
+  bool bias = true;
+};
+
+class ConvTranspose1DOp : public BaseOp {
+ public:
+  explicit ConvTranspose1DOp(const ConvTranspose1DOpOptions& options);
+
+  void load(const ParameterFile::ptr_t& ploader) override;
+
+  void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  ParameterFile::ptr_t getParams() override;
+
+  inline Tensor& weight() { return weight_; }
+
+  inline Tensor& bias() { return bias_; }
+
+  inline ConvTranspose1DOpOptions& options() { return options_; }
+
+ protected:
+  Tensor weight_;
+  Tensor bias_;
+  ConvTranspose1DOpOptions options_;


⚠️ Potential issue | 🟡 Minor

Add API docs for ConvTranspose1D options and op.

These are new public types; please add brief comments covering purpose, parameter meaning (channels/stride/padding/output_padding/groups/bias), and invariants so callers know expected constraints. As per coding guidelines, public APIs should include clear docstrings/comments.

🤖 Prompt for AI Agents

In `@mllm/core/aops/ConvTranspose1DOp.hpp` around lines 12 - 49, Add API documentation comments for the new public types: place a brief docstring above the struct ConvTranspose1DOpOptions describing its purpose (options for 1D transposed convolution), and document each field (in_channels, out_channels, kernel_size, stride, padding, output_padding, dilation, groups, bias) including units, valid ranges/constraints (e.g., positive ints, output_padding < max(stride, dilation), groups divides in_channels/out_channels, kernel_size > 0), and default meanings; also add a short doc comment above the ConvTranspose1DOp class describing its role, lifecycle methods (load, trace, forward, reshape, setup, getParams), and invariants about weight_ and bias_ (shapes derived from options, bias present only if options.bias is true). Ensure comments reference the exact symbols ConvTranspose1DOpOptions, ConvTranspose1DOp, weight_, bias_ so callers can locate the documented behavior.

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/TanhOp.hpp

+struct TanhOpOptions : public BaseOpOptions<TanhOpOptions> {};
+
+class TanhOp : public BaseOp {
+ public:
+  explicit TanhOp(const TanhOpOptions& options);
+
+  void load(const ParameterFile::ptr_t& ploader) override;
+
+  void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  inline TanhOpOptions& options() { return options_; }
+
+ protected:
+  TanhOpOptions options_;


⚠️ Potential issue | 🟡 Minor

Add API docs for TanhOp.

Please add brief doc comments describing purpose, inputs/outputs, and any error/shape expectations. As per coding guidelines, public APIs should include clear docstrings/comments.

🤖 Prompt for AI Agents

In `@mllm/core/aops/TanhOp.hpp` around lines 11 - 30, Add API doc comments for TanhOp and TanhOpOptions: place a brief class-level comment above struct TanhOpOptions and class TanhOp describing the operation's purpose (element-wise hyperbolic tangent), expected inputs (single tensor of arbitrary shape), outputs (single tensor with same shape and dtype), and any error/shape expectations (e.g., require one input tensor, matching output count; throw or assert on incorrect counts). Also add short comments on the public methods load, trace, forward, reshape, setup and the options() accessor indicating their roles (e.g., load: load parameters from ParameterFile; forward: compute element-wise tanh; reshape/setup: validate and set output shapes; trace: record shapes/metadata) and mention any preconditions or postconditions (shape invariants), following existing doc style in the codebase.

coderabbitai

Actionable comments posted: 13

♻️ Duplicate comments (3)

examples/qwen2_5omni/image_infer.cpp (2)

24-29: ⚠️ Potential issue | 🟡 Minor

Silent fallback to v1 on unknown --model_version value.

Unknown values are silently treated as v1, which could load models with the wrong format. Add validation.

🐛 Proposed fix

   mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
   if (model_version.get() == "v1") {
     file_version = mllm::ModelFileVersion::kV1;
   } else if (model_version.get() == "v2") {
     file_version = mllm::ModelFileVersion::kV2;
+  } else {
+    MLLM_ERROR_EXIT(mllm::ExitCode::kCoreError, "Unsupported --model_version: '{}'. Use 'v1' or 'v2'.", model_version.get());
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The current logic
silently treats unknown model_version values as mllm::ModelFileVersion::kV1;
update the validation around model_version.get() so only "v1" and "v2" are
accepted: if the value is "v1" or "v2" set file_version accordingly
(mllm::ModelFileVersion::kV1/kV2), otherwise log an explicit error mentioning
the invalid model_version and terminate (or return a non-zero code) so we don't
accidentally load the wrong format; reference the variables model_version and
file_version and the enum mllm::ModelFileVersion in your change.

52-59: ⚠️ Potential issue | 🟠 Major

Hardcoded paths and commented-out std::getline() make this example non-interactive.

The user-input prompts are printed but then ignored — image_path and prompt_text are hardcoded. Uncomment the std::getline calls (or make the hardcoded values CLI arguments) before merging.

🐛 Proposed fix

     fmt::print("Image path (or 'exit/quit'): ");
-    image_path = "../../rsc/pics.jpg";
-    //std::getline(std::cin, image_path);
+    std::getline(std::cin, image_path);
     if (image_path == "exit" || image_path == "quit") { return 0; }

     fmt::print("Prompt text: ");
-    prompt_text = "描述图片中物体";
-    //std::getline(std::cin, prompt_text);
+    std::getline(std::cin, prompt_text);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/qwen2_5omni/image_infer.cpp` around lines 52 - 59, The example
currently prints prompts but ignores user input because image_path and
prompt_text are hardcoded; revert to interactive mode by uncommenting the
std::getline(std::cin, image_path) and std::getline(std::cin, prompt_text) calls
(or alternatively parse them from CLI args) so that the printed prompts actually
collect user input; ensure the check using image_path == "exit" || image_path ==
"quit" still runs after reading input and that prompt_text is used when calling
the inference routines in this file.

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (1)

1014-1030: ⚠️ Potential issue | 🔴 Critical

Missing fallback for unsupported dtype in attention computation.

If key_states.dtype() is neither kFloat32 nor kFloat16, attn remains uninitialized and nn::functional::matmul(attn, value_states) on line 1026 invokes undefined behavior.

🐛 Proposed fix

     } else if (key_states.dtype() == kFloat16) {
       attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));
       attn = mask_(attn);
       attn = softmax_(attn);
       attn = attn.to(kFloat16);
+    } else {
+      MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation.");
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 1014 - 1030,
The attention path lacks a fallback when key_states.dtype() is neither kFloat32
nor kFloat16, leaving attn uninitialized and causing undefined behavior in
nn::functional::matmul(attn, value_states); add an else branch in the attention
block that handles other dtypes by converting query_states and key_states to
kFloat32, computing attn the same way (mask_ and softmax_ applied) and then
converting back if needed, or alternatively throw a clear runtime_error; update
the code around key_states, query_states, attn, head_dim_, mask_, softmax_, and
the subsequent matmul/output steps so attn is always initialized before use.

🧹 Nitpick comments (15)

mllm/backends/cpu/ops/EmbeddingOp.cpp (1)
29-29: static warning flag is process-wide — all embedding instances share a single suppression flag.

static std::atomic<bool> lives for the entire process lifetime and is shared across every CPUEmbeddingOp instance. In a multimodal model like Qwen2.5Omni, there are multiple independent embedding layers (text, audio, vision). Once any one of them fires an OOB, all subsequent OOB events in every other embedding layer are silently zero-filled with no diagnostic output.

Moving the flag to an instance member variable gives each embedding layer its own one-time warning, preserving the "no spam" intent while retaining full observability per layer.
♻️ Proposed refactor: instance-level warning flag

In EmbeddingOp.hpp (or wherever CPUEmbeddingOp is declared), add the member:
+  std::atomic<bool> warned_token_oob_{false};
Then in EmbeddingOp.cpp, replace the static local:
-  static std::atomic<bool> warned_token_oob{false};
and update the CAS site:
-        if (warned_token_oob.compare_exchange_strong(expected, true)) {
+        if (warned_token_oob_.compare_exchange_strong(expected, true)) {
As per coding guidelines: "Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors" — a shared static flag can suppress warnings from distinct layers, reducing the signal quality of those logs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` at line 29, The process-wide static
warned_token_oob suppresses OOB warnings across all embedding layers; make it an
instance member on CPUEmbeddingOp (e.g., add std::atomic<bool>
warned_token_oob_{false}; to the CPUEmbeddingOp declaration in EmbeddingOp.hpp)
and update uses in EmbeddingOp.cpp to reference the instance field instead of
the static local (replace the static std::atomic<bool> warned_token_oob and the
CAS site to use this->warned_token_oob_ or equivalent), so each embedding
instance emits its own one-time OOB warning.
mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py (2)
112-136: Each checkpoint is loaded twice — doubles I/O for large models

_collect_source_meta loads the full checkpoint to gather metadata and then discards the state dict. _stream_source_tensors reloads the same file from disk during the write pass. For multi-GB model files this can double overall runtime.

A straightforward mitigation is to return the state dict from _collect_source_meta (or accept it as an optional parameter) and pass it directly to _stream_source_tensors, trading memory for I/O only when both checkpoints can fit in RAM. Alternatively, a single-pass approach can collect metadata lazily the first time each tensor is encountered during streaming.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py` around lines 112 -
136, _collect_source_meta currently loads the entire checkpoint then
_stream_source_tensors reloads it, doubling I/O for large files; modify
_collect_source_meta to optionally return the loaded state dict (or accept a
preloaded state dict param) and update its callers to pass that state into
_stream_source_tensors so the file is only read once, or alternatively refactor
to a single-pass lazy stream where _stream_source_tensors queries metadata on
first access; update the function signatures for _collect_source_meta and
_stream_source_tensors and all call sites to propagate the in-memory state dict
when available and ensure gc.collect() only runs when the state is truly
discarded.
266-268: Pre-allocation creates a large in-memory zero buffer

b"\x00" * reserved_bytes allocates up to several MB on the Python heap (532 + max_params × 352 bytes). A seek-then-write approach avoids this allocation and produces a sparse file on most filesystems:
♻️ Proposed refactor
-        reserved_bytes = MODEL_FILE_V2_DESC_SIZE + max_params * MODEL_FILE_V2_PARAM_DESC_SIZE
-        self.f.write(b"\x00" * reserved_bytes)
-        self.f.flush()
+        reserved_bytes = MODEL_FILE_V2_DESC_SIZE + max_params * MODEL_FILE_V2_PARAM_DESC_SIZE
+        self.f.seek(reserved_bytes - 1, os.SEEK_SET)
+        self.f.write(b"\x00")
+        self.f.flush()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py` around lines 266 -
268, The current code builds a large in-memory zero buffer via b"\x00" *
reserved_bytes and writes it (reserved_bytes is computed from
MODEL_FILE_V2_DESC_SIZE and MODEL_FILE_V2_PARAM_DESC_SIZE), which can allocate
megabytes; instead, replace the allocation+write with a seek-based sparse-file
creation: use self.f.seek(reserved_bytes - 1, 1) (or io.SEEK_CUR) and then write
a single b"\x00" and flush — this avoids allocating the full zero buffer and
produces a sparse file while preserving the reserved region semantics used
around reserved_bytes, MODEL_FILE_V2_DESC_SIZE, and
MODEL_FILE_V2_PARAM_DESC_SIZE.
mllm/models/minicpm_o45/modeling_minicpm_o45.hpp (1)
150-187: Duplicated weight-norm materialization — token2wav_weight_norm.hpp already provides _materializeWeightNorm.

MiniCPMO45TTS::loadFromParameter reimplements the same per-row normalization logic as _materializeWeightNorm in token2wav_weight_norm.hpp. Additionally, the epsilon handling differs slightly: here it's if (norm < kEps) { norm = kEps; } (clamps the norm), while the other uses std::sqrt(std::max(norm_sq, kEps)) (clamps before sqrt). This inconsistency could cause subtle numerical differences.

Consider reusing _materializeWeightNorm or extracting a shared helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/modeling_minicpm_o45.hpp` around lines 150 - 187,
MiniCPMO45TTS::loadFromParameter duplicates per-row weight-norm materialization
already implemented as _materializeWeightNorm in token2wav_weight_norm.hpp;
replace the manual loop that builds head_code_weight_ with a call to the shared
_materializeWeightNorm (or factor out a small helper used by both places) so
normalization and epsilon handling are identical, i.e., consume g and v tensors,
produce the per-row scaled weight Tensor and push into head_code_weight_,
ensuring the same kEps/clamping semantics as _materializeWeightNorm.
mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp (3)
1340-1351: std::pow(2, i) for integer powers — prefer bit shifts for exact results.

std::pow returns double and is subject to floating-point precision issues. For computing powers of 2, bit shifts are exact and faster.
Proposed fix (example for line 1347-1348)
-      const int32_t in_ch = cfg.base_channels / static_cast<int32_t>(std::pow(2, i));
-      const int32_t out_ch = cfg.base_channels / static_cast<int32_t>(std::pow(2, i + 1));
+      const int32_t in_ch = cfg.base_channels >> i;
+      const int32_t out_ch = cfg.base_channels >> (i + 1);
Apply the same pattern to other std::pow(2, ...) usages in lines 1359, 1372, and 1379.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp` around lines 1340
- 1351, The use of std::pow(2, i) in HiFTGenerator constructor (e.g., computing
in_ch and out_ch) should be replaced with integer bit-shifts to avoid
floating-point imprecision and improve performance: compute powers of two as (1
<< i) (and (1 << (i+1))) and ensure the result is cast to int32_t where needed;
apply the same replacement for the other occurrences referenced in the diff (the
uses at the spots corresponding to lines around 1359, 1372, and 1379) so all
exponent-of-two calculations use bit shifts rather than std::pow.
1207-1213: New RNG created on every f02sine call — inconsistent with the thread_local pattern used elsewhere.

randomNormalLike (line 341) and SourceModuleHnNSF2::randomLike (line 1251) both use static thread_local std::mt19937 for stable, efficient random generation. Here, a new std::mt19937 is constructed and seeded from std::random_device on every call, which is both slower and may degrade randomness quality on platforms with weak random_device entropy.
Proposed fix
-    std::mt19937 rng(std::random_device{}());
+    static thread_local std::mt19937 rng(std::random_device{}());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp` around lines 1207
- 1213, Replace the per-call std::mt19937 construction in f02sine with a
persistent thread-local RNG to match the pattern used by randomNormalLike and
SourceModuleHnNSF2::randomLike: declare static thread_local std::mt19937 rng
seeded once (e.g., from std::random_device() at initialization) and reuse it for
the std::uniform_real_distribution<float> uni in the loop that sets rp[(b*T +
0)*H + h] += phase0; to avoid reseeding on every call and ensure consistent,
efficient randomness.
1-1522: ~1500 lines in a single header — consider splitting into separate files.

This header contains the full token2wav pipeline: utility functions, Conformer encoder, DiT estimator, CFM decoder, HiFT vocoder, and the top-level model. Splitting into focused headers (e.g., token2wav_utils.hpp, token2wav_flow.hpp, token2wav_hift.hpp) would improve maintainability and compile times.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp` around lines 1 -
1522, The header is too large and should be split into focused headers to
improve maintainability and compile times: extract low-level helpers and pure
functions (e.g., repeatInterleaveSeq, concatInt64Seq, l2NormalizeRow,
tensorMish/LeakyRelu, makeHannWindow, relShift, addHeadBiasInplace,
randomNormalLike, SineGen2 helpers) into token2wav_utils.hpp; move the
flow/encoder/DiT/CFM-related classes (EspnetRelPositionalEncoding,
LinearNoSubsampling, ConformerEncoderLayer, UpsampleConformerEncoderV2,
DiTAttention, DiTBlock, DiTEstimator, CausalConditionalCFM,
CausalMaskedDiffWithXvec, MiniCPMO45FlowConfig and usages) into
token2wav_flow.hpp; and move HiFT/vocoder classes (ResBlock, SnakeActivation,
ConvRNNF0Predictor, SourceModuleHnNSF2, HiFTGenerator and MiniCPMO45HiFTConfig)
into token2wav_hift.hpp; update MiniCPMO45Token2WavModel to include these
headers and adjust include guards/namespaces and any cross-file forward
declarations (e.g., ensure SineGen2 and SourceModuleHnNSF2 visibility for
HiFTGenerator, and materializeWeightNormParameters/loadFromParameter usages
remain accessible). Ensure each new header keeps only the symbols it needs and
include the appropriate original dependencies (nn headers, Tensor utilities) and
update the single original header to include the three new headers and re-export
MiniCPMO45Token2WavModel.
mllm/models/minicpm_o45/export_prompt_cache.py (1)
189-196: Bare except Exception swallows the original error — consider logging it.

When quantization fails on the primary device, the fallback to CPU is a good pattern, but the original exception is silently discarded. This makes debugging difficult if the CPU fallback also fails for a related reason.
Proposed fix
     try:
         prompt_tokens, prompt_tokens_lens = audio_tokenizer.quantize(mels.to(quantize_device), mels_lens.to(quantize_device))
-    except Exception:
+    except Exception as e:
+        logger.log(f"Quantize failed on {quantize_device}, falling back to CPU: {e}")
         quantize_device = torch.device("cpu")
         audio_tokenizer = _move_model(audio_tokenizer, quantize_device)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/export_prompt_cache.py` around lines 189 - 196, The
try/except around audio_tokenizer.quantize currently swallows the original
Exception; change it to catch Exception as e and log or attach e before falling
back to CPU so the original error is preserved for debugging: when calling
audio_tokenizer.quantize(...) catch Exception as e, record the exception (e) via
logging or processLogger (or re-raise with context) before calling
_move_model(audio_tokenizer, torch.device("cpu")) and audio_tokenizer.eval(),
then retry audio_tokenizer.quantize(...) on the cpu quantize_device.
mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp (1)
396-400: Silent catch (...) swallows JSON parse errors.

If the tokenizer JSON is malformed, this fails silently without any diagnostic output. Consider at minimum logging a warning so users aren't confused by missing special tokens.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp` around lines 396 - 400,
Replace the silent catch(...) around nlohmann::json::parse(in) so parse failures
aren’t swallowed: catch a std::exception (e.g. catch (const std::exception& e))
and emit a diagnostic (std::cerr or the module logger) that includes the
exception.what() and context (e.g. "failed to parse tokenizer JSON, special
tokens may be missing"), then return; update the code around json_data /
nlohmann::json::parse to use this new catch so users see a warning when the JSON
is malformed.
mllm/models/minicpm_o45/token2wav_weight_norm.hpp (1)
61-68: key_g is always equal to key — minor redundancy.

On line 66, key_g is reconstructed as prefix + ".parametrizations.weight.original0", but key is already that exact string (it matched the _endsWith(key, marker) check and marker == ".parametrizations.weight.original0"). This is functionally correct and arguably clearer for readability, so just noting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/minicpm_o45/token2wav_weight_norm.hpp` around lines 61 - 68, The
variable key_g is redundant because within the loop when _endsWith(key, marker)
is true (marker == ".parametrizations.weight.original0"), key already equals
prefix + marker; update the loop in token2wav_weight_norm.hpp to either use key
directly instead of key_g or set key_g = key (remove reconstructing the same
string) and remove any unnecessary allocations; adjust references to key_g
accordingly (look for key_g, key_v, key_w and the loop over keys) to keep
semantics identical but eliminate the redundant string build.
examples/minicpm_o45/main.cpp (2)
133-134: Misleading variable name wav_i16 — still holds float data.

wav_i16 suggests int16 samples, but it's the result of wav * 32767.0f, which is still a float tensor. The WavWriter receives wav_i16.ptr<float>(). Consider renaming to wav_scaled or similar to avoid confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/minicpm_o45/main.cpp` around lines 133 - 134, Rename the misleading
variable wav_i16 which still contains float data (result of wav * 32767.0f) to
something like wav_scaled or wav_scaled_f to avoid implying int16; update its
declaration and all uses (the multiplication expression and the pointer passed
into wenet::WavWriter) so that wav_scaled.ptr<float>() is passed and the symbol
names (wav, wav_i16, WavWriter) match the new identifier.
278-294: Duplicated token2wav inference block.

Lines 278–294 are nearly identical to lines 123–137 (the --tts_tokens_in path). Extract a helper function to avoid this duplication and reduce future maintenance burden.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/minicpm_o45/main.cpp` around lines 278 - 294, The token2wav
inference block is duplicated; extract a helper (e.g., runToken2Wav) that
encapsulates loading the token2wav parameter via mllm::load, loading the prompt
cache via mllm::models::minicpm_o45::loadMiniCPMO45Token2WavPromptCache,
constructing mllm::models::minicpm_o45::MiniCPMO45Token2WavModel, calling
loadFromParameter, running infer with token_ids and tts_token2wav_n_timesteps,
converting to int16 and writing via wenet::WavWriter, and printing the saved
path; replace both duplicated blocks (the `--tts_tokens_in` path and the
`tts_wav_out` path) with calls to this helper to remove duplication and
centralize logic.
examples/qwen2_5omni/CMakeLists.txt (1)
17-19: Consider whether the debug executable should be included in the default build.

mllm-qwen2_5-omni-image-runner-dbg is a debug artifact. If it's only for development, consider gating it behind a CMake option (e.g., if(MLLM_BUILD_DEBUG_EXAMPLES)) to keep the default build clean.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen2_5omni/CMakeLists.txt` around lines 17 - 19, The CMakeLists
currently always adds the debug executable mllm-qwen2_5-omni-image-runner-dbg;
gate its creation behind a CMake option so the debug artifact is not built by
default. Add a boolean option (e.g., MLLM_BUILD_DEBUG_EXAMPLES) and wrap the
add_executable/target_link_libraries/target_include_directories for
mllm-qwen2_5-omni-image-runner-dbg inside an if(MLLM_BUILD_DEBUG_EXAMPLES) ...
endif() block so the target (and its links to MllmRT and MllmCPUBackend and
include dir ${MLLM_INCLUDE_DIR}) is only created when the option is enabled.
mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (2)
1641-1646: RNG engine re-created on every sampling call.

std::random_device + std::mt19937 are instantiated fresh in sampleFromDistribution and categoricalSample (line 1656). In a tight generation loop, this is wasteful and may yield poor randomness on platforms where std::random_device is deterministic. Consider making the RNG a class member initialized once.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 1641 - 1646,
sampleFromDistribution and categoricalSample recreate std::random_device and
std::mt19937 on every call which is wasteful and can produce poor randomness;
make a single RNG instance a persistent member (e.g., add a std::mt19937 rng; to
the model class and initialize it once in the constructor, seeding from
std::random_device there), then update sampleFromDistribution and
categoricalSample to use that class member rng instead of constructing local
rd/gen objects and construct std::discrete_distribution using the provided probs
as before.
421-450: Public classes and key utility functions lack docstrings.

Per coding guidelines, public APIs should have comments explaining purpose, parameters, and returns. Classes like Qwen2_5OmniPatchEmbed, Qwen2_5OmniVisionEncoder, Qwen2_5OmniAudioEncoder, Qwen2_5OmniForCausalLM, Qwen2_5OmniForConditionalGeneration, and utility functions like makeMultimodalPositionEmbedding, makeWindowIndex, etc., would benefit from brief doc comments describing their role in the pipeline and expected input/output shapes. As per coding guidelines: "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 421 - 450, Add
concise doc comments above each public class/function: Qwen2_5OmniPatchEmbed,
Qwen2_5OmniVisionEncoder, Qwen2_5OmniAudioEncoder, Qwen2_5OmniForCausalLM,
Qwen2_5OmniForConditionalGeneration, and utility functions like
makeMultimodalPositionEmbedding and makeWindowIndex. For each comment include a
one-line purpose, constructor/parameter descriptions (e.g., cfg fields used:
visual_in_chans, visual_hidden_size, patch sizes), expected input and output
tensor shapes (e.g., input tensor shape, view/reshape results, output embed
dim), return values, and any error/edge conditions; place them immediately above
the class or function definition using the project’s C++ comment style (Doxygen
or ///) and keep them short and focused.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/minicpm_o45/main.cpp`:
- Around line 111-119: The token-parsing inner loop can hang when a non-numeric
token sets failbit and never advances the stream; replace the while(!ss.eof())
pattern with a robust parse that advances on each iteration — for example, use a
loop that extracts into an integer (while (ss >> token)) to terminate on parse
failure, or read into a temporary string (e.g., tokenStr) and use std::stoll in
a try/catch to skip malformed tokens and continue; update the code using the
existing variables ss, token, and token_ids so malformed entries no longer cause
an infinite loop.

In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 22-35: The help flag is unreachable because Argparse::parse
enforces required args before returning; fix by detecting help manually from
argv before calling Argparse::parse(argc, argv): scan argc/argv for "--help" or
"-h" and if present call Argparse::printHelp(), mllm::shutdownContext(), and
return 0 immediately, then only call Argparse::parse(...) afterward; this
preserves the existing help.isSet()/printHelp() usage and avoids triggering the
required-args check in Argparse::parse.

In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py`:
- Around line 76-78: The try/except around torch.load that sets obj (obj =
torch.load(path, map_location="cpu", weights_only=True") and falls back to obj =
torch.load(path, map_location="cpu") silently is insecure; modify the except
TypeError block in convert_token2wav_pt_to_mllm.py to emit a warnings.warn (or
logging.warning) explaining that weights_only is unsupported on this PyTorch
build and that the fallback will deserialize arbitrary Python objects, then
proceed only after the warning; reference the existing symbols obj, path, and
the torch.load calls so you add the warning immediately before the fallback load
and keep the behavior otherwise.

In `@mllm/models/minicpm_o45/export_prompt_cache.py`:
- Around line 219-225: The code only pads prompt_mels when target_len >
prompt_mels.shape[1] but doesn't truncate when target_len <
prompt_mels.shape[1], which can leave extra mel frames and cause shape
mismatches; update the block around target_len, prompt_tokens, prompt_mels and
up_rate to handle both cases by slicing/truncating prompt_mels to target_len
(e.g., keep first target_len frames along dim 1) when prompt_mels.shape[1] >
target_len, and keep the existing torch.nn.functional.pad logic for the > case
so prompt_mels always ends up with length equal to target_len.

In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp`:
- Around line 1076-1086: The ConvRNNF0Predictor constructor is implicitly
convertible from std::string; mark the constructor explicit by adding the
explicit keyword to the ConvRNNF0Predictor(const std::string& name, int32_t
in_channels = 80, int32_t cond_channels = 512) declaration to prevent unintended
implicit conversions while leaving defaults and body unchanged (identify the
constructor in class ConvRNNF0Predictor and update its signature).

In `@mllm/models/minicpm_o45/modeling_minicpm_o45.hpp`:
- Around line 490-500: The index arithmetic using int32_t (base = b * S * D + s
* D) can overflow for large tensors; change the index calculation and any offset
variables to int64_t (e.g., declare base as int64_t and compute using int64_t
temporaries) and use that 64-bit offset when indexing ptr inside the loops
(while keeping loop counters b, s, d as int32_t if desired). Update references
in this normalization block (kEps, b, s, d, base, ptr, B, S, D) so all
multiplications/additions promoting to int64_t to avoid overflow. Ensure final
division ptr[base + d] uses the 64-bit index or is safely cast for pointer
arithmetic.
- Line 849: The creation of position_ids uses Tensor::empty({1, seq_len},
kInt64) without an explicit device; update the call that initializes the
variable position_ids (where Tensor::empty is invoked with seq_len and kInt64)
to include the device parameter , kCPU so it matches other Tensor::empty calls
in this file and maintains consistent device specification.

In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp`:
- Around line 337-344: The loop in tokenization_minicpm_o45.hpp uses int32_t pos
to track positions in applied_string which can overflow; change the type of pos
to size_t and update the cast when updating pos (replace
static_cast<int32_t>(found + tag.size()) with static_cast<size_t>(found +
tag.size()) or simply rely on size_t arithmetic) inside the loop that iterates
over image_tags and uses applied_string.find to build text_chunks so string
positions use the correct unsigned size type.
- Around line 472-483: The code uses float_t which can vary by platform; replace
uses of float_t with explicit float to match the kFloat32 tensor allocation:
change pixel_values.ptr<float_t>() to pixel_values.ptr<float>(), src_ptr to
img_tensors[b].ptr<float>(), and use sizeof(float) in the std::memset and
std::memcpy length calculations (ensure all occurrences inside the block that
use float_t are updated to float and sizeof(float) so no size mismatch or buffer
overrun occurs for pixel_values_ptr, src_ptr, dst_offset calculations).

In `@mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp`:
- Around line 218-231: The code reads thinker_cfg["text_config"] and then
accesses fields via text_cfg[...] without guarding for absence, which will throw
nlohmann::json::out_of_range if "text_config" is missing; update the block that
defines thinker_cfg/text_cfg to first check thinker_cfg.contains("text_config")
and either throw a clear runtime_error (or log and return) mentioning the
missing "thinker_config.text_config" key, or use safe accessors like
text_cfg.value("hidden_size", default) for each field (hidden_size,
intermediate_size, num_attention_heads, num_key_value_heads, num_hidden_layers,
max_position_embeddings, rms_norm_eps, vocab_size, rope_theta,
tie_word_embeddings) so the code in configuration_qwen2_5omni.hpp handles a
missing "text_config" deterministically.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 397-403: The computation in makeAudioSinusoidalPosEmb can divide
by zero when channels==2 because half = channels/2 -> 1 and you compute
log_timescale_increment = std::log(max_timescale)/static_cast<float>(half - 1);
change the logic to guard when half <= 1: if half > 1 compute
log_timescale_increment as before, otherwise set log_timescale_increment = 0
(and subsequently handle inv_timescales/scale computation accordingly so you
don't divide by zero or produce NaNs). Update references in
makeAudioSinusoidalPosEmb to use this conditional denominator and ensure the
generated pos_emb for the half==1 case produces a valid (e.g., zero or constant)
embedding rather than crashing.
- Around line 2026-2030: The code currently unconditionally pops the last
element of generated_codes which removes a valid token when generation stops due
to max_new_tokens; change it to only pop when the last token is an EOS token (or
when the talker signaled it ended on EOS). Locate where generated_codes is
modified and returned (around generated_codes, makeTokenTensor and the loop that
calls the talker), and replace the unconditional pop_back() with a check like:
if (!generated_codes.empty() && generated_codes.back() ==
<EOS_ID_or_codec.eos_token>) { generated_codes.pop_back(); } or use the loop’s
ended_on_eos flag if available to decide whether to remove the trailing EOS
before calling makeTokenTensor; leave the MLLM_ERROR_EXIT path intact for empty
results.

---

Duplicate comments:
In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The current logic silently treats unknown model_version
values as mllm::ModelFileVersion::kV1; update the validation around
model_version.get() so only "v1" and "v2" are accepted: if the value is "v1" or
"v2" set file_version accordingly (mllm::ModelFileVersion::kV1/kV2), otherwise
log an explicit error mentioning the invalid model_version and terminate (or
return a non-zero code) so we don't accidentally load the wrong format;
reference the variables model_version and file_version and the enum
mllm::ModelFileVersion in your change.
- Around line 52-59: The example currently prints prompts but ignores user input
because image_path and prompt_text are hardcoded; revert to interactive mode by
uncommenting the std::getline(std::cin, image_path) and std::getline(std::cin,
prompt_text) calls (or alternatively parse them from CLI args) so that the
printed prompts actually collect user input; ensure the check using image_path
== "exit" || image_path == "quit" still runs after reading input and that
prompt_text is used when calling the inference routines in this file.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 1014-1030: The attention path lacks a fallback when
key_states.dtype() is neither kFloat32 nor kFloat16, leaving attn uninitialized
and causing undefined behavior in nn::functional::matmul(attn, value_states);
add an else branch in the attention block that handles other dtypes by
converting query_states and key_states to kFloat32, computing attn the same way
(mask_ and softmax_ applied) and then converting back if needed, or
alternatively throw a clear runtime_error; update the code around key_states,
query_states, attn, head_dim_, mask_, softmax_, and the subsequent matmul/output
steps so attn is always initialized before use.

---

Nitpick comments:
In `@examples/minicpm_o45/main.cpp`:
- Around line 133-134: Rename the misleading variable wav_i16 which still
contains float data (result of wav * 32767.0f) to something like wav_scaled or
wav_scaled_f to avoid implying int16; update its declaration and all uses (the
multiplication expression and the pointer passed into wenet::WavWriter) so that
wav_scaled.ptr<float>() is passed and the symbol names (wav, wav_i16, WavWriter)
match the new identifier.
- Around line 278-294: The token2wav inference block is duplicated; extract a
helper (e.g., runToken2Wav) that encapsulates loading the token2wav parameter
via mllm::load, loading the prompt cache via
mllm::models::minicpm_o45::loadMiniCPMO45Token2WavPromptCache, constructing
mllm::models::minicpm_o45::MiniCPMO45Token2WavModel, calling loadFromParameter,
running infer with token_ids and tts_token2wav_n_timesteps, converting to int16
and writing via wenet::WavWriter, and printing the saved path; replace both
duplicated blocks (the `--tts_tokens_in` path and the `tts_wav_out` path) with
calls to this helper to remove duplication and centralize logic.

In `@examples/qwen2_5omni/CMakeLists.txt`:
- Around line 17-19: The CMakeLists currently always adds the debug executable
mllm-qwen2_5-omni-image-runner-dbg; gate its creation behind a CMake option so
the debug artifact is not built by default. Add a boolean option (e.g.,
MLLM_BUILD_DEBUG_EXAMPLES) and wrap the
add_executable/target_link_libraries/target_include_directories for
mllm-qwen2_5-omni-image-runner-dbg inside an if(MLLM_BUILD_DEBUG_EXAMPLES) ...
endif() block so the target (and its links to MllmRT and MllmCPUBackend and
include dir ${MLLM_INCLUDE_DIR}) is only created when the option is enabled.

In `@mllm/backends/cpu/ops/EmbeddingOp.cpp`:
- Line 29: The process-wide static warned_token_oob suppresses OOB warnings
across all embedding layers; make it an instance member on CPUEmbeddingOp (e.g.,
add std::atomic<bool> warned_token_oob_{false}; to the CPUEmbeddingOp
declaration in EmbeddingOp.hpp) and update uses in EmbeddingOp.cpp to reference
the instance field instead of the static local (replace the static
std::atomic<bool> warned_token_oob and the CAS site to use
this->warned_token_oob_ or equivalent), so each embedding instance emits its own
one-time OOB warning.

In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py`:
- Around line 112-136: _collect_source_meta currently loads the entire
checkpoint then _stream_source_tensors reloads it, doubling I/O for large files;
modify _collect_source_meta to optionally return the loaded state dict (or
accept a preloaded state dict param) and update its callers to pass that state
into _stream_source_tensors so the file is only read once, or alternatively
refactor to a single-pass lazy stream where _stream_source_tensors queries
metadata on first access; update the function signatures for
_collect_source_meta and _stream_source_tensors and all call sites to propagate
the in-memory state dict when available and ensure gc.collect() only runs when
the state is truly discarded.
- Around line 266-268: The current code builds a large in-memory zero buffer via
b"\x00" * reserved_bytes and writes it (reserved_bytes is computed from
MODEL_FILE_V2_DESC_SIZE and MODEL_FILE_V2_PARAM_DESC_SIZE), which can allocate
megabytes; instead, replace the allocation+write with a seek-based sparse-file
creation: use self.f.seek(reserved_bytes - 1, 1) (or io.SEEK_CUR) and then write
a single b"\x00" and flush — this avoids allocating the full zero buffer and
produces a sparse file while preserving the reserved region semantics used
around reserved_bytes, MODEL_FILE_V2_DESC_SIZE, and
MODEL_FILE_V2_PARAM_DESC_SIZE.

In `@mllm/models/minicpm_o45/export_prompt_cache.py`:
- Around line 189-196: The try/except around audio_tokenizer.quantize currently
swallows the original Exception; change it to catch Exception as e and log or
attach e before falling back to CPU so the original error is preserved for
debugging: when calling audio_tokenizer.quantize(...) catch Exception as e,
record the exception (e) via logging or processLogger (or re-raise with context)
before calling _move_model(audio_tokenizer, torch.device("cpu")) and
audio_tokenizer.eval(), then retry audio_tokenizer.quantize(...) on the cpu
quantize_device.

In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp`:
- Around line 1340-1351: The use of std::pow(2, i) in HiFTGenerator constructor
(e.g., computing in_ch and out_ch) should be replaced with integer bit-shifts to
avoid floating-point imprecision and improve performance: compute powers of two
as (1 << i) (and (1 << (i+1))) and ensure the result is cast to int32_t where
needed; apply the same replacement for the other occurrences referenced in the
diff (the uses at the spots corresponding to lines around 1359, 1372, and 1379)
so all exponent-of-two calculations use bit shifts rather than std::pow.
- Around line 1207-1213: Replace the per-call std::mt19937 construction in
f02sine with a persistent thread-local RNG to match the pattern used by
randomNormalLike and SourceModuleHnNSF2::randomLike: declare static thread_local
std::mt19937 rng seeded once (e.g., from std::random_device() at initialization)
and reuse it for the std::uniform_real_distribution<float> uni in the loop that
sets rp[(b*T + 0)*H + h] += phase0; to avoid reseeding on every call and ensure
consistent, efficient randomness.
- Around line 1-1522: The header is too large and should be split into focused
headers to improve maintainability and compile times: extract low-level helpers
and pure functions (e.g., repeatInterleaveSeq, concatInt64Seq, l2NormalizeRow,
tensorMish/LeakyRelu, makeHannWindow, relShift, addHeadBiasInplace,
randomNormalLike, SineGen2 helpers) into token2wav_utils.hpp; move the
flow/encoder/DiT/CFM-related classes (EspnetRelPositionalEncoding,
LinearNoSubsampling, ConformerEncoderLayer, UpsampleConformerEncoderV2,
DiTAttention, DiTBlock, DiTEstimator, CausalConditionalCFM,
CausalMaskedDiffWithXvec, MiniCPMO45FlowConfig and usages) into
token2wav_flow.hpp; and move HiFT/vocoder classes (ResBlock, SnakeActivation,
ConvRNNF0Predictor, SourceModuleHnNSF2, HiFTGenerator and MiniCPMO45HiFTConfig)
into token2wav_hift.hpp; update MiniCPMO45Token2WavModel to include these
headers and adjust include guards/namespaces and any cross-file forward
declarations (e.g., ensure SineGen2 and SourceModuleHnNSF2 visibility for
HiFTGenerator, and materializeWeightNormParameters/loadFromParameter usages
remain accessible). Ensure each new header keeps only the symbols it needs and
include the appropriate original dependencies (nn headers, Tensor utilities) and
update the single original header to include the three new headers and re-export
MiniCPMO45Token2WavModel.

In `@mllm/models/minicpm_o45/modeling_minicpm_o45.hpp`:
- Around line 150-187: MiniCPMO45TTS::loadFromParameter duplicates per-row
weight-norm materialization already implemented as _materializeWeightNorm in
token2wav_weight_norm.hpp; replace the manual loop that builds head_code_weight_
with a call to the shared _materializeWeightNorm (or factor out a small helper
used by both places) so normalization and epsilon handling are identical, i.e.,
consume g and v tensors, produce the per-row scaled weight Tensor and push into
head_code_weight_, ensuring the same kEps/clamping semantics as
_materializeWeightNorm.

In `@mllm/models/minicpm_o45/token2wav_weight_norm.hpp`:
- Around line 61-68: The variable key_g is redundant because within the loop
when _endsWith(key, marker) is true (marker ==
".parametrizations.weight.original0"), key already equals prefix + marker;
update the loop in token2wav_weight_norm.hpp to either use key directly instead
of key_g or set key_g = key (remove reconstructing the same string) and remove
any unnecessary allocations; adjust references to key_g accordingly (look for
key_g, key_v, key_w and the loop over keys) to keep semantics identical but
eliminate the redundant string build.

In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp`:
- Around line 396-400: Replace the silent catch(...) around
nlohmann::json::parse(in) so parse failures aren’t swallowed: catch a
std::exception (e.g. catch (const std::exception& e)) and emit a diagnostic
(std::cerr or the module logger) that includes the exception.what() and context
(e.g. "failed to parse tokenizer JSON, special tokens may be missing"), then
return; update the code around json_data / nlohmann::json::parse to use this new
catch so users see a warning when the JSON is malformed.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 1641-1646: sampleFromDistribution and categoricalSample recreate
std::random_device and std::mt19937 on every call which is wasteful and can
produce poor randomness; make a single RNG instance a persistent member (e.g.,
add a std::mt19937 rng; to the model class and initialize it once in the
constructor, seeding from std::random_device there), then update
sampleFromDistribution and categoricalSample to use that class member rng
instead of constructing local rd/gen objects and construct
std::discrete_distribution using the provided probs as before.
- Around line 421-450: Add concise doc comments above each public
class/function: Qwen2_5OmniPatchEmbed, Qwen2_5OmniVisionEncoder,
Qwen2_5OmniAudioEncoder, Qwen2_5OmniForCausalLM,
Qwen2_5OmniForConditionalGeneration, and utility functions like
makeMultimodalPositionEmbedding and makeWindowIndex. For each comment include a
one-line purpose, constructor/parameter descriptions (e.g., cfg fields used:
visual_in_chans, visual_hidden_size, patch sizes), expected input and output
tensor shapes (e.g., input tensor shape, view/reshape results, output embed
dim), return values, and any error/edge conditions; place them immediately above
the class or function definition using the project’s C++ comment style (Doxygen
or ///) and keep them short and focused.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 674f97c and e1ba448.

📒 Files selected for processing (18)

examples/CMakeLists.txt
examples/minicpm_o45/CMakeLists.txt
examples/minicpm_o45/config_minicpm_o45.json
examples/minicpm_o45/main.cpp
examples/qwen2_5omni/CMakeLists.txt
examples/qwen2_5omni/image_infer.cpp
mllm/backends/cpu/ops/EmbeddingOp.cpp
mllm/models/minicpm_o2_6/modeling_chattts.hpp
mllm/models/minicpm_o45/configuration_minicpm_o45.hpp
mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py
mllm/models/minicpm_o45/export_prompt_cache.py
mllm/models/minicpm_o45/modeling_minicpm_o45.hpp
mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp
mllm/models/minicpm_o45/token2wav_prompt_cache.hpp
mllm/models/minicpm_o45/token2wav_weight_norm.hpp
mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp
mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp
mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp

💤 Files with no reviewable changes (1)

mllm/models/minicpm_o2_6/modeling_chattts.hpp

🚧 Files skipped from review as they are similar to previous changes (1)

examples/CMakeLists.txt

coderabbitai · 2026-02-23T13:28:40Z

examples/minicpm_o45/main.cpp

+    for (std::string line; std::getline(ifs, line);) {
+      if (line.empty()) { continue; }
+      std::stringstream ss(line);
+      while (!ss.eof()) {
+        int64_t token = 0;
+        ss >> token;
+        if (!ss.fail()) { token_ids.push_back(token); }
+      }
+    }


⚠️ Potential issue | 🔴 Critical

Potential infinite loop when parsing malformed token file.

If a line contains non-numeric text, ss >> token sets failbit without advancing the stream position, so ss.eof() never becomes true and this loop spins forever.

🐛 Proposed fix

std::stringstream ss(line); - while (!ss.eof()) { + while (ss >> std::ws && !ss.eof()) { int64_t token = 0; - ss >> token; - if (!ss.fail()) { token_ids.push_back(token); } + if (ss >> token) { + token_ids.push_back(token); + } else { + break; // stop on non-numeric content + } }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/minicpm_o45/main.cpp` around lines 111 - 119, The token-parsing inner loop can hang when a non-numeric token sets failbit and never advances the stream; replace the while(!ss.eof()) pattern with a robust parse that advances on each iteration — for example, use a loop that extracts into an integer (while (ss >> token)) to terminate on parse failure, or read into a temporary string (e.g., tokenStr) and use std::stoll in a try/catch to skip malformed tokens and continue; update the code using the existing variables ss, token, and token_ids so malformed entries no longer cause an infinite loop.

coderabbitai · 2026-02-23T13:28:40Z

examples/qwen2_5omni/image_infer.cpp

+  Argparse::parse(argc, argv);
+
+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }
+
+  if (help.isSet()) {
+    Argparse::printHelp();
+    mllm::shutdownContext();
+    return 0;
+  }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find Argparse implementation find . -name "*.h" -o -name "*.cpp" | xargs grep -l "class Argparse\|Argparse::parse" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 1728

🏁 Script executed:

# Check the Argparse header to understand parse() behavior fd -t f -e h "Argparse" -o "argparse" | head -10

Repository: UbiquitousLearning/mllm

Length of output: 202

🏁 Script executed:

# Look at the image_infer.cpp file to confirm lines and flow cat examples/qwen2_5omni/image_infer.cpp | head -50

Repository: UbiquitousLearning/mllm

Length of output: 1992

🏁 Script executed:

# Check how other examples handle required args and help grep -rn "\.required(true)" examples/ --include="*.cpp" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 2961

🏁 Script executed:

# Search for Argparse::parse in other examples to see the pattern grep -rn "Argparse::parse" examples/ --include="*.cpp" -A 10 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 4753

🏁 Script executed:

# Find the Argparse header/implementation find . -type f $ -name "*.h" -o -name "*.hpp" $ | xargs grep -l "class Argparse\|struct Argparse" | grep -v test | head -10

Repository: UbiquitousLearning/mllm

Length of output: 92

🏁 Script executed:

# Look for parse method implementation rg -n "Argparse::parse" --type=cpp --type=h -B 2 -A 10 | head -100

Repository: UbiquitousLearning/mllm

Length of output: 6821

🏁 Script executed:

# Search for the Argparse class definition rg -n "class Argparse|struct Argparse" --type=h --type=cpp -A 20 | head -150

Repository: UbiquitousLearning/mllm

Length of output: 1277

🏁 Script executed:

# Read the Argparse.hpp file to see the full parse() implementation sed -n '149,250p' mllm/utils/Argparse.hpp

Repository: UbiquitousLearning/mllm

Length of output: 3218

🏁 Script executed:

# Check if there's any help handling in parse() rg -n "help|required|printHelp" mllm/utils/Argparse.hpp -B 2 -A 2 | head -80

Repository: UbiquitousLearning/mllm

Length of output: 2471

Help flag is unreachable when required arguments are missing.

Argparse::parse() enforces required arguments before returning (see Argparse.hpp lines 200–205). When run with --help alone, the missing required arguments (model_path, model_version, tokenizer_path, config_path) trigger MLLM_ERROR_EXIT inside parse(), terminating the program before the help check at line 31 executes. Check for the help flag before calling parse():

🐛 Proposed fix — check help before parse

+ if (help.isSet()) { + Argparse::printHelp(); + mllm::shutdownContext(); + return 0; + } + Argparse::parse(argc, argv); mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; if (model_version.get() == "v1") { file_version = mllm::ModelFileVersion::kV1; } else if (model_version.get() == "v2") { file_version = mllm::ModelFileVersion::kV2; } - - if (help.isSet()) { - Argparse::printHelp(); - mllm::shutdownContext(); - return 0; - }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/qwen2_5omni/image_infer.cpp` around lines 22 - 35, The help flag is unreachable because Argparse::parse enforces required args before returning; fix by detecting help manually from argv before calling Argparse::parse(argc, argv): scan argc/argv for "--help" or "-h" and if present call Argparse::printHelp(), mllm::shutdownContext(), and return 0 immediately, then only call Argparse::parse(...) afterward; this preserves the existing help.isSet()/printHelp() usage and avoids triggering the required-args check in Argparse::parse.

coderabbitai · 2026-02-23T13:28:40Z

mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py

+        obj = torch.load(path, map_location="cpu", weights_only=True)
+    except TypeError:
+        obj = torch.load(path, map_location="cpu")


⚠️ Potential issue | 🟠 Major

Silent fallback to insecure torch.load (arbitrary pickle execution)

When weights_only=True raises TypeError on older PyTorch builds, the code silently falls back to weights_only=False, which allows arbitrary Python objects to be deserialized via pickle. Running this tool against a maliciously crafted checkpoint would execute attacker-controlled code.

At minimum, emit a warnings.warn before the fallback load:

🛡️ Proposed fix

+import warnings try: obj = torch.load(path, map_location="cpu", weights_only=True) except TypeError: + warnings.warn( + f"torch.load() does not support 'weights_only' on this PyTorch version; " + f"loading {path} without safety restrictions. Ensure the checkpoint is trusted.", + stacklevel=2, + ) obj = torch.load(path, map_location="cpu")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py` around lines 76 - 78, The try/except around torch.load that sets obj (obj = torch.load(path, map_location="cpu", weights_only=True") and falls back to obj = torch.load(path, map_location="cpu") silently is insecure; modify the except TypeError block in convert_token2wav_pt_to_mllm.py to emit a warnings.warn (or logging.warning) explaining that weights_only is unsupported on this PyTorch build and that the fallback will deserialize arbitrary Python objects, then proceed only after the warning; reference the existing symbols obj, path, and the torch.load calls so you add the warning immediately before the fallback load and keep the behavior otherwise.

coderabbitai · 2026-02-23T13:28:40Z

mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py

+    writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc)
+    written = 0
+    try:
+        for m, t in _stream_source_tensors(flow_pt, flow_metas):
+            writer.write_tensor(m.full_name, t)
+            written += 1
+        for m, t in _stream_source_tensors(hift_pt, hift_metas):
+            writer.write_tensor(m.full_name, t)
+            written += 1
+        writer.finalize()
+    finally:
+        writer.close()


⚠️ Potential issue | 🟠 Major

_V2StreamingWriter construction outside try block causes file-handle leak on init failure

If _V2StreamingWriter.__init__ raises (e.g., disk full during the zero-byte pre-allocation at line 267), the try block is never entered, so finally: writer.close() is never reached. The file opened on line 264 relies on CPython's reference-counting GC to be closed, and a partially-written output file is silently left on disk.

🛡️ Proposed fix

- output.parent.mkdir(parents=True, exist_ok=True) - writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc) - written = 0 - try: + output.parent.mkdir(parents=True, exist_ok=True) + writer = None + written = 0 + try: + writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc) for m, t in _stream_source_tensors(flow_pt, flow_metas): writer.write_tensor(m.full_name, t) written += 1 for m, t in _stream_source_tensors(hift_pt, hift_metas): writer.write_tensor(m.full_name, t) written += 1 writer.finalize() finally: - writer.close() + if writer is not None: + writer.close()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc)

written = 0

try:

for m, t in _stream_source_tensors(flow_pt, flow_metas):

writer.write_tensor(m.full_name, t)

written += 1

for m, t in _stream_source_tensors(hift_pt, hift_metas):

writer.write_tensor(m.full_name, t)

written += 1

writer.finalize()

finally:

writer.close()

writer = None

written = 0

try:

writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc)

for m, t in _stream_source_tensors(flow_pt, flow_metas):

writer.write_tensor(m.full_name, t)

written += 1

for m, t in _stream_source_tensors(hift_pt, hift_metas):

writer.write_tensor(m.full_name, t)

written += 1

writer.finalize()

finally:

if writer is not None:

writer.close()

coderabbitai · 2026-02-23T13:28:40Z

mllm/models/minicpm_o45/export_prompt_cache.py

+    target_len = int(prompt_tokens.shape[1]) * up_rate
+    if target_len > prompt_mels.shape[1]:
+        prompt_mels = torch.nn.functional.pad(
+            prompt_mels,
+            (0, 0, 0, target_len - prompt_mels.shape[1]),
+            mode="replicate",
+        )


⚠️ Potential issue | 🟡 Minor

Missing truncation when target_len < prompt_mels.shape[1].

Only the case where target_len > prompt_mels.shape[1] is handled (padding). If the mel spectrogram is longer than prompt_tokens.shape[1] * up_rate, the extra frames will be written to the cache and may cause a shape mismatch during inference.

Proposed fix

target_len = int(prompt_tokens.shape[1]) * up_rate if target_len > prompt_mels.shape[1]: prompt_mels = torch.nn.functional.pad( prompt_mels, (0, 0, 0, target_len - prompt_mels.shape[1]), mode="replicate", ) + elif target_len < prompt_mels.shape[1]: + prompt_mels = prompt_mels[:, :target_len, :]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

target_len = int(prompt_tokens.shape[1]) * up_rate

if target_len > prompt_mels.shape[1]:

prompt_mels = torch.nn.functional.pad(

prompt_mels,

(0, 0, 0, target_len - prompt_mels.shape[1]),

mode="replicate",

)

target_len = int(prompt_tokens.shape[1]) * up_rate

if target_len > prompt_mels.shape[1]:

prompt_mels = torch.nn.functional.pad(

prompt_mels,

(0, 0, 0, target_len - prompt_mels.shape[1]),

mode="replicate",

)

elif target_len < prompt_mels.shape[1]:

prompt_mels = prompt_mels[:, :target_len, :]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/export_prompt_cache.py` around lines 219 - 225, The code only pads prompt_mels when target_len > prompt_mels.shape[1] but doesn't truncate when target_len < prompt_mels.shape[1], which can leave extra mel frames and cause shape mismatches; update the block around target_len, prompt_tokens, prompt_mels and up_rate to handle both cases by slicing/truncating prompt_mels to target_len (e.g., keep first target_len frames along dim 1) when prompt_mels.shape[1] > target_len, and keep the existing torch.nn.functional.pad logic for the > case so prompt_mels always ends up with length equal to target_len.

coderabbitai · 2026-02-23T13:28:41Z

mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp

+      int32_t pos = 0;
+      for (const auto& tag : image_tags) {
+        auto found = applied_string.find(tag, pos);
+        if (found != std::string::npos) {
+          text_chunks.push_back(applied_string.substr(pos, found - pos));
+          pos = static_cast<int32_t>(found + tag.size());
+        }
+      }


⚠️ Potential issue | 🟡 Minor

int32_t pos can overflow for long strings — use size_t.

pos is used to track position in applied_string but is declared as int32_t. If found + tag.size() exceeds INT32_MAX, the static_cast<int32_t> truncates the value. While unlikely for typical chat messages, using size_t is the idiomatic and safe choice for string positions.

Proposed fix

- int32_t pos = 0; + size_t pos = 0; for (const auto& tag : image_tags) { auto found = applied_string.find(tag, pos); if (found != std::string::npos) { text_chunks.push_back(applied_string.substr(pos, found - pos)); - pos = static_cast<int32_t>(found + tag.size()); + pos = found + tag.size(); } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

int32_t pos = 0;

for (const auto& tag : image_tags) {

auto found = applied_string.find(tag, pos);

if (found != std::string::npos) {

text_chunks.push_back(applied_string.substr(pos, found - pos));

pos = static_cast<int32_t>(found + tag.size());

}

}

size_t pos = 0;

for (const auto& tag : image_tags) {

auto found = applied_string.find(tag, pos);

if (found != std::string::npos) {

text_chunks.push_back(applied_string.substr(pos, found - pos));

pos = found + tag.size();

}

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp` around lines 337 - 344, The loop in tokenization_minicpm_o45.hpp uses int32_t pos to track positions in applied_string which can overflow; change the type of pos to size_t and update the cast when updating pos (replace static_cast<int32_t>(found + tag.size()) with static_cast<size_t>(found + tag.size()) or simply rely on size_t arithmetic) inside the loop that iterates over image_tags and uses applied_string.find to build text_chunks so string positions use the correct unsigned size type.

coderabbitai · 2026-02-23T13:28:41Z

mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp

+      auto* pixel_values_ptr = pixel_values.ptr<float_t>();
+      std::memset(pixel_values_ptr, 0, static_cast<size_t>(img_tensors.size()) * channels * patch_size * hw_patch_size * sizeof(float_t));
+
+      for (int32_t b = 0; b < static_cast<int32_t>(img_tensors.size()); ++b) {
+        int32_t src_hw = img_tensors[b].shape()[2];
+        const auto* src_ptr = img_tensors[b].ptr<float>();
+
+        for (int32_t c = 0; c < channels; ++c) {
+          for (int32_t p = 0; p < patch_size; ++p) {
+            int32_t src_offset = c * patch_size * src_hw + p * src_hw;
+            int32_t dst_offset = b * channels * patch_size * hw_patch_size + c * patch_size * hw_patch_size + p * hw_patch_size;
+            std::memcpy(pixel_values_ptr + dst_offset, src_ptr + src_offset, src_hw * sizeof(float_t));


⚠️ Potential issue | 🔴 Critical

float_t may not be float — use float consistently.

float_t is defined by <cmath> and its actual type depends on FLT_EVAL_METHOD. On some platforms/compilers it can be double or long double, causing a size mismatch with the kFloat32 tensor allocation. The tensor is allocated as kFloat32 (4 bytes per element), but sizeof(float_t) could be 8 bytes, leading to buffer overruns in the memset and memcpy calls.

Proposed fix

- auto* pixel_values_ptr = pixel_values.ptr<float_t>(); - std::memset(pixel_values_ptr, 0, static_cast<size_t>(img_tensors.size()) * channels * patch_size * hw_patch_size * sizeof(float_t)); + auto* pixel_values_ptr = pixel_values.ptr<float>(); + std::memset(pixel_values_ptr, 0, static_cast<size_t>(img_tensors.size()) * channels * patch_size * hw_patch_size * sizeof(float));

And on line 483:

- std::memcpy(pixel_values_ptr + dst_offset, src_ptr + src_offset, src_hw * sizeof(float_t)); + std::memcpy(pixel_values_ptr + dst_offset, src_ptr + src_offset, src_hw * sizeof(float));

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp` around lines 472 - 483, The code uses float_t which can vary by platform; replace uses of float_t with explicit float to match the kFloat32 tensor allocation: change pixel_values.ptr<float_t>() to pixel_values.ptr<float>(), src_ptr to img_tensors[b].ptr<float>(), and use sizeof(float) in the std::memset and std::memcpy length calculations (ensure all occurrences inside the block that use float_t are updated to float and sizeof(float) so no size mismatch or buffer overrun occurs for pixel_values_ptr, src_ptr, dst_offset calculations).

coderabbitai · 2026-02-23T13:28:41Z

mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp

+    if (root.contains("thinker_config")) {
+      auto& thinker_cfg = root["thinker_config"];
+      auto& text_cfg = thinker_cfg["text_config"];
+
+      hidden_size = text_cfg["hidden_size"];
+      intermediate_size = text_cfg["intermediate_size"];
+      num_attention_heads = text_cfg["num_attention_heads"];
+      num_key_value_heads = text_cfg["num_key_value_heads"];
+      num_hidden_layers = text_cfg["num_hidden_layers"];
+      max_position_embeddings = text_cfg["max_position_embeddings"];
+      rms_norm_eps = text_cfg["rms_norm_eps"];
+      vocab_size = text_cfg["vocab_size"];
+      rope_theta = text_cfg["rope_theta"];
+      tie_word_embeddings = text_cfg.value("tie_word_embeddings", false);


⚠️ Potential issue | 🟡 Minor

Unguarded access to thinker_config["text_config"] — will throw if key is missing.

Line 220 accesses thinker_cfg["text_config"] without checking thinker_cfg.contains("text_config") first. If thinker_config is present but lacks text_config, this throws nlohmann::json::out_of_range. The same applies to lines 222-230 which use direct [] access on text_cfg.

If text_config is a required field when thinker_config is present, consider adding an explicit check with a descriptive error message to help users diagnose config issues.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp` around lines 218 - 231, The code reads thinker_cfg["text_config"] and then accesses fields via text_cfg[...] without guarding for absence, which will throw nlohmann::json::out_of_range if "text_config" is missing; update the block that defines thinker_cfg/text_cfg to first check thinker_cfg.contains("text_config") and either throw a clear runtime_error (or log and return) mentioning the missing "thinker_config.text_config" key, or use safe accessors like text_cfg.value("hidden_size", default) for each field (hidden_size, intermediate_size, num_attention_heads, num_key_value_heads, num_hidden_layers, max_position_embeddings, rms_norm_eps, vocab_size, rope_theta, tie_word_embeddings) so the code in configuration_qwen2_5omni.hpp handles a missing "text_config" deterministically.

coderabbitai · 2026-02-23T13:28:41Z

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp

+inline auto makeAudioSinusoidalPosEmb(int32_t length, int32_t channels, float max_timescale = 10000.0f) -> Tensor {
+  MLLM_RT_ASSERT(channels % 2 == 0);
+  auto pos_emb = Tensor::empty({length, channels}, kFloat32, kCPU).alloc();
+  auto pos_ptr = pos_emb.ptr<float>();
+
+  const int half = channels / 2;
+  const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1);


⚠️ Potential issue | 🟡 Minor

Potential division by zero when channels == 2.

Line 403 computes std::log(max_timescale) / static_cast<float>(half - 1) where half = channels / 2. If channels == 2, half is 1 and this divides by zero. While unlikely given typical model configs (e.g. audio_d_model = 1024), a guard would be defensive.

🛡️ Proposed fix

const int half = channels / 2; + MLLM_RT_ASSERT(half > 1); const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

inline auto makeAudioSinusoidalPosEmb(int32_t length, int32_t channels, float max_timescale = 10000.0f) -> Tensor {

MLLM_RT_ASSERT(channels % 2 == 0);

auto pos_emb = Tensor::empty({length, channels}, kFloat32, kCPU).alloc();

auto pos_ptr = pos_emb.ptr<float>();

const int half = channels / 2;

const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1);

inline auto makeAudioSinusoidalPosEmb(int32_t length, int32_t channels, float max_timescale = 10000.0f) -> Tensor {

MLLM_RT_ASSERT(channels % 2 == 0);

auto pos_emb = Tensor::empty({length, channels}, kFloat32, kCPU).alloc();

auto pos_ptr = pos_emb.ptr<float>();

const int half = channels / 2;

MLLM_RT_ASSERT(half > 1);

const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 397 - 403, The computation in makeAudioSinusoidalPosEmb can divide by zero when channels==2 because half = channels/2 -> 1 and you compute log_timescale_increment = std::log(max_timescale)/static_cast<float>(half - 1); change the logic to guard when half <= 1: if half > 1 compute log_timescale_increment as before, otherwise set log_timescale_increment = 0 (and subsequently handle inv_timescales/scale computation accordingly so you don't divide by zero or produce NaNs). Update references in makeAudioSinusoidalPosEmb to use this conditional denominator and ensure the generated pos_emb for the half==1 case produces a valid (e.g., zero or constant) embedding rather than crashing.

coderabbitai · 2026-02-23T13:28:41Z

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp

+    if (!generated_codes.empty()) { generated_codes.pop_back(); }
+    if (generated_codes.empty()) {
+      MLLM_ERROR_EXIT(ExitCode::kCoreError, "Talker produced no codec tokens.");
+    }
+    return makeTokenTensor(generated_codes);


⚠️ Potential issue | 🟠 Major

Last valid codec token is dropped when the talker reaches max_new_tokens without EOS.

Line 2026 unconditionally pops the last element. This correctly strips a trailing EOS token when the loop breaks at line 2017, but if the loop exits naturally (max iterations reached), the last real codec token is removed instead.

🐛 Proposed fix — only pop when the loop ended on EOS

+ bool ended_with_eos = !generated_codes.empty() && + std::find(eos_ids.begin(), eos_ids.end(), generated_codes.back()) != eos_ids.end(); - if (!generated_codes.empty()) { generated_codes.pop_back(); } + if (ended_with_eos) { generated_codes.pop_back(); } if (generated_codes.empty()) { MLLM_ERROR_EXIT(ExitCode::kCoreError, "Talker produced no codec tokens."); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 2026 - 2030, The code currently unconditionally pops the last element of generated_codes which removes a valid token when generation stops due to max_new_tokens; change it to only pop when the last token is an EOS token (or when the talker signaled it ended on EOS). Locate where generated_codes is modified and returned (around generated_codes, makeTokenTensor and the loop that calls the talker), and replace the unconditional pop_back() with a check like: if (!generated_codes.empty() && generated_codes.back() == <EOS_ID_or_codec.eos_token>) { generated_codes.pop_back(); } or use the loop’s ended_on_eos flag if available to decide whether to remove the trailing EOS before calling makeTokenTensor; leave the MLLM_ERROR_EXIT path intact for empty results.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

mllm/backends/cpu/ops/EmbeddingOp.cpp (1)
60-67: ⚠️ Potential issue | 🟠 Major

kGGUF_Q4_0 case was not updated during the refactor — shadow variable and dead check.

Two problems introduced by the surrounding restructuring:

Line 61 — shadowed variable: auto token_idx re-fetches the same value already available as const auto token_idx from line 35. This is inside the else branch (valid token range guaranteed), so the re-declaration is dead computation and will trigger -Wshadow compiler warnings.

Line 62 — dead guard: if (token_idx >= 0) is now unreachable-false because the else block at line 44 already ensures token_idx >= 0 && token_idx < options_.vocab_size. It can never be false here, so the entire if is dead code and silently drops the upper-bound guard implied by the original pre-refactor logic.
🐛 Proposed fix — align `kGGUF_Q4_0` with the refactored pattern
        case kGGUF_Q4_0: {
-          auto token_idx = *ins.coffsettedPtr<mllm_int64_t>({b, (int)s});
-          if (token_idx >= 0) {
-            dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0,
-                                ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size);
-          }
+          dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0,
+                              reinterpret_cast<float*>(out_ptr), options_.hidden_size);
           break;
        }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` around lines 60 - 67, In the
kGGUF_Q4_0 branch of EmbeddingOp.cpp remove the shadowed re-declaration and the
redundant if-check: use the existing const auto token_idx (already computed
earlier from ins.coffsettedPtr) instead of creating a new auto token_idx, and
drop the unreachable "if (token_idx >= 0)" guard so you directly call
dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size
/ QK4_0, ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size);
ensuring the token_idx type and usage match the earlier declaration and preserve
the upper-bound assumptions from the surrounding else branch.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm/backends/cpu/ops/EmbeddingOp.cpp`:
- Line 59: Remove the trailing whitespace on the blank line at line 59 in
mllm/backends/cpu/ops/EmbeddingOp.cpp (the file containing the EmbeddingOp
implementation); replace the whitespace-only line with either an empty newline
(no spaces) or remove the line entirely so no line ends with trailing spaces,
then run the repository linter/format step to ensure no other lines violate the
"no trailing whitespace" rule.

---

Outside diff comments:
In `@mllm/backends/cpu/ops/EmbeddingOp.cpp`:
- Around line 60-67: In the kGGUF_Q4_0 branch of EmbeddingOp.cpp remove the
shadowed re-declaration and the redundant if-check: use the existing const auto
token_idx (already computed earlier from ins.coffsettedPtr) instead of creating
a new auto token_idx, and drop the unreachable "if (token_idx >= 0)" guard so
you directly call dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx *
options_.hidden_size / QK4_0, ous.coffsettedPtr<float>({b, (int)s, 0}),
options_.hidden_size); ensuring the token_idx type and usage match the earlier
declaration and preserve the upper-bound assumptions from the surrounding else
branch.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e1ba448 and 8c0cda7.

📒 Files selected for processing (2)

examples/CMakeLists.txt
mllm/backends/cpu/ops/EmbeddingOp.cpp

🚧 Files skipped from review as they are similar to previous changes (1)

examples/CMakeLists.txt

coderabbitai · 2026-02-23T13:41:09Z

mllm/backends/cpu/ops/EmbeddingOp.cpp

          }
-          break;
-        }
+


⚠️ Potential issue | 🟡 Minor

Trailing whitespace on line 59 — coding guidelines violation.

Line 59 contains only whitespace characters. As per coding guidelines, "No line may end with trailing whitespace."

🛠️ Proposed fix

- +

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` at line 59, Remove the trailing whitespace on the blank line at line 59 in mllm/backends/cpu/ops/EmbeddingOp.cpp (the file containing the EmbeddingOp implementation); replace the whitespace-only line with either an empty newline (no spaces) or remove the line entirely so no line ends with trailing spaces, then run the repository linter/format step to ensure no other lines violate the "no trailing whitespace" rule.

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (3)

examples/minicpm_o45/CMakeLists.txt (1)
5-7: Consider removing the commented-out Python runner block.

The mllm-minicpm-o45-runner-python target and the corresponding main_python.cpp source don't appear to be part of this PR. Committed dead CMake code adds noise; prefer tracking the future work in an issue rather than leaving it commented out.
🧹 Proposed cleanup
 add_executable(mllm-minicpm-o45-runner main.cpp)
 target_link_libraries(mllm-minicpm-o45-runner PRIVATE MllmRT MllmCPUBackend)
 target_include_directories(mllm-minicpm-o45-runner PRIVATE ${MLLM_INCLUDE_DIR})
-
-# add_executable(mllm-minicpm-o45-runner-python main_python.cpp)
-# target_link_libraries(mllm-minicpm-o45-runner-python PRIVATE MllmRT MllmCPUBackend)
-# target_include_directories(mllm-minicpm-o45-runner-python PRIVATE ${MLLM_INCLUDE_DIR})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/minicpm_o45/CMakeLists.txt` around lines 5 - 7, Remove the
commented-out Python runner CMake block to avoid dead code: delete the three
commented lines that reference the mllm-minicpm-o45-runner-python target, its
link libraries (MllmRT, MllmCPUBackend), and the target_include_directories
entry for ${MLLM_INCLUDE_DIR}; if this functionality is desired later, track it
as an issue rather than leaving commented CMake in the repo (identify the block
by the target name mllm-minicpm-o45-runner-python and the source
main_python.cpp).
examples/qwen2_5omni/image_infer_dbg.cpp (1)
13-13: Remove the commented-out MLLM_MAIN macro.

This is dead code left over from a previous iteration. Since int main(...) is used directly, the comment adds noise.
-//MLLM_MAIN({
 int main(int argc, char** argv) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen2_5omni/image_infer_dbg.cpp` at line 13, Remove the dead
commented-out macro invocation by deleting the line containing "//MLLM_MAIN({"
so the file uses the explicit int main(...) implementation only; locate the
commented token "MLLM_MAIN" in examples/qwen2_5omni/image_infer_dbg.cpp and
remove that commented-out macro line to avoid noise.
examples/qwen2_5omni/audio_out_infer.cpp (1)
89-89: Add perfSummary() delegation method to Qwen2_5OmniForConditionalGeneration.

Calling qwen_omni.thinker_.perfSummary() directly exposes an internal member. Unlike other model examples (text_infer, image_infer) that call .perfSummary() directly on the model instance, Qwen2_5OmniForConditionalGeneration doesn't expose this method. Add a public void perfSummary() { thinker_.perfSummary(); } wrapper for API consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen2_5omni/audio_out_infer.cpp` at line 89, Add a public delegation
method to Qwen2_5OmniForConditionalGeneration so callers can invoke
perfSummary() without accessing the internal thinker_ member; implement a public
void perfSummary() that simply calls thinker_.perfSummary(); place it in the
class public section (matching other examples) so examples like
audio_out_infer.cpp can call model.perfSummary() instead of
qwen_omni.thinker_.perfSummary().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/qwen2_5omni/audio_out_infer.cpp`:
- Around line 31-36: The code silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is neither "v1" nor "v2";
update the conditional in the block that sets file_version (which currently
checks model_version.get() and assigns to file_version) to include a final else
branch that rejects invalid values—e.g., log or print an error mentioning the
invalid model_version.get() value and exit/throw so the program fails fast
rather than proceeding with a wrong mllm::ModelFileVersion.
- Around line 75-76: Check for the presence of the "sequence" key and verify
tensor rank before indexing: avoid using operator[] on inputs (which may
default-insert) and instead use find/contains or at(), and ensure
inputs["sequence"].shape().size() >= 2 before reading shape()[1]; do the same
for output.sequences (ensure output.sequences.shape().size() >= 2) before
computing input_len and total_len, and return or throw a clear error if the
checks fail so you don't perform an out-of-bounds access when computing
input_len and total_len.
- Around line 77-81: The code mixes narrow (fmt::print) and wide (std::wcout)
output which can cause UB and ordering issues; change to a single stream: either
(A) use narrow output only by making qwen_tokenizer.detokenize return or be
converted to std::string/UTF-8 and replace std::wcout with fmt::print or
std::cout, or (B) use wide output only by replacing fmt::print calls with
std::wcout and ensure detokenize returns std::wstring; also fix the loop index
narrowing by changing "int i = input_len" to a matching unsigned type (e.g.,
size_t or the type of input_len) or use auto with an explicit cast to avoid
signed/unsigned conversion. Reference symbols: fmt::print, std::wcout,
qwen_tokenizer.detokenize, output.sequences.at<mllm::mllm_int64_t>, and the loop
variable "i".
- Around line 83-85: The code is using wav.shape().back() and wav.ptr<float>()
unsafely when output.wav may be empty; before constructing wenet::WavWriter or
calling wav_writer.Write, check that output.wav is non-empty (e.g., verify
output.wav.numel() > 0 or !output.wav.shape().empty() &&
output.wav.shape().back() > 0), and only then compute the sample count and get
wav.ptr<float>(); if empty, skip creating wenet::WavWriter and avoid calling
Write (or write an empty file/return an error), referencing the
variables/functions wav, output.wav, wenet::WavWriter, wav_writer.Write, and
output_path to locate the change. Ensure you convert the sample count to the
integer type expected by WavWriter and do not dereference wav.ptr<float>() when
numel == 0.
- Around line 17-42: The help flag check is executed too late (after
required-arg validation) so running with only -h fails; after calling
Argparse::parse(argc, argv) immediately check help.isSet() and handle it (call
Argparse::printHelp() and mllm::shutdownContext() then return) before any
required-argument-dependent logic such as the model_version parsing; update the
code around Argparse::parse, help.isSet(), Argparse::printHelp(), and
mllm::shutdownContext() accordingly.

In `@examples/qwen2_5omni/image_infer_dbg.cpp`:
- Around line 28-33: The code silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2"; change
the logic in the block that sets file_version (using model_version.get()) to
validate the value and reject unknown values instead of falling back: if
model_version is "v1" or "v2" set file_version accordingly, otherwise log an
error (or throw/exit) indicating the unsupported model_version value and list
accepted values, so callers cannot accidentally load the wrong layout when
model_version is mistyped.
- Around line 53-63: The code currently hardcodes image_path and prompt_text and
leaves the interactive prompts unused; restore interactivity by re-enabling the
std::getline calls for image_path and prompt_text (remove or guard the debug
assignments "../../rsc/pics.jpg" and "描述图片中物体") so the prompts actually read
from stdin, ensuring the exit/quit check on image_path and subsequent logic work
as intended; update the block around the variables image_path, prompt_text, the
std::getline calls, and the exit check so the program behaves interactively
again.
- Around line 20-39: The help flag check must run before Argparse::parse()
because parse() enforces required arguments and will exit before help.isSet() is
evaluated; move the current help parsing logic so you call Argparse::add for
-h/--help, then inspect help.isSet() and call Argparse::printHelp() and
mllm::shutdownContext()/return before invoking Argparse::parse(), or
alternatively update Argparse::parse() to short-circuit when it detects
-h/--help (exempting it from the required-argument validation) so the existing
if (help.isSet()) { Argparse::printHelp(); mllm::shutdownContext(); return 0; }
can run.
- Around line 69-78: The callback mixes fmt::print (narrow stdout) and
std::wcout (wide stdout) causing unsafe interleaving: update the
qwen2_5omni.streamGenerate callback to write via a single narrow output path
instead of std::wcout — call qwen2_5omni_tokenizer.detokenize and ensure you
get/convert a UTF‑8 std::string, then emit it with fmt::print (or std::cout) so
all output uses the same channel; if you truly need wide output, flush
fmt::print before the response and switch to using std::wcout exclusively for
the whole response block.

---

Nitpick comments:
In `@examples/minicpm_o45/CMakeLists.txt`:
- Around line 5-7: Remove the commented-out Python runner CMake block to avoid
dead code: delete the three commented lines that reference the
mllm-minicpm-o45-runner-python target, its link libraries (MllmRT,
MllmCPUBackend), and the target_include_directories entry for
${MLLM_INCLUDE_DIR}; if this functionality is desired later, track it as an
issue rather than leaving commented CMake in the repo (identify the block by the
target name mllm-minicpm-o45-runner-python and the source main_python.cpp).

In `@examples/qwen2_5omni/audio_out_infer.cpp`:
- Line 89: Add a public delegation method to Qwen2_5OmniForConditionalGeneration
so callers can invoke perfSummary() without accessing the internal thinker_
member; implement a public void perfSummary() that simply calls
thinker_.perfSummary(); place it in the class public section (matching other
examples) so examples like audio_out_infer.cpp can call model.perfSummary()
instead of qwen_omni.thinker_.perfSummary().

In `@examples/qwen2_5omni/image_infer_dbg.cpp`:
- Line 13: Remove the dead commented-out macro invocation by deleting the line
containing "//MLLM_MAIN({" so the file uses the explicit int main(...)
implementation only; locate the commented token "MLLM_MAIN" in
examples/qwen2_5omni/image_infer_dbg.cpp and remove that commented-out macro
line to avoid noise.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c0cda7 and af574ae.

📒 Files selected for processing (3)

examples/minicpm_o45/CMakeLists.txt
examples/qwen2_5omni/audio_out_infer.cpp
examples/qwen2_5omni/image_infer_dbg.cpp

coderabbitai · 2026-02-24T07:49:58Z