feat : Add Support of Qwen2.5Omni Model and MiniCPM-o-4_5 Model#612
feat : Add Support of Qwen2.5Omni Model and MiniCPM-o-4_5 Model#612KKkai0315 wants to merge 18 commits intoUbiquitousLearning:mainfrom
Conversation
…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.
… debug print statements from Qwen3DecoderLayer
…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.
…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.
… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.
…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.
…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.
… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds two example projects (Qwen2.5-Omni and MiniCPM-o45), large multimodal model implementations (configs, tokenizers, preprocessors, modeling, token2wav), new ops/layers (ConvTranspose1D, Tanh) with CPU implementations and tests, build updates, and Python tooling for conversion and prompt-cache export. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as Runner
participant Tokenizer as Qwen2_5OmniTokenizer
participant Preproc as Audio/Image Preprocessor
participant Model as Qwen2_5OmniForCausalLM
User->>CLI: provide input (text / image / audio)
CLI->>Tokenizer: load tokenizer & config
CLI->>Preproc: (if media) process file -> features
CLI->>Tokenizer: convertMessage / convertVisionMessage / convertAudioMessage
Tokenizer-->>CLI: token ids + feature tensors
CLI->>Model: forward(input_ids, feature_tensors)
Model->>Model: thinker/encoders fuse modalities
Model-->>CLI: logits / streamed token outputs
loop streaming
CLI->>Tokenizer: detokenize(next_token)
Tokenizer-->>CLI: text chunk
CLI->>User: stream output
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 24-29: The code currently silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2";
instead validate model_version.get() explicitly: examine the string returned by
model_version.get(), set file_version to mllm::ModelFileVersion::kV1 or ::kV2
for "v1" and "v2" respectively, and for any other value print a clear error
mentioning the allowed values and terminate (non-zero exit) to avoid loading the
wrong format; update the block that assigns file_version (and any related usage)
to enforce this validation and fail fast on unsupported values.
- Around line 48-60: The example currently sets audio_path and prompt_text to
empty strings causing processAudioFile()/convertAudioMessage() to get empty
audio via mllm::audio::readWAV("") and fail; restore interactive input by
re-enabling the std::getline calls (or alternatively accept paths via CLI args)
so audio_path and prompt_text are populated before calling
processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit"
remains and that prompt_text falls back to a default if empty to avoid empty
prompt usage.
In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json`:
- Around line 5-7: The config contains unused flags enable_audio_output and
enable_talker (alongside model_type "qwen2_5_omni"); update the JSON to avoid
misleading users by either removing the keys enable_audio_output and
enable_talker or setting both to false. Locate the entries for
"enable_audio_output" and "enable_talker" in the config_qwen2_5omni_7B.json and
change their values to false (or delete those lines) so only supported
functionality (the thinker/text output) is advertised.
In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The current logic silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is unknown; update the
branch around model_version.get() so you explicitly accept only "v1" and "v2",
set file_version to mllm::ModelFileVersion::kV1 or kV2 accordingly, and
otherwise report an error (include the invalid value and allowed values) and
terminate with a non‑zero exit or throw an exception; reference the
model_version variable and the file_version / mllm::ModelFileVersion symbols
when making this validation change.
- Around line 52-60: Uncomment the two std::getline calls so the program reads
user input instead of using empty strings: after the fmt::print("Image path (or
'exit/quit'): "); call restore std::getline(std::cin, image_path) to populate
image_path (so the exit/quit check works), and after fmt::print("Prompt text:
"); restore std::getline(std::cin, prompt_text) to populate prompt_text; ensure
the variables image_path and prompt_text are used as before and that included
<iostream> usage is consistent with the surrounding code.
In `@examples/qwen2_5omni/text_infer.cpp`:
- Around line 24-29: The code silently defaults to mllm::ModelFileVersion::kV1
when model_version is unknown; change this to validate the input from
model_version.get() and reject unsupported values instead of defaulting. In the
block that sets file_version (referencing model_version.get(), file_version, and
mllm::ModelFileVersion), check for "v1" and "v2" explicitly and if neither
matches, print a clear error mentioning the unsupported model_version and
terminate (e.g., return/exit with non-zero or throw) so the program fails fast
rather than loading the wrong format.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 942-958: The code leaves attn uninitialized when
key_states.dtype() is neither kFloat32 nor kFloat16; add a fallback else branch
in the attention block (around the key_states.dtype() checks) that handles other
dtypes by casting query_states and key_states to kFloat32, computing attn =
nn::functional::matmul(...)* (1.f / sqrtf(head_dim_)), applying mask_ and
softmax_, and then casting attn back to the original key_states.dtype() if
needed (or alternatively throw a clear runtime_error mentioning unsupported
dtype) so that attn is always initialized before using
nn::functional::matmul(attn, value_states).
In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp`:
- Around line 303-316: The code computes img_token_nums from grid_thw and then
inserts tokens without validating it; add a check (similar to
convertAudioMessage's audio_token_nums > 0) after computing img_token_nums and
before ids.insert to ensure img_token_nums > 0, and on failure call
MLLM_ERROR_EXIT with a clear message (e.g., "Invalid image token count") so you
don't call ids.insert(img_token_nums - 1, ...) with zero/negative counts; locate
symbols grid_thw, img_token_nums, image_token_id, ids.insert and the surrounding
convertVisionMessage logic to make the change.
🧹 Nitpick comments (7)
mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp (2)
98-114: Redundant condition in whitespace handling logic.After the while loop on line 100 consumes all whitespace,
str[pos]is guaranteed to be a non-whitespace character (whenpos < str.size()). Therefore, the conditionstd::iswspace(str[pos])on line 101 is never true, making the second half of the condition unreachable.This also makes lines 109-114 redundant since the only way to reach them is if the first block restored
pos(line 105), but that only happens when the condition on line 101 is false—which it always is unless we're at end of string.Consider clarifying the intended behavior or simplifying this logic.
189-193: Clarify the technical reason for the max_pixels constraint.The comment "interestingly, the answer went bad when setting max_pixels higher" is informal and doesn't explain the root cause. Consider documenting the technical reason (e.g., memory constraints, model training resolution, numerical stability) to help future maintainers understand why this limit exists.
mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp (2)
19-37: Consider adding existence checks or documenting required fields.The config parser accesses required fields like
text_config,hidden_size,intermediate_size, etc. directly without.contains()checks. If the config file is missing these required fields, the parser will throw an exception.This fail-fast behavior may be intentional, but consider either:
- Adding explicit existence checks with descriptive error messages
- Documenting the required config structure in a header comment
80-118: Consider extracting common parsing logic to reduce duplication.The root-level config parsing (lines 80-118) largely duplicates the thinker_config parsing (lines 23-79). This is understandable for supporting two config formats, but could be refactored into helper methods if the config format evolves.
mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (2)
958-961: Consider passing layer_idx as a constructor parameter instead of public member.
layer_idx_is a public member that's set externally (line 1003). This breaks encapsulation and could lead to bugs if the member is not set before use.🛠️ Suggested refactor
class Qwen2_5OmniAttention final : public nn::Module { + int layer_idx_; // ... other members ... public: - Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) { + Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg, int layer_idx) + : nn::Module(name), layer_idx_(layer_idx) { // ... } - - int layer_idx_; };Then update Qwen2_5OmniDecoder and the module registration accordingly.
853-858: Addexplicitkeyword for consistency with other constructors.
Qwen2_5OmniMLP,Qwen2_5OmniAttention(line 889),Qwen2_5OmniDecoder(line 972), andQwen2_5OmniText(line 1001) are missing theexplicitkeyword that other similar classes use (e.g.,Qwen2_5OmniVisionMLPat line 421).🛠️ Suggested fix
- Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) { + explicit Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {Apply similar changes to
Qwen2_5OmniAttention,Qwen2_5OmniDecoder, andQwen2_5OmniText.mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp (1)
43-55: Add M_PI definition guard for portability.
M_PIis not part of the C++ standard and is only a POSIX extension. While commonly available on Unix-like systems, it may be undefined on other platforms (notably Windows with MSVC without_USE_MATH_DEFINES). The codebase already uses this pattern inthird_party/wenet_audio/fft.handmllm/backends/cpu/ops/STFTOp.cpp.🛠️ Suggested fix
+#ifndef M_PI +#define M_PI 3.14159265358979323846 +#endif + inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Validate --model_version instead of silently falling back to v1.
An unknown value currently defaults to v1, which can load the wrong format and fail in confusing ways. Consider rejecting unsupported values explicitly.
Suggested fix
- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
- if (model_version.get() == "v1") {
- file_version = mllm::ModelFileVersion::kV1;
- } else if (model_version.get() == "v2") {
- file_version = mllm::ModelFileVersion::kV2;
- }
+ mllm::ModelFileVersion file_version;
+ if (model_version.get() == "v1") {
+ file_version = mllm::ModelFileVersion::kV1;
+ } else if (model_version.get() == "v2") {
+ file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported --model_version: {}\n", model_version.get());
+ return 1;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } | |
| mllm::ModelFileVersion file_version; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } else { | |
| fmt::print("Unsupported --model_version: {}\n", model_version.get()); | |
| return 1; | |
| } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 24 - 29, The code
currently silently defaults file_version to mllm::ModelFileVersion::kV1 when
model_version.get() is not "v1" or "v2"; instead validate model_version.get()
explicitly: examine the string returned by model_version.get(), set file_version
to mllm::ModelFileVersion::kV1 or ::kV2 for "v1" and "v2" respectively, and for
any other value print a clear error mentioning the allowed values and terminate
(non-zero exit) to avoid loading the wrong format; update the block that assigns
file_version (and any related usage) to enforce this validation and fail fast on
unsupported values.
| std::string audio_path; | ||
| std::string prompt_text; | ||
|
|
||
| fmt::print("Audio path (or 'exit/quit'): "); | ||
| //std::getline(std::cin, audio_path); | ||
| //if (audio_path == "exit" || audio_path == "quit") { return 0; } | ||
| audio_path = ""; | ||
|
|
||
| fmt::print("Prompt text: "); | ||
| //std::getline(std::cin, prompt_text); | ||
| //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } | ||
| prompt_text = ""; | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd examples/qwen2_5omni && wc -l audio_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 190
🏁 Script executed:
head -70 examples/qwen2_5omni/audio_infer.cpp | tail -30Repository: UbiquitousLearning/mllm
Length of output: 1202
🏁 Script executed:
cat -n examples/qwen2_5omni/audio_infer.cpp | sed -n '40,70p'Repository: UbiquitousLearning/mllm
Length of output: 1510
🏁 Script executed:
rg -A 10 "convertAudioMessage" examples/qwen2_5omni/ --type cppRepository: UbiquitousLearning/mllm
Length of output: 1132
🏁 Script executed:
rg -B 5 -A 10 "audio_file_path" examples/qwen2_5omni/ --type cpp | head -50Repository: UbiquitousLearning/mllm
Length of output: 1466
🏁 Script executed:
fd "\.hpp$|\.h$" examples/qwen2_5omni --type fRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
rg "convertAudioMessage" --type cpp --type hpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 94
🏁 Script executed:
find . -type f -name "*.h" -o -name "*.hpp" | grep -E "(qwen|tokenizer)" | head -20Repository: UbiquitousLearning/mllm
Length of output: 1152
🏁 Script executed:
rg "convertAudioMessage" -A 15 --max-count 5Repository: UbiquitousLearning/mllm
Length of output: 3215
🏁 Script executed:
rg "audio_file_path" -B 5 -A 10 --max-count 10Repository: UbiquitousLearning/mllm
Length of output: 16860
🏁 Script executed:
rg "MLLM_ERROR_EXIT" -B 2 -A 2 --max-count 5Repository: UbiquitousLearning/mllm
Length of output: 50379
🏁 Script executed:
rg "readWAV" -B 3 -A 8 | head -40Repository: UbiquitousLearning/mllm
Length of output: 3737
Enable interactive input for the example to work as intended.
The example currently has input hardcoded to empty strings, causing convertAudioMessage() to fail. When an empty audio_path is passed to processAudioFile(), it attempts to read from an empty path via mllm::audio::readWAV(""), which returns empty audio data. This triggers the error check if (audio_result.input_features.isNil() || audio_result.feature_length <= 0) and exits with MLLM_ERROR_EXIT(ExitCode::kIOError, "Failed to extract audio features for Qwen2.5-Omni.").
Restore the commented std::getline() calls to enable interactive input, or wire the paths to CLI arguments.
Suggested fix (restore interactive input)
- fmt::print("Audio path (or 'exit/quit'): ");
- //std::getline(std::cin, audio_path);
- //if (audio_path == "exit" || audio_path == "quit") { return 0; }
- audio_path = "";
-
- fmt::print("Prompt text: ");
- //std::getline(std::cin, prompt_text);
- //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
- prompt_text = "";
+ fmt::print("Audio path (or 'exit/quit'): ");
+ std::getline(std::cin, audio_path);
+ if (audio_path == "exit" || audio_path == "quit") { return 0; }
+
+ fmt::print("Prompt text: ");
+ std::getline(std::cin, prompt_text);
+ if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::string audio_path; | |
| std::string prompt_text; | |
| fmt::print("Audio path (or 'exit/quit'): "); | |
| //std::getline(std::cin, audio_path); | |
| //if (audio_path == "exit" || audio_path == "quit") { return 0; } | |
| audio_path = ""; | |
| fmt::print("Prompt text: "); | |
| //std::getline(std::cin, prompt_text); | |
| //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } | |
| prompt_text = ""; | |
| std::string audio_path; | |
| std::string prompt_text; | |
| fmt::print("Audio path (or 'exit/quit'): "); | |
| std::getline(std::cin, audio_path); | |
| if (audio_path == "exit" || audio_path == "quit") { return 0; } | |
| fmt::print("Prompt text: "); | |
| std::getline(std::cin, prompt_text); | |
| if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 48 - 60, The example
currently sets audio_path and prompt_text to empty strings causing
processAudioFile()/convertAudioMessage() to get empty audio via
mllm::audio::readWAV("") and fail; restore interactive input by re-enabling the
std::getline calls (or alternatively accept paths via CLI args) so audio_path
and prompt_text are populated before calling
processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit"
remains and that prompt_text falls back to a default if empty to avoid empty
prompt usage.
| "enable_audio_output": true, | ||
| "enable_talker": true, | ||
| "model_type": "qwen2_5_omni", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, verify the file exists and show context around the lines mentioned
echo "=== Checking the config file ==="
find . -name "config_qwen2_5omni_7B.json" -type f 2>/dev/null | head -5
echo -e "\n=== Config file content ==="
if [ -f "examples/qwen2_5omni/config_qwen2_5omni_7B.json" ]; then
cat -n "examples/qwen2_5omni/config_qwen2_5omni_7B.json"
fi
echo -e "\n=== Search for flag usage in code ==="
rg -n --type=cpp 'enable_talker|enable_audio_output' -A 2 -B 2Repository: UbiquitousLearning/mllm
Length of output: 16869
🏁 Script executed:
#!/bin/bash
# Search for flag usage in Python and other files
echo "=== Search for enable_talker and enable_audio_output in all files ==="
rg -n 'enable_talker|enable_audio_output' -i --type-list | head -5
echo -e "\n=== Search without type restriction ==="
rg -n 'enable_talker|enable_audio_output' -A 3 -B 1 | head -50
echo -e "\n=== Search for talker_config usage ==="
rg -n 'talker_config' -A 2 -B 1 | head -50
echo -e "\n=== Check for model loading/config ==="
rg -n 'def.*load|def.*model' --type=py | grep -i config | head -20
echo -e "\n=== Find Python files in examples directory ==="
find examples -name "*.py" -type f 2>/dev/null | head -10Repository: UbiquitousLearning/mllm
Length of output: 1512
🏁 Script executed:
#!/bin/bash
# Let's find where config files are loaded and used
echo "=== Look for config loading functions ==="
rg -n 'load.*config|from_pretrained' --type=py -A 2 | head -60
echo -e "\n=== Search for qwen omni references ==="
rg -n 'qwen.*omni|omni.*model' --type=py -i | head -30
echo -e "\n=== Explore directory structure ==="
find . -type d -name "*qwen*" -o -name "*model*" -o -name "*config*" 2>/dev/null | head -20
echo -e "\n=== Look for model initialization in Python ==="
find . -name "*.py" -type f 2>/dev/null | head -20
echo -e "\n=== Check for source code structure ==="
ls -la | head -20Repository: UbiquitousLearning/mllm
Length of output: 3977
🏁 Script executed:
#!/bin/bash
# Search for where JSON configs are loaded
echo "=== Search for JSON config loading ==="
rg -n '\.json|load.*config' --type=py | grep -i 'config.*json\|json.*config' | head -20
echo -e "\n=== Search for qwen2_5omni model class or file ==="
find . -type f \( -name "*.py" -o -name "*.hpp" -o -name "*.cpp" -o -name "*.h" \) -exec grep -l 'qwen2_5omni\|Qwen2_5Omni' {} \; 2>/dev/null | head -20
echo -e "\n=== Check if config file is actually used in any Python test or runner ==="
rg -n 'config_qwen2_5omni' -A 3 | head -30
echo -e "\n=== Look for files that reference the examples directory ==="
rg -n 'examples/qwen2_5omni' | head -20
echo -e "\n=== Search for enable_ flags in code (broader search) ==="
rg -n 'enable_' --type=py -B 2 -A 2 | head -40Repository: UbiquitousLearning/mllm
Length of output: 3540
🏁 Script executed:
#!/bin/bash
# Read the configuration file
echo "=== configuration_qwen2_5omni.hpp ==="
wc -l mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp
head -100 mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp
echo -e "\n=== modeling_qwen2_5omni.hpp ==="
wc -l mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp
echo -e "\n=== Search for enable_talker in C++ files ==="
rg -n 'enable_talker|enable_audio_output' mllm/models/qwen2_5omni/ -A 3 -B 1Repository: UbiquitousLearning/mllm
Length of output: 5986
🏁 Script executed:
#!/bin/bash
# Check if the flags are mentioned anywhere in the entire qwen2_5omni files
echo "=== All files in qwen2_5omni directory ==="
ls -la mllm/models/qwen2_5omni/
echo -e "\n=== Search ALL qwen2_5omni files for enable_ flags ==="
grep -r 'enable_talker\|enable_audio_output' mllm/models/qwen2_5omni/ 2>/dev/null || echo "Not found in C++ files"
echo -e "\n=== Search ALL qwen2_5omni files for talker ==="
grep -n 'talker' mllm/models/qwen2_5omni/*.hpp mllm/models/qwen2_5omni/*.cpp 2>/dev/null | head -20
echo -e "\n=== Full configuration_qwen2_5omni.hpp to see if flags are read ==="
cat mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp | tail -80Repository: UbiquitousLearning/mllm
Length of output: 4080
🏁 Script executed:
#!/bin/bash
# Check if flags are used anywhere in the entire repository
echo "=== Search entire repo for enable_talker ==="
rg -n 'enable_talker' --type-list 2>/dev/null || true
rg -n 'enable_talker' 2>/dev/null | head -20
echo -e "\n=== Search entire repo for enable_audio_output ==="
rg -n 'enable_audio_output' 2>/dev/null | head -20
echo -e "\n=== Check if flags affect model initialization ==="
rg -n 'talker_config\|thinker_config' mllm/models/qwen2_5omni/ -B 2 -A 2
echo -e "\n=== Look at the inference examples to see how config is used ==="
head -50 examples/qwen2_5omni/text_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 2339
Remove or set enable_talker and enable_audio_output to false—these flags are currently unused.
These flags are defined in the config but never parsed or checked in the model code. Setting them to true misleads users about available functionality since only the thinker (text output) is currently loaded and used. To avoid confusion, either remove these flags or set them to false until talker and audio output support are actually implemented.
Suggested change
- "enable_audio_output": true,
- "enable_talker": true,
+ "enable_audio_output": false,
+ "enable_talker": false,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "enable_audio_output": true, | |
| "enable_talker": true, | |
| "model_type": "qwen2_5_omni", | |
| "enable_audio_output": false, | |
| "enable_talker": false, | |
| "model_type": "qwen2_5_omni", |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json` around lines 5 - 7, The
config contains unused flags enable_audio_output and enable_talker (alongside
model_type "qwen2_5_omni"); update the JSON to avoid misleading users by either
removing the keys enable_audio_output and enable_talker or setting both to
false. Locate the entries for "enable_audio_output" and "enable_talker" in the
config_qwen2_5omni_7B.json and change their values to false (or delete those
lines) so only supported functionality (the thinker/text output) is advertised.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Validate --model_version instead of silently falling back to v1.
An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.
Suggested fix
- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
- if (model_version.get() == "v1") {
- file_version = mllm::ModelFileVersion::kV1;
- } else if (model_version.get() == "v2") {
- file_version = mllm::ModelFileVersion::kV2;
- }
+ mllm::ModelFileVersion file_version;
+ if (model_version.get() == "v1") {
+ file_version = mllm::ModelFileVersion::kV1;
+ } else if (model_version.get() == "v2") {
+ file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported --model_version: {}\n", model_version.get());
+ return 1;
+ }🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The current logic
silently defaults file_version to mllm::ModelFileVersion::kV1 when
model_version.get() is unknown; update the branch around model_version.get() so
you explicitly accept only "v1" and "v2", set file_version to
mllm::ModelFileVersion::kV1 or kV2 accordingly, and otherwise report an error
(include the invalid value and allowed values) and terminate with a non‑zero
exit or throw an exception; reference the model_version variable and the
file_version / mllm::ModelFileVersion symbols when making this validation
change.
| fmt::print("Image path (or 'exit/quit'): "); | ||
| image_path = ""; | ||
| //std::getline(std::cin, image_path); | ||
| if (image_path == "exit" || image_path == "quit") { return 0; } | ||
|
|
||
| fmt::print("Prompt text: "); | ||
| prompt_text = ""; | ||
| //std::getline(std::cin, prompt_text); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd examples/qwen2_5omni && wc -l image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 85
🏁 Script executed:
cd examples/qwen2_5omni && sed -n '45,65p' image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 730
🏁 Script executed:
cd examples/qwen2_5omni && grep -n "getline" image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 157
🏁 Script executed:
cd examples/qwen2_5omni && head -52 image_infer.cpp | tail -20Repository: UbiquitousLearning/mllm
Length of output: 789
🏁 Script executed:
cd examples/qwen2_5omni && cat -n image_infer.cpp | head -30Repository: UbiquitousLearning/mllm
Length of output: 1480
🏁 Script executed:
cd examples/qwen2_5omni && sed -n '50,75p' image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 1119
Uncomment the std::getline() calls to restore interactive input.
The example prints prompts for user input but then hardcodes empty strings and has the std::getline() calls commented out. This makes the example non-functional since it will always attempt to process empty image paths and prompts. Restore the interactive input handling by uncommenting lines 54 and 59 (the std::getline() calls).
Suggested fix
- fmt::print("Image path (or 'exit/quit'): ");
- image_path = "";
- //std::getline(std::cin, image_path);
- if (image_path == "exit" || image_path == "quit") { return 0; }
-
- fmt::print("Prompt text: ");
- prompt_text = "";
- //std::getline(std::cin, prompt_text);
+ fmt::print("Image path (or 'exit/quit'): ");
+ std::getline(std::cin, image_path);
+ if (image_path == "exit" || image_path == "quit") { return 0; }
+
+ fmt::print("Prompt text: ");
+ std::getline(std::cin, prompt_text);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fmt::print("Image path (or 'exit/quit'): "); | |
| image_path = ""; | |
| //std::getline(std::cin, image_path); | |
| if (image_path == "exit" || image_path == "quit") { return 0; } | |
| fmt::print("Prompt text: "); | |
| prompt_text = ""; | |
| //std::getline(std::cin, prompt_text); | |
| fmt::print("Image path (or 'exit/quit'): "); | |
| std::getline(std::cin, image_path); | |
| if (image_path == "exit" || image_path == "quit") { return 0; } | |
| fmt::print("Prompt text: "); | |
| std::getline(std::cin, prompt_text); |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 52 - 60, Uncomment the two
std::getline calls so the program reads user input instead of using empty
strings: after the fmt::print("Image path (or 'exit/quit'): "); call restore
std::getline(std::cin, image_path) to populate image_path (so the exit/quit
check works), and after fmt::print("Prompt text: "); restore
std::getline(std::cin, prompt_text) to populate prompt_text; ensure the
variables image_path and prompt_text are used as before and that included
<iostream> usage is consistent with the surrounding code.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Validate --model_version instead of silently falling back to v1.
An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.
Suggested fix
- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
- if (model_version.get() == "v1") {
- file_version = mllm::ModelFileVersion::kV1;
- } else if (model_version.get() == "v2") {
- file_version = mllm::ModelFileVersion::kV2;
- }
+ mllm::ModelFileVersion file_version;
+ if (model_version.get() == "v1") {
+ file_version = mllm::ModelFileVersion::kV1;
+ } else if (model_version.get() == "v2") {
+ file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported --model_version: {}\n", model_version.get());
+ return 1;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } | |
| mllm::ModelFileVersion file_version; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } else { | |
| fmt::print("Unsupported --model_version: {}\n", model_version.get()); | |
| return 1; | |
| } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/text_infer.cpp` around lines 24 - 29, The code silently
defaults to mllm::ModelFileVersion::kV1 when model_version is unknown; change
this to validate the input from model_version.get() and reject unsupported
values instead of defaulting. In the block that sets file_version (referencing
model_version.get(), file_version, and mllm::ModelFileVersion), check for "v1"
and "v2" explicitly and if neither matches, print a clear error mentioning the
unsupported model_version and terminate (e.g., return/exit with non-zero or
throw) so the program fails fast rather than loading the wrong format.
| Tensor attn; | ||
| if (key_states.dtype() == kFloat32) { | ||
| attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); | ||
| attn = mask_(attn); | ||
| attn = softmax_(attn); | ||
| } else if (key_states.dtype() == kFloat16) { | ||
| attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); | ||
| attn = mask_(attn); | ||
| attn = softmax_(attn); | ||
| attn = attn.to(kFloat16); | ||
| } | ||
|
|
||
| auto output = nn::functional::matmul(attn, value_states); | ||
| output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_}); | ||
| output = o_proj_(output); | ||
| return {output}; | ||
| } |
There was a problem hiding this comment.
Missing fallback for unsupported dtype in attention computation.
If key_states.dtype() is neither kFloat32 nor kFloat16, the attn tensor remains uninitialized (line 942 declares it as Tensor attn;), and the subsequent nn::functional::matmul(attn, value_states) on line 954 would cause undefined behavior.
🛠️ Suggested fix
if (key_states.dtype() == kFloat32) {
attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));
attn = mask_(attn);
attn = softmax_(attn);
} else if (key_states.dtype() == kFloat16) {
attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));
attn = mask_(attn);
attn = softmax_(attn);
attn = attn.to(kFloat16);
+ } else {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation.");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Tensor attn; | |
| if (key_states.dtype() == kFloat32) { | |
| attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| } else if (key_states.dtype() == kFloat16) { | |
| attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| attn = attn.to(kFloat16); | |
| } | |
| auto output = nn::functional::matmul(attn, value_states); | |
| output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_}); | |
| output = o_proj_(output); | |
| return {output}; | |
| } | |
| Tensor attn; | |
| if (key_states.dtype() == kFloat32) { | |
| attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| } else if (key_states.dtype() == kFloat16) { | |
| attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| attn = attn.to(kFloat16); | |
| } else { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation."); | |
| } | |
| auto output = nn::functional::matmul(attn, value_states); | |
| output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_}); | |
| output = o_proj_(output); | |
| return {output}; | |
| } |
🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 942 - 958, The
code leaves attn uninitialized when key_states.dtype() is neither kFloat32 nor
kFloat16; add a fallback else branch in the attention block (around the
key_states.dtype() checks) that handles other dtypes by casting query_states and
key_states to kFloat32, computing attn = nn::functional::matmul(...)* (1.f /
sqrtf(head_dim_)), applying mask_ and softmax_, and then casting attn back to
the original key_states.dtype() if needed (or alternatively throw a clear
runtime_error mentioning unsupported dtype) so that attn is always initialized
before using nn::functional::matmul(attn, value_states).
| auto grid_t = grid_thw.ptr<int32_t>()[0]; | ||
| auto grid_h = grid_thw.ptr<int32_t>()[1]; | ||
| auto grid_w = grid_thw.ptr<int32_t>()[2]; | ||
| int32_t img_token_nums = grid_t * grid_h * grid_w; | ||
| img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); | ||
|
|
||
| auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>"); | ||
| { | ||
| auto it = std::find(ids.begin(), ids.end(), image_token_id); | ||
| if (it == ids.end()) { | ||
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template."); | ||
| } | ||
| ids.insert(it + 1, img_token_nums - 1, image_token_id); | ||
| } |
There was a problem hiding this comment.
Add validation for image token count before insertion.
Unlike convertAudioMessage which validates audio_token_nums > 0 (lines 347-349), convertVisionMessage doesn't validate img_token_nums before the ids.insert() call. If grid dimensions somehow result in zero or negative values, inserting img_token_nums - 1 tokens would cause undefined behavior.
🛠️ Suggested fix
int32_t img_token_nums = grid_t * grid_h * grid_w;
img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);
+ if (img_token_nums <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni.");
+ }
auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto grid_t = grid_thw.ptr<int32_t>()[0]; | |
| auto grid_h = grid_thw.ptr<int32_t>()[1]; | |
| auto grid_w = grid_thw.ptr<int32_t>()[2]; | |
| int32_t img_token_nums = grid_t * grid_h * grid_w; | |
| img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); | |
| auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>"); | |
| { | |
| auto it = std::find(ids.begin(), ids.end(), image_token_id); | |
| if (it == ids.end()) { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template."); | |
| } | |
| ids.insert(it + 1, img_token_nums - 1, image_token_id); | |
| } | |
| auto grid_t = grid_thw.ptr<int32_t>()[0]; | |
| auto grid_h = grid_thw.ptr<int32_t>()[1]; | |
| auto grid_w = grid_thw.ptr<int32_t>()[2]; | |
| int32_t img_token_nums = grid_t * grid_h * grid_w; | |
| img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); | |
| if (img_token_nums <= 0) { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni."); | |
| } | |
| auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>"); | |
| { | |
| auto it = std::find(ids.begin(), ids.end(), image_token_id); | |
| if (it == ids.end()) { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template."); | |
| } | |
| ids.insert(it + 1, img_token_nums - 1, image_token_id); | |
| } |
🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp` around lines 303 - 316,
The code computes img_token_nums from grid_thw and then inserts tokens without
validating it; add a check (similar to convertAudioMessage's audio_token_nums >
0) after computing img_token_nums and before ids.insert to ensure img_token_nums
> 0, and on failure call MLLM_ERROR_EXIT with a clear message (e.g., "Invalid
image token count") so you don't call ids.insert(img_token_nums - 1, ...) with
zero/negative counts; locate symbols grid_thw, img_token_nums, image_token_id,
ids.insert and the surrounding convertVisionMessage logic to make the change.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp`:
- Around line 11-22: Add concise doc comments above the CPUConvTranspose1DOp
class and CPUConvTranspose1DOpFactory to document their roles and expected
tensor shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU
backend of aops::ConvTranspose1DOp, list expected input tensors (e.g., input,
weight, optional bias), expected output shape behavior (how output length is
computed from stride/padding/dilation/output_padding), and any preconditions
(memory layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory
explaining it constructs CPUConvTranspose1DOp instances from
aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions
to the CPU implementation.
In `@mllm/core/aops/ConvTranspose1DOp.cpp`:
- Around line 17-23: Validate options_.groups before any division/modulo in
ConvTranspose1DOp::load and ConvTranspose1DOp::reshape by checking it is > 0 and
that it cleanly divides the relevant channel counts (e.g., options_.out_channels
% options_.groups == 0 and any other channel dimension used with groups is
divisible). If a check fails, return/raise a clear error (e.g., throw
std::invalid_argument or use the existing error/reporting mechanism) with a
message referencing the op name and invalid group value. Update the logic around
weight_ = weight_.view(...) in load() and the corresponding shape calculations
in reshape() to assume groups is validated so the divisions/modulo are safe.
Ensure tests or callers expecting validation get a deterministic error rather
than UB/crash.
- Around line 72-81: The code must validate computed seq_out and option
constraints before allocating the output tensor: compute seq_out as currently
done, then check that seq_out > 0 and that options_.output_padding <
options_.stride and that options_.padding >= 0, options_.dilation >= 1,
options_.kernel_size > 0 and options_.stride > 0; if any check fails,
return/raise a clear configuration error (e.g., throw std::invalid_argument or
use the project’s error/reporting helper) instead of calling
outputs.emplace_back(Tensor::empty(...)); place these checks immediately after
the seq_out calculation and before the outputs.emplace_back line in
ConvTranspose1DOp (referencing seq_out, options_.output_padding,
options_.stride, options_.padding, options_.dilation, options_.kernel_size).
In `@mllm/core/aops/ConvTranspose1DOp.hpp`:
- Around line 12-49: Add API documentation comments for the new public types:
place a brief docstring above the struct ConvTranspose1DOpOptions describing its
purpose (options for 1D transposed convolution), and document each field
(in_channels, out_channels, kernel_size, stride, padding, output_padding,
dilation, groups, bias) including units, valid ranges/constraints (e.g.,
positive ints, output_padding < max(stride, dilation), groups divides
in_channels/out_channels, kernel_size > 0), and default meanings; also add a
short doc comment above the ConvTranspose1DOp class describing its role,
lifecycle methods (load, trace, forward, reshape, setup, getParams), and
invariants about weight_ and bias_ (shapes derived from options, bias present
only if options.bias is true). Ensure comments reference the exact symbols
ConvTranspose1DOpOptions, ConvTranspose1DOp, weight_, bias_ so callers can
locate the documented behavior.
In `@mllm/core/aops/TanhOp.hpp`:
- Around line 11-30: Add API doc comments for TanhOp and TanhOpOptions: place a
brief class-level comment above struct TanhOpOptions and class TanhOp describing
the operation's purpose (element-wise hyperbolic tangent), expected inputs
(single tensor of arbitrary shape), outputs (single tensor with same shape and
dtype), and any error/shape expectations (e.g., require one input tensor,
matching output count; throw or assert on incorrect counts). Also add short
comments on the public methods load, trace, forward, reshape, setup and the
options() accessor indicating their roles (e.g., load: load parameters from
ParameterFile; forward: compute element-wise tanh; reshape/setup: validate and
set output shapes; trace: record shapes/metadata) and mention any preconditions
or postconditions (shape invariants), following existing doc style in the
codebase.
🧹 Nitpick comments (4)
mllm/backends/cpu/ops/ConvTranspose1DOp.cpp (2)
48-48: Unnecessary zero-initialization of output.The output buffer is zero-filled here, but the subsequent loop at line 86 directly assigns
output_ptr[output_idx] = sum;rather than accumulating. Each output element is computed completely before being written, making this zero-fill redundant overhead.♻️ Suggested fix
- std::fill_n(output_ptr, output.numel(), 0.0f); + // No zero-fill needed: each output element is fully computed before assignment
52-90: Redundant dtype switch after assertion.The
switch (output.dtype())at line 52 is redundant sinceMLLM_RT_ASSERT_EQ(output.dtype(), kFloat32)at line 39 already guarantees the dtype. Consider removing the switch or, if future dtype support is planned, removing the assertion.♻️ Suggested simplification
- switch (output.dtype()) { - case kFloat32: - MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, { + MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, { // ... loop body unchanged ... - }); - break; - default: NYI("ConvTranspose1D: unsupported data type"); - } + });tests/cpu/ConvTranspose1DKernelTest.hpp (1)
114-133: Redundant variable extraction in failure path.Lines 117-128 extract all config values into local variables, but these are only used in the
print()call. Consider simplifying by passing the config directly or just printing the failing config map.♻️ Simplified failure handling
bool testConvTranspose1D(const std::vector<std::unordered_map<std::string, int32_t>>& cfgs) { for (auto& cfg : cfgs) { if (!testConvTranspose1DOnce(cfg)) { - auto batch = cfg.at("batch"); - auto in_channel = cfg.at("in_channel"); - auto out_channel = cfg.at("out_channel"); - auto sequence = cfg.at("sequence"); - auto kernel_size = cfg.at("kernel_size"); - auto stride = cfg.at("stride"); - auto padding = cfg.at("padding"); - auto output_padding = cfg.at("output_padding"); - auto dilation = cfg.at("dilation"); - auto groups = cfg.at("groups"); - auto bias = cfg.at("bias"); - print(batch, in_channel, out_channel, sequence, kernel_size, stride, padding, output_padding, dilation, groups, bias); + // Print failing configuration for debugging + for (const auto& [key, value] : cfg) { + print(key, "=", value); + } return false; } } return true; }mllm/nn/layers/Tanh.hpp (1)
11-18: Add a brief class doc comment.Public APIs should have a short description of purpose/inputs/outputs. As per coding guidelines, please add a brief doc comment for the Tanh layer.
| class CPUConvTranspose1DOp final : public aops::ConvTranspose1DOp { | ||
| public: | ||
| explicit CPUConvTranspose1DOp(const aops::ConvTranspose1DOpOptions& options); | ||
|
|
||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
| }; | ||
|
|
||
| class CPUConvTranspose1DOpFactory : public TypedOpFactory<OpTypes::kConvTranspose1D, aops::ConvTranspose1DOpOptions> { | ||
| public: | ||
| std::shared_ptr<BaseOp> createOpImpl(const aops::ConvTranspose1DOpOptions& options) override { | ||
| return std::make_shared<CPUConvTranspose1DOp>(options); | ||
| } |
There was a problem hiding this comment.
Add brief docs for CPU op/factory.
Please add short comments describing the CPU op’s role and any expectations about inputs/outputs to keep backend APIs self-documenting. As per coding guidelines, public APIs should include clear docstrings/comments.
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp` around lines 11 - 22, Add
concise doc comments above the CPUConvTranspose1DOp class and
CPUConvTranspose1DOpFactory to document their roles and expected tensor
shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU backend
of aops::ConvTranspose1DOp, list expected input tensors (e.g., input, weight,
optional bias), expected output shape behavior (how output length is computed
from stride/padding/dilation/output_padding), and any preconditions (memory
layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory
explaining it constructs CPUConvTranspose1DOp instances from
aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions
to the CPU implementation.
| void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) { | ||
| switch (ploader->version()) { | ||
| case ModelFileVersion::kV1: { | ||
| weight_ = ploader->pull(getName() + ".weight"); | ||
| if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); } | ||
| weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size}); | ||
| if (options_.bias) { bias_ = bias_.view({options_.out_channels}); } |
There was a problem hiding this comment.
Guard against invalid groups before division/modulo.
options_.groups is used in division/modulo in both load() and reshape(). If it’s 0 (or incompatible with channels), this will crash or corrupt shapes. Add explicit validation before use. As per coding guidelines, validate inputs for public APIs.
🔧 Suggested fix
void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) {
+ if (options_.groups <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0");
+ }
+ MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0);
switch (ploader->version()) {
case ModelFileVersion::kV1: {
weight_ = ploader->pull(getName() + ".weight");
if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); }
weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size});
if (options_.bias) { bias_ = bias_.view({options_.out_channels}); }
break;
}
@@
void ConvTranspose1DOp::reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) {
const auto& i = inputs[0];
const auto& ishape = i.shape();
if (ishape.size() != 3) {
MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp expects 3D input, got {} D", ishape.size());
outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
return;
}
+ if (options_.groups <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0");
+ outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
+ return;
+ }
const int batch = ishape[0];
const int in_channels = ishape[1];
const int sequence = ishape[2];
MLLM_RT_ASSERT_EQ(in_channels, options_.in_channels);
MLLM_RT_ASSERT_EQ(in_channels % options_.groups, 0);
MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0);Also applies to: 68-71
🤖 Prompt for AI Agents
In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 17 - 23, Validate
options_.groups before any division/modulo in ConvTranspose1DOp::load and
ConvTranspose1DOp::reshape by checking it is > 0 and that it cleanly divides the
relevant channel counts (e.g., options_.out_channels % options_.groups == 0 and
any other channel dimension used with groups is divisible). If a check fails,
return/raise a clear error (e.g., throw std::invalid_argument or use the
existing error/reporting mechanism) with a message referencing the op name and
invalid group value. Update the logic around weight_ = weight_.view(...) in
load() and the corresponding shape calculations in reshape() to assume groups is
validated so the divisions/modulo are safe. Ensure tests or callers expecting
validation get a deterministic error rather than UB/crash.
| const int kernel_size = options_.kernel_size; | ||
| const int stride = options_.stride; | ||
| const int dilation = options_.dilation; | ||
| const int padding = options_.padding; | ||
| const int output_padding = options_.output_padding; | ||
|
|
||
| const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1; | ||
|
|
||
| auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out}; | ||
| outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device())); |
There was a problem hiding this comment.
Validate output size and output_padding constraints.
seq_out can become <= 0 for invalid parameters (e.g., negative padding/dilation, output_padding ≥ stride). Right now this creates an invalid tensor. Add explicit checks before allocation to fail fast and surface configuration errors. As per coding guidelines, validate inputs for public APIs.
🔧 Suggested fix
const int kernel_size = options_.kernel_size;
const int stride = options_.stride;
const int dilation = options_.dilation;
const int padding = options_.padding;
const int output_padding = options_.output_padding;
+ if (kernel_size <= 0 || stride <= 0 || dilation <= 0 || padding < 0 || output_padding < 0 || output_padding >= stride) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid ConvTranspose1D options");
+ outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
+ return;
+ }
const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1;
+ if (seq_out <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1D produces non-positive output length");
+ outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
+ return;
+ }
auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out};
outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device()));🤖 Prompt for AI Agents
In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 72 - 81, The code must
validate computed seq_out and option constraints before allocating the output
tensor: compute seq_out as currently done, then check that seq_out > 0 and that
options_.output_padding < options_.stride and that options_.padding >= 0,
options_.dilation >= 1, options_.kernel_size > 0 and options_.stride > 0; if any
check fails, return/raise a clear configuration error (e.g., throw
std::invalid_argument or use the project’s error/reporting helper) instead of
calling outputs.emplace_back(Tensor::empty(...)); place these checks immediately
after the seq_out calculation and before the outputs.emplace_back line in
ConvTranspose1DOp (referencing seq_out, options_.output_padding,
options_.stride, options_.padding, options_.dilation, options_.kernel_size).
| struct ConvTranspose1DOpOptions : public BaseOpOptions<ConvTranspose1DOpOptions> { | ||
| int32_t in_channels; | ||
| int32_t out_channels; | ||
| int32_t kernel_size; | ||
| int32_t stride = 1; | ||
| int32_t padding = 0; | ||
| int32_t output_padding = 0; | ||
| int32_t dilation = 1; | ||
| int32_t groups = 1; | ||
| bool bias = true; | ||
| }; | ||
|
|
||
| class ConvTranspose1DOp : public BaseOp { | ||
| public: | ||
| explicit ConvTranspose1DOp(const ConvTranspose1DOpOptions& options); | ||
|
|
||
| void load(const ParameterFile::ptr_t& ploader) override; | ||
|
|
||
| void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| ParameterFile::ptr_t getParams() override; | ||
|
|
||
| inline Tensor& weight() { return weight_; } | ||
|
|
||
| inline Tensor& bias() { return bias_; } | ||
|
|
||
| inline ConvTranspose1DOpOptions& options() { return options_; } | ||
|
|
||
| protected: | ||
| Tensor weight_; | ||
| Tensor bias_; | ||
| ConvTranspose1DOpOptions options_; |
There was a problem hiding this comment.
Add API docs for ConvTranspose1D options and op.
These are new public types; please add brief comments covering purpose, parameter meaning (channels/stride/padding/output_padding/groups/bias), and invariants so callers know expected constraints. As per coding guidelines, public APIs should include clear docstrings/comments.
🤖 Prompt for AI Agents
In `@mllm/core/aops/ConvTranspose1DOp.hpp` around lines 12 - 49, Add API
documentation comments for the new public types: place a brief docstring above
the struct ConvTranspose1DOpOptions describing its purpose (options for 1D
transposed convolution), and document each field (in_channels, out_channels,
kernel_size, stride, padding, output_padding, dilation, groups, bias) including
units, valid ranges/constraints (e.g., positive ints, output_padding <
max(stride, dilation), groups divides in_channels/out_channels, kernel_size >
0), and default meanings; also add a short doc comment above the
ConvTranspose1DOp class describing its role, lifecycle methods (load, trace,
forward, reshape, setup, getParams), and invariants about weight_ and bias_
(shapes derived from options, bias present only if options.bias is true). Ensure
comments reference the exact symbols ConvTranspose1DOpOptions,
ConvTranspose1DOp, weight_, bias_ so callers can locate the documented behavior.
| struct TanhOpOptions : public BaseOpOptions<TanhOpOptions> {}; | ||
|
|
||
| class TanhOp : public BaseOp { | ||
| public: | ||
| explicit TanhOp(const TanhOpOptions& options); | ||
|
|
||
| void load(const ParameterFile::ptr_t& ploader) override; | ||
|
|
||
| void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| inline TanhOpOptions& options() { return options_; } | ||
|
|
||
| protected: | ||
| TanhOpOptions options_; |
There was a problem hiding this comment.
Add API docs for TanhOp.
Please add brief doc comments describing purpose, inputs/outputs, and any error/shape expectations. As per coding guidelines, public APIs should include clear docstrings/comments.
🤖 Prompt for AI Agents
In `@mllm/core/aops/TanhOp.hpp` around lines 11 - 30, Add API doc comments for
TanhOp and TanhOpOptions: place a brief class-level comment above struct
TanhOpOptions and class TanhOp describing the operation's purpose (element-wise
hyperbolic tangent), expected inputs (single tensor of arbitrary shape), outputs
(single tensor with same shape and dtype), and any error/shape expectations
(e.g., require one input tensor, matching output count; throw or assert on
incorrect counts). Also add short comments on the public methods load, trace,
forward, reshape, setup and the options() accessor indicating their roles (e.g.,
load: load parameters from ParameterFile; forward: compute element-wise tanh;
reshape/setup: validate and set output shapes; trace: record shapes/metadata)
and mention any preconditions or postconditions (shape invariants), following
existing doc style in the codebase.
There was a problem hiding this comment.
Actionable comments posted: 13
♻️ Duplicate comments (3)
examples/qwen2_5omni/image_infer.cpp (2)
24-29:⚠️ Potential issue | 🟡 MinorSilent fallback to v1 on unknown
--model_versionvalue.Unknown values are silently treated as v1, which could load models with the wrong format. Add validation.
🐛 Proposed fix
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; if (model_version.get() == "v1") { file_version = mllm::ModelFileVersion::kV1; } else if (model_version.get() == "v2") { file_version = mllm::ModelFileVersion::kV2; + } else { + MLLM_ERROR_EXIT(mllm::ExitCode::kCoreError, "Unsupported --model_version: '{}'. Use 'v1' or 'v2'.", model_version.get()); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The current logic silently treats unknown model_version values as mllm::ModelFileVersion::kV1; update the validation around model_version.get() so only "v1" and "v2" are accepted: if the value is "v1" or "v2" set file_version accordingly (mllm::ModelFileVersion::kV1/kV2), otherwise log an explicit error mentioning the invalid model_version and terminate (or return a non-zero code) so we don't accidentally load the wrong format; reference the variables model_version and file_version and the enum mllm::ModelFileVersion in your change.
52-59:⚠️ Potential issue | 🟠 MajorHardcoded paths and commented-out
std::getline()make this example non-interactive.The user-input prompts are printed but then ignored —
image_pathandprompt_textare hardcoded. Uncomment thestd::getlinecalls (or make the hardcoded values CLI arguments) before merging.🐛 Proposed fix
fmt::print("Image path (or 'exit/quit'): "); - image_path = "../../rsc/pics.jpg"; - //std::getline(std::cin, image_path); + std::getline(std::cin, image_path); if (image_path == "exit" || image_path == "quit") { return 0; } fmt::print("Prompt text: "); - prompt_text = "描述图片中物体"; - //std::getline(std::cin, prompt_text); + std::getline(std::cin, prompt_text);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen2_5omni/image_infer.cpp` around lines 52 - 59, The example currently prints prompts but ignores user input because image_path and prompt_text are hardcoded; revert to interactive mode by uncommenting the std::getline(std::cin, image_path) and std::getline(std::cin, prompt_text) calls (or alternatively parse them from CLI args) so that the printed prompts actually collect user input; ensure the check using image_path == "exit" || image_path == "quit" still runs after reading input and that prompt_text is used when calling the inference routines in this file.mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (1)
1014-1030:⚠️ Potential issue | 🔴 CriticalMissing fallback for unsupported dtype in attention computation.
If
key_states.dtype()is neitherkFloat32norkFloat16,attnremains uninitialized andnn::functional::matmul(attn, value_states)on line 1026 invokes undefined behavior.🐛 Proposed fix
} else if (key_states.dtype() == kFloat16) { attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); attn = attn.to(kFloat16); + } else { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation."); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 1014 - 1030, The attention path lacks a fallback when key_states.dtype() is neither kFloat32 nor kFloat16, leaving attn uninitialized and causing undefined behavior in nn::functional::matmul(attn, value_states); add an else branch in the attention block that handles other dtypes by converting query_states and key_states to kFloat32, computing attn the same way (mask_ and softmax_ applied) and then converting back if needed, or alternatively throw a clear runtime_error; update the code around key_states, query_states, attn, head_dim_, mask_, softmax_, and the subsequent matmul/output steps so attn is always initialized before use.
🧹 Nitpick comments (15)
mllm/backends/cpu/ops/EmbeddingOp.cpp (1)
29-29:staticwarning flag is process-wide — all embedding instances share a single suppression flag.
static std::atomic<bool>lives for the entire process lifetime and is shared across everyCPUEmbeddingOpinstance. In a multimodal model like Qwen2.5Omni, there are multiple independent embedding layers (text, audio, vision). Once any one of them fires an OOB, all subsequent OOB events in every other embedding layer are silently zero-filled with no diagnostic output.Moving the flag to an instance member variable gives each embedding layer its own one-time warning, preserving the "no spam" intent while retaining full observability per layer.
♻️ Proposed refactor: instance-level warning flag
In
EmbeddingOp.hpp(or whereverCPUEmbeddingOpis declared), add the member:+ std::atomic<bool> warned_token_oob_{false};Then in
EmbeddingOp.cpp, replace the static local:- static std::atomic<bool> warned_token_oob{false};and update the CAS site:
- if (warned_token_oob.compare_exchange_strong(expected, true)) { + if (warned_token_oob_.compare_exchange_strong(expected, true)) {As per coding guidelines: "Add appropriate logging (e.g., debug, info, warning, error) for significant events and errors" — a shared static flag can suppress warnings from distinct layers, reducing the signal quality of those logs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` at line 29, The process-wide static warned_token_oob suppresses OOB warnings across all embedding layers; make it an instance member on CPUEmbeddingOp (e.g., add std::atomic<bool> warned_token_oob_{false}; to the CPUEmbeddingOp declaration in EmbeddingOp.hpp) and update uses in EmbeddingOp.cpp to reference the instance field instead of the static local (replace the static std::atomic<bool> warned_token_oob and the CAS site to use this->warned_token_oob_ or equivalent), so each embedding instance emits its own one-time OOB warning.mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py (2)
112-136: Each checkpoint is loaded twice — doubles I/O for large models
_collect_source_metaloads the full checkpoint to gather metadata and then discards the state dict._stream_source_tensorsreloads the same file from disk during the write pass. For multi-GB model files this can double overall runtime.A straightforward mitigation is to return the state dict from
_collect_source_meta(or accept it as an optional parameter) and pass it directly to_stream_source_tensors, trading memory for I/O only when both checkpoints can fit in RAM. Alternatively, a single-pass approach can collect metadata lazily the first time each tensor is encountered during streaming.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py` around lines 112 - 136, _collect_source_meta currently loads the entire checkpoint then _stream_source_tensors reloads it, doubling I/O for large files; modify _collect_source_meta to optionally return the loaded state dict (or accept a preloaded state dict param) and update its callers to pass that state into _stream_source_tensors so the file is only read once, or alternatively refactor to a single-pass lazy stream where _stream_source_tensors queries metadata on first access; update the function signatures for _collect_source_meta and _stream_source_tensors and all call sites to propagate the in-memory state dict when available and ensure gc.collect() only runs when the state is truly discarded.
266-268: Pre-allocation creates a large in-memory zero buffer
b"\x00" * reserved_bytesallocates up to several MB on the Python heap (532 +max_params× 352 bytes). A seek-then-write approach avoids this allocation and produces a sparse file on most filesystems:♻️ Proposed refactor
- reserved_bytes = MODEL_FILE_V2_DESC_SIZE + max_params * MODEL_FILE_V2_PARAM_DESC_SIZE - self.f.write(b"\x00" * reserved_bytes) - self.f.flush() + reserved_bytes = MODEL_FILE_V2_DESC_SIZE + max_params * MODEL_FILE_V2_PARAM_DESC_SIZE + self.f.seek(reserved_bytes - 1, os.SEEK_SET) + self.f.write(b"\x00") + self.f.flush()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py` around lines 266 - 268, The current code builds a large in-memory zero buffer via b"\x00" * reserved_bytes and writes it (reserved_bytes is computed from MODEL_FILE_V2_DESC_SIZE and MODEL_FILE_V2_PARAM_DESC_SIZE), which can allocate megabytes; instead, replace the allocation+write with a seek-based sparse-file creation: use self.f.seek(reserved_bytes - 1, 1) (or io.SEEK_CUR) and then write a single b"\x00" and flush — this avoids allocating the full zero buffer and produces a sparse file while preserving the reserved region semantics used around reserved_bytes, MODEL_FILE_V2_DESC_SIZE, and MODEL_FILE_V2_PARAM_DESC_SIZE.mllm/models/minicpm_o45/modeling_minicpm_o45.hpp (1)
150-187: Duplicated weight-norm materialization —token2wav_weight_norm.hppalready provides_materializeWeightNorm.
MiniCPMO45TTS::loadFromParameterreimplements the same per-row normalization logic as_materializeWeightNormintoken2wav_weight_norm.hpp. Additionally, the epsilon handling differs slightly: here it'sif (norm < kEps) { norm = kEps; }(clamps the norm), while the other usesstd::sqrt(std::max(norm_sq, kEps))(clamps before sqrt). This inconsistency could cause subtle numerical differences.Consider reusing
_materializeWeightNormor extracting a shared helper.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/modeling_minicpm_o45.hpp` around lines 150 - 187, MiniCPMO45TTS::loadFromParameter duplicates per-row weight-norm materialization already implemented as _materializeWeightNorm in token2wav_weight_norm.hpp; replace the manual loop that builds head_code_weight_ with a call to the shared _materializeWeightNorm (or factor out a small helper used by both places) so normalization and epsilon handling are identical, i.e., consume g and v tensors, produce the per-row scaled weight Tensor and push into head_code_weight_, ensuring the same kEps/clamping semantics as _materializeWeightNorm.mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp (3)
1340-1351:std::pow(2, i)for integer powers — prefer bit shifts for exact results.
std::powreturnsdoubleand is subject to floating-point precision issues. For computing powers of 2, bit shifts are exact and faster.Proposed fix (example for line 1347-1348)
- const int32_t in_ch = cfg.base_channels / static_cast<int32_t>(std::pow(2, i)); - const int32_t out_ch = cfg.base_channels / static_cast<int32_t>(std::pow(2, i + 1)); + const int32_t in_ch = cfg.base_channels >> i; + const int32_t out_ch = cfg.base_channels >> (i + 1);Apply the same pattern to other
std::pow(2, ...)usages in lines 1359, 1372, and 1379.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp` around lines 1340 - 1351, The use of std::pow(2, i) in HiFTGenerator constructor (e.g., computing in_ch and out_ch) should be replaced with integer bit-shifts to avoid floating-point imprecision and improve performance: compute powers of two as (1 << i) (and (1 << (i+1))) and ensure the result is cast to int32_t where needed; apply the same replacement for the other occurrences referenced in the diff (the uses at the spots corresponding to lines around 1359, 1372, and 1379) so all exponent-of-two calculations use bit shifts rather than std::pow.
1207-1213: New RNG created on everyf02sinecall — inconsistent with thethread_localpattern used elsewhere.
randomNormalLike(line 341) andSourceModuleHnNSF2::randomLike(line 1251) both usestatic thread_local std::mt19937for stable, efficient random generation. Here, a newstd::mt19937is constructed and seeded fromstd::random_deviceon every call, which is both slower and may degrade randomness quality on platforms with weakrandom_deviceentropy.Proposed fix
- std::mt19937 rng(std::random_device{}()); + static thread_local std::mt19937 rng(std::random_device{}());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp` around lines 1207 - 1213, Replace the per-call std::mt19937 construction in f02sine with a persistent thread-local RNG to match the pattern used by randomNormalLike and SourceModuleHnNSF2::randomLike: declare static thread_local std::mt19937 rng seeded once (e.g., from std::random_device() at initialization) and reuse it for the std::uniform_real_distribution<float> uni in the loop that sets rp[(b*T + 0)*H + h] += phase0; to avoid reseeding on every call and ensure consistent, efficient randomness.
1-1522: ~1500 lines in a single header — consider splitting into separate files.This header contains the full token2wav pipeline: utility functions, Conformer encoder, DiT estimator, CFM decoder, HiFT vocoder, and the top-level model. Splitting into focused headers (e.g.,
token2wav_utils.hpp,token2wav_flow.hpp,token2wav_hift.hpp) would improve maintainability and compile times.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp` around lines 1 - 1522, The header is too large and should be split into focused headers to improve maintainability and compile times: extract low-level helpers and pure functions (e.g., repeatInterleaveSeq, concatInt64Seq, l2NormalizeRow, tensorMish/LeakyRelu, makeHannWindow, relShift, addHeadBiasInplace, randomNormalLike, SineGen2 helpers) into token2wav_utils.hpp; move the flow/encoder/DiT/CFM-related classes (EspnetRelPositionalEncoding, LinearNoSubsampling, ConformerEncoderLayer, UpsampleConformerEncoderV2, DiTAttention, DiTBlock, DiTEstimator, CausalConditionalCFM, CausalMaskedDiffWithXvec, MiniCPMO45FlowConfig and usages) into token2wav_flow.hpp; and move HiFT/vocoder classes (ResBlock, SnakeActivation, ConvRNNF0Predictor, SourceModuleHnNSF2, HiFTGenerator and MiniCPMO45HiFTConfig) into token2wav_hift.hpp; update MiniCPMO45Token2WavModel to include these headers and adjust include guards/namespaces and any cross-file forward declarations (e.g., ensure SineGen2 and SourceModuleHnNSF2 visibility for HiFTGenerator, and materializeWeightNormParameters/loadFromParameter usages remain accessible). Ensure each new header keeps only the symbols it needs and include the appropriate original dependencies (nn headers, Tensor utilities) and update the single original header to include the three new headers and re-export MiniCPMO45Token2WavModel.mllm/models/minicpm_o45/export_prompt_cache.py (1)
189-196: Bareexcept Exceptionswallows the original error — consider logging it.When quantization fails on the primary device, the fallback to CPU is a good pattern, but the original exception is silently discarded. This makes debugging difficult if the CPU fallback also fails for a related reason.
Proposed fix
try: prompt_tokens, prompt_tokens_lens = audio_tokenizer.quantize(mels.to(quantize_device), mels_lens.to(quantize_device)) - except Exception: + except Exception as e: + logger.log(f"Quantize failed on {quantize_device}, falling back to CPU: {e}") quantize_device = torch.device("cpu") audio_tokenizer = _move_model(audio_tokenizer, quantize_device)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/export_prompt_cache.py` around lines 189 - 196, The try/except around audio_tokenizer.quantize currently swallows the original Exception; change it to catch Exception as e and log or attach e before falling back to CPU so the original error is preserved for debugging: when calling audio_tokenizer.quantize(...) catch Exception as e, record the exception (e) via logging or processLogger (or re-raise with context) before calling _move_model(audio_tokenizer, torch.device("cpu")) and audio_tokenizer.eval(), then retry audio_tokenizer.quantize(...) on the cpu quantize_device.mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp (1)
396-400: Silentcatch (...)swallows JSON parse errors.If the tokenizer JSON is malformed, this fails silently without any diagnostic output. Consider at minimum logging a warning so users aren't confused by missing special tokens.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp` around lines 396 - 400, Replace the silent catch(...) around nlohmann::json::parse(in) so parse failures aren’t swallowed: catch a std::exception (e.g. catch (const std::exception& e)) and emit a diagnostic (std::cerr or the module logger) that includes the exception.what() and context (e.g. "failed to parse tokenizer JSON, special tokens may be missing"), then return; update the code around json_data / nlohmann::json::parse to use this new catch so users see a warning when the JSON is malformed.mllm/models/minicpm_o45/token2wav_weight_norm.hpp (1)
61-68:key_gis always equal tokey— minor redundancy.On line 66,
key_gis reconstructed asprefix + ".parametrizations.weight.original0", butkeyis already that exact string (it matched the_endsWith(key, marker)check andmarker == ".parametrizations.weight.original0"). This is functionally correct and arguably clearer for readability, so just noting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/minicpm_o45/token2wav_weight_norm.hpp` around lines 61 - 68, The variable key_g is redundant because within the loop when _endsWith(key, marker) is true (marker == ".parametrizations.weight.original0"), key already equals prefix + marker; update the loop in token2wav_weight_norm.hpp to either use key directly instead of key_g or set key_g = key (remove reconstructing the same string) and remove any unnecessary allocations; adjust references to key_g accordingly (look for key_g, key_v, key_w and the loop over keys) to keep semantics identical but eliminate the redundant string build.examples/minicpm_o45/main.cpp (2)
133-134: Misleading variable namewav_i16— still holdsfloatdata.
wav_i16suggests int16 samples, but it's the result ofwav * 32767.0f, which is still a float tensor. TheWavWriterreceiveswav_i16.ptr<float>(). Consider renaming towav_scaledor similar to avoid confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/minicpm_o45/main.cpp` around lines 133 - 134, Rename the misleading variable wav_i16 which still contains float data (result of wav * 32767.0f) to something like wav_scaled or wav_scaled_f to avoid implying int16; update its declaration and all uses (the multiplication expression and the pointer passed into wenet::WavWriter) so that wav_scaled.ptr<float>() is passed and the symbol names (wav, wav_i16, WavWriter) match the new identifier.
278-294: Duplicated token2wav inference block.Lines 278–294 are nearly identical to lines 123–137 (the
--tts_tokens_inpath). Extract a helper function to avoid this duplication and reduce future maintenance burden.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/minicpm_o45/main.cpp` around lines 278 - 294, The token2wav inference block is duplicated; extract a helper (e.g., runToken2Wav) that encapsulates loading the token2wav parameter via mllm::load, loading the prompt cache via mllm::models::minicpm_o45::loadMiniCPMO45Token2WavPromptCache, constructing mllm::models::minicpm_o45::MiniCPMO45Token2WavModel, calling loadFromParameter, running infer with token_ids and tts_token2wav_n_timesteps, converting to int16 and writing via wenet::WavWriter, and printing the saved path; replace both duplicated blocks (the `--tts_tokens_in` path and the `tts_wav_out` path) with calls to this helper to remove duplication and centralize logic.examples/qwen2_5omni/CMakeLists.txt (1)
17-19: Consider whether the debug executable should be included in the default build.
mllm-qwen2_5-omni-image-runner-dbgis a debug artifact. If it's only for development, consider gating it behind a CMake option (e.g.,if(MLLM_BUILD_DEBUG_EXAMPLES)) to keep the default build clean.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen2_5omni/CMakeLists.txt` around lines 17 - 19, The CMakeLists currently always adds the debug executable mllm-qwen2_5-omni-image-runner-dbg; gate its creation behind a CMake option so the debug artifact is not built by default. Add a boolean option (e.g., MLLM_BUILD_DEBUG_EXAMPLES) and wrap the add_executable/target_link_libraries/target_include_directories for mllm-qwen2_5-omni-image-runner-dbg inside an if(MLLM_BUILD_DEBUG_EXAMPLES) ... endif() block so the target (and its links to MllmRT and MllmCPUBackend and include dir ${MLLM_INCLUDE_DIR}) is only created when the option is enabled.mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (2)
1641-1646: RNG engine re-created on every sampling call.
std::random_device+std::mt19937are instantiated fresh insampleFromDistributionandcategoricalSample(line 1656). In a tight generation loop, this is wasteful and may yield poor randomness on platforms wherestd::random_deviceis deterministic. Consider making the RNG a class member initialized once.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 1641 - 1646, sampleFromDistribution and categoricalSample recreate std::random_device and std::mt19937 on every call which is wasteful and can produce poor randomness; make a single RNG instance a persistent member (e.g., add a std::mt19937 rng; to the model class and initialize it once in the constructor, seeding from std::random_device there), then update sampleFromDistribution and categoricalSample to use that class member rng instead of constructing local rd/gen objects and construct std::discrete_distribution using the provided probs as before.
421-450: Public classes and key utility functions lack docstrings.Per coding guidelines, public APIs should have comments explaining purpose, parameters, and returns. Classes like
Qwen2_5OmniPatchEmbed,Qwen2_5OmniVisionEncoder,Qwen2_5OmniAudioEncoder,Qwen2_5OmniForCausalLM,Qwen2_5OmniForConditionalGeneration, and utility functions likemakeMultimodalPositionEmbedding,makeWindowIndex, etc., would benefit from brief doc comments describing their role in the pipeline and expected input/output shapes. As per coding guidelines: "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 421 - 450, Add concise doc comments above each public class/function: Qwen2_5OmniPatchEmbed, Qwen2_5OmniVisionEncoder, Qwen2_5OmniAudioEncoder, Qwen2_5OmniForCausalLM, Qwen2_5OmniForConditionalGeneration, and utility functions like makeMultimodalPositionEmbedding and makeWindowIndex. For each comment include a one-line purpose, constructor/parameter descriptions (e.g., cfg fields used: visual_in_chans, visual_hidden_size, patch sizes), expected input and output tensor shapes (e.g., input tensor shape, view/reshape results, output embed dim), return values, and any error/edge conditions; place them immediately above the class or function definition using the project’s C++ comment style (Doxygen or ///) and keep them short and focused.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/minicpm_o45/main.cpp`:
- Around line 111-119: The token-parsing inner loop can hang when a non-numeric
token sets failbit and never advances the stream; replace the while(!ss.eof())
pattern with a robust parse that advances on each iteration — for example, use a
loop that extracts into an integer (while (ss >> token)) to terminate on parse
failure, or read into a temporary string (e.g., tokenStr) and use std::stoll in
a try/catch to skip malformed tokens and continue; update the code using the
existing variables ss, token, and token_ids so malformed entries no longer cause
an infinite loop.
In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 22-35: The help flag is unreachable because Argparse::parse
enforces required args before returning; fix by detecting help manually from
argv before calling Argparse::parse(argc, argv): scan argc/argv for "--help" or
"-h" and if present call Argparse::printHelp(), mllm::shutdownContext(), and
return 0 immediately, then only call Argparse::parse(...) afterward; this
preserves the existing help.isSet()/printHelp() usage and avoids triggering the
required-args check in Argparse::parse.
In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py`:
- Around line 76-78: The try/except around torch.load that sets obj (obj =
torch.load(path, map_location="cpu", weights_only=True") and falls back to obj =
torch.load(path, map_location="cpu") silently is insecure; modify the except
TypeError block in convert_token2wav_pt_to_mllm.py to emit a warnings.warn (or
logging.warning) explaining that weights_only is unsupported on this PyTorch
build and that the fallback will deserialize arbitrary Python objects, then
proceed only after the warning; reference the existing symbols obj, path, and
the torch.load calls so you add the warning immediately before the fallback load
and keep the behavior otherwise.
In `@mllm/models/minicpm_o45/export_prompt_cache.py`:
- Around line 219-225: The code only pads prompt_mels when target_len >
prompt_mels.shape[1] but doesn't truncate when target_len <
prompt_mels.shape[1], which can leave extra mel frames and cause shape
mismatches; update the block around target_len, prompt_tokens, prompt_mels and
up_rate to handle both cases by slicing/truncating prompt_mels to target_len
(e.g., keep first target_len frames along dim 1) when prompt_mels.shape[1] >
target_len, and keep the existing torch.nn.functional.pad logic for the > case
so prompt_mels always ends up with length equal to target_len.
In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp`:
- Around line 1076-1086: The ConvRNNF0Predictor constructor is implicitly
convertible from std::string; mark the constructor explicit by adding the
explicit keyword to the ConvRNNF0Predictor(const std::string& name, int32_t
in_channels = 80, int32_t cond_channels = 512) declaration to prevent unintended
implicit conversions while leaving defaults and body unchanged (identify the
constructor in class ConvRNNF0Predictor and update its signature).
In `@mllm/models/minicpm_o45/modeling_minicpm_o45.hpp`:
- Around line 490-500: The index arithmetic using int32_t (base = b * S * D + s
* D) can overflow for large tensors; change the index calculation and any offset
variables to int64_t (e.g., declare base as int64_t and compute using int64_t
temporaries) and use that 64-bit offset when indexing ptr inside the loops
(while keeping loop counters b, s, d as int32_t if desired). Update references
in this normalization block (kEps, b, s, d, base, ptr, B, S, D) so all
multiplications/additions promoting to int64_t to avoid overflow. Ensure final
division ptr[base + d] uses the 64-bit index or is safely cast for pointer
arithmetic.
- Line 849: The creation of position_ids uses Tensor::empty({1, seq_len},
kInt64) without an explicit device; update the call that initializes the
variable position_ids (where Tensor::empty is invoked with seq_len and kInt64)
to include the device parameter , kCPU so it matches other Tensor::empty calls
in this file and maintains consistent device specification.
In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp`:
- Around line 337-344: The loop in tokenization_minicpm_o45.hpp uses int32_t pos
to track positions in applied_string which can overflow; change the type of pos
to size_t and update the cast when updating pos (replace
static_cast<int32_t>(found + tag.size()) with static_cast<size_t>(found +
tag.size()) or simply rely on size_t arithmetic) inside the loop that iterates
over image_tags and uses applied_string.find to build text_chunks so string
positions use the correct unsigned size type.
- Around line 472-483: The code uses float_t which can vary by platform; replace
uses of float_t with explicit float to match the kFloat32 tensor allocation:
change pixel_values.ptr<float_t>() to pixel_values.ptr<float>(), src_ptr to
img_tensors[b].ptr<float>(), and use sizeof(float) in the std::memset and
std::memcpy length calculations (ensure all occurrences inside the block that
use float_t are updated to float and sizeof(float) so no size mismatch or buffer
overrun occurs for pixel_values_ptr, src_ptr, dst_offset calculations).
In `@mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp`:
- Around line 218-231: The code reads thinker_cfg["text_config"] and then
accesses fields via text_cfg[...] without guarding for absence, which will throw
nlohmann::json::out_of_range if "text_config" is missing; update the block that
defines thinker_cfg/text_cfg to first check thinker_cfg.contains("text_config")
and either throw a clear runtime_error (or log and return) mentioning the
missing "thinker_config.text_config" key, or use safe accessors like
text_cfg.value("hidden_size", default) for each field (hidden_size,
intermediate_size, num_attention_heads, num_key_value_heads, num_hidden_layers,
max_position_embeddings, rms_norm_eps, vocab_size, rope_theta,
tie_word_embeddings) so the code in configuration_qwen2_5omni.hpp handles a
missing "text_config" deterministically.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 397-403: The computation in makeAudioSinusoidalPosEmb can divide
by zero when channels==2 because half = channels/2 -> 1 and you compute
log_timescale_increment = std::log(max_timescale)/static_cast<float>(half - 1);
change the logic to guard when half <= 1: if half > 1 compute
log_timescale_increment as before, otherwise set log_timescale_increment = 0
(and subsequently handle inv_timescales/scale computation accordingly so you
don't divide by zero or produce NaNs). Update references in
makeAudioSinusoidalPosEmb to use this conditional denominator and ensure the
generated pos_emb for the half==1 case produces a valid (e.g., zero or constant)
embedding rather than crashing.
- Around line 2026-2030: The code currently unconditionally pops the last
element of generated_codes which removes a valid token when generation stops due
to max_new_tokens; change it to only pop when the last token is an EOS token (or
when the talker signaled it ended on EOS). Locate where generated_codes is
modified and returned (around generated_codes, makeTokenTensor and the loop that
calls the talker), and replace the unconditional pop_back() with a check like:
if (!generated_codes.empty() && generated_codes.back() ==
<EOS_ID_or_codec.eos_token>) { generated_codes.pop_back(); } or use the loop’s
ended_on_eos flag if available to decide whether to remove the trailing EOS
before calling makeTokenTensor; leave the MLLM_ERROR_EXIT path intact for empty
results.
---
Duplicate comments:
In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The current logic silently treats unknown model_version
values as mllm::ModelFileVersion::kV1; update the validation around
model_version.get() so only "v1" and "v2" are accepted: if the value is "v1" or
"v2" set file_version accordingly (mllm::ModelFileVersion::kV1/kV2), otherwise
log an explicit error mentioning the invalid model_version and terminate (or
return a non-zero code) so we don't accidentally load the wrong format;
reference the variables model_version and file_version and the enum
mllm::ModelFileVersion in your change.
- Around line 52-59: The example currently prints prompts but ignores user input
because image_path and prompt_text are hardcoded; revert to interactive mode by
uncommenting the std::getline(std::cin, image_path) and std::getline(std::cin,
prompt_text) calls (or alternatively parse them from CLI args) so that the
printed prompts actually collect user input; ensure the check using image_path
== "exit" || image_path == "quit" still runs after reading input and that
prompt_text is used when calling the inference routines in this file.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 1014-1030: The attention path lacks a fallback when
key_states.dtype() is neither kFloat32 nor kFloat16, leaving attn uninitialized
and causing undefined behavior in nn::functional::matmul(attn, value_states);
add an else branch in the attention block that handles other dtypes by
converting query_states and key_states to kFloat32, computing attn the same way
(mask_ and softmax_ applied) and then converting back if needed, or
alternatively throw a clear runtime_error; update the code around key_states,
query_states, attn, head_dim_, mask_, softmax_, and the subsequent matmul/output
steps so attn is always initialized before use.
---
Nitpick comments:
In `@examples/minicpm_o45/main.cpp`:
- Around line 133-134: Rename the misleading variable wav_i16 which still
contains float data (result of wav * 32767.0f) to something like wav_scaled or
wav_scaled_f to avoid implying int16; update its declaration and all uses (the
multiplication expression and the pointer passed into wenet::WavWriter) so that
wav_scaled.ptr<float>() is passed and the symbol names (wav, wav_i16, WavWriter)
match the new identifier.
- Around line 278-294: The token2wav inference block is duplicated; extract a
helper (e.g., runToken2Wav) that encapsulates loading the token2wav parameter
via mllm::load, loading the prompt cache via
mllm::models::minicpm_o45::loadMiniCPMO45Token2WavPromptCache, constructing
mllm::models::minicpm_o45::MiniCPMO45Token2WavModel, calling loadFromParameter,
running infer with token_ids and tts_token2wav_n_timesteps, converting to int16
and writing via wenet::WavWriter, and printing the saved path; replace both
duplicated blocks (the `--tts_tokens_in` path and the `tts_wav_out` path) with
calls to this helper to remove duplication and centralize logic.
In `@examples/qwen2_5omni/CMakeLists.txt`:
- Around line 17-19: The CMakeLists currently always adds the debug executable
mllm-qwen2_5-omni-image-runner-dbg; gate its creation behind a CMake option so
the debug artifact is not built by default. Add a boolean option (e.g.,
MLLM_BUILD_DEBUG_EXAMPLES) and wrap the
add_executable/target_link_libraries/target_include_directories for
mllm-qwen2_5-omni-image-runner-dbg inside an if(MLLM_BUILD_DEBUG_EXAMPLES) ...
endif() block so the target (and its links to MllmRT and MllmCPUBackend and
include dir ${MLLM_INCLUDE_DIR}) is only created when the option is enabled.
In `@mllm/backends/cpu/ops/EmbeddingOp.cpp`:
- Line 29: The process-wide static warned_token_oob suppresses OOB warnings
across all embedding layers; make it an instance member on CPUEmbeddingOp (e.g.,
add std::atomic<bool> warned_token_oob_{false}; to the CPUEmbeddingOp
declaration in EmbeddingOp.hpp) and update uses in EmbeddingOp.cpp to reference
the instance field instead of the static local (replace the static
std::atomic<bool> warned_token_oob and the CAS site to use
this->warned_token_oob_ or equivalent), so each embedding instance emits its own
one-time OOB warning.
In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py`:
- Around line 112-136: _collect_source_meta currently loads the entire
checkpoint then _stream_source_tensors reloads it, doubling I/O for large files;
modify _collect_source_meta to optionally return the loaded state dict (or
accept a preloaded state dict param) and update its callers to pass that state
into _stream_source_tensors so the file is only read once, or alternatively
refactor to a single-pass lazy stream where _stream_source_tensors queries
metadata on first access; update the function signatures for
_collect_source_meta and _stream_source_tensors and all call sites to propagate
the in-memory state dict when available and ensure gc.collect() only runs when
the state is truly discarded.
- Around line 266-268: The current code builds a large in-memory zero buffer via
b"\x00" * reserved_bytes and writes it (reserved_bytes is computed from
MODEL_FILE_V2_DESC_SIZE and MODEL_FILE_V2_PARAM_DESC_SIZE), which can allocate
megabytes; instead, replace the allocation+write with a seek-based sparse-file
creation: use self.f.seek(reserved_bytes - 1, 1) (or io.SEEK_CUR) and then write
a single b"\x00" and flush — this avoids allocating the full zero buffer and
produces a sparse file while preserving the reserved region semantics used
around reserved_bytes, MODEL_FILE_V2_DESC_SIZE, and
MODEL_FILE_V2_PARAM_DESC_SIZE.
In `@mllm/models/minicpm_o45/export_prompt_cache.py`:
- Around line 189-196: The try/except around audio_tokenizer.quantize currently
swallows the original Exception; change it to catch Exception as e and log or
attach e before falling back to CPU so the original error is preserved for
debugging: when calling audio_tokenizer.quantize(...) catch Exception as e,
record the exception (e) via logging or processLogger (or re-raise with context)
before calling _move_model(audio_tokenizer, torch.device("cpu")) and
audio_tokenizer.eval(), then retry audio_tokenizer.quantize(...) on the cpu
quantize_device.
In `@mllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hpp`:
- Around line 1340-1351: The use of std::pow(2, i) in HiFTGenerator constructor
(e.g., computing in_ch and out_ch) should be replaced with integer bit-shifts to
avoid floating-point imprecision and improve performance: compute powers of two
as (1 << i) (and (1 << (i+1))) and ensure the result is cast to int32_t where
needed; apply the same replacement for the other occurrences referenced in the
diff (the uses at the spots corresponding to lines around 1359, 1372, and 1379)
so all exponent-of-two calculations use bit shifts rather than std::pow.
- Around line 1207-1213: Replace the per-call std::mt19937 construction in
f02sine with a persistent thread-local RNG to match the pattern used by
randomNormalLike and SourceModuleHnNSF2::randomLike: declare static thread_local
std::mt19937 rng seeded once (e.g., from std::random_device() at initialization)
and reuse it for the std::uniform_real_distribution<float> uni in the loop that
sets rp[(b*T + 0)*H + h] += phase0; to avoid reseeding on every call and ensure
consistent, efficient randomness.
- Around line 1-1522: The header is too large and should be split into focused
headers to improve maintainability and compile times: extract low-level helpers
and pure functions (e.g., repeatInterleaveSeq, concatInt64Seq, l2NormalizeRow,
tensorMish/LeakyRelu, makeHannWindow, relShift, addHeadBiasInplace,
randomNormalLike, SineGen2 helpers) into token2wav_utils.hpp; move the
flow/encoder/DiT/CFM-related classes (EspnetRelPositionalEncoding,
LinearNoSubsampling, ConformerEncoderLayer, UpsampleConformerEncoderV2,
DiTAttention, DiTBlock, DiTEstimator, CausalConditionalCFM,
CausalMaskedDiffWithXvec, MiniCPMO45FlowConfig and usages) into
token2wav_flow.hpp; and move HiFT/vocoder classes (ResBlock, SnakeActivation,
ConvRNNF0Predictor, SourceModuleHnNSF2, HiFTGenerator and MiniCPMO45HiFTConfig)
into token2wav_hift.hpp; update MiniCPMO45Token2WavModel to include these
headers and adjust include guards/namespaces and any cross-file forward
declarations (e.g., ensure SineGen2 and SourceModuleHnNSF2 visibility for
HiFTGenerator, and materializeWeightNormParameters/loadFromParameter usages
remain accessible). Ensure each new header keeps only the symbols it needs and
include the appropriate original dependencies (nn headers, Tensor utilities) and
update the single original header to include the three new headers and re-export
MiniCPMO45Token2WavModel.
In `@mllm/models/minicpm_o45/modeling_minicpm_o45.hpp`:
- Around line 150-187: MiniCPMO45TTS::loadFromParameter duplicates per-row
weight-norm materialization already implemented as _materializeWeightNorm in
token2wav_weight_norm.hpp; replace the manual loop that builds head_code_weight_
with a call to the shared _materializeWeightNorm (or factor out a small helper
used by both places) so normalization and epsilon handling are identical, i.e.,
consume g and v tensors, produce the per-row scaled weight Tensor and push into
head_code_weight_, ensuring the same kEps/clamping semantics as
_materializeWeightNorm.
In `@mllm/models/minicpm_o45/token2wav_weight_norm.hpp`:
- Around line 61-68: The variable key_g is redundant because within the loop
when _endsWith(key, marker) is true (marker ==
".parametrizations.weight.original0"), key already equals prefix + marker;
update the loop in token2wav_weight_norm.hpp to either use key directly instead
of key_g or set key_g = key (remove reconstructing the same string) and remove
any unnecessary allocations; adjust references to key_g accordingly (look for
key_g, key_v, key_w and the loop over keys) to keep semantics identical but
eliminate the redundant string build.
In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp`:
- Around line 396-400: Replace the silent catch(...) around
nlohmann::json::parse(in) so parse failures aren’t swallowed: catch a
std::exception (e.g. catch (const std::exception& e)) and emit a diagnostic
(std::cerr or the module logger) that includes the exception.what() and context
(e.g. "failed to parse tokenizer JSON, special tokens may be missing"), then
return; update the code around json_data / nlohmann::json::parse to use this new
catch so users see a warning when the JSON is malformed.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 1641-1646: sampleFromDistribution and categoricalSample recreate
std::random_device and std::mt19937 on every call which is wasteful and can
produce poor randomness; make a single RNG instance a persistent member (e.g.,
add a std::mt19937 rng; to the model class and initialize it once in the
constructor, seeding from std::random_device there), then update
sampleFromDistribution and categoricalSample to use that class member rng
instead of constructing local rd/gen objects and construct
std::discrete_distribution using the provided probs as before.
- Around line 421-450: Add concise doc comments above each public
class/function: Qwen2_5OmniPatchEmbed, Qwen2_5OmniVisionEncoder,
Qwen2_5OmniAudioEncoder, Qwen2_5OmniForCausalLM,
Qwen2_5OmniForConditionalGeneration, and utility functions like
makeMultimodalPositionEmbedding and makeWindowIndex. For each comment include a
one-line purpose, constructor/parameter descriptions (e.g., cfg fields used:
visual_in_chans, visual_hidden_size, patch sizes), expected input and output
tensor shapes (e.g., input tensor shape, view/reshape results, output embed
dim), return values, and any error/edge conditions; place them immediately above
the class or function definition using the project’s C++ comment style (Doxygen
or ///) and keep them short and focused.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
examples/CMakeLists.txtexamples/minicpm_o45/CMakeLists.txtexamples/minicpm_o45/config_minicpm_o45.jsonexamples/minicpm_o45/main.cppexamples/qwen2_5omni/CMakeLists.txtexamples/qwen2_5omni/image_infer.cppmllm/backends/cpu/ops/EmbeddingOp.cppmllm/models/minicpm_o2_6/modeling_chattts.hppmllm/models/minicpm_o45/configuration_minicpm_o45.hppmllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.pymllm/models/minicpm_o45/export_prompt_cache.pymllm/models/minicpm_o45/modeling_minicpm_o45.hppmllm/models/minicpm_o45/modeling_minicpm_o45_token2wav.hppmllm/models/minicpm_o45/token2wav_prompt_cache.hppmllm/models/minicpm_o45/token2wav_weight_norm.hppmllm/models/minicpm_o45/tokenization_minicpm_o45.hppmllm/models/qwen2_5omni/configuration_qwen2_5omni.hppmllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp
💤 Files with no reviewable changes (1)
- mllm/models/minicpm_o2_6/modeling_chattts.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/CMakeLists.txt
| for (std::string line; std::getline(ifs, line);) { | ||
| if (line.empty()) { continue; } | ||
| std::stringstream ss(line); | ||
| while (!ss.eof()) { | ||
| int64_t token = 0; | ||
| ss >> token; | ||
| if (!ss.fail()) { token_ids.push_back(token); } | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential infinite loop when parsing malformed token file.
If a line contains non-numeric text, ss >> token sets failbit without advancing the stream position, so ss.eof() never becomes true and this loop spins forever.
🐛 Proposed fix
std::stringstream ss(line);
- while (!ss.eof()) {
+ while (ss >> std::ws && !ss.eof()) {
int64_t token = 0;
- ss >> token;
- if (!ss.fail()) { token_ids.push_back(token); }
+ if (ss >> token) {
+ token_ids.push_back(token);
+ } else {
+ break; // stop on non-numeric content
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/minicpm_o45/main.cpp` around lines 111 - 119, The token-parsing
inner loop can hang when a non-numeric token sets failbit and never advances the
stream; replace the while(!ss.eof()) pattern with a robust parse that advances
on each iteration — for example, use a loop that extracts into an integer (while
(ss >> token)) to terminate on parse failure, or read into a temporary string
(e.g., tokenStr) and use std::stoll in a try/catch to skip malformed tokens and
continue; update the code using the existing variables ss, token, and token_ids
so malformed entries no longer cause an infinite loop.
| Argparse::parse(argc, argv); | ||
|
|
||
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } | ||
|
|
||
| if (help.isSet()) { | ||
| Argparse::printHelp(); | ||
| mllm::shutdownContext(); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find Argparse implementation
find . -name "*.h" -o -name "*.cpp" | xargs grep -l "class Argparse\|Argparse::parse" | head -5Repository: UbiquitousLearning/mllm
Length of output: 1728
🏁 Script executed:
# Check the Argparse header to understand parse() behavior
fd -t f -e h "Argparse" -o "argparse" | head -10Repository: UbiquitousLearning/mllm
Length of output: 202
🏁 Script executed:
# Look at the image_infer.cpp file to confirm lines and flow
cat examples/qwen2_5omni/image_infer.cpp | head -50Repository: UbiquitousLearning/mllm
Length of output: 1992
🏁 Script executed:
# Check how other examples handle required args and help
grep -rn "\.required(true)" examples/ --include="*.cpp" | head -20Repository: UbiquitousLearning/mllm
Length of output: 2961
🏁 Script executed:
# Search for Argparse::parse in other examples to see the pattern
grep -rn "Argparse::parse" examples/ --include="*.cpp" -A 10 | head -80Repository: UbiquitousLearning/mllm
Length of output: 4753
🏁 Script executed:
# Find the Argparse header/implementation
find . -type f \( -name "*.h" -o -name "*.hpp" \) | xargs grep -l "class Argparse\|struct Argparse" | grep -v test | head -10Repository: UbiquitousLearning/mllm
Length of output: 92
🏁 Script executed:
# Look for parse method implementation
rg -n "Argparse::parse" --type=cpp --type=h -B 2 -A 10 | head -100Repository: UbiquitousLearning/mllm
Length of output: 6821
🏁 Script executed:
# Search for the Argparse class definition
rg -n "class Argparse|struct Argparse" --type=h --type=cpp -A 20 | head -150Repository: UbiquitousLearning/mllm
Length of output: 1277
🏁 Script executed:
# Read the Argparse.hpp file to see the full parse() implementation
sed -n '149,250p' mllm/utils/Argparse.hppRepository: UbiquitousLearning/mllm
Length of output: 3218
🏁 Script executed:
# Check if there's any help handling in parse()
rg -n "help|required|printHelp" mllm/utils/Argparse.hpp -B 2 -A 2 | head -80Repository: UbiquitousLearning/mllm
Length of output: 2471
Help flag is unreachable when required arguments are missing.
Argparse::parse() enforces required arguments before returning (see Argparse.hpp lines 200–205). When run with --help alone, the missing required arguments (model_path, model_version, tokenizer_path, config_path) trigger MLLM_ERROR_EXIT inside parse(), terminating the program before the help check at line 31 executes. Check for the help flag before calling parse():
🐛 Proposed fix — check help before parse
+ if (help.isSet()) {
+ Argparse::printHelp();
+ mllm::shutdownContext();
+ return 0;
+ }
+
Argparse::parse(argc, argv);
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
}
-
- if (help.isSet()) {
- Argparse::printHelp();
- mllm::shutdownContext();
- return 0;
- }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/image_infer.cpp` around lines 22 - 35, The help flag is
unreachable because Argparse::parse enforces required args before returning; fix
by detecting help manually from argv before calling Argparse::parse(argc, argv):
scan argc/argv for "--help" or "-h" and if present call Argparse::printHelp(),
mllm::shutdownContext(), and return 0 immediately, then only call
Argparse::parse(...) afterward; this preserves the existing
help.isSet()/printHelp() usage and avoids triggering the required-args check in
Argparse::parse.
| obj = torch.load(path, map_location="cpu", weights_only=True) | ||
| except TypeError: | ||
| obj = torch.load(path, map_location="cpu") |
There was a problem hiding this comment.
Silent fallback to insecure torch.load (arbitrary pickle execution)
When weights_only=True raises TypeError on older PyTorch builds, the code silently falls back to weights_only=False, which allows arbitrary Python objects to be deserialized via pickle. Running this tool against a maliciously crafted checkpoint would execute attacker-controlled code.
At minimum, emit a warnings.warn before the fallback load:
🛡️ Proposed fix
+import warnings
try:
obj = torch.load(path, map_location="cpu", weights_only=True)
except TypeError:
+ warnings.warn(
+ f"torch.load() does not support 'weights_only' on this PyTorch version; "
+ f"loading {path} without safety restrictions. Ensure the checkpoint is trusted.",
+ stacklevel=2,
+ )
obj = torch.load(path, map_location="cpu")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/minicpm_o45/convert_token2wav_pt_to_mllm.py` around lines 76 -
78, The try/except around torch.load that sets obj (obj = torch.load(path,
map_location="cpu", weights_only=True") and falls back to obj = torch.load(path,
map_location="cpu") silently is insecure; modify the except TypeError block in
convert_token2wav_pt_to_mllm.py to emit a warnings.warn (or logging.warning)
explaining that weights_only is unsupported on this PyTorch build and that the
fallback will deserialize arbitrary Python objects, then proceed only after the
warning; reference the existing symbols obj, path, and the torch.load calls so
you add the warning immediately before the fallback load and keep the behavior
otherwise.
| writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc) | ||
| written = 0 | ||
| try: | ||
| for m, t in _stream_source_tensors(flow_pt, flow_metas): | ||
| writer.write_tensor(m.full_name, t) | ||
| written += 1 | ||
| for m, t in _stream_source_tensors(hift_pt, hift_metas): | ||
| writer.write_tensor(m.full_name, t) | ||
| written += 1 | ||
| writer.finalize() | ||
| finally: | ||
| writer.close() |
There was a problem hiding this comment.
_V2StreamingWriter construction outside try block causes file-handle leak on init failure
If _V2StreamingWriter.__init__ raises (e.g., disk full during the zero-byte pre-allocation at line 267), the try block is never entered, so finally: writer.close() is never reached. The file opened on line 264 relies on CPython's reference-counting GC to be closed, and a partially-written output file is silently left on disk.
🛡️ Proposed fix
- output.parent.mkdir(parents=True, exist_ok=True)
- writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc)
- written = 0
- try:
+ output.parent.mkdir(parents=True, exist_ok=True)
+ writer = None
+ written = 0
+ try:
+ writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc)
for m, t in _stream_source_tensors(flow_pt, flow_metas):
writer.write_tensor(m.full_name, t)
written += 1
for m, t in _stream_source_tensors(hift_pt, hift_metas):
writer.write_tensor(m.full_name, t)
written += 1
writer.finalize()
finally:
- writer.close()
+ if writer is not None:
+ writer.close()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc) | |
| written = 0 | |
| try: | |
| for m, t in _stream_source_tensors(flow_pt, flow_metas): | |
| writer.write_tensor(m.full_name, t) | |
| written += 1 | |
| for m, t in _stream_source_tensors(hift_pt, hift_metas): | |
| writer.write_tensor(m.full_name, t) | |
| written += 1 | |
| writer.finalize() | |
| finally: | |
| writer.close() | |
| writer = None | |
| written = 0 | |
| try: | |
| writer = _V2StreamingWriter(output=output, model_name=model_name, max_params=max_param_desc) | |
| for m, t in _stream_source_tensors(flow_pt, flow_metas): | |
| writer.write_tensor(m.full_name, t) | |
| written += 1 | |
| for m, t in _stream_source_tensors(hift_pt, hift_metas): | |
| writer.write_tensor(m.full_name, t) | |
| written += 1 | |
| writer.finalize() | |
| finally: | |
| if writer is not None: | |
| writer.close() |
| target_len = int(prompt_tokens.shape[1]) * up_rate | ||
| if target_len > prompt_mels.shape[1]: | ||
| prompt_mels = torch.nn.functional.pad( | ||
| prompt_mels, | ||
| (0, 0, 0, target_len - prompt_mels.shape[1]), | ||
| mode="replicate", | ||
| ) |
There was a problem hiding this comment.
Missing truncation when target_len < prompt_mels.shape[1].
Only the case where target_len > prompt_mels.shape[1] is handled (padding). If the mel spectrogram is longer than prompt_tokens.shape[1] * up_rate, the extra frames will be written to the cache and may cause a shape mismatch during inference.
Proposed fix
target_len = int(prompt_tokens.shape[1]) * up_rate
if target_len > prompt_mels.shape[1]:
prompt_mels = torch.nn.functional.pad(
prompt_mels,
(0, 0, 0, target_len - prompt_mels.shape[1]),
mode="replicate",
)
+ elif target_len < prompt_mels.shape[1]:
+ prompt_mels = prompt_mels[:, :target_len, :]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| target_len = int(prompt_tokens.shape[1]) * up_rate | |
| if target_len > prompt_mels.shape[1]: | |
| prompt_mels = torch.nn.functional.pad( | |
| prompt_mels, | |
| (0, 0, 0, target_len - prompt_mels.shape[1]), | |
| mode="replicate", | |
| ) | |
| target_len = int(prompt_tokens.shape[1]) * up_rate | |
| if target_len > prompt_mels.shape[1]: | |
| prompt_mels = torch.nn.functional.pad( | |
| prompt_mels, | |
| (0, 0, 0, target_len - prompt_mels.shape[1]), | |
| mode="replicate", | |
| ) | |
| elif target_len < prompt_mels.shape[1]: | |
| prompt_mels = prompt_mels[:, :target_len, :] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/minicpm_o45/export_prompt_cache.py` around lines 219 - 225, The
code only pads prompt_mels when target_len > prompt_mels.shape[1] but doesn't
truncate when target_len < prompt_mels.shape[1], which can leave extra mel
frames and cause shape mismatches; update the block around target_len,
prompt_tokens, prompt_mels and up_rate to handle both cases by
slicing/truncating prompt_mels to target_len (e.g., keep first target_len frames
along dim 1) when prompt_mels.shape[1] > target_len, and keep the existing
torch.nn.functional.pad logic for the > case so prompt_mels always ends up with
length equal to target_len.
| int32_t pos = 0; | ||
| for (const auto& tag : image_tags) { | ||
| auto found = applied_string.find(tag, pos); | ||
| if (found != std::string::npos) { | ||
| text_chunks.push_back(applied_string.substr(pos, found - pos)); | ||
| pos = static_cast<int32_t>(found + tag.size()); | ||
| } | ||
| } |
There was a problem hiding this comment.
int32_t pos can overflow for long strings — use size_t.
pos is used to track position in applied_string but is declared as int32_t. If found + tag.size() exceeds INT32_MAX, the static_cast<int32_t> truncates the value. While unlikely for typical chat messages, using size_t is the idiomatic and safe choice for string positions.
Proposed fix
- int32_t pos = 0;
+ size_t pos = 0;
for (const auto& tag : image_tags) {
auto found = applied_string.find(tag, pos);
if (found != std::string::npos) {
text_chunks.push_back(applied_string.substr(pos, found - pos));
- pos = static_cast<int32_t>(found + tag.size());
+ pos = found + tag.size();
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int32_t pos = 0; | |
| for (const auto& tag : image_tags) { | |
| auto found = applied_string.find(tag, pos); | |
| if (found != std::string::npos) { | |
| text_chunks.push_back(applied_string.substr(pos, found - pos)); | |
| pos = static_cast<int32_t>(found + tag.size()); | |
| } | |
| } | |
| size_t pos = 0; | |
| for (const auto& tag : image_tags) { | |
| auto found = applied_string.find(tag, pos); | |
| if (found != std::string::npos) { | |
| text_chunks.push_back(applied_string.substr(pos, found - pos)); | |
| pos = found + tag.size(); | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp` around lines 337 - 344,
The loop in tokenization_minicpm_o45.hpp uses int32_t pos to track positions in
applied_string which can overflow; change the type of pos to size_t and update
the cast when updating pos (replace static_cast<int32_t>(found + tag.size())
with static_cast<size_t>(found + tag.size()) or simply rely on size_t
arithmetic) inside the loop that iterates over image_tags and uses
applied_string.find to build text_chunks so string positions use the correct
unsigned size type.
| auto* pixel_values_ptr = pixel_values.ptr<float_t>(); | ||
| std::memset(pixel_values_ptr, 0, static_cast<size_t>(img_tensors.size()) * channels * patch_size * hw_patch_size * sizeof(float_t)); | ||
|
|
||
| for (int32_t b = 0; b < static_cast<int32_t>(img_tensors.size()); ++b) { | ||
| int32_t src_hw = img_tensors[b].shape()[2]; | ||
| const auto* src_ptr = img_tensors[b].ptr<float>(); | ||
|
|
||
| for (int32_t c = 0; c < channels; ++c) { | ||
| for (int32_t p = 0; p < patch_size; ++p) { | ||
| int32_t src_offset = c * patch_size * src_hw + p * src_hw; | ||
| int32_t dst_offset = b * channels * patch_size * hw_patch_size + c * patch_size * hw_patch_size + p * hw_patch_size; | ||
| std::memcpy(pixel_values_ptr + dst_offset, src_ptr + src_offset, src_hw * sizeof(float_t)); |
There was a problem hiding this comment.
float_t may not be float — use float consistently.
float_t is defined by <cmath> and its actual type depends on FLT_EVAL_METHOD. On some platforms/compilers it can be double or long double, causing a size mismatch with the kFloat32 tensor allocation. The tensor is allocated as kFloat32 (4 bytes per element), but sizeof(float_t) could be 8 bytes, leading to buffer overruns in the memset and memcpy calls.
Proposed fix
- auto* pixel_values_ptr = pixel_values.ptr<float_t>();
- std::memset(pixel_values_ptr, 0, static_cast<size_t>(img_tensors.size()) * channels * patch_size * hw_patch_size * sizeof(float_t));
+ auto* pixel_values_ptr = pixel_values.ptr<float>();
+ std::memset(pixel_values_ptr, 0, static_cast<size_t>(img_tensors.size()) * channels * patch_size * hw_patch_size * sizeof(float));And on line 483:
- std::memcpy(pixel_values_ptr + dst_offset, src_ptr + src_offset, src_hw * sizeof(float_t));
+ std::memcpy(pixel_values_ptr + dst_offset, src_ptr + src_offset, src_hw * sizeof(float));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/minicpm_o45/tokenization_minicpm_o45.hpp` around lines 472 - 483,
The code uses float_t which can vary by platform; replace uses of float_t with
explicit float to match the kFloat32 tensor allocation: change
pixel_values.ptr<float_t>() to pixel_values.ptr<float>(), src_ptr to
img_tensors[b].ptr<float>(), and use sizeof(float) in the std::memset and
std::memcpy length calculations (ensure all occurrences inside the block that
use float_t are updated to float and sizeof(float) so no size mismatch or buffer
overrun occurs for pixel_values_ptr, src_ptr, dst_offset calculations).
| if (root.contains("thinker_config")) { | ||
| auto& thinker_cfg = root["thinker_config"]; | ||
| auto& text_cfg = thinker_cfg["text_config"]; | ||
|
|
||
| hidden_size = text_cfg["hidden_size"]; | ||
| intermediate_size = text_cfg["intermediate_size"]; | ||
| num_attention_heads = text_cfg["num_attention_heads"]; | ||
| num_key_value_heads = text_cfg["num_key_value_heads"]; | ||
| num_hidden_layers = text_cfg["num_hidden_layers"]; | ||
| max_position_embeddings = text_cfg["max_position_embeddings"]; | ||
| rms_norm_eps = text_cfg["rms_norm_eps"]; | ||
| vocab_size = text_cfg["vocab_size"]; | ||
| rope_theta = text_cfg["rope_theta"]; | ||
| tie_word_embeddings = text_cfg.value("tie_word_embeddings", false); |
There was a problem hiding this comment.
Unguarded access to thinker_config["text_config"] — will throw if key is missing.
Line 220 accesses thinker_cfg["text_config"] without checking thinker_cfg.contains("text_config") first. If thinker_config is present but lacks text_config, this throws nlohmann::json::out_of_range. The same applies to lines 222-230 which use direct [] access on text_cfg.
If text_config is a required field when thinker_config is present, consider adding an explicit check with a descriptive error message to help users diagnose config issues.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp` around lines 218 -
231, The code reads thinker_cfg["text_config"] and then accesses fields via
text_cfg[...] without guarding for absence, which will throw
nlohmann::json::out_of_range if "text_config" is missing; update the block that
defines thinker_cfg/text_cfg to first check thinker_cfg.contains("text_config")
and either throw a clear runtime_error (or log and return) mentioning the
missing "thinker_config.text_config" key, or use safe accessors like
text_cfg.value("hidden_size", default) for each field (hidden_size,
intermediate_size, num_attention_heads, num_key_value_heads, num_hidden_layers,
max_position_embeddings, rms_norm_eps, vocab_size, rope_theta,
tie_word_embeddings) so the code in configuration_qwen2_5omni.hpp handles a
missing "text_config" deterministically.
| inline auto makeAudioSinusoidalPosEmb(int32_t length, int32_t channels, float max_timescale = 10000.0f) -> Tensor { | ||
| MLLM_RT_ASSERT(channels % 2 == 0); | ||
| auto pos_emb = Tensor::empty({length, channels}, kFloat32, kCPU).alloc(); | ||
| auto pos_ptr = pos_emb.ptr<float>(); | ||
|
|
||
| const int half = channels / 2; | ||
| const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1); |
There was a problem hiding this comment.
Potential division by zero when channels == 2.
Line 403 computes std::log(max_timescale) / static_cast<float>(half - 1) where half = channels / 2. If channels == 2, half is 1 and this divides by zero. While unlikely given typical model configs (e.g. audio_d_model = 1024), a guard would be defensive.
🛡️ Proposed fix
const int half = channels / 2;
+ MLLM_RT_ASSERT(half > 1);
const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inline auto makeAudioSinusoidalPosEmb(int32_t length, int32_t channels, float max_timescale = 10000.0f) -> Tensor { | |
| MLLM_RT_ASSERT(channels % 2 == 0); | |
| auto pos_emb = Tensor::empty({length, channels}, kFloat32, kCPU).alloc(); | |
| auto pos_ptr = pos_emb.ptr<float>(); | |
| const int half = channels / 2; | |
| const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1); | |
| inline auto makeAudioSinusoidalPosEmb(int32_t length, int32_t channels, float max_timescale = 10000.0f) -> Tensor { | |
| MLLM_RT_ASSERT(channels % 2 == 0); | |
| auto pos_emb = Tensor::empty({length, channels}, kFloat32, kCPU).alloc(); | |
| auto pos_ptr = pos_emb.ptr<float>(); | |
| const int half = channels / 2; | |
| MLLM_RT_ASSERT(half > 1); | |
| const float log_timescale_increment = std::log(max_timescale) / static_cast<float>(half - 1); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 397 - 403, The
computation in makeAudioSinusoidalPosEmb can divide by zero when channels==2
because half = channels/2 -> 1 and you compute log_timescale_increment =
std::log(max_timescale)/static_cast<float>(half - 1); change the logic to guard
when half <= 1: if half > 1 compute log_timescale_increment as before, otherwise
set log_timescale_increment = 0 (and subsequently handle inv_timescales/scale
computation accordingly so you don't divide by zero or produce NaNs). Update
references in makeAudioSinusoidalPosEmb to use this conditional denominator and
ensure the generated pos_emb for the half==1 case produces a valid (e.g., zero
or constant) embedding rather than crashing.
| if (!generated_codes.empty()) { generated_codes.pop_back(); } | ||
| if (generated_codes.empty()) { | ||
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Talker produced no codec tokens."); | ||
| } | ||
| return makeTokenTensor(generated_codes); |
There was a problem hiding this comment.
Last valid codec token is dropped when the talker reaches max_new_tokens without EOS.
Line 2026 unconditionally pops the last element. This correctly strips a trailing EOS token when the loop breaks at line 2017, but if the loop exits naturally (max iterations reached), the last real codec token is removed instead.
🐛 Proposed fix — only pop when the loop ended on EOS
+ bool ended_with_eos = !generated_codes.empty() &&
+ std::find(eos_ids.begin(), eos_ids.end(), generated_codes.back()) != eos_ids.end();
- if (!generated_codes.empty()) { generated_codes.pop_back(); }
+ if (ended_with_eos) { generated_codes.pop_back(); }
if (generated_codes.empty()) {
MLLM_ERROR_EXIT(ExitCode::kCoreError, "Talker produced no codec tokens.");
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 2026 - 2030,
The code currently unconditionally pops the last element of generated_codes
which removes a valid token when generation stops due to max_new_tokens; change
it to only pop when the last token is an EOS token (or when the talker signaled
it ended on EOS). Locate where generated_codes is modified and returned (around
generated_codes, makeTokenTensor and the loop that calls the talker), and
replace the unconditional pop_back() with a check like: if
(!generated_codes.empty() && generated_codes.back() ==
<EOS_ID_or_codec.eos_token>) { generated_codes.pop_back(); } or use the loop’s
ended_on_eos flag if available to decide whether to remove the trailing EOS
before calling makeTokenTensor; leave the MLLM_ERROR_EXIT path intact for empty
results.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
mllm/backends/cpu/ops/EmbeddingOp.cpp (1)
60-67:⚠️ Potential issue | 🟠 Major
kGGUF_Q4_0case was not updated during the refactor — shadow variable and dead check.Two problems introduced by the surrounding restructuring:
Line 61 — shadowed variable:
auto token_idxre-fetches the same value already available asconst auto token_idxfrom line 35. This is inside theelsebranch (valid token range guaranteed), so the re-declaration is dead computation and will trigger-Wshadowcompiler warnings.Line 62 — dead guard:
if (token_idx >= 0)is now unreachable-false because theelseblock at line 44 already ensurestoken_idx >= 0 && token_idx < options_.vocab_size. It can never be false here, so the entireifis dead code and silently drops the upper-bound guard implied by the original pre-refactor logic.🐛 Proposed fix — align `kGGUF_Q4_0` with the refactored pattern
case kGGUF_Q4_0: { - auto token_idx = *ins.coffsettedPtr<mllm_int64_t>({b, (int)s}); - if (token_idx >= 0) { - dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0, - ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size); - } + dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0, + reinterpret_cast<float*>(out_ptr), options_.hidden_size); break; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` around lines 60 - 67, In the kGGUF_Q4_0 branch of EmbeddingOp.cpp remove the shadowed re-declaration and the redundant if-check: use the existing const auto token_idx (already computed earlier from ins.coffsettedPtr) instead of creating a new auto token_idx, and drop the unreachable "if (token_idx >= 0)" guard so you directly call dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx * options_.hidden_size / QK4_0, ous.coffsettedPtr<float>({b, (int)s, 0}), options_.hidden_size); ensuring the token_idx type and usage match the earlier declaration and preserve the upper-bound assumptions from the surrounding else branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mllm/backends/cpu/ops/EmbeddingOp.cpp`:
- Line 59: Remove the trailing whitespace on the blank line at line 59 in
mllm/backends/cpu/ops/EmbeddingOp.cpp (the file containing the EmbeddingOp
implementation); replace the whitespace-only line with either an empty newline
(no spaces) or remove the line entirely so no line ends with trailing spaces,
then run the repository linter/format step to ensure no other lines violate the
"no trailing whitespace" rule.
---
Outside diff comments:
In `@mllm/backends/cpu/ops/EmbeddingOp.cpp`:
- Around line 60-67: In the kGGUF_Q4_0 branch of EmbeddingOp.cpp remove the
shadowed re-declaration and the redundant if-check: use the existing const auto
token_idx (already computed earlier from ins.coffsettedPtr) instead of creating
a new auto token_idx, and drop the unreachable "if (token_idx >= 0)" guard so
you directly call dequantize_row_q4_0(weight_.ptr<block_q4_0>() + token_idx *
options_.hidden_size / QK4_0, ous.coffsettedPtr<float>({b, (int)s, 0}),
options_.hidden_size); ensuring the token_idx type and usage match the earlier
declaration and preserve the upper-bound assumptions from the surrounding else
branch.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/CMakeLists.txtmllm/backends/cpu/ops/EmbeddingOp.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/CMakeLists.txt
| } | ||
| break; | ||
| } | ||
|
|
There was a problem hiding this comment.
Trailing whitespace on line 59 — coding guidelines violation.
Line 59 contains only whitespace characters. As per coding guidelines, "No line may end with trailing whitespace."
🛠️ Proposed fix
-
+📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm/backends/cpu/ops/EmbeddingOp.cpp` at line 59, Remove the trailing
whitespace on the blank line at line 59 in mllm/backends/cpu/ops/EmbeddingOp.cpp
(the file containing the EmbeddingOp implementation); replace the
whitespace-only line with either an empty newline (no spaces) or remove the line
entirely so no line ends with trailing spaces, then run the repository
linter/format step to ensure no other lines violate the "no trailing whitespace"
rule.
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
examples/minicpm_o45/CMakeLists.txt (1)
5-7: Consider removing the commented-out Python runner block.The
mllm-minicpm-o45-runner-pythontarget and the correspondingmain_python.cppsource don't appear to be part of this PR. Committed dead CMake code adds noise; prefer tracking the future work in an issue rather than leaving it commented out.🧹 Proposed cleanup
add_executable(mllm-minicpm-o45-runner main.cpp) target_link_libraries(mllm-minicpm-o45-runner PRIVATE MllmRT MllmCPUBackend) target_include_directories(mllm-minicpm-o45-runner PRIVATE ${MLLM_INCLUDE_DIR}) - -# add_executable(mllm-minicpm-o45-runner-python main_python.cpp) -# target_link_libraries(mllm-minicpm-o45-runner-python PRIVATE MllmRT MllmCPUBackend) -# target_include_directories(mllm-minicpm-o45-runner-python PRIVATE ${MLLM_INCLUDE_DIR})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/minicpm_o45/CMakeLists.txt` around lines 5 - 7, Remove the commented-out Python runner CMake block to avoid dead code: delete the three commented lines that reference the mllm-minicpm-o45-runner-python target, its link libraries (MllmRT, MllmCPUBackend), and the target_include_directories entry for ${MLLM_INCLUDE_DIR}; if this functionality is desired later, track it as an issue rather than leaving commented CMake in the repo (identify the block by the target name mllm-minicpm-o45-runner-python and the source main_python.cpp).examples/qwen2_5omni/image_infer_dbg.cpp (1)
13-13: Remove the commented-outMLLM_MAINmacro.This is dead code left over from a previous iteration. Since
int main(...)is used directly, the comment adds noise.-//MLLM_MAIN({ int main(int argc, char** argv) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen2_5omni/image_infer_dbg.cpp` at line 13, Remove the dead commented-out macro invocation by deleting the line containing "//MLLM_MAIN({" so the file uses the explicit int main(...) implementation only; locate the commented token "MLLM_MAIN" in examples/qwen2_5omni/image_infer_dbg.cpp and remove that commented-out macro line to avoid noise.examples/qwen2_5omni/audio_out_infer.cpp (1)
89-89: AddperfSummary()delegation method toQwen2_5OmniForConditionalGeneration.Calling
qwen_omni.thinker_.perfSummary()directly exposes an internal member. Unlike other model examples (text_infer, image_infer) that call.perfSummary()directly on the model instance,Qwen2_5OmniForConditionalGenerationdoesn't expose this method. Add a publicvoid perfSummary() { thinker_.perfSummary(); }wrapper for API consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qwen2_5omni/audio_out_infer.cpp` at line 89, Add a public delegation method to Qwen2_5OmniForConditionalGeneration so callers can invoke perfSummary() without accessing the internal thinker_ member; implement a public void perfSummary() that simply calls thinker_.perfSummary(); place it in the class public section (matching other examples) so examples like audio_out_infer.cpp can call model.perfSummary() instead of qwen_omni.thinker_.perfSummary().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/qwen2_5omni/audio_out_infer.cpp`:
- Around line 31-36: The code silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is neither "v1" nor "v2";
update the conditional in the block that sets file_version (which currently
checks model_version.get() and assigns to file_version) to include a final else
branch that rejects invalid values—e.g., log or print an error mentioning the
invalid model_version.get() value and exit/throw so the program fails fast
rather than proceeding with a wrong mllm::ModelFileVersion.
- Around line 75-76: Check for the presence of the "sequence" key and verify
tensor rank before indexing: avoid using operator[] on inputs (which may
default-insert) and instead use find/contains or at(), and ensure
inputs["sequence"].shape().size() >= 2 before reading shape()[1]; do the same
for output.sequences (ensure output.sequences.shape().size() >= 2) before
computing input_len and total_len, and return or throw a clear error if the
checks fail so you don't perform an out-of-bounds access when computing
input_len and total_len.
- Around line 77-81: The code mixes narrow (fmt::print) and wide (std::wcout)
output which can cause UB and ordering issues; change to a single stream: either
(A) use narrow output only by making qwen_tokenizer.detokenize return or be
converted to std::string/UTF-8 and replace std::wcout with fmt::print or
std::cout, or (B) use wide output only by replacing fmt::print calls with
std::wcout and ensure detokenize returns std::wstring; also fix the loop index
narrowing by changing "int i = input_len" to a matching unsigned type (e.g.,
size_t or the type of input_len) or use auto with an explicit cast to avoid
signed/unsigned conversion. Reference symbols: fmt::print, std::wcout,
qwen_tokenizer.detokenize, output.sequences.at<mllm::mllm_int64_t>, and the loop
variable "i".
- Around line 83-85: The code is using wav.shape().back() and wav.ptr<float>()
unsafely when output.wav may be empty; before constructing wenet::WavWriter or
calling wav_writer.Write, check that output.wav is non-empty (e.g., verify
output.wav.numel() > 0 or !output.wav.shape().empty() &&
output.wav.shape().back() > 0), and only then compute the sample count and get
wav.ptr<float>(); if empty, skip creating wenet::WavWriter and avoid calling
Write (or write an empty file/return an error), referencing the
variables/functions wav, output.wav, wenet::WavWriter, wav_writer.Write, and
output_path to locate the change. Ensure you convert the sample count to the
integer type expected by WavWriter and do not dereference wav.ptr<float>() when
numel == 0.
- Around line 17-42: The help flag check is executed too late (after
required-arg validation) so running with only -h fails; after calling
Argparse::parse(argc, argv) immediately check help.isSet() and handle it (call
Argparse::printHelp() and mllm::shutdownContext() then return) before any
required-argument-dependent logic such as the model_version parsing; update the
code around Argparse::parse, help.isSet(), Argparse::printHelp(), and
mllm::shutdownContext() accordingly.
In `@examples/qwen2_5omni/image_infer_dbg.cpp`:
- Around line 28-33: The code silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2"; change
the logic in the block that sets file_version (using model_version.get()) to
validate the value and reject unknown values instead of falling back: if
model_version is "v1" or "v2" set file_version accordingly, otherwise log an
error (or throw/exit) indicating the unsupported model_version value and list
accepted values, so callers cannot accidentally load the wrong layout when
model_version is mistyped.
- Around line 53-63: The code currently hardcodes image_path and prompt_text and
leaves the interactive prompts unused; restore interactivity by re-enabling the
std::getline calls for image_path and prompt_text (remove or guard the debug
assignments "../../rsc/pics.jpg" and "描述图片中物体") so the prompts actually read
from stdin, ensuring the exit/quit check on image_path and subsequent logic work
as intended; update the block around the variables image_path, prompt_text, the
std::getline calls, and the exit check so the program behaves interactively
again.
- Around line 20-39: The help flag check must run before Argparse::parse()
because parse() enforces required arguments and will exit before help.isSet() is
evaluated; move the current help parsing logic so you call Argparse::add for
-h/--help, then inspect help.isSet() and call Argparse::printHelp() and
mllm::shutdownContext()/return before invoking Argparse::parse(), or
alternatively update Argparse::parse() to short-circuit when it detects
-h/--help (exempting it from the required-argument validation) so the existing
if (help.isSet()) { Argparse::printHelp(); mllm::shutdownContext(); return 0; }
can run.
- Around line 69-78: The callback mixes fmt::print (narrow stdout) and
std::wcout (wide stdout) causing unsafe interleaving: update the
qwen2_5omni.streamGenerate callback to write via a single narrow output path
instead of std::wcout — call qwen2_5omni_tokenizer.detokenize and ensure you
get/convert a UTF‑8 std::string, then emit it with fmt::print (or std::cout) so
all output uses the same channel; if you truly need wide output, flush
fmt::print before the response and switch to using std::wcout exclusively for
the whole response block.
---
Nitpick comments:
In `@examples/minicpm_o45/CMakeLists.txt`:
- Around line 5-7: Remove the commented-out Python runner CMake block to avoid
dead code: delete the three commented lines that reference the
mllm-minicpm-o45-runner-python target, its link libraries (MllmRT,
MllmCPUBackend), and the target_include_directories entry for
${MLLM_INCLUDE_DIR}; if this functionality is desired later, track it as an
issue rather than leaving commented CMake in the repo (identify the block by the
target name mllm-minicpm-o45-runner-python and the source main_python.cpp).
In `@examples/qwen2_5omni/audio_out_infer.cpp`:
- Line 89: Add a public delegation method to Qwen2_5OmniForConditionalGeneration
so callers can invoke perfSummary() without accessing the internal thinker_
member; implement a public void perfSummary() that simply calls
thinker_.perfSummary(); place it in the class public section (matching other
examples) so examples like audio_out_infer.cpp can call model.perfSummary()
instead of qwen_omni.thinker_.perfSummary().
In `@examples/qwen2_5omni/image_infer_dbg.cpp`:
- Line 13: Remove the dead commented-out macro invocation by deleting the line
containing "//MLLM_MAIN({" so the file uses the explicit int main(...)
implementation only; locate the commented token "MLLM_MAIN" in
examples/qwen2_5omni/image_infer_dbg.cpp and remove that commented-out macro
line to avoid noise.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/minicpm_o45/CMakeLists.txtexamples/qwen2_5omni/audio_out_infer.cppexamples/qwen2_5omni/image_infer_dbg.cpp
| auto& help = Argparse::add<bool>("-h|--help").help("Show help message"); | ||
| auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model path").required(true); | ||
| auto& model_version = Argparse::add<std::string>("-mv|--model_version").help("Model version").required(true); | ||
| auto& tokenizer_path = Argparse::add<std::string>("-t|--tokenizer_path").help("Tokenizer directory").required(true); | ||
| auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true); | ||
| auto& spk_dict_path = Argparse::add<std::string>("-s|--spk_dict_path").help("Speaker json path").required(true); | ||
| auto& prompt = Argparse::add<std::string>("-p|--prompt").help("Prompt text").def(""); | ||
| auto& image_path = Argparse::add<std::string>("-i|--image_path").help("Image path").def(""); | ||
| auto& audio_path = Argparse::add<std::string>("-a|--audio_path").help("Audio path").def(""); | ||
| auto& speaker = Argparse::add<std::string>("-sp|--speaker").help("Speaker name (default: first entry)").def(""); | ||
| auto& output_path = Argparse::add<std::string>("-o|--output_path").help("Output wav path").def("./qwen2_5omni.wav"); | ||
|
|
||
| Argparse::parse(argc, argv); | ||
|
|
||
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } | ||
|
|
||
| if (help.isSet()) { | ||
| Argparse::printHelp(); | ||
| mllm::shutdownContext(); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
❓ Verification inconclusive
Script executed:
#!/bin/bash
# Find the Argparse implementation
find . -type f \( -name "*.h" -o -name "*.hpp" -o -name "*.cpp" \) | xargs grep -l "class Argparse" | head -5Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
./mllm/utils/Argparse.hpp
Script executed:
#!/bin/bash
# Check if Argparse is in a well-known location (include directory)
fd -e h -e hpp "Argparse" | head -10Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
mllm/utils/Argparse.hpp
Script executed:
#!/bin/bash
# Search for Argparse::parse implementation
rg -n "void Argparse::parse|class Argparse" --type=cpp --type=hpp -A 5 | head -50Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stderr:
unrecognized file type: hpp
Script executed:
#!/bin/bash
# Read the Argparse.hpp file to understand parse() implementation
wc -l ./mllm/utils/Argparse.hppRepository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
279 ./mllm/utils/Argparse.hpp
Script executed:
#!/bin/bash
# Look at parse() method in detail
rg -n "parse\(" ./mllm/utils/Argparse.hpp -A 30 -B 2Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
23- public:
24- virtual ~ArgumentBase() = default;
25: virtual void parse(const std::string& value) = 0;
26- virtual void handleFlag() = 0;
27- [[nodiscard]] virtual const std::vector<std::string>& flags() const = 0;
28- [[nodiscard]] virtual bool isPositional() const = 0;
29- [[nodiscard]] virtual bool isRequired() const = 0;
30- [[nodiscard]] virtual bool isSet() const = 0;
31- [[nodiscard]] virtual std::string help() const = 0;
32- [[nodiscard]] virtual std::string meta() const = 0;
33- [[nodiscard]] virtual bool isBoolean() const = 0;
34- [[nodiscard]] virtual bool needsValue() const = 0;
35-};
36-
37-template<typename T>
38-class Argument : public ArgumentBase, public ArgArgument<Argument<T>> {
39- public:
40- Argument& flag(const std::string& f) {
41- flags_.push_back(f);
42- return *this;
43- }
44-
45- Argument& help(const std::string& text) {
46- help_ = text;
47- return *this;
48- }
49-
50- Argument& required(bool val = true) {
51- required_ = val;
52- return *this;
53- }
54-
55- Argument& positional(bool val = true) {
--
68- }
69-
70: void parse(const std::string& value) override {
71- std::istringstream iss(value);
72- if constexpr (std::is_same_v<T, bool>) {
73- // Handle boolean values specially
74- if (value == "true" || value == "1") {
75- value_ = true;
76- } else if (value == "false" || value == "0") {
77- value_ = false;
78- } else {
79- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid boolean value");
80- }
81- } else {
82- if constexpr (std::is_same<T, std::string>()) {
83- value_ = iss.str();
84- } else {
85- iss >> value_;
86- }
87- if (iss.fail()) { MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid value type"); }
88- }
89- is_set_ = true;
90- }
91-
92- void handleFlag() override {
93- if constexpr (std::is_same_v<T, bool>) {
94- value_ = true;
95- is_set_ = true;
96- } else {
97- throw std::runtime_error("Non-boolean flag not supported");
98- }
99- }
100-
--
133-class Argparse {
134- public:
135: Argparse() = default;
136-
137- template<typename T>
138- static Argument<T>& add(const std::string& flags) {
139- auto& inst = instance();
140- auto arg = std::make_unique<Argument<T>>();
141- auto flag_list = splitFlags(flags);
142- for (const auto& f : flag_list) { arg->flag(f); }
143- if (flag_list.empty()) arg->positional();
144- Argument<T>* ptr = arg.get();
145- inst.args_.push_back(std::move(arg));
146- return *ptr;
147- }
148-
149: static void parse(int argc, char* argv[]) {
150- auto& inst = instance();
151- std::vector<std::string> args(argv + 1, argv + argc);
152- std::vector<std::string> positional_args;
153-
154- // First parse options
155- for (size_t i = 0; i < args.size(); ++i) {
156- const auto& arg = args[i];
157- if (arg[0] == '-') {
158- bool found = false;
159- for (auto& param : inst.args_) {
160- const auto& flags = param->flags();
161- if (std::find(flags.begin(), flags.end(), arg) != flags.end()) {
162- if (param->isBoolean()) {
163- param->handleFlag();
164- } else {
165- if (i + 1 >= args.size()) {
166- printHelp();
167- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing value for {}", arg);
168- }
169: param->parse(args[++i]);
170- }
171- found = true;
172- break;
173- }
174- }
175- if (!found) {
176- printHelp();
177- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unknown option: {}", arg);
178- }
179- } else {
180- positional_args.push_back(arg);
181- }
182- }
183-
184- // Then parse positional arguments
185- size_t pos_idx = 0;
186- for (auto& param : inst.args_) {
187- if (param->isPositional()) {
188- if (pos_idx >= positional_args.size()) {
189- if (param->isRequired()) {
190- printHelp();
191- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing positional argument");
192- }
193- continue;
194- }
195: param->parse(positional_args[pos_idx++]);
196- }
197- }
198-
199- // Check requirements
200- for (auto& param : inst.args_) {
201- if (param->isRequired() && !param->isSet()) {
202- printHelp();
203- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing required argument: {}", param->meta());
204- }
205- }
206- }
207-
208- template<typename T>
209- static T get(const std::string& flag) {
210- auto& inst = instance();
211- for (auto& param : inst.args_) {
212- const auto& flags = param->flags();
213- if (std::find(flags.begin(), flags.end(), flag) != flags.end()) {
214- auto* p = dynamic_cast<Argument<T>*>(param.get());
215- if (p) { return p->get(); }
216- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Type mismatch for {}", flag);
217- }
218- }
219- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Argument not found: {}", flag);
220- }
221-
222- static void printHelp() {
223- auto& inst = instance();
224- fmt::println("Usage:");
225- for (const auto& arg : inst.args_) {
Script executed:
#!/bin/bash
# Check for any help flag special handling
rg -n "help" ./mllm/utils/Argparse.hpp -i -B 2 -A 2Repository: UbiquitousLearning/mllm
Repository: UbiquitousLearning/mllm
Exit code: 0
stdout:
29- [[nodiscard]] virtual bool isRequired() const = 0;
30- [[nodiscard]] virtual bool isSet() const = 0;
31: [[nodiscard]] virtual std::string help() const = 0;
32- [[nodiscard]] virtual std::string meta() const = 0;
33- [[nodiscard]] virtual bool isBoolean() const = 0;
--
43- }
44-
45: Argument& help(const std::string& text) {
46: help_ = text;
47- return *this;
48- }
--
107- [[nodiscard]] bool isSet() const override { return is_set_; }
108-
109: [[nodiscard]] std::string help() const override { return help_; }
110-
111- [[nodiscard]] std::string meta() const override { return meta_; }
--
122- private:
123- std::vector<std::string> flags_;
124: std::string help_;
125- std::string meta_;
126- T value_;
--
164- } else {
165- if (i + 1 >= args.size()) {
166: printHelp();
167- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing value for {}", arg);
168- }
--
174- }
175- if (!found) {
176: printHelp();
177- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unknown option: {}", arg);
178- }
--
188- if (pos_idx >= positional_args.size()) {
189- if (param->isRequired()) {
190: printHelp();
191- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing positional argument");
192- }
--
200- for (auto& param : inst.args_) {
201- if (param->isRequired() && !param->isSet()) {
202: printHelp();
203- MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing required argument: {}", param->meta());
204- }
--
220- }
221-
222: static void printHelp() {
223- auto& inst = instance();
224- fmt::println("Usage:");
--
238- for (const auto& arg : inst.args_) {
239- if (arg->isPositional()) {
240: fmt::println(" <{}>\t{}", arg->meta(), arg->help());
241- } else {
242- fmt::print(" ");
--
245- fmt::print("{}", arg->flags()[i]);
246- }
247: fmt::println("\t{}", arg->help());
248- }
249- }
--help won't work in isolation due to required-arg enforcement.
Argparse::parse(argc, argv) (line 29) is called before help.isSet() is checked (line 38), while five arguments are declared required(true). The parse() method validates required fields at the end (checking all required arguments before returning); when invoked with only -h, it will print help and exit with an error about missing required arguments before the help flag check in your code is ever reached. Move the help check to immediately after Argparse::parse(), or modify Argparse to skip required-field validation when the help flag is set.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/audio_out_infer.cpp` around lines 17 - 42, The help flag
check is executed too late (after required-arg validation) so running with only
-h fails; after calling Argparse::parse(argc, argv) immediately check
help.isSet() and handle it (call Argparse::printHelp() and
mllm::shutdownContext() then return) before any required-argument-dependent
logic such as the model_version parsing; update the code around Argparse::parse,
help.isSet(), Argparse::printHelp(), and mllm::shutdownContext() accordingly.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Add an else branch to reject invalid model_version values.
An unrecognized version string (anything other than "v1" or "v2") silently falls through and defaults to kV1 with no diagnostic, which will produce a hard-to-debug mismatch between the loaded weights and the expected model file format.
🐛 Proposed fix
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ MLLM_ERROR_EXIT(mllm::ExitCode::kCoreError,
+ fmt::format("Unknown model_version '{}'. Expected 'v1' or 'v2'.", model_version.get()).c_str());
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } | |
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } else { | |
| MLLM_ERROR_EXIT(mllm::ExitCode::kCoreError, | |
| fmt::format("Unknown model_version '{}'. Expected 'v1' or 'v2'.", model_version.get()).c_str()); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/audio_out_infer.cpp` around lines 31 - 36, The code
silently defaults file_version to mllm::ModelFileVersion::kV1 when
model_version.get() is neither "v1" nor "v2"; update the conditional in the
block that sets file_version (which currently checks model_version.get() and
assigns to file_version) to include a final else branch that rejects invalid
values—e.g., log or print an error mentioning the invalid model_version.get()
value and exit/throw so the program fails fast rather than proceeding with a
wrong mllm::ModelFileVersion.
| auto input_len = inputs["sequence"].shape()[1]; | ||
| auto total_len = output.sequences.shape()[1]; |
There was a problem hiding this comment.
Unchecked shape()[1] indexing — Cppcheck flags out-of-bounds access.
Both inputs["sequence"].shape()[1] and output.sequences.shape()[1] index position 1 without asserting the shape has at least 2 dimensions. If the tokenizer returns a scalar/1-D tensor, or if the "sequence" key is absent in inputs (which would default-insert an empty tensor on a std::map-like container), this is a runtime out-of-bounds read.
🛡️ Proposed defensive fix
+ MLLM_CHECK(inputs.count("sequence") && inputs["sequence"].shape().size() >= 2,
+ "inputs[\"sequence\"] must be a 2-D tensor");
auto input_len = inputs["sequence"].shape()[1];
+ MLLM_CHECK(output.sequences.shape().size() >= 2,
+ "output.sequences must be a 2-D tensor");
auto total_len = output.sequences.shape()[1];📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto input_len = inputs["sequence"].shape()[1]; | |
| auto total_len = output.sequences.shape()[1]; | |
| MLLM_CHECK(inputs.count("sequence") && inputs["sequence"].shape().size() >= 2, | |
| "inputs[\"sequence\"] must be a 2-D tensor"); | |
| auto input_len = inputs["sequence"].shape()[1]; | |
| MLLM_CHECK(output.sequences.shape().size() >= 2, | |
| "output.sequences must be a 2-D tensor"); | |
| auto total_len = output.sequences.shape()[1]; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/audio_out_infer.cpp` around lines 75 - 76, Check for the
presence of the "sequence" key and verify tensor rank before indexing: avoid
using operator[] on inputs (which may default-insert) and instead use
find/contains or at(), and ensure inputs["sequence"].shape().size() >= 2 before
reading shape()[1]; do the same for output.sequences (ensure
output.sequences.shape().size() >= 2) before computing input_len and total_len,
and return or throw a clear error if the checks fail so you don't perform an
out-of-bounds access when computing input_len and total_len.
| fmt::print("\nResponse: "); | ||
| for (int i = input_len; i < total_len; ++i) { | ||
| std::wcout << qwen_tokenizer.detokenize(output.sequences.at<mllm::mllm_int64_t>({0, i})) << std::flush; | ||
| } | ||
| fmt::print("\n"); |
There was a problem hiding this comment.
Mixing std::wcout (wide) with fmt::print (narrow) on stdout can produce garbled output or UB.
fmt::print writes narrow bytes to the C FILE* stdout. Once std::wcout is first used on the same underlying stream, POSIX locks the stream orientation to wide; subsequent narrow writes (from fmt::print) then invoke undefined behavior per the C standard. Even before orientation is locked, the two streams have independent buffers, so "\nResponse: " (line 77) and the "\n" (line 81) may appear out of order relative to the std::wcout tokens.
Additionally, int i = input_len on line 78 is a signed/unsigned narrowing conversion if input_len is size_t/int64_t.
♻️ Proposed fix: use a single output stream
- fmt::print("\nResponse: ");
for (int i = input_len; i < total_len; ++i) {
- std::wcout << qwen_tokenizer.detokenize(output.sequences.at<mllm::mllm_int64_t>({0, i})) << std::flush;
+ fmt::print("{}", mllm::to_utf8(qwen_tokenizer.detokenize(
+ output.sequences.at<mllm::mllm_int64_t>({0, i}))));
+ std::fflush(stdout);
}
- fmt::print("\n");
+ fmt::print("\nResponse: ");
+ // (move print before loop, or restructure as shown)(Exact fix depends on whether detokenize can return a std::string/UTF-8 directly. If it must return std::wstring, consider std::wcout exclusively or convert at the call site.)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/audio_out_infer.cpp` around lines 77 - 81, The code
mixes narrow (fmt::print) and wide (std::wcout) output which can cause UB and
ordering issues; change to a single stream: either (A) use narrow output only by
making qwen_tokenizer.detokenize return or be converted to std::string/UTF-8 and
replace std::wcout with fmt::print or std::cout, or (B) use wide output only by
replacing fmt::print calls with std::wcout and ensure detokenize returns
std::wstring; also fix the loop index narrowing by changing "int i = input_len"
to a matching unsigned type (e.g., size_t or the type of input_len) or use auto
with an explicit cast to avoid signed/unsigned conversion. Reference symbols:
fmt::print, std::wcout, qwen_tokenizer.detokenize,
output.sequences.at<mllm::mllm_int64_t>, and the loop variable "i".
| auto wav = output.wav * 32767.0f; | ||
| wenet::WavWriter wav_writer(wav.ptr<float>(), wav.shape().back(), 1, 24000, 16); | ||
| wav_writer.Write(output_path.get()); |
There was a problem hiding this comment.
wav.shape().back() is UB and wav.ptr<float>() is unsafe when output.wav is empty.
If generateAudio produces no audio (e.g., because the talker component isn't fully functional — as the PR description notes), output.wav may be a zero-dimensional or empty tensor. Calling .shape().back() on an empty shape vector is undefined behavior, and passing an invalid pointer to WavWriter will corrupt or crash. Cppcheck flags the nearby buffer out-of-bounds at line 82.
🛡️ Proposed guard
+ if (output.wav.shape().empty() || output.wav.shape().back() == 0) {
+ fmt::print("Warning: no audio was generated.\n");
+ } else {
auto wav = output.wav * 32767.0f;
wenet::WavWriter wav_writer(wav.ptr<float>(), wav.shape().back(), 1, 24000, 16);
wav_writer.Write(output_path.get());
fmt::print("Saved audio to {}\n", output_path.get());
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/audio_out_infer.cpp` around lines 83 - 85, The code is
using wav.shape().back() and wav.ptr<float>() unsafely when output.wav may be
empty; before constructing wenet::WavWriter or calling wav_writer.Write, check
that output.wav is non-empty (e.g., verify output.wav.numel() > 0 or
!output.wav.shape().empty() && output.wav.shape().back() > 0), and only then
compute the sample count and get wav.ptr<float>(); if empty, skip creating
wenet::WavWriter and avoid calling Write (or write an empty file/return an
error), referencing the variables/functions wav, output.wav, wenet::WavWriter,
wav_writer.Write, and output_path to locate the change. Ensure you convert the
sample count to the integer type expected by WavWriter and do not dereference
wav.ptr<float>() when numel == 0.
| auto& help = Argparse::add<bool>("-h|--help").help("Show help message"); | ||
| auto& model_path = Argparse::add<std::string>("-m|--model_path").help("Model path").required(true); | ||
| auto& model_version = Argparse::add<std::string>("-mv|--model_version").help("Model version").required(true); | ||
| auto& tokenizer_path = Argparse::add<std::string>("-t|--tokenizer_path").help("Tokenizer directory").required(true); | ||
| auto& config_path = Argparse::add<std::string>("-c|--config_path").help("Config path").required(true); | ||
|
|
||
| Argparse::parse(argc, argv); | ||
|
|
||
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } | ||
|
|
||
| if (help.isSet()) { | ||
| Argparse::printHelp(); | ||
| mllm::shutdownContext(); | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find Argparse header file
fd -e hpp -e h "Argparse" 2>/dev/null | head -20Repository: UbiquitousLearning/mllm
Length of output: 90
🏁 Script executed:
#!/bin/bash
# Search for Argparse files more broadly
fd "Argparse" 2>/dev/null | grep -v ".git"Repository: UbiquitousLearning/mllm
Length of output: 90
🏁 Script executed:
#!/bin/bash
# Find where Argparse is defined/included
git ls-files | grep -i argparseRepository: UbiquitousLearning/mllm
Length of output: 90
🏁 Script executed:
cat mllm/utils/Argparse.hppRepository: UbiquitousLearning/mllm
Length of output: 7702
The --help flag cannot work as intended because required argument validation happens inside parse().
When invoked with only --help, the Argparse::parse() method processes the flag and sets help.isSet() to true, but then performs a final validation loop (lines 200–206 in Argparse.hpp) that enforces all required arguments. Since -m, -mv, -t, and -c are not provided, parse() calls MLLM_ERROR_EXIT before returning, and the if (help.isSet()) check on line 35 is never reached.
To fix this, move the help check before parse() or modify parse() to detect and handle --help specially by exempting it from required-argument validation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/image_infer_dbg.cpp` around lines 20 - 39, The help flag
check must run before Argparse::parse() because parse() enforces required
arguments and will exit before help.isSet() is evaluated; move the current help
parsing logic so you call Argparse::add for -h/--help, then inspect help.isSet()
and call Argparse::printHelp() and mllm::shutdownContext()/return before
invoking Argparse::parse(), or alternatively update Argparse::parse() to
short-circuit when it detects -h/--help (exempting it from the required-argument
validation) so the existing if (help.isSet()) { Argparse::printHelp();
mllm::shutdownContext(); return 0; } can run.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Unrecognised --model_version silently falls back to kV1.
Any value other than "v1" or "v2" (e.g. a typo) silently uses kV1, which can load the wrong file layout without any indication of the mistake.
🛡️ Proposed fix: reject invalid values
mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
if (model_version.get() == "v1") {
file_version = mllm::ModelFileVersion::kV1;
} else if (model_version.get() == "v2") {
file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print(stderr, "Error: unknown model_version '{}'. Expected 'v1' or 'v2'.\n",
+ model_version.get());
+ mllm::shutdownContext();
+ return 1;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/image_infer_dbg.cpp` around lines 28 - 33, The code
silently defaults file_version to mllm::ModelFileVersion::kV1 when
model_version.get() is not "v1" or "v2"; change the logic in the block that sets
file_version (using model_version.get()) to validate the value and reject
unknown values instead of falling back: if model_version is "v1" or "v2" set
file_version accordingly, otherwise log an error (or throw/exit) indicating the
unsupported model_version value and list accepted values, so callers cannot
accidentally load the wrong layout when model_version is mistyped.
| std::string image_path; | ||
| std::string prompt_text; | ||
|
|
||
| fmt::print("Image path (or 'exit/quit'): "); | ||
| image_path = "../../rsc/pics.jpg"; | ||
| //std::getline(std::cin, image_path); | ||
| if (image_path == "exit" || image_path == "quit") { return 0; } | ||
|
|
||
| fmt::print("Prompt text: "); | ||
| prompt_text = "描述图片中物体"; | ||
| //std::getline(std::cin, prompt_text); |
There was a problem hiding this comment.
Hardcoded debug inputs make the program non-interactive and misleading.
image_path and prompt_text are hardcoded while the corresponding std::getline calls are commented out. The program prints interactive prompts ("Image path (or 'exit/quit'): ", "Prompt text: ") but never reads from stdin, so users get no opportunity to provide their own inputs. The exit/quit branch on Line 59 is dead code because image_path is always "../../rsc/pics.jpg". This is clearly a leftover debug shortcut that should be reverted before merging.
🐛 Proposed fix: restore interactive input
- std::string image_path;
- std::string prompt_text;
-
- fmt::print("Image path (or 'exit/quit'): ");
- image_path = "../../rsc/pics.jpg";
- //std::getline(std::cin, image_path);
- if (image_path == "exit" || image_path == "quit") { return 0; }
-
- fmt::print("Prompt text: ");
- prompt_text = "描述图片中物体";
- //std::getline(std::cin, prompt_text);
+ std::string image_path;
+ std::string prompt_text;
+
+ fmt::print("Image path (or 'exit/quit'): ");
+ std::getline(std::cin, image_path);
+ if (image_path == "exit" || image_path == "quit") { return 0; }
+
+ fmt::print("Prompt text: ");
+ std::getline(std::cin, prompt_text);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/image_infer_dbg.cpp` around lines 53 - 63, The code
currently hardcodes image_path and prompt_text and leaves the interactive
prompts unused; restore interactivity by re-enabling the std::getline calls for
image_path and prompt_text (remove or guard the debug assignments
"../../rsc/pics.jpg" and "描述图片中物体") so the prompts actually read from stdin,
ensuring the exit/quit check on image_path and subsequent logic work as
intended; update the block around the variables image_path, prompt_text, the
std::getline calls, and the exit check so the program behaves interactively
again.
| fmt::print("\nResponse: "); | ||
| qwen2_5omni.streamGenerate(inputs, | ||
| { | ||
| {"do_sample", mllm::AnyValue(false)}, | ||
| {"max_length", mllm::AnyValue(qwen2_5omni_cfg.max_cache_length)}, | ||
| }, | ||
| [&](int64_t token_id) { | ||
| auto str = qwen2_5omni_tokenizer.detokenize(token_id); | ||
| std::wcout << str << std::flush; | ||
| }); |
There was a problem hiding this comment.
Mixing fmt::print (narrow stdout) and std::wcout (wide stdout) is unsafe inside the streaming callback.
fmt::print writes narrow bytes directly through the C FILE* for stdout, while std::wcout writes through the C++ wide-character stream layer, both ultimately targeting the same file descriptor. Interleaving them — especially in a hot per-token callback — risks garbled output, broken multibyte sequences, and buffering inconsistencies. The detokenized string should be emitted through a single output channel.
🛠️ Proposed fix: use a single narrow output path
If detokenize returns a UTF-8 std::string, prefer:
- fmt::print("\nResponse: ");
qwen2_5omni.streamGenerate(inputs,
{
{"do_sample", mllm::AnyValue(false)},
{"max_length", mllm::AnyValue(qwen2_5omni_cfg.max_cache_length)},
},
[&](int64_t token_id) {
auto str = qwen2_5omni_tokenizer.detokenize(token_id);
- std::wcout << str << std::flush;
+ fmt::print("{}", str);
+ std::fflush(stdout);
});If detokenize returns std::wstring and wide output is intentional, flush fmt::print output before switching and use std::wcout exclusively for the response section.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qwen2_5omni/image_infer_dbg.cpp` around lines 69 - 78, The callback
mixes fmt::print (narrow stdout) and std::wcout (wide stdout) causing unsafe
interleaving: update the qwen2_5omni.streamGenerate callback to write via a
single narrow output path instead of std::wcout — call
qwen2_5omni_tokenizer.detokenize and ensure you get/convert a UTF‑8 std::string,
then emit it with fmt::print (or std::cout) so all output uses the same channel;
if you truly need wide output, flush fmt::print before the response and switch
to using std::wcout exclusively for the whole response block.
Summary by CodeRabbit
New Features
Backends & Layers
Tools
Tests