Skip to content

feat (calculator): Added a hybrid model tab and support V4#996

Open
Potterluo wants to merge 3 commits into
ModelEngine-Group:developfrom
Potterluo:feature_v4_kvcache
Open

feat (calculator): Added a hybrid model tab and support V4#996
Potterluo wants to merge 3 commits into
ModelEngine-Group:developfrom
Potterluo:feature_v4_kvcache

Conversation

@Potterluo
Copy link
Copy Markdown
Contributor

feat (calculator): Added a hybrid model tab and optimized the KV caching forex interface

  • Added two tabs, Standard Model and Hybrid Model, for model selection
  • Implemented dedicated configuration options for DeepSeek V4 series hybrid models
image

Potterluo added 2 commits June 5, 2026 10:03
…ing forex interface

- Added two tabs, Standard Model and Hybrid Model, for model selection

- Implemented dedicated configuration options for DeepSeek V4 series hybrid models
@Potterluo Potterluo force-pushed the feature_v4_kvcache branch from f617388 to bd86a2c Compare June 5, 2026 02:03
'MQA': {
title: 'MQA (Multi-Query Attention)',
icon: '🔹',
color: '#10b981',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning: The MQA formula appears incomplete. It shows head_dim but the calculation logic in performCalculation uses hidden_size * (kvHeads / num_attention_heads) for the else branch. The formula description should clarify that kv_heads = 1 means the head_dim is hidden_size / attn_heads, and the formula should be consistent with the actual calculation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MQA formula is correct. When kv_heads = 1, head_dim = hidden_size / attn_heads. The formula 2 × layers × tokens × batch × head_dim / tp is mathematically equivalent to 2 × layers × tokens × batch × hidden_size × (1/attn_heads) / tp. The parameter description already clarifies: head_dim: "Head dimension (hidden_size / attn_heads)". No changes needed.

'DSA': {
title: 'DSA (DeepSeek Sparse Attention)',
icon: '🔮',
color: '#9333ea',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning: The DSA formula is missing batch_size and / tensor_parallelism. The formula shown is KV = layers × tokens × ... but the actual calculation in performCalculation includes batch_size and divides by tp. Please update the formula display to match the actual calculation logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Updated DSA formula to: Single-GPU KV Cache = num_hidden_layers × num_tokens × batch_size × (kv_lora_rank + qk_rope_head_dim + index_head_dim) / tensor_parallelism × dtype_bytes

Comment thread docs/source/_static/calculator.js Outdated

const modelName = document.getElementById('hybrid-model-select').value;
const deployment = document.getElementById('hybrid-deployment').value;
const tokens = parseInt(document.getElementById('hybrid-token-input').value) || 4096;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: Add validation for the parsed tokens value. If tokens <= 0 or isNaN(tokens), the calculation will produce invalid results. Consider adding input validation similar to calculateKVCache() function:

if (!tokens || isNaN(tokens) || tokens <= 0) {
    displayError('Invalid Input', 'Please enter a valid positive number for tokens.');
    return;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added input validation for tokens in calculateHybrid() and calculateHybridMaxTokens().

Comment thread docs/source/_static/calculator.js Outdated
const modelName = document.getElementById('hybrid-model-select').value;
const deployment = document.getElementById('hybrid-deployment').value;
const tokens = parseInt(document.getElementById('hybrid-token-input').value) || 4096;
const batchSize = parseInt(document.getElementById('hybrid-batch-size').value) || 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: Add validation for batch_size and tp similar to the standard model calculation. Invalid values (e.g., tp = 0) would cause division by zero or incorrect results:

if (tp <= 0) {
    displayError('Invalid Input', 'Tensor Parallelism must be at least 1.');
    return;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added input validation for batch_size and tp to prevent division by zero or invalid results.

c4aLayers: 30,
c128aLayers: 31,
// vllm-ascend (512 tokens/block)
vllmAscend: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: The hardcoded byte-per-block values (e.g., c4aCompressor: 131072) are magic numbers. Consider adding comments explaining how these values were derived or calculated, to make the code more maintainable and help future readers understand the data source.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added calculation source comments explaining how values like c4aCompressor: 131072 are derived (e.g., 512 tokens × 256 KV heads × 1 byte = 131072 B).

Copy link
Copy Markdown
Contributor Author

@Potterluo Potterluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed review! Here are my responses:


Comment 1: MQA formula - The MQA formula is correct. When kv_heads = 1, head_dim = hidden_size / attn_heads. The formula 2 × layers × tokens × batch × head_dim / tp is mathematically equivalent to the actual calculation 2 × layers × tokens × batch × hidden_size × (1/attn_heads) / tp. The parameter description already clarifies: head_dim: 'Head dimension (hidden_size / attn_heads)'. No changes needed.


Comment 2: DSA formula missing batch_size/tp ✅ Fixed. Updated formula to: Single-GPU KV Cache = num_hidden_layers × num_tokens × batch_size × (kv_lora_rank + qk_rope_head_dim + index_head_dim) / tensor_parallelism × dtype_bytes


Comment 3: Input validation for tokens ✅ Fixed. Added validation in calculateHybrid() and calculateHybridMaxTokens().


Comment 4: Input validation for batch_size/tp ✅ Fixed. Added complete validation for all input parameters.


Comment 5: Magic numbers documentation ✅ Fixed. Added calculation source comments explaining how values like c4aCompressor: 131072 are derived.


Please review the latest commit for the fixes.

- Fix DSA formula to include batch_size and tensor_parallelism
- Add input validation for tokens, batch_size, tp in calculateHybrid()
- Add input validation in calculateHybridMaxTokens()
- Add calculation source comments for magic numbers (c4aCompressor etc)
- Rename archType to modelArch to fix codespell warning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants