Skip to content

[Feat] Add vLLM-native UCM connector metrics#995

Open
dante159753 wants to merge 7 commits into
ModelEngine-Group:developfrom
dante159753:ucm-vllm-connector-metrics
Open

[Feat] Add vLLM-native UCM connector metrics#995
dante159753 wants to merge 7 commits into
ModelEngine-Group:developfrom
dante159753:ucm-vllm-connector-metrics

Conversation

@dante159753
Copy link
Copy Markdown
Contributor

Summary

This PR wires UCM metrics into vLLM's KV connector metrics path without changing the C++ metrics core semantics.

Changes

  • Add a Python-side UCM metrics dispatcher that drains the existing C++ metrics core once and fans out snapshots to multiproc and vllm_connector consumers.
  • Add YAML config support for multiproc_prefix, vllm_connector_prefix, and fixed consumer switches.
  • Expose vLLM connector Prometheus metrics with vllm:ucm_* names and model_name, engine, worker_rank labels.
  • Keep the existing multiprocess metrics path compatible through get_all_stats_and_clear() and ucm: output when multiproc is enabled.
  • Update Grafana dashboards to query vllm:ucm_* metrics and add a KV cache hit-rate breakdown panel for GPU prefix cache vs external connector cache.

Why

The old UCM metrics path could only be scraped reliably from the master-side multiprocess logger in multi-node deployments. vLLM now supports KV connector metrics aggregation, so UCM connector metrics can be reported through the vLLM-native stats path and include worker-side connector observations.

Impact

  • Existing metrics_config_path remains the single configuration entry point.
  • Deployments can keep multiprocess UCM metrics, enable vLLM connector metrics, or run both via the config consumers.
  • vLLM-native metrics use the vllm: prefix required by vLLM's metrics snapshot path.
  • Duration metrics are exported as seconds on the vLLM side while dashboards preserve the existing millisecond display where appropriate.

Verification

  • uv run --no-project --with pytest python -m pytest --noconftest test\test_ucm_connector_metrics.py -q
  • python -m py_compile ucm\metrics_config.py ucm\metrics_dispatcher.py ucm\integration\vllm\metrics.py ucm\integration\vllm\ucm_connector.py ucm\observability.py
  • git diff --check
  • Parsed all modified Grafana dashboard JSON files and checked duplicate panel ids / grid overlap.
  • Live verified on a2 with vLLM 0.18.0 + vLLM-Ascend 0.18.0rc1 using NPU 0:
    • UCM branch installed and service started successfully with /health and /metrics returning 200.
    • /metrics exposed vllm:ucm_* series with worker_rank labels.
    • vLLM prefix cache counters used by the new dashboard panel were present: vllm:prefix_cache_* and vllm:external_prefix_cache_*.

Add YAML-controlled multiproc and vLLM connector metrics consumers.

Fan out a single UCM metrics core drain into Python-side consumer buffers.

Expose vLLM connector metrics with vllm: prefixes and worker_rank labels.
Switch UCM dashboard queries to vLLM connector metric names.

Add a KV cache hit-rate breakdown panel to the vLLM dashboard.
Comment thread ucm/metrics_config.py
if not config_path:
return {}
try:
import yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: The yaml import is inside a try-except block in load_metrics_config, but the ImportError is logged and returns empty dict. This means the function silently fails if PyYAML is not installed. Consider raising an exception when PyYAML is required (if a YAML config path is provided) to make the failure more explicit, or document this dependency clearly.

Comment thread ucm/metrics_dispatcher.py
if len(current_counts) < len(bucket_counts):
current_counts.extend([0] * (len(bucket_counts) - len(current_counts)))
for index, count in enumerate(bucket_counts):
current_counts[index] += int(count)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: When count is very large, int(count) could be problematic if count is a float that represents a very large number. Consider validating that count is within reasonable bounds before accumulating, or using int(round(count)) to avoid floating-point precision issues when converting large floats to integers.

Comment thread ucm/metrics_dispatcher.py
getattr(value, "sum", 0.0)
)

def _empty_stats(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Suggestion: The _histogram_tuple function returns empty list for bucket_counts if the value doesn't match expected formats. This could lead to silently dropped histogram data. Consider logging a warning when histogram data format is unrecognized:

if not bucket_counts:
    logger.warning(f"Unrecognized histogram format: {type(value)}")

def reduce(self) -> dict[str, int | float]:
return {
"ucm_load_requests_total": self._histogram_sum("load_requests_num"),
"ucm_load_blocks_total": self._histogram_sum("load_blocks_num"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning: The reduce() method returns a hardcoded dict of specific metric names (ucm_load_requests_total, etc.). This means if new metrics are added to the config, they won be included in the reduced output. Consider making this method configurable or iterating over all histogram metrics in self.data to provide a complete summary.

Comment thread ucm/metrics_dispatcher.py

CONSUMERS = (MULTIPROC_CONSUMER, VLLM_CONNECTOR_CONSUMER)
_DISPATCHER: "MetricsDispatcher | None" = None
_DISPATCHER_LOCK = threading.Lock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Warning: The singleton pattern in get_metrics_dispatcher does not provide a way to reset the dispatcher. If the config changes (e.g., during hot-reload or testing), the old dispatcher will continue to be used. Consider adding a reset_metrics_dispatcher() function or passing a force_new parameter for scenarios where config needs to be reloaded:

def reset_metrics_dispatcher():
    global _DISPATCHER
    with _DISPATCHER_LOCK:
        _DISPATCHER = None

@dante159753 dante159753 requested a review from flesher0813 as a code owner June 5, 2026 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants