Skip to content

Conversation

@contrueCT
Copy link
Contributor

@contrueCT contrueCT commented Dec 20, 2025

Ⅰ. Describe what this PR did

Add metrics for NamingServer:

  • seata_namingserver_cluster_node_count (Gauge)
  • seata_namingserver_watcher_count (Gauge)
  • seata_namingserver_cluster_change_push_total (Counter)
  • JVM and HTTP-related metrics implemented via Micrometer and Spring Boot.

Ⅱ. Does this pull request fix one issue?

fixes #7852

Ⅲ. Why don't you add test cases (unit test/integration test)?

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

@codecov
Copy link

codecov bot commented Dec 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.62%. Comparing base (6160c7d) to head (1307fdc).

Additional details and impacted files
@@             Coverage Diff              @@
##                2.x    #7882      +/-   ##
============================================
+ Coverage     71.58%   71.62%   +0.03%     
  Complexity      883      883              
============================================
  Files          1294     1294              
  Lines         49554    49554              
  Branches       5884     5884              
============================================
+ Hits          35475    35493      +18     
+ Misses        11155    11142      -13     
+ Partials       2924     2919       -5     

see 12 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@funky-eyes funky-eyes changed the title feat: add metrics support for NamingServer feature: add metrics support for NamingServer Dec 22, 2025
Copy link
Contributor

@funky-eyes funky-eyes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I suggest adding monitoring metrics for the response time (avg, p50, p99, p999) and QPS of each interface. You can implement this in the filter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we're missing some common tags, such as the host of the current Naming Server. For example, this would make it easier to query metrics from specific Naming Server instances.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive Prometheus metrics support to the NamingServer module. The implementation enables monitoring of critical server operations through Micrometer and Spring Boot Actuator.

Key Changes:

  • Introduces three core metrics: cluster node count (Gauge), watcher count (Gauge), and cluster change push notifications (Counter)
  • Adds configurable metrics support with a no-op implementation for when metrics are disabled
  • Integrates metrics collection into existing server operations for real-time monitoring

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
namingserver/pom.xml Adds Spring Boot Actuator and Micrometer Prometheus registry dependencies
namingserver/src/main/resources/application.yml Configures metrics endpoint exposure and percentile distribution settings
namingserver/src/test/resources/application.yml Configures test environment metrics settings
namingserver/src/main/java/org/apache/seata/namingserver/metrics/NamingServerMetricsManager.java Defines the metrics manager interface with metric names and tag constants
namingserver/src/main/java/org/apache/seata/namingserver/metrics/PrometheusNamingMetricsManager.java Implements Prometheus metrics collection using MultiGauge and Counter
namingserver/src/main/java/org/apache/seata/namingserver/metrics/NoOpNamingMetricsManager.java Provides no-op implementation when metrics are disabled
namingserver/src/main/java/org/apache/seata/namingserver/metrics/NamingServerTagsContributor.java Adds custom tags to HTTP request metrics for business dimensions
namingserver/src/main/java/org/apache/seata/namingserver/manager/NamingManager.java Integrates cluster node count metrics refresh on instance registration/unregistration
namingserver/src/main/java/org/apache/seata/namingserver/manager/ClusterWatcherManager.java Integrates watcher count metrics and cluster change push counter
namingserver/src/test/java/org/apache/seata/namingserver/NamingServerMetricsManagerTest.java Adds comprehensive unit tests for metrics manager functionality
namingserver/src/test/java/org/apache/seata/namingserver/ClusterWatcherManagerTest.java Updates tests to inject NoOpNamingMetricsManager
changes/en-us/2.x.md Documents the feature addition in English changelog
changes/zh-cn/2.x.md Documents the feature addition in Chinese changelog
Comments suppressed due to low confidence (1)

namingserver/src/main/java/org/apache/seata/namingserver/manager/ClusterWatcherManager.java:78

  • The scheduled task re-registers watchers within its execution which calls registryWatcher, and registryWatcher immediately triggers refreshWatcherCountMetrics. This creates a potential for frequent metrics refreshes every second as the scheduled task runs. This could cause unnecessary overhead when combined with the refresh on line 102.
        scheduledThreadPoolExecutor.scheduleAtFixedRate(
                () -> {
                    for (String group : WATCHERS.keySet()) {
                        Optional.ofNullable(WATCHERS.remove(group))
                                .ifPresent(watchers -> watchers.parallelStream().forEach(watcher -> {
                                    if (System.currentTimeMillis() >= watcher.getTimeout()) {
                                        notify(watcher, HttpStatus.NOT_MODIFIED.value());
                                    }
                                    if (!watcher.isDone()) {
                                        // Re-register
                                        registryWatcher(watcher);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +60
// Add namespace tag
String namespace = context.getCarrier().getParameter(TAG_NAMESPACE);
keyValues = keyValues.and(KeyValue.of(TAG_NAMESPACE, namespace != null ? namespace : UNKNOWN));

// Add cluster tag
String cluster = context.getCarrier().getParameter(TAG_CLUSTER);
if (cluster == null) {
cluster = context.getCarrier().getParameter("clusterName");
}
keyValues = keyValues.and(KeyValue.of(TAG_CLUSTER, cluster != null ? cluster : UNKNOWN));

// Add vgroup tag
String vgroup = context.getCarrier().getParameter(TAG_VGROUP);
if (vgroup == null) {
vgroup = context.getCarrier().getParameter("group");
}
keyValues = keyValues.and(KeyValue.of(TAG_VGROUP, vgroup != null ? vgroup : UNKNOWN));
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding tags with value "unknown" for every HTTP request will create high cardinality metrics. When these parameters are not present in requests, it would be better to either not add the tag at all, or use a constant shared "unknown" value. This could lead to performance issues and excessive memory usage in the metrics registry as each unique combination of tag values creates a separate time series. Consider only adding tags when they have meaningful values.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "unknown" value is a fixed string constant, not a dynamic value that changes per request. This won't cause cardinality explosion. The namespace, cluster, and vgroup parameters are also low-cardinality by nature in naming server scenarios (limited to dozens at most). Using "unknown" as a fallback is a standard practice that helps operators identify requests with missing parameters.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@contrueCT
Copy link
Contributor Author

It seems we're missing some common tags, such as the host of the current Naming Server. For example, this would make it easier to query metrics from specific Naming Server instances.

Additionally, I suggest adding monitoring metrics for the response time (avg, p50, p99, p999) and QPS of each interface. You can implement this in the filter.

Both implemented. Response time metrics (P50/P90/P95/P99/P999) use Spring Boot Actuator's http.server.requests with custom business tags via NamingServerTagsContributor. Common tag application added in application.yml.


@Override
public KeyValues getLowCardinalityKeyValues(ServerRequestObservationContext context) {
KeyValues keyValues = super.getLowCardinalityKeyValues(context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only applicable to requests that include a namespace and vgroup? If so, how should the tags be added in the PrometheusNamingMetricsManager class? For example, when a vgroup under a specific namespace's cluster performs a watch operation, or when fetching cluster information.

watchers.parallelStream().forEach(this::notify);
// Increment cluster change push counter
if (!watchers.isEmpty()) {
metricsManager.incrementClusterChangePushCount(event.getGroup());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, only the "group" is included. Is it possible to also include the namespace and cluster information as tags?

Comment on lines +72 to +75
percentiles:
http.server.requests: 0.5, 0.9, 0.95, 0.99, 0.999
percentiles-histogram:
http.server.requests: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
percentiles:
http.server.requests: 0.5, 0.9, 0.95, 0.99, 0.999
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5, 0.99, 0.999

I believe that for now, these three percentiles should suffice, and we should not enable percentiles-histogram, as it would result in excessive data being generated in Prometheus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add metrics for NamingServer

2 participants