Adds periodic cleanup job to delete doc level monitor percolate query indices by eirsep · Pull Request #2024 · opensearch-project/alerting

eirsep · 2026-02-19T22:34:37Z

Change

: Implement a scheduled cleanup job that: 1. Identifies query indices whose source data indices no longer exist 2. Verifies no active monitors reference the query index 3. Deletes orphaned query indices older than a configurable threshold (e.g., 7 days)

Background

Alerting Doc level monitors leaves clusters experiencing gradual growth of number query indices, leading to: - 100+ query indices per detector accumulating over time - Increased cluster storage costs - Degraded cluster performance and causing high CPU. Manual intervention required to delete old indices

Doc-level monitors (detectors) use a specialized index called a "query index" to efficiently match incoming documents against detection rules. This is called a percolate search query. Think of it as a reverse index: instead of searching documents for queries, we search queries for matching documents.

Key Characteristics: - Each monitor maintains query indices under an alias (e.g., .opensearch-sap-cloudtrail-detectors-queries-optimized-*) - Query indices mirror field mappings from source data indices to enable query execution - When field limits are exceeded, the system automatically creates a new query index (rollover)

Problem: - Timeseries data indices (e.g., log-aws-cloudtrail-000001, 000002, etc.) are regularly rolled over and eventually moved to UltraWarm tier and deleted - Doc-level monitors do not track when source indices are deleted or moved out of hot tier - Query indices created for these indices remain indefinitely - No automated cleanup job exists to remove unused query indices

Impact: - Query indices accumulate at the rate of data index rollovers - For daily index rollovers over 3 years: 1000+ query indices - Each query index consumes cluster resources even when no longer needed

Example Timeline:

Day 1: log-aws-cloudtrail-000001 created → query index 000001 created
Day 5: log-aws-cloudtrail-000010 created → query index 000002 created
...
Day 90: log-aws-cloudtrail-000001 deleted (by ISM retention policy)
→ query index 000001 still exists (orphaned)
Day 1000: 100 query indices exist, 98 are orphaned

alerting/src/main/kotlin/org/opensearch/alerting/util/ValidationHelpers.kt

AWSHurneyt · 2026-02-19T23:11:56Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+            deleteIndexRequest,
+            object : ActionListener<AcknowledgedResponse> {
+                override fun onResponse(response: AcknowledgedResponse) {
+                    logger.info("Successfully deleted query indices: $indicesToDelete")


Nitpick:
Would it make sense to log existingIndices here instead of indicesToDelete since they could potentially be different?

AWSHurneyt · 2026-02-19T23:21:41Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+                search(searchRequest, it)
+            }
+
+            logger.info("Metadata query returned ${response.hits.hits.size} documents")


Clarification question:
Is there any realistic risk of there being more than 10,000 metadata docs? I'm wondering whether we would need to log response.hits.totalHits here, and then perform the subsequent steps in batches of 10,000.

Or would the plan be for subsequent executions of the cleanup job to catch any lingering metadata?

AWSHurneyt · 2026-02-19T23:31:04Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+    }
+
+    private suspend fun fetchAllMonitorMetadata(): List<MonitorMetadata> {
+        val configIndex = ".opendistro-alerting-config"


Nitpick:
We probably have a constant somewhere in the package we could reference here.

AWSHurneyt · 2026-02-19T23:31:18Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+    }
+
+    private suspend fun getMonitor(monitorId: String): Monitor? {
+        val getRequest = GetRequest(".opendistro-alerting-config", monitorId)


Nitpick:
We probably have a constant somewhere in the package we could reference here.

AWSHurneyt · 2026-02-19T23:33:32Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+
+        for (metadata in allMetadata) {
+            val monitorId = extractMonitorId(metadata.id)
+            val monitor = getMonitor(monitorId)


Nitpick:
Rather than calling getMonitor for each metadata, would it make sense to call searchMonitor and query using all of the monitorIDs?

AWSHurneyt · 2026-02-19T23:37:42Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+
+    private suspend fun cleanupMetadataMappings(allMetadata: List<MonitorMetadata>, indicesToDelete: List<String>) {
+        for (metadata in allMetadata) {
+            val monitorId = extractMonitorId(metadata.id)


Clarification question:
It looks like the MonitorMetadata object already has a monitorId attribute. Is there a reason to not use that attribute here?
https://github.com/opensearch-project/common-utils/blob/main/src/main/kotlin/org/opensearch/commons/alerting/model/MonitorMetadata.kt#L25

engechas · 2026-02-19T23:49:39Z

A couple high level thoughts on this:

Do we need a query index per data index? I recall one single query index was used before and it was moved to an index per to avoid mapping conflicts. Is there a better middle ground here where shared query indices are used so long as there's no conflicts and a new index created only on conflicts?
Can we hook into the index rollover, deletion, or transition to warm flows to trigger deletion of the query index rather than relying on a sweeper pattern?
If we delete the query index for a warm index, then the index is migrated back to hot - will monitors created against the index continue to function?

engechas · 2026-02-19T23:50:24Z

alerting/src/main/kotlin/org/opensearch/alerting/util/ValidationHelpers.kt

+
+    private val logger = LogManager.getLogger()
+
+    const val FQDN_REGEX =


nit: for complex regex like this it's helpful to add a comment explaining what it matches and a few examples

engechas · 2026-02-19T23:52:34Z

alerting/src/main/kotlin/org/opensearch/alerting/util/ValidationHelpers.kt

+import java.net.InetAddress
+import java.net.URL
+
+object ValidationHelpers {


Is this class related to the sweeper logic?

Let's add unit tests for these methods

Removing this.
This is a junk file from my git stash for an older issue. It's not used.

engechas · 2026-02-19T23:55:52Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+            scheduledCleanup?.cancel()
+            scheduledCleanup = threadPool.scheduleWithFixedDelay(
+                { cleanupQueryIndices() },
+                queryIndexCleanupPeriod,
+                ThreadPool.Names.MANAGEMENT
+            )


nit: can call offClusterManager() then onClusterManager() instead of redefining the logic here

engechas · 2026-02-20T00:03:24Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+    internal fun extractIndexNumber(concreteQueryIndex: String): Int {
+        return concreteQueryIndex.substringAfterLast("-").toIntOrNull() ?: 0
+    }


This seems brittle. It relies on the rollover to always follows - as a suffix pattern. Things like a daily rollover would not follow this pattern. Is there a more deterministic way to get this information? Maybe the write index of the alias?

engechas · 2026-02-20T00:05:55Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+
+                override fun onFailure(e: Exception) {
+                    logger.error("Batch delete failed for: $indicesToDelete. Retrying individually.", e)
+                    deleteQueryIndicesOneByOne(indicesToDelete)


If the bulk delete fails, would we reasonably expect 1-by-1 deletion to succeed?

engechas · 2026-02-20T00:06:27Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+    }
+
+    internal fun extractAliasFromConcreteIndex(concreteQueryIndex: String): String {
+        return concreteQueryIndex.substringBeforeLast("-")


Same comment as below

engechas · 2026-02-20T00:07:16Z

alerting/src/main/kotlin/org/opensearch/alerting/QueryIndexCleanup.kt

+
+    private fun indexExists(indexName: String): Boolean {
+        return try {
+            clusterService.state().metadata().hasIndex(indexName)


Are these state calls to the cached local copy or do they request the current state from the master node?

… indices Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

eirsep requested review from AWSHurneyt, amsiglan, bowenlan-amzn, engechas, getsaurabh02, goyamegh, jowg-amazon, lezzago, praveensameneni, rishabhmaurya, riysaxen-amzn, sbcd90 and toepkerd as code owners February 19, 2026 22:34

AWSHurneyt reviewed Feb 19, 2026

View reviewed changes

engechas reviewed Feb 20, 2026

View reviewed changes

eirsep added 2 commits February 24, 2026 10:23

adds periodic cleanup job to delete doc level monitor percolate query…

29bf942

… indices Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

address comments.

237bccd

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

eirsep force-pushed the query-index-cleanup branch from d625b23 to 237bccd Compare February 25, 2026 17:35

fix monitor metadata source to query index mapping update

bb222ba


		private val logger = LogManager.getLogger()

		const val FQDN_REGEX =

Conversation

eirsep commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change

Background

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

engechas commented Feb 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eirsep commented Feb 19, 2026 •

edited

Loading