Skip to content

Adds periodic cleanup job to delete doc level monitor percolate query indices#2024

Open
eirsep wants to merge 3 commits intoopensearch-project:mainfrom
eirsep:query-index-cleanup
Open

Adds periodic cleanup job to delete doc level monitor percolate query indices#2024
eirsep wants to merge 3 commits intoopensearch-project:mainfrom
eirsep:query-index-cleanup

Conversation

@eirsep
Copy link
Member

@eirsep eirsep commented Feb 19, 2026

Change

: Implement a scheduled cleanup job that: 1. Identifies query indices whose source data indices no longer exist 2. Verifies no active monitors reference the query index 3. Deletes orphaned query indices older than a configurable threshold (e.g., 7 days)

Background

Alerting Doc level monitors leaves clusters experiencing gradual growth of number query indices, leading to: - 100+ query indices per detector accumulating over time - Increased cluster storage costs - Degraded cluster performance and causing high CPU. Manual intervention required to delete old indices

Doc-level monitors (detectors) use a specialized index called a "query index" to efficiently match incoming documents against detection rules. This is called a percolate search query. Think of it as a reverse index: instead of searching documents for queries, we search queries for matching documents.

Key Characteristics: - Each monitor maintains query indices under an alias (e.g., .opensearch-sap-cloudtrail-detectors-queries-optimized-*) - Query indices mirror field mappings from source data indices to enable query execution - When field limits are exceeded, the system automatically creates a new query index (rollover)

Problem: - Timeseries data indices (e.g., log-aws-cloudtrail-000001, 000002, etc.) are regularly rolled over and eventually moved to UltraWarm tier and deleted - Doc-level monitors do not track when source indices are deleted or moved out of hot tier - Query indices created for these indices remain indefinitely - No automated cleanup job exists to remove unused query indices

Impact: - Query indices accumulate at the rate of data index rollovers - For daily index rollovers over 3 years: 1000+ query indices - Each query index consumes cluster resources even when no longer needed

Example Timeline:

Day 1: log-aws-cloudtrail-000001 created → query index 000001 created
Day 5: log-aws-cloudtrail-000010 created → query index 000002 created
...
Day 90: log-aws-cloudtrail-000001 deleted (by ISM retention policy)
→ query index 000001 still exists (orphaned)

Day 1000: 100 query indices exist, 98 are orphaned

deleteIndexRequest,
object : ActionListener<AcknowledgedResponse> {
override fun onResponse(response: AcknowledgedResponse) {
logger.info("Successfully deleted query indices: $indicesToDelete")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
Would it make sense to log existingIndices here instead of indicesToDelete since they could potentially be different?

search(searchRequest, it)
}

logger.info("Metadata query returned ${response.hits.hits.size} documents")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification question:
Is there any realistic risk of there being more than 10,000 metadata docs? I'm wondering whether we would need to log response.hits.totalHits here, and then perform the subsequent steps in batches of 10,000.

Or would the plan be for subsequent executions of the cleanup job to catch any lingering metadata?

}

private suspend fun fetchAllMonitorMetadata(): List<MonitorMetadata> {
val configIndex = ".opendistro-alerting-config"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
We probably have a constant somewhere in the package we could reference here.

}

private suspend fun getMonitor(monitorId: String): Monitor? {
val getRequest = GetRequest(".opendistro-alerting-config", monitorId)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
We probably have a constant somewhere in the package we could reference here.


for (metadata in allMetadata) {
val monitorId = extractMonitorId(metadata.id)
val monitor = getMonitor(monitorId)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
Rather than calling getMonitor for each metadata, would it make sense to call searchMonitor and query using all of the monitorIDs?


private suspend fun cleanupMetadataMappings(allMetadata: List<MonitorMetadata>, indicesToDelete: List<String>) {
for (metadata in allMetadata) {
val monitorId = extractMonitorId(metadata.id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification question:
It looks like the MonitorMetadata object already has a monitorId attribute. Is there a reason to not use that attribute here?
https://github.com/opensearch-project/common-utils/blob/main/src/main/kotlin/org/opensearch/commons/alerting/model/MonitorMetadata.kt#L25

@engechas
Copy link
Collaborator

A couple high level thoughts on this:

  1. Do we need a query index per data index? I recall one single query index was used before and it was moved to an index per to avoid mapping conflicts. Is there a better middle ground here where shared query indices are used so long as there's no conflicts and a new index created only on conflicts?
  2. Can we hook into the index rollover, deletion, or transition to warm flows to trigger deletion of the query index rather than relying on a sweeper pattern?
  3. If we delete the query index for a warm index, then the index is migrated back to hot - will monitors created against the index continue to function?


private val logger = LogManager.getLogger()

const val FQDN_REGEX =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for complex regex like this it's helpful to add a comment explaining what it matches and a few examples

import java.net.InetAddress
import java.net.URL

object ValidationHelpers {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class related to the sweeper logic?

Let's add unit tests for these methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this.
This is a junk file from my git stash for an older issue. It's not used.

Comment on lines +105 to +110
scheduledCleanup?.cancel()
scheduledCleanup = threadPool.scheduleWithFixedDelay(
{ cleanupQueryIndices() },
queryIndexCleanupPeriod,
ThreadPool.Names.MANAGEMENT
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can call offClusterManager() then onClusterManager() instead of redefining the logic here

Comment on lines +417 to +419
internal fun extractIndexNumber(concreteQueryIndex: String): Int {
return concreteQueryIndex.substringAfterLast("-").toIntOrNull() ?: 0
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems brittle. It relies on the rollover to always follows - as a suffix pattern. Things like a daily rollover would not follow this pattern. Is there a more deterministic way to get this information? Maybe the write index of the alias?


override fun onFailure(e: Exception) {
logger.error("Batch delete failed for: $indicesToDelete. Retrying individually.", e)
deleteQueryIndicesOneByOne(indicesToDelete)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the bulk delete fails, would we reasonably expect 1-by-1 deletion to succeed?

}

internal fun extractAliasFromConcreteIndex(concreteQueryIndex: String): String {
return concreteQueryIndex.substringBeforeLast("-")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as below


private fun indexExists(indexName: String): Boolean {
return try {
clusterService.state().metadata().hasIndex(indexName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these state calls to the cached local copy or do they request the current state from the master node?

… indices

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
@eirsep eirsep force-pushed the query-index-cleanup branch from d625b23 to 237bccd Compare February 25, 2026 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants