Adds periodic cleanup job to delete doc level monitor percolate query indices#2024
Adds periodic cleanup job to delete doc level monitor percolate query indices#2024eirsep wants to merge 3 commits intoopensearch-project:mainfrom
Conversation
alerting/src/main/kotlin/org/opensearch/alerting/util/ValidationHelpers.kt
Outdated
Show resolved
Hide resolved
| deleteIndexRequest, | ||
| object : ActionListener<AcknowledgedResponse> { | ||
| override fun onResponse(response: AcknowledgedResponse) { | ||
| logger.info("Successfully deleted query indices: $indicesToDelete") |
There was a problem hiding this comment.
Nitpick:
Would it make sense to log existingIndices here instead of indicesToDelete since they could potentially be different?
| search(searchRequest, it) | ||
| } | ||
|
|
||
| logger.info("Metadata query returned ${response.hits.hits.size} documents") |
There was a problem hiding this comment.
Clarification question:
Is there any realistic risk of there being more than 10,000 metadata docs? I'm wondering whether we would need to log response.hits.totalHits here, and then perform the subsequent steps in batches of 10,000.
Or would the plan be for subsequent executions of the cleanup job to catch any lingering metadata?
| } | ||
|
|
||
| private suspend fun fetchAllMonitorMetadata(): List<MonitorMetadata> { | ||
| val configIndex = ".opendistro-alerting-config" |
There was a problem hiding this comment.
Nitpick:
We probably have a constant somewhere in the package we could reference here.
| } | ||
|
|
||
| private suspend fun getMonitor(monitorId: String): Monitor? { | ||
| val getRequest = GetRequest(".opendistro-alerting-config", monitorId) |
There was a problem hiding this comment.
Nitpick:
We probably have a constant somewhere in the package we could reference here.
|
|
||
| for (metadata in allMetadata) { | ||
| val monitorId = extractMonitorId(metadata.id) | ||
| val monitor = getMonitor(monitorId) |
There was a problem hiding this comment.
Nitpick:
Rather than calling getMonitor for each metadata, would it make sense to call searchMonitor and query using all of the monitorIDs?
|
|
||
| private suspend fun cleanupMetadataMappings(allMetadata: List<MonitorMetadata>, indicesToDelete: List<String>) { | ||
| for (metadata in allMetadata) { | ||
| val monitorId = extractMonitorId(metadata.id) |
There was a problem hiding this comment.
Clarification question:
It looks like the MonitorMetadata object already has a monitorId attribute. Is there a reason to not use that attribute here?
https://github.com/opensearch-project/common-utils/blob/main/src/main/kotlin/org/opensearch/commons/alerting/model/MonitorMetadata.kt#L25
|
A couple high level thoughts on this:
|
|
|
||
| private val logger = LogManager.getLogger() | ||
|
|
||
| const val FQDN_REGEX = |
There was a problem hiding this comment.
nit: for complex regex like this it's helpful to add a comment explaining what it matches and a few examples
| import java.net.InetAddress | ||
| import java.net.URL | ||
|
|
||
| object ValidationHelpers { |
There was a problem hiding this comment.
Is this class related to the sweeper logic?
Let's add unit tests for these methods
There was a problem hiding this comment.
Removing this.
This is a junk file from my git stash for an older issue. It's not used.
| scheduledCleanup?.cancel() | ||
| scheduledCleanup = threadPool.scheduleWithFixedDelay( | ||
| { cleanupQueryIndices() }, | ||
| queryIndexCleanupPeriod, | ||
| ThreadPool.Names.MANAGEMENT | ||
| ) |
There was a problem hiding this comment.
nit: can call offClusterManager() then onClusterManager() instead of redefining the logic here
| internal fun extractIndexNumber(concreteQueryIndex: String): Int { | ||
| return concreteQueryIndex.substringAfterLast("-").toIntOrNull() ?: 0 | ||
| } |
There was a problem hiding this comment.
This seems brittle. It relies on the rollover to always follows - as a suffix pattern. Things like a daily rollover would not follow this pattern. Is there a more deterministic way to get this information? Maybe the write index of the alias?
|
|
||
| override fun onFailure(e: Exception) { | ||
| logger.error("Batch delete failed for: $indicesToDelete. Retrying individually.", e) | ||
| deleteQueryIndicesOneByOne(indicesToDelete) |
There was a problem hiding this comment.
If the bulk delete fails, would we reasonably expect 1-by-1 deletion to succeed?
| } | ||
|
|
||
| internal fun extractAliasFromConcreteIndex(concreteQueryIndex: String): String { | ||
| return concreteQueryIndex.substringBeforeLast("-") |
|
|
||
| private fun indexExists(indexName: String): Boolean { | ||
| return try { | ||
| clusterService.state().metadata().hasIndex(indexName) |
There was a problem hiding this comment.
Are these state calls to the cached local copy or do they request the current state from the master node?
… indices Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
d625b23 to
237bccd
Compare
Change
: Implement a scheduled cleanup job that: 1. Identifies query indices whose source data indices no longer exist 2. Verifies no active monitors reference the query index 3. Deletes orphaned query indices older than a configurable threshold (e.g., 7 days)
Background
Alerting Doc level monitors leaves clusters experiencing gradual growth of number query indices, leading to: - 100+ query indices per detector accumulating over time - Increased cluster storage costs - Degraded cluster performance and causing high CPU. Manual intervention required to delete old indices
Doc-level monitors (detectors) use a specialized index called a "query index" to efficiently match incoming documents against detection rules. This is called a percolate search query. Think of it as a reverse index: instead of searching documents for queries, we search queries for matching documents.
Key Characteristics: - Each monitor maintains query indices under an alias (e.g., .opensearch-sap-cloudtrail-detectors-queries-optimized-*) - Query indices mirror field mappings from source data indices to enable query execution - When field limits are exceeded, the system automatically creates a new query index (rollover)
Problem: - Timeseries data indices (e.g., log-aws-cloudtrail-000001, 000002, etc.) are regularly rolled over and eventually moved to UltraWarm tier and deleted - Doc-level monitors do not track when source indices are deleted or moved out of hot tier - Query indices created for these indices remain indefinitely - No automated cleanup job exists to remove unused query indices
Impact: - Query indices accumulate at the rate of data index rollovers - For daily index rollovers over 3 years: 1000+ query indices - Each query index consumes cluster resources even when no longer needed
Example Timeline:
Day 1: log-aws-cloudtrail-000001 created → query index 000001 created
Day 5: log-aws-cloudtrail-000010 created → query index 000002 created
...
Day 90: log-aws-cloudtrail-000001 deleted (by ISM retention policy)
→ query index 000001 still exists (orphaned)
Day 1000: 100 query indices exist, 98 are orphaned