perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates#18131
perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates#18131chakru-r wants to merge 3 commits into
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (72.93%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! |
27967ce to
0c3507d
Compare
…gation API
Adds `TimeseriesAspectService.batchGetAggregatedStats()` and the backing
`ESAggregatedStatsDAO.getBatchAggregatedStats()` to execute a single
OpenSearch request for up to N URNs at once, using an outer
`terms("batch_urn_outer")` aggregation keyed by URN.
Previously the DAO had no batch path; every caller fired one ES request
per entity, and per-entity aggregation queries had two hardcoded limits:
- Bucket sort order was always ascending `_key` for STRING grouping
buckets — there was no way to request top-N by metric value.
- Bucket size was always `MAX_TERM_BUCKETS` (1,440 for 24 h × 60 min),
forcing ES to materialise all user buckets so the caller could sort
and trim client-side.
New additions:
- `GroupingBucket.pdl`: adds optional `size`, `orderByMetric`, and
`ascending` fields so callers can express "top-5 by metric DESC"
directly in the query rather than fetching everything and discarding.
- `ESAggregatedStatsDAO.makeGroupingAggregationBuilder`: respects the
new PDL fields — STRING buckets use `_key` order by default but switch
to metric-ordered `terms` when `orderByMetric=true`; `size` caps the
bucket count at the call site instead of always defaulting to
MAX_TERM_BUCKETS.
- `ElasticSearchTimeseriesAspectService.batchGetAggregatedStats`:
sub-batches the URN list by `TimeseriesAspectServiceConfig.BatchAggConfig
.maxUrnsPerBatch` (default 50) and delegates each sub-batch to the DAO.
When the feature flag is off the method falls back to the existing
per-URN `getAggregatedStats` default interface method, making the
change backward-compatible.
`DashboardStatsSummaryResolver` previously shared the same utility path
as the `DashboardUsageStatsResolver` detail view (`getUserUsageCounts`
from `DashboardUsageStatsUtils`):
- It fired two separate ES queries per dashboard (one `getAspectValues`
for viewCount, one `getAggregatedStats` for users).
- The user query fetched all users with six aggregation specs
(SUM + CARDINALITY for usageCount, viewsCount, executionsCount) using
a STRING grouping bucket with no size cap — returning every user ever
seen for that dashboard.
- `uniqueUserCountLast30Days` was computed as `userUsageCounts.size()`,
a client-side count of the returned buckets.
- Top-5 users were selected by client-side sort of the full list, with
the rest discarded.
- On a search results page showing N dashboards this meant 2 × N
concurrent ES requests, each potentially materialising thousands of
user buckets.
The new `DashboardStatsSummaryBatchLoader` (DataLoader pattern) fires
exactly three query types for all N dashboards in a single GraphQL
request, each sub-batched into groups of ≤ 50 URNs:
A. `batchGetAspectValues` (no time window, limit=1) → viewCount
B. `batchGetAggregatedStats` CARDINALITY on `userCounts.user` →
uniqueUserCountLast30Days (ES-native cardinality, exact regardless
of user volume)
C. `batchGetAggregatedStats` SUM on `userCounts.usageCount`, STRING
grouping size=5 orderByMetric=true ascending=false →
topUsersLast30Days already ranked by ES, no client sort needed
`DashboardStatsSummaryResolver` is gated by the existing
`timeseriesAspectAggBatchLoadEnabled` feature flag (env var
`TIMESERIES_ASPECT_AGG_BATCH_LOAD_ENABLED`, default true); when disabled
it falls back to the old per-URN path unchanged.
Detail pages that fetch statsSummary for a single dashboard at a time
are unaffected — the old path is still used there because there is no
fan-out to batch. The fan-out reduction applies to search and browse
result pages that render many dashboards simultaneously.
A smoke test (`test_stats_summary_graphql.py`) is added that seeds 3
dashboards and asserts result correctness. Parity between the batch path
and the per-URN fallback path was verified manually by running the test
with the feature flag both enabled and disabled.
Note: the unbatched fallback path in this resolver and the feature flag
gating it will be removed in a follow-up PR once the batch path has
been validated in production, making the batch loader the permanent
implementation.
Note: a further follow-up PR will apply the same batch-loader pattern to
DatasetStatsSummaryResolver and ChartStatsSummaryResolver. The timeseries
service and DAO changes in Part 1 require no further modification to
support those resolvers.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
0c3507d to
68c3118
Compare
| Criterion startTimeCriterion = | ||
| criteria.add( | ||
| buildCriterion( | ||
| ES_FIELD_TIMESTAMP, Condition.GREATER_THAN_OR_EQUAL_TO, Long.toString(startTime)); |
There was a problem hiding this comment.
We really need to add something like a floor(), either in the service layer or the frontend. Are we breaking OS/ES caching by always sending incrementing startTime/endTime? For example, if the same query is run within the same say hour, this means OS/ES can cache the result but if we keep incrementing timestamps it busts this cache.
david-leifker
left a comment
There was a problem hiding this comment.
I don't seen any blockers here. I would ask whether we have search caching on the service layer or are optimizing the queries (at least from the frontend) to leverage OS/ES caching.
Part 1 — Timeseries service & DAO: batchGetAggregatedStats
Adds
TimeseriesAspectService.batchGetAggregatedStats()and the backingESAggregatedStatsDAO.getBatchAggregatedStats()to execute a single OpenSearch request for up to N URNs at once, using an outerterms("batch_urn_outer")aggregation keyed by URN.Previously the DAO had no batch path; every caller fired one ES request per entity, and per-entity aggregation queries had two hardcoded limits:
_keyfor STRING grouping buckets — there was no way to request top-N by metric value.MAX_TERM_BUCKETS(1,440 for 24 h × 60 min), forcing ES to materialise all user buckets so the caller could sort and trim client-side.New additions:
GroupingBucket.pdl: adds optionalsize,orderByMetric, andascendingfields so callers can express "top-5 by metric DESC" directly in the query rather than fetching everything and discarding.ESAggregatedStatsDAO.makeGroupingAggregationBuilder: respects the new PDL fields — STRING buckets use_keyorder by default but switch to metric-orderedtermswhenorderByMetric=true;sizecaps the bucket count at the call site instead of always defaulting to MAX_TERM_BUCKETS.ElasticSearchTimeseriesAspectService.batchGetAggregatedStats: sub-batches the URN list byTimeseriesAspectServiceConfig.BatchAggConfig.maxUrnsPerBatch(default 50) and delegates each sub-batch to the DAO. When the feature flag is off the method falls back to the existing per-URNgetAggregatedStatsdefault interface method, making the change backward-compatible.Part 2 — Dashboard.statsSummary resolver: replace per-URN fan-out with DataLoader batch
DashboardStatsSummaryResolverpreviously shared the same utility path as theDashboardUsageStatsResolverdetail view (getUserUsageCountsfromDashboardUsageStatsUtils):getAspectValuesfor viewCount, onegetAggregatedStatsfor users).uniqueUserCountLast30Dayswas computed asuserUsageCounts.size(), a client-side count of the returned buckets.The new
DashboardStatsSummaryBatchLoader(DataLoader pattern) fires exactly three query types for all N dashboards in a single GraphQL request, each sub-batched into groups of ≤ 50 URNs:batchGetAspectValues(no time window, limit=1) → viewCountbatchGetAggregatedStatsCARDINALITY onuserCounts.user→ uniqueUserCountLast30Days (ES-native cardinality, exact regardless of user volume)batchGetAggregatedStatsSUM onuserCounts.usageCount, STRING grouping size=5 orderByMetric=true ascending=false → topUsersLast30Days already ranked by ES, no client sort neededDashboardStatsSummaryResolveris gated by the existingtimeseriesAspectAggBatchLoadEnabledfeature flag (env varTIMESERIES_ASPECT_AGG_BATCH_LOAD_ENABLED, default true); when disabled it falls back to the old per-URN path unchanged.Detail pages that fetch statsSummary for a single dashboard at a time are unaffected — the old path is still used there because there is no fan-out to batch. The fan-out reduction applies to search and browse result pages that render many dashboards simultaneously.
A smoke test (
test_stats_summary_graphql.py) is added that seeds 3 dashboards and asserts result correctness. Parity between the batch path and the per-URN fallback path was verified manually by running the test with the feature flag both enabled and disabled.Note: the unbatched fallback path in this resolver and the feature flag gating it will be removed in a follow-up PR once the batch path has been validated in production, making the batch loader the permanent implementation.
Note: a further follow-up PR will apply the same batch-loader pattern to DatasetStatsSummaryResolver and ChartStatsSummaryResolver. The timeseries service and DAO changes in Part 1 require no further modification to support those resolvers.
Checklist