Skip to content

perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates#18131

Open
chakru-r wants to merge 3 commits into
masterfrom
cr-timerseries-agg-batch
Open

perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates#18131
chakru-r wants to merge 3 commits into
masterfrom
cr-timerseries-agg-batch

Conversation

@chakru-r

@chakru-r chakru-r commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Part 1 — Timeseries service & DAO: batchGetAggregatedStats

Adds TimeseriesAspectService.batchGetAggregatedStats() and the backing ESAggregatedStatsDAO.getBatchAggregatedStats() to execute a single OpenSearch request for up to N URNs at once, using an outer terms("batch_urn_outer") aggregation keyed by URN.

Previously the DAO had no batch path; every caller fired one ES request per entity, and per-entity aggregation queries had two hardcoded limits:

  • Bucket sort order was always ascending _key for STRING grouping buckets — there was no way to request top-N by metric value.
  • Bucket size was always MAX_TERM_BUCKETS (1,440 for 24 h × 60 min), forcing ES to materialise all user buckets so the caller could sort and trim client-side.

New additions:

  • GroupingBucket.pdl: adds optional size, orderByMetric, and ascending fields so callers can express "top-5 by metric DESC" directly in the query rather than fetching everything and discarding.
  • ESAggregatedStatsDAO.makeGroupingAggregationBuilder: respects the new PDL fields — STRING buckets use _key order by default but switch to metric-ordered terms when orderByMetric=true; size caps the bucket count at the call site instead of always defaulting to MAX_TERM_BUCKETS.
  • ElasticSearchTimeseriesAspectService.batchGetAggregatedStats: sub-batches the URN list by TimeseriesAspectServiceConfig.BatchAggConfig.maxUrnsPerBatch (default 50) and delegates each sub-batch to the DAO. When the feature flag is off the method falls back to the existing per-URN getAggregatedStats default interface method, making the change backward-compatible.

Part 2 — Dashboard.statsSummary resolver: replace per-URN fan-out with DataLoader batch

DashboardStatsSummaryResolver previously shared the same utility path as the DashboardUsageStatsResolver detail view (getUserUsageCounts from DashboardUsageStatsUtils):

  • It fired two separate ES queries per dashboard (one getAspectValues for viewCount, one getAggregatedStats for users).
  • The user query fetched all users with six aggregation specs (SUM + CARDINALITY for usageCount, viewsCount, executionsCount) using a STRING grouping bucket with no size cap — returning every user ever seen for that dashboard.
  • uniqueUserCountLast30Days was computed as userUsageCounts.size(), a client-side count of the returned buckets.
  • Top-5 users were selected by client-side sort of the full list, with the rest discarded.
  • On a search results page showing N dashboards this meant 2 × N concurrent ES requests, each potentially materialising thousands of user buckets.

The new DashboardStatsSummaryBatchLoader (DataLoader pattern) fires exactly three query types for all N dashboards in a single GraphQL request, each sub-batched into groups of ≤ 50 URNs:

  • A. batchGetAspectValues (no time window, limit=1) → viewCount
  • B. batchGetAggregatedStats CARDINALITY on userCounts.user → uniqueUserCountLast30Days (ES-native cardinality, exact regardless of user volume)
  • C. batchGetAggregatedStats SUM on userCounts.usageCount, STRING grouping size=5 orderByMetric=true ascending=false → topUsersLast30Days already ranked by ES, no client sort needed

DashboardStatsSummaryResolver is gated by the existing timeseriesAspectAggBatchLoadEnabled feature flag (env var TIMESERIES_ASPECT_AGG_BATCH_LOAD_ENABLED, default true); when disabled it falls back to the old per-URN path unchanged.

Detail pages that fetch statsSummary for a single dashboard at a time are unaffected — the old path is still used there because there is no fan-out to batch. The fan-out reduction applies to search and browse result pages that render many dashboards simultaneously.

A smoke test (test_stats_summary_graphql.py) is added that seeds 3 dashboards and asserts result correctness. Parity between the batch path and the per-URN fallback path was verified manually by running the test with the feature flag both enabled and disabled.

Note: the unbatched fallback path in this resolver and the feature flag gating it will be removed in a follow-up PR once the batch path has been validated in production, making the batch loader the permanent implementation.

Note: a further follow-up PR will apply the same batch-loader pattern to DatasetStatsSummaryResolver and ChartStatsSummaryResolver. The timeseries service and DAO changes in Part 1 require no further modification to support those resolvers.

Checklist

  • PR conforms to Contributing Guidelines
  • Tests added/updated
  • Docs added/updated (n/a)
  • Breaking changes documented (n/a)

@github-actions github-actions Bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Jul 1, 2026
@chakru-r chakru-r changed the title Cr timerseries agg batch perf(graphql): timeseries aggregateStats improvements Jul 1, 2026
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.93578% with 59 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...n/metadata/timeseries/TimeseriesAspectService.java 0.00% 19 Missing ⚠️
...timeseries/elastic/query/ESAggregatedStatsDAO.java 80.48% 9 Missing and 7 partials ⚠️
...solvers/load/DashboardStatsSummaryBatchLoader.java 79.24% 8 Missing and 3 partials ⚠️
.../resolvers/dashboard/DashboardUsageStatsUtils.java 84.21% 5 Missing and 1 partial ⚠️
...lvers/dashboard/DashboardStatsSummaryResolver.java 63.63% 3 Missing and 1 partial ⚠️
.../elastic/ElasticSearchTimeseriesAspectService.java 84.61% 1 Missing and 1 partial ⚠️
...com/linkedin/datahub/graphql/GmsGraphQLEngine.java 50.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (72.93%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

@chakru-r chakru-r force-pushed the cr-timerseries-agg-batch branch from 27967ce to 0c3507d Compare July 2, 2026 09:15
@chakru-r chakru-r changed the title perf(graphql): timeseries aggregateStats improvements perf(graphql): batch-load Dashboard.statsSummary via timeseries aggregation API Jul 2, 2026
…gation API

Adds `TimeseriesAspectService.batchGetAggregatedStats()` and the backing
`ESAggregatedStatsDAO.getBatchAggregatedStats()` to execute a single
OpenSearch request for up to N URNs at once, using an outer
`terms("batch_urn_outer")` aggregation keyed by URN.

Previously the DAO had no batch path; every caller fired one ES request
per entity, and per-entity aggregation queries had two hardcoded limits:

- Bucket sort order was always ascending `_key` for STRING grouping
  buckets — there was no way to request top-N by metric value.
- Bucket size was always `MAX_TERM_BUCKETS` (1,440 for 24 h × 60 min),
  forcing ES to materialise all user buckets so the caller could sort
  and trim client-side.

New additions:

- `GroupingBucket.pdl`: adds optional `size`, `orderByMetric`, and
  `ascending` fields so callers can express "top-5 by metric DESC"
  directly in the query rather than fetching everything and discarding.
- `ESAggregatedStatsDAO.makeGroupingAggregationBuilder`: respects the
  new PDL fields — STRING buckets use `_key` order by default but switch
  to metric-ordered `terms` when `orderByMetric=true`; `size` caps the
  bucket count at the call site instead of always defaulting to
  MAX_TERM_BUCKETS.
- `ElasticSearchTimeseriesAspectService.batchGetAggregatedStats`:
  sub-batches the URN list by `TimeseriesAspectServiceConfig.BatchAggConfig
  .maxUrnsPerBatch` (default 50) and delegates each sub-batch to the DAO.
  When the feature flag is off the method falls back to the existing
  per-URN `getAggregatedStats` default interface method, making the
  change backward-compatible.

`DashboardStatsSummaryResolver` previously shared the same utility path
as the `DashboardUsageStatsResolver` detail view (`getUserUsageCounts`
from `DashboardUsageStatsUtils`):

- It fired two separate ES queries per dashboard (one `getAspectValues`
  for viewCount, one `getAggregatedStats` for users).
- The user query fetched all users with six aggregation specs
  (SUM + CARDINALITY for usageCount, viewsCount, executionsCount) using
  a STRING grouping bucket with no size cap — returning every user ever
  seen for that dashboard.
- `uniqueUserCountLast30Days` was computed as `userUsageCounts.size()`,
  a client-side count of the returned buckets.
- Top-5 users were selected by client-side sort of the full list, with
  the rest discarded.
- On a search results page showing N dashboards this meant 2 × N
  concurrent ES requests, each potentially materialising thousands of
  user buckets.

The new `DashboardStatsSummaryBatchLoader` (DataLoader pattern) fires
exactly three query types for all N dashboards in a single GraphQL
request, each sub-batched into groups of ≤ 50 URNs:

  A. `batchGetAspectValues` (no time window, limit=1) → viewCount
  B. `batchGetAggregatedStats` CARDINALITY on `userCounts.user` →
     uniqueUserCountLast30Days (ES-native cardinality, exact regardless
     of user volume)
  C. `batchGetAggregatedStats` SUM on `userCounts.usageCount`, STRING
     grouping size=5 orderByMetric=true ascending=false →
     topUsersLast30Days already ranked by ES, no client sort needed

`DashboardStatsSummaryResolver` is gated by the existing
`timeseriesAspectAggBatchLoadEnabled` feature flag (env var
`TIMESERIES_ASPECT_AGG_BATCH_LOAD_ENABLED`, default true); when disabled
it falls back to the old per-URN path unchanged.

Detail pages that fetch statsSummary for a single dashboard at a time
are unaffected — the old path is still used there because there is no
fan-out to batch. The fan-out reduction applies to search and browse
result pages that render many dashboards simultaneously.

A smoke test (`test_stats_summary_graphql.py`) is added that seeds 3
dashboards and asserts result correctness. Parity between the batch path
and the per-URN fallback path was verified manually by running the test
with the feature flag both enabled and disabled.

Note: the unbatched fallback path in this resolver and the feature flag
gating it will be removed in a follow-up PR once the batch path has
been validated in production, making the batch loader the permanent
implementation.

Note: a further follow-up PR will apply the same batch-loader pattern to
DatasetStatsSummaryResolver and ChartStatsSummaryResolver. The timeseries
service and DAO changes in Part 1 require no further modification to
support those resolvers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chakru-r chakru-r force-pushed the cr-timerseries-agg-batch branch from 0c3507d to 68c3118 Compare July 2, 2026 13:23
@chakru-r chakru-r marked this pull request as ready for review July 2, 2026 13:27
@chakru-r chakru-r changed the title perf(graphql): batch-load Dashboard.statsSummary via timeseries aggregation API perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates Jul 2, 2026
Criterion startTimeCriterion =
criteria.add(
buildCriterion(
ES_FIELD_TIMESTAMP, Condition.GREATER_THAN_OR_EQUAL_TO, Long.toString(startTime));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to add something like a floor(), either in the service layer or the frontend. Are we breaking OS/ES caching by always sending incrementing startTime/endTime? For example, if the same query is run within the same say hour, this means OS/ES can cache the result but if we keep incrementing timestamps it busts this cache.

@david-leifker david-leifker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't seen any blockers here. I would ask whether we have search caching on the service layer or are optimizing the queries (at least from the frontend) to leverage OS/ES caching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops PR or Issue related to DataHub backend & deployment pending-submitter-merge product PR or Issue related to the DataHub UI/UX smoke_test Contains changes related to smoke tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants