perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates by chakru-r · Pull Request #18131 · datahub-project/datahub

chakru-r · 2026-07-01T22:08:31Z

Part 1 — Timeseries service & DAO: batchGetAggregatedStats

Adds TimeseriesAspectService.batchGetAggregatedStats() and the backing ESAggregatedStatsDAO.getBatchAggregatedStats() to execute a single OpenSearch request for up to N URNs at once, using an outer terms("batch_urn_outer") aggregation keyed by URN.

Previously the DAO had no batch path; every caller fired one ES request per entity, and per-entity aggregation queries had two hardcoded limits:

Bucket sort order was always ascending _key for STRING grouping buckets — there was no way to request top-N by metric value.
Bucket size was always MAX_TERM_BUCKETS (1,440 for 24 h × 60 min), forcing ES to materialise all user buckets so the caller could sort and trim client-side.

New additions:

GroupingBucket.pdl: adds optional size, orderByMetric, and ascending fields so callers can express "top-5 by metric DESC" directly in the query rather than fetching everything and discarding.
ESAggregatedStatsDAO.makeGroupingAggregationBuilder: respects the new PDL fields — STRING buckets use _key order by default but switch to metric-ordered terms when orderByMetric=true; size caps the bucket count at the call site instead of always defaulting to MAX_TERM_BUCKETS.
ElasticSearchTimeseriesAspectService.batchGetAggregatedStats: sub-batches the URN list by TimeseriesAspectServiceConfig.BatchAggConfig.maxUrnsPerBatch (default 50) and delegates each sub-batch to the DAO. When the feature flag is off the method falls back to the existing per-URN getAggregatedStats default interface method, making the change backward-compatible.

Part 2 — Dashboard.statsSummary resolver: replace per-URN fan-out with DataLoader batch

DashboardStatsSummaryResolver previously shared the same utility path as the DashboardUsageStatsResolver detail view (getUserUsageCounts from DashboardUsageStatsUtils):

It fired two separate ES queries per dashboard (one getAspectValues for viewCount, one getAggregatedStats for users).
The user query fetched all users with six aggregation specs (SUM + CARDINALITY for usageCount, viewsCount, executionsCount) using a STRING grouping bucket with no size cap — returning every user ever seen for that dashboard.
uniqueUserCountLast30Days was computed as userUsageCounts.size(), a client-side count of the returned buckets.
Top-5 users were selected by client-side sort of the full list, with the rest discarded.
On a search results page showing N dashboards this meant 2 × N concurrent ES requests, each potentially materialising thousands of user buckets.

The new DashboardStatsSummaryBatchLoader (DataLoader pattern) fires exactly three query types for all N dashboards in a single GraphQL request, each sub-batched into groups of ≤ 50 URNs:

A. batchGetAspectValues (no time window, limit=1) → viewCount
B. batchGetAggregatedStats CARDINALITY on userCounts.user → uniqueUserCountLast30Days (ES-native cardinality, exact regardless of user volume)
C. batchGetAggregatedStats SUM on userCounts.usageCount, STRING grouping size=5 orderByMetric=true ascending=false → topUsersLast30Days already ranked by ES, no client sort needed

DashboardStatsSummaryResolver is gated by the existing timeseriesAspectAggBatchLoadEnabled feature flag (env var TIMESERIES_ASPECT_AGG_BATCH_LOAD_ENABLED, default true); when disabled it falls back to the old per-URN path unchanged.

Detail pages that fetch statsSummary for a single dashboard at a time are unaffected — the old path is still used there because there is no fan-out to batch. The fan-out reduction applies to search and browse result pages that render many dashboards simultaneously.

A smoke test (test_stats_summary_graphql.py) is added that seeds 3 dashboards and asserts result correctness. Parity between the batch path and the per-URN fallback path was verified manually by running the test with the feature flag both enabled and disabled.

Note: the unbatched fallback path in this resolver and the feature flag gating it will be removed in a follow-up PR once the batch path has been validated in production, making the batch loader the permanent implementation.

Note: a further follow-up PR will apply the same batch-loader pattern to DatasetStatsSummaryResolver and ChartStatsSummaryResolver. The timeseries service and DAO changes in Part 1 require no further modification to support those resolvers.

Checklist

PR conforms to Contributing Guidelines
Tests added/updated
Docs added/updated (n/a)
Breaking changes documented (n/a)

codecov · 2026-07-01T22:11:49Z

Codecov Report

❌ Patch coverage is 72.93578% with 59 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...n/metadata/timeseries/TimeseriesAspectService.java	0.00%	19 Missing ⚠️
...timeseries/elastic/query/ESAggregatedStatsDAO.java	80.48%	9 Missing and 7 partials ⚠️
...solvers/load/DashboardStatsSummaryBatchLoader.java	79.24%	8 Missing and 3 partials ⚠️
.../resolvers/dashboard/DashboardUsageStatsUtils.java	84.21%	5 Missing and 1 partial ⚠️
...lvers/dashboard/DashboardStatsSummaryResolver.java	63.63%	3 Missing and 1 partial ⚠️
.../elastic/ElasticSearchTimeseriesAspectService.java	84.61%	1 Missing and 1 partial ⚠️
...com/linkedin/datahub/graphql/GmsGraphQLEngine.java	50.00%	1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (72.93%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

…gation API Adds `TimeseriesAspectService.batchGetAggregatedStats()` and the backing `ESAggregatedStatsDAO.getBatchAggregatedStats()` to execute a single OpenSearch request for up to N URNs at once, using an outer `terms("batch_urn_outer")` aggregation keyed by URN. Previously the DAO had no batch path; every caller fired one ES request per entity, and per-entity aggregation queries had two hardcoded limits: - Bucket sort order was always ascending `_key` for STRING grouping buckets — there was no way to request top-N by metric value. - Bucket size was always `MAX_TERM_BUCKETS` (1,440 for 24 h × 60 min), forcing ES to materialise all user buckets so the caller could sort and trim client-side. New additions: - `GroupingBucket.pdl`: adds optional `size`, `orderByMetric`, and `ascending` fields so callers can express "top-5 by metric DESC" directly in the query rather than fetching everything and discarding. - `ESAggregatedStatsDAO.makeGroupingAggregationBuilder`: respects the new PDL fields — STRING buckets use `_key` order by default but switch to metric-ordered `terms` when `orderByMetric=true`; `size` caps the bucket count at the call site instead of always defaulting to MAX_TERM_BUCKETS. - `ElasticSearchTimeseriesAspectService.batchGetAggregatedStats`: sub-batches the URN list by `TimeseriesAspectServiceConfig.BatchAggConfig .maxUrnsPerBatch` (default 50) and delegates each sub-batch to the DAO. When the feature flag is off the method falls back to the existing per-URN `getAggregatedStats` default interface method, making the change backward-compatible. `DashboardStatsSummaryResolver` previously shared the same utility path as the `DashboardUsageStatsResolver` detail view (`getUserUsageCounts` from `DashboardUsageStatsUtils`): - It fired two separate ES queries per dashboard (one `getAspectValues` for viewCount, one `getAggregatedStats` for users). - The user query fetched all users with six aggregation specs (SUM + CARDINALITY for usageCount, viewsCount, executionsCount) using a STRING grouping bucket with no size cap — returning every user ever seen for that dashboard. - `uniqueUserCountLast30Days` was computed as `userUsageCounts.size()`, a client-side count of the returned buckets. - Top-5 users were selected by client-side sort of the full list, with the rest discarded. - On a search results page showing N dashboards this meant 2 × N concurrent ES requests, each potentially materialising thousands of user buckets. The new `DashboardStatsSummaryBatchLoader` (DataLoader pattern) fires exactly three query types for all N dashboards in a single GraphQL request, each sub-batched into groups of ≤ 50 URNs: A. `batchGetAspectValues` (no time window, limit=1) → viewCount B. `batchGetAggregatedStats` CARDINALITY on `userCounts.user` → uniqueUserCountLast30Days (ES-native cardinality, exact regardless of user volume) C. `batchGetAggregatedStats` SUM on `userCounts.usageCount`, STRING grouping size=5 orderByMetric=true ascending=false → topUsersLast30Days already ranked by ES, no client sort needed `DashboardStatsSummaryResolver` is gated by the existing `timeseriesAspectAggBatchLoadEnabled` feature flag (env var `TIMESERIES_ASPECT_AGG_BATCH_LOAD_ENABLED`, default true); when disabled it falls back to the old per-URN path unchanged. Detail pages that fetch statsSummary for a single dashboard at a time are unaffected — the old path is still used there because there is no fan-out to batch. The fan-out reduction applies to search and browse result pages that render many dashboards simultaneously. A smoke test (`test_stats_summary_graphql.py`) is added that seeds 3 dashboards and asserts result correctness. Parity between the batch path and the per-URN fallback path was verified manually by running the test with the feature flag both enabled and disabled. Note: the unbatched fallback path in this resolver and the feature flag gating it will be removed in a follow-up PR once the batch path has been validated in production, making the batch loader the permanent implementation. Note: a further follow-up PR will apply the same batch-loader pattern to DatasetStatsSummaryResolver and ChartStatsSummaryResolver. The timeseries service and DAO changes in Part 1 require no further modification to support those resolvers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

david-leifker · 2026-07-02T15:00:18Z

-      Criterion startTimeCriterion =
+      criteria.add(
          buildCriterion(
-              ES_FIELD_TIMESTAMP, Condition.GREATER_THAN_OR_EQUAL_TO, Long.toString(startTime));


We really need to add something like a floor(), either in the service layer or the frontend. Are we breaking OS/ES caching by always sending incrementing startTime/endTime? For example, if the same query is run within the same say hour, this means OS/ES can cache the result but if we keep incrementing timestamps it busts this cache.

david-leifker

I don't seen any blockers here. I would ask whether we have search caching on the service layer or are optimizing the queries (at least from the frontend) to leverage OS/ES caching.

github-actions Bot added product PR or Issue related to the DataHub UI/UX devops PR or Issue related to DataHub backend & deployment smoke_test Contains changes related to smoke tests labels Jul 1, 2026

chakru-r changed the title ~~Cr timerseries agg batch~~ perf(graphql): timeseries aggregateStats improvements Jul 1, 2026

github-actions Bot deployed to datahub-wheels (Preview) July 1, 2026 22:10 View deployment

vercel Bot deployed to Preview July 1, 2026 22:28 View deployment

chakru-r force-pushed the cr-timerseries-agg-batch branch from 27967ce to 0c3507d Compare July 2, 2026 09:15

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 09:18 View deployment

chakru-r changed the title ~~perf(graphql): timeseries aggregateStats improvements~~ perf(graphql): batch-load Dashboard.statsSummary via timeseries aggregation API Jul 2, 2026

vercel Bot deployed to Preview July 2, 2026 09:30 View deployment

chakru-r force-pushed the cr-timerseries-agg-batch branch from 0c3507d to 68c3118 Compare July 2, 2026 13:23

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 13:25 View deployment

chakru-r marked this pull request as ready for review July 2, 2026 13:27

chakru-r changed the title ~~perf(graphql): batch-load Dashboard.statsSummary via timeseries aggregation API~~ perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates Jul 2, 2026

vercel Bot deployed to Preview July 2, 2026 13:38 View deployment

chakru-r added 2 commits July 2, 2026 19:10

additional tests

8bee90c

update snapshot

7931835

github-actions Bot deployed to datahub-wheels (Preview) July 2, 2026 14:07 View deployment

vercel Bot deployed to Preview July 2, 2026 14:24 View deployment

david-leifker reviewed Jul 2, 2026

View reviewed changes

david-leifker approved these changes Jul 2, 2026

View reviewed changes

maggiehays added the pending-submitter-merge label Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates#18131

perf(graphql): batched timeseries aggregation API and Dashboard.statsSummary updates#18131
chakru-r wants to merge 3 commits into
masterfrom
cr-timerseries-agg-batch

chakru-r commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

david-leifker Jul 2, 2026

Uh oh!

david-leifker left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

chakru-r commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Part 1 — Timeseries service & DAO: batchGetAggregatedStats

Part 2 — Dashboard.statsSummary resolver: replace per-URN fan-out with DataLoader batch

Checklist

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

david-leifker Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

david-leifker left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chakru-r commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading