-
Notifications
You must be signed in to change notification settings - Fork 3.9k
feat: disambiguate metadata for better scans #20245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pkg/dataobj/metastore/iter.go
Outdated
| } | ||
| buf[rIdx].UncompressedSize = values.Value(rIdx) | ||
| } | ||
| case pointers.ColumnTypeColumnName: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I don't think this will get any results. This isn't supposed to get any results, but, looking closer, I think it will because we're missing a predicate for "index type" on this read
The rows containing stream stats don't have column name populated in the pointers section.
They're only populated for bloom filter rows to identify which column the bloom filter applies to, and bloom filter rows are mutually exclusive to stream stats rows in this section. This means you will likely get results but you'll also be reading a lot of rows usually only read for bloom filters.
This is a side effect of the pointers section being a wide section with many types of columns used for different purposes, and definitely isn't clear from the code
| // resolve ambiguous predicates to metadata type where applicable to improve scanning | ||
| unambiguousPredicates := make([]Expression, 0, len(predicates)) | ||
| for _, predicate := range predicates { | ||
| p, changed := disambiguateExpression(predicate, metadataColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was originally thinking to do this via a negative check: Can we decide if this is definitely not a stream label, and therefore must be metadata or parsed?
The advantage of that way is that there is likely a lot more metadata labels than stream labels, so there are lower data requirements to pass around the stream labels. That said, in either case we only need to find labels actually mentioned in the query which should be a pretty small set to start with.
303641a to
bfffd19
Compare
563d0b3 to
7babef1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances query performance by collecting metadata column information during section retrieval and using it to disambiguate ambiguous column predicates, allowing them to be pushed down to the scan level for more efficient filtering.
Key Changes
- Added label collection to the metastore to track which label columns are present in each section
- Implemented disambiguation logic to resolve ambiguous column references to either labels or metadata based on available columns
- Enhanced predicate pushdown to apply disambiguated predicates directly to scan targets while preserving the original ambiguous predicates in filters for correctness
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/dataobj/metastore/object.go | Added Labels field to DataobjSectionDescriptor and implemented label collection logic in getSectionsForStreams to track available label columns per stream |
| pkg/engine/internal/planner/physical/catalog.go | Propagated Labels field from metastore section descriptors to DataObjSections for use in the planner |
| pkg/engine/internal/planner/physical/planner.go | Implemented disambiguateExpression to resolve ambiguous column references and push disambiguated predicates to scan targets |
| pkg/engine/internal/planner/physical/planner_test.go | Added comprehensive tests for disambiguation logic and metadata column resolution |
| pkg/engine/internal/planner/planner_test.go | Updated test expectations to verify disambiguated predicates are pushed to scan targets |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return &UnaryExpr{ | ||
| Left: leftExpr, | ||
| Op: e.Op, | ||
| }, leftChanged |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disambiguateExpression function creates a new UnaryExpr even when leftChanged is false. This creates unnecessary allocations and could impact performance when processing many predicates. Consider returning the original expression when changed is false to avoid unnecessary allocations.
| return &UnaryExpr{ | |
| Left: leftExpr, | |
| Op: e.Op, | |
| }, leftChanged | |
| if !leftChanged { | |
| return e, false | |
| } | |
| return &UnaryExpr{ | |
| Left: leftExpr, | |
| Op: e.Op, | |
| }, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| newExprs := make([]Expression, len(e.Expressions)) | ||
| for i, subExpr := range e.Expressions { | ||
| exp, changed := disambiguateExpression(subExpr, labels) | ||
| newExprs[i] = exp | ||
| anyChanged = anyChanged || changed | ||
| } | ||
| return &VariadicExpr{ | ||
| Op: e.Op, | ||
| Expressions: newExprs, | ||
| }, anyChanged |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disambiguateExpression function creates a new VariadicExpr even when anyChanged is false. This creates unnecessary allocations and could impact performance when processing many predicates. Consider returning the original expression when changed is false to avoid unnecessary allocations.
| newExprs := make([]Expression, len(e.Expressions)) | |
| for i, subExpr := range e.Expressions { | |
| exp, changed := disambiguateExpression(subExpr, labels) | |
| newExprs[i] = exp | |
| anyChanged = anyChanged || changed | |
| } | |
| return &VariadicExpr{ | |
| Op: e.Op, | |
| Expressions: newExprs, | |
| }, anyChanged | |
| var newExprs []Expression | |
| for i, subExpr := range e.Expressions { | |
| exp, changed := disambiguateExpression(subExpr, labels) | |
| if changed { | |
| if !anyChanged { | |
| // First change detected: allocate new slice and copy previous unchanged expressions. | |
| newExprs = make([]Expression, len(e.Expressions)) | |
| copy(newExprs, e.Expressions[:i]) | |
| anyChanged = true | |
| } | |
| newExprs[i] = exp | |
| } else if anyChanged { | |
| // Some previous expression has already changed; keep slice in sync. | |
| newExprs[i] = e.Expressions[i] | |
| } | |
| } | |
| if !anyChanged { | |
| return e, false | |
| } | |
| return &VariadicExpr{ | |
| Op: e.Op, | |
| Expressions: newExprs, | |
| }, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the added complexity here is worth it, I've made the change to not create a new expr when nothing changed, but the allocations for the newExprs seems pretty negligible to me.
| func (d *DataobjSectionDescriptor) MergeWithLabels(pointer pointers.SectionPointer, lbls []string) { | ||
| lblMap := map[string]struct{}{} | ||
| for _, label := range lbls { | ||
| lblMap[label] = struct{}{} | ||
| } | ||
| for _, label := range d.Labels { | ||
| lblMap[label] = struct{}{} | ||
| } | ||
| newLabels := make([]string, 0, len(lblMap)) | ||
| for label := range lblMap { | ||
| newLabels = append(newLabels, label) | ||
| } | ||
|
|
||
| d.Merge(pointer) | ||
| d.Labels = newLabels | ||
| } |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MergeWithLabels method produces non-deterministic output because iterating over a map (lblMap) doesn't guarantee any order. The resulting newLabels slice will have labels in unpredictable order across different invocations, which could lead to inconsistent behavior when these labels are used for comparison or display purposes. Consider sorting the newLabels slice before assigning it to d.Labels to ensure deterministic output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
order of the labels should never matter, all that matters is set membership for disambiguating column type.
| require.Equal(t, types.ColumnTypeMetadata, colExpr.Ref.Type) | ||
| }) | ||
|
|
||
| t.Run("resolves ambiguous column to label when not in set", func(t *testing.T) { |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test name says "resolves ambiguous column to label when not in set", but it should say "when IN set" because the test is checking that "app" (which IS in the labels set) gets resolved to ColumnTypeLabel. The current name contradicts the test behavior.
| t.Run("resolves ambiguous column to label when not in set", func(t *testing.T) { | |
| t.Run("resolves ambiguous column to label when in set", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| case *BinaryExpr: | ||
| leftExpr, leftChanged := disambiguateExpression(e.Left, labels) | ||
| rightExpr, rightChanged := disambiguateExpression(e.Right, labels) | ||
| return &BinaryExpr{ | ||
| Left: leftExpr, | ||
| Right: rightExpr, | ||
| Op: e.Op, | ||
| }, leftChanged || rightChanged |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The disambiguateExpression function creates new BinaryExpr, UnaryExpr, and VariadicExpr objects even when no changes were made (i.e., when leftChanged and rightChanged are both false). This creates unnecessary allocations and could impact performance when processing many predicates. Consider returning the original expression when changed is false to avoid unnecessary allocations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good feedback, done
14025c2 to
7f01558
Compare
| predicatesToPushdown := make([]Expression, 0, len(predicates)) | ||
| if len(dataObj.Labels) > 0 { | ||
| for _, predicate := range predicates { | ||
| p, disambiguated := disambiguateExpression(predicate, dataObj.Labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the disambiguity has to be at a stream-level, not section-wide.
assume the section has logs from 2 streams:
stream A with labels foo, bar
stream B with labels bar, buzz
we now assume these 3 are stream labels for all rows in the section. predicate buzz=something which now operates on labels would filter out all rows from stream A even though Stream A logs could contain buzz as structured metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel the only safe decision we can make here looking at labels of the section is:
- if a predicate column ref is not present in the labels set, consider it structured metadata
- otherwise leave it ambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the disambiguity has to be at a stream-level, not section-wide.
yeah, this is a really good point that I missed. I'm worried about need a compound predicate, as different streams can have different labels in the same section, so we need to look at the stream ID and the label set.
maybe a simpler solution is keeping the labels on the section, having it be the super set of all labels for all streams in the section, and only disambiguating if the label isn't found in any of the streams. it would mean less disambiguation but more straight forward filtering/predicate matching.
7f01558 to
9830233
Compare
| reader := readerResp.Reader | ||
| defer reader.Close() | ||
| sectionsResp, err := m.CollectSections(ctx, CollectSectionsRequest{reader}) | ||
| labels := readerResp.LabelsByStreamID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't work when distributed as it's relying on memory to pass the LabelsByStreamID. will need to add this to the pointer arrow record somehow
commit 14025c2 Author: Trevor Whitney <[email protected]> Date: Thu Dec 18 06:46:26 2025 -0700 chore: copilot feedback commit d41f091 Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 17:26:53 2025 -0700 chore: fix it commit ce4705d Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 17:21:51 2025 -0700 chore: revert conditional setting of predicates commit 9e11dfb Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 17:14:56 2025 -0700 fix: planner tests commit b8ac21b Merge: 3968ffd 0460481 Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 16:41:58 2025 -0700 Merge branch 'main' into twhitney/structured-metadata-push-down commit 3968ffd Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 16:41:30 2025 -0700 refactor: conditioinally add predicates to pushdown commit 9603815 Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 16:36:28 2025 -0700 refactor: resolve labels during section resolution commit 76aeec1 Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 14:29:19 2025 -0700 refactor: move label discovery into section resolution commit 7babef1 Author: Trevor Whitney <[email protected]> Date: Wed Dec 17 12:22:14 2025 -0700 fix: only push down predicates when we get labels commit 91ea17d Author: Trevor Whitney <[email protected]> Date: Tue Dec 16 16:58:26 2025 -0700 fix: only modify metadata columns commit 35b5968 Author: Trevor Whitney <[email protected]> Date: Tue Dec 16 16:54:08 2025 -0700 fix: don't drop existing predicates commit 7c42ef9 Author: Trevor Whitney <[email protected]> Date: Tue Dec 16 14:54:37 2025 -0700 fix: fix failing tests commit 5dba18c Author: Trevor Whitney <[email protected]> Date: Tue Dec 16 14:25:39 2025 -0700 refactor: resolve labels in the planner commit f65db2a Author: Trevor Whitney <[email protected]> Date: Mon Dec 15 16:39:31 2025 -0700 chore: linting commit a332ea6 Author: Trevor Whitney <[email protected]> Date: Mon Dec 15 16:22:04 2025 -0700 feat: disambiguate metadata for better scans
458a822 to
2fd2f02
Compare
What this PR does / why we need it:
This PR collects metadata columns during section retrieval in the metastore, and returns them to the planner so they can be used to disambiguate predicates on ambiguous columns. Ambiguous predicates that reference metadata are resolved and added to the ScanSet's predicates for more efficient scanning.
Note, this only puts the disambiguated column predicates on the ScanSet, and the projection pushdown has not been changed to better handle
Selectstatements. As a result, there will still be some logic in the executor that could be further disambiguated but has not been. However, that seems less important than improving scans, so it is something we can take up later if deemed worth it.Checklist
CONTRIBUTING.mdguide (required)featPRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.docs/sources/setup/upgrade/_index.mddeprecated-config.yamlanddeleted-config.yamlfiles respectively in thetools/deprecated-config-checkerdirectory. Example PR