Skip to content

Conversation

@trevorwhitney
Copy link
Collaborator

@trevorwhitney trevorwhitney commented Dec 15, 2025

What this PR does / why we need it:

This PR collects metadata columns during section retrieval in the metastore, and returns them to the planner so they can be used to disambiguate predicates on ambiguous columns. Ambiguous predicates that reference metadata are resolved and added to the ScanSet's predicates for more efficient scanning.

Note, this only puts the disambiguated column predicates on the ScanSet, and the projection pushdown has not been changed to better handle Select statements. As a result, there will still be some logic in the executor that could be further disambiguated but has not been. However, that seems less important than improving scans, so it is something we can take up later if deemed worth it.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
    • Note that Promtail is considered to be feature complete, and future development for logs collection will be in Grafana Alloy. As such, feat PRs are unlikely to be accepted unless a case can be made for the feature actually being a bug fix to existing behavior.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

}
buf[rIdx].UncompressedSize = values.Value(rIdx)
}
case pointers.ColumnTypeColumnName:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't think this will get any results. This isn't supposed to get any results, but, looking closer, I think it will because we're missing a predicate for "index type" on this read

The rows containing stream stats don't have column name populated in the pointers section.
They're only populated for bloom filter rows to identify which column the bloom filter applies to, and bloom filter rows are mutually exclusive to stream stats rows in this section. This means you will likely get results but you'll also be reading a lot of rows usually only read for bloom filters.
This is a side effect of the pointers section being a wide section with many types of columns used for different purposes, and definitely isn't clear from the code

// resolve ambiguous predicates to metadata type where applicable to improve scanning
unambiguousPredicates := make([]Expression, 0, len(predicates))
for _, predicate := range predicates {
p, changed := disambiguateExpression(predicate, metadataColumns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally thinking to do this via a negative check: Can we decide if this is definitely not a stream label, and therefore must be metadata or parsed?

The advantage of that way is that there is likely a lot more metadata labels than stream labels, so there are lower data requirements to pass around the stream labels. That said, in either case we only need to find labels actually mentioned in the query which should be a pretty small set to start with.

@trevorwhitney trevorwhitney force-pushed the twhitney/structured-metadata-push-down branch from 303641a to bfffd19 Compare December 16, 2025 18:02
@trevorwhitney trevorwhitney force-pushed the twhitney/structured-metadata-push-down branch from 563d0b3 to 7babef1 Compare December 17, 2025 20:00
@trevorwhitney trevorwhitney marked this pull request as ready for review December 18, 2025 00:27
@trevorwhitney trevorwhitney requested a review from a team as a code owner December 18, 2025 00:27
Copilot AI review requested due to automatic review settings December 18, 2025 00:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances query performance by collecting metadata column information during section retrieval and using it to disambiguate ambiguous column predicates, allowing them to be pushed down to the scan level for more efficient filtering.

Key Changes

  • Added label collection to the metastore to track which label columns are present in each section
  • Implemented disambiguation logic to resolve ambiguous column references to either labels or metadata based on available columns
  • Enhanced predicate pushdown to apply disambiguated predicates directly to scan targets while preserving the original ambiguous predicates in filters for correctness

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/dataobj/metastore/object.go Added Labels field to DataobjSectionDescriptor and implemented label collection logic in getSectionsForStreams to track available label columns per stream
pkg/engine/internal/planner/physical/catalog.go Propagated Labels field from metastore section descriptors to DataObjSections for use in the planner
pkg/engine/internal/planner/physical/planner.go Implemented disambiguateExpression to resolve ambiguous column references and push disambiguated predicates to scan targets
pkg/engine/internal/planner/physical/planner_test.go Added comprehensive tests for disambiguation logic and metadata column resolution
pkg/engine/internal/planner/planner_test.go Updated test expectations to verify disambiguated predicates are pushed to scan targets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 696 to 699
return &UnaryExpr{
Left: leftExpr,
Op: e.Op,
}, leftChanged
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disambiguateExpression function creates a new UnaryExpr even when leftChanged is false. This creates unnecessary allocations and could impact performance when processing many predicates. Consider returning the original expression when changed is false to avoid unnecessary allocations.

Suggested change
return &UnaryExpr{
Left: leftExpr,
Op: e.Op,
}, leftChanged
if !leftChanged {
return e, false
}
return &UnaryExpr{
Left: leftExpr,
Op: e.Op,
}, true

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 702 to 711
newExprs := make([]Expression, len(e.Expressions))
for i, subExpr := range e.Expressions {
exp, changed := disambiguateExpression(subExpr, labels)
newExprs[i] = exp
anyChanged = anyChanged || changed
}
return &VariadicExpr{
Op: e.Op,
Expressions: newExprs,
}, anyChanged
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disambiguateExpression function creates a new VariadicExpr even when anyChanged is false. This creates unnecessary allocations and could impact performance when processing many predicates. Consider returning the original expression when changed is false to avoid unnecessary allocations.

Suggested change
newExprs := make([]Expression, len(e.Expressions))
for i, subExpr := range e.Expressions {
exp, changed := disambiguateExpression(subExpr, labels)
newExprs[i] = exp
anyChanged = anyChanged || changed
}
return &VariadicExpr{
Op: e.Op,
Expressions: newExprs,
}, anyChanged
var newExprs []Expression
for i, subExpr := range e.Expressions {
exp, changed := disambiguateExpression(subExpr, labels)
if changed {
if !anyChanged {
// First change detected: allocate new slice and copy previous unchanged expressions.
newExprs = make([]Expression, len(e.Expressions))
copy(newExprs, e.Expressions[:i])
anyChanged = true
}
newExprs[i] = exp
} else if anyChanged {
// Some previous expression has already changed; keep slice in sync.
newExprs[i] = e.Expressions[i]
}
}
if !anyChanged {
return e, false
}
return &VariadicExpr{
Op: e.Op,
Expressions: newExprs,
}, true

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the added complexity here is worth it, I've made the change to not create a new expr when nothing changed, but the allocations for the newExprs seems pretty negligible to me.

Comment on lines +96 to +114
func (d *DataobjSectionDescriptor) MergeWithLabels(pointer pointers.SectionPointer, lbls []string) {
lblMap := map[string]struct{}{}
for _, label := range lbls {
lblMap[label] = struct{}{}
}
for _, label := range d.Labels {
lblMap[label] = struct{}{}
}
newLabels := make([]string, 0, len(lblMap))
for label := range lblMap {
newLabels = append(newLabels, label)
}

d.Merge(pointer)
d.Labels = newLabels
}
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MergeWithLabels method produces non-deterministic output because iterating over a map (lblMap) doesn't guarantee any order. The resulting newLabels slice will have labels in unpredictable order across different invocations, which could lead to inconsistent behavior when these labels are used for comparison or display purposes. Consider sorting the newLabels slice before assigning it to d.Labels to ensure deterministic output.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

order of the labels should never matter, all that matters is set membership for disambiguating column type.

require.Equal(t, types.ColumnTypeMetadata, colExpr.Ref.Type)
})

t.Run("resolves ambiguous column to label when not in set", func(t *testing.T) {
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test name says "resolves ambiguous column to label when not in set", but it should say "when IN set" because the test is checking that "app" (which IS in the labels set) gets resolved to ColumnTypeLabel. The current name contradicts the test behavior.

Suggested change
t.Run("resolves ambiguous column to label when not in set", func(t *testing.T) {
t.Run("resolves ambiguous column to label when in set", func(t *testing.T) {

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 667 to 674
case *BinaryExpr:
leftExpr, leftChanged := disambiguateExpression(e.Left, labels)
rightExpr, rightChanged := disambiguateExpression(e.Right, labels)
return &BinaryExpr{
Left: leftExpr,
Right: rightExpr,
Op: e.Op,
}, leftChanged || rightChanged
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The disambiguateExpression function creates new BinaryExpr, UnaryExpr, and VariadicExpr objects even when no changes were made (i.e., when leftChanged and rightChanged are both false). This creates unnecessary allocations and could impact performance when processing many predicates. Consider returning the original expression when changed is false to avoid unnecessary allocations.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good feedback, done

@trevorwhitney trevorwhitney force-pushed the twhitney/structured-metadata-push-down branch from 14025c2 to 7f01558 Compare December 18, 2025 13:56
predicatesToPushdown := make([]Expression, 0, len(predicates))
if len(dataObj.Labels) > 0 {
for _, predicate := range predicates {
p, disambiguated := disambiguateExpression(predicate, dataObj.Labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the disambiguity has to be at a stream-level, not section-wide.

assume the section has logs from 2 streams:
stream A with labels foo, bar
stream B with labels bar, buzz

we now assume these 3 are stream labels for all rows in the section. predicate buzz=something which now operates on labels would filter out all rows from stream A even though Stream A logs could contain buzz as structured metadata

Copy link
Contributor

@ashwanthgoli ashwanthgoli Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel the only safe decision we can make here looking at labels of the section is:

  • if a predicate column ref is not present in the labels set, consider it structured metadata
  • otherwise leave it ambiguous

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the disambiguity has to be at a stream-level, not section-wide.

yeah, this is a really good point that I missed. I'm worried about need a compound predicate, as different streams can have different labels in the same section, so we need to look at the stream ID and the label set.

maybe a simpler solution is keeping the labels on the section, having it be the super set of all labels for all streams in the section, and only disambiguating if the label isn't found in any of the streams. it would mean less disambiguation but more straight forward filtering/predicate matching.

@trevorwhitney trevorwhitney marked this pull request as draft December 18, 2025 19:01
@trevorwhitney trevorwhitney force-pushed the twhitney/structured-metadata-push-down branch from 7f01558 to 9830233 Compare December 18, 2025 23:40
reader := readerResp.Reader
defer reader.Close()
sectionsResp, err := m.CollectSections(ctx, CollectSectionsRequest{reader})
labels := readerResp.LabelsByStreamID
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work when distributed as it's relying on memory to pass the LabelsByStreamID. will need to add this to the pointer arrow record somehow

commit 14025c2
Author: Trevor Whitney <[email protected]>
Date:   Thu Dec 18 06:46:26 2025 -0700

    chore: copilot feedback

commit d41f091
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 17:26:53 2025 -0700

    chore: fix it

commit ce4705d
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 17:21:51 2025 -0700

    chore: revert conditional setting of predicates

commit 9e11dfb
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 17:14:56 2025 -0700

    fix: planner tests

commit b8ac21b
Merge: 3968ffd 0460481
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 16:41:58 2025 -0700

    Merge branch 'main' into twhitney/structured-metadata-push-down

commit 3968ffd
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 16:41:30 2025 -0700

    refactor: conditioinally add predicates to pushdown

commit 9603815
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 16:36:28 2025 -0700

    refactor: resolve labels during section resolution

commit 76aeec1
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 14:29:19 2025 -0700

    refactor: move label discovery into section resolution

commit 7babef1
Author: Trevor Whitney <[email protected]>
Date:   Wed Dec 17 12:22:14 2025 -0700

    fix: only push down predicates when we get labels

commit 91ea17d
Author: Trevor Whitney <[email protected]>
Date:   Tue Dec 16 16:58:26 2025 -0700

    fix: only modify metadata columns

commit 35b5968
Author: Trevor Whitney <[email protected]>
Date:   Tue Dec 16 16:54:08 2025 -0700

    fix: don't drop existing predicates

commit 7c42ef9
Author: Trevor Whitney <[email protected]>
Date:   Tue Dec 16 14:54:37 2025 -0700

    fix: fix failing tests

commit 5dba18c
Author: Trevor Whitney <[email protected]>
Date:   Tue Dec 16 14:25:39 2025 -0700

    refactor: resolve labels in the planner

commit f65db2a
Author: Trevor Whitney <[email protected]>
Date:   Mon Dec 15 16:39:31 2025 -0700

    chore: linting

commit a332ea6
Author: Trevor Whitney <[email protected]>
Date:   Mon Dec 15 16:22:04 2025 -0700

    feat: disambiguate metadata for better scans
@trevorwhitney trevorwhitney force-pushed the twhitney/structured-metadata-push-down branch from 458a822 to 2fd2f02 Compare January 5, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants