Description
When executing aggregation queries with GROUP BY, the order of columns in the result set does not match the order specified in the SELECT clause. The behavior depends on whether the aggregation has an explicit alias or not, but in both cases the result order is wrong.
Examples
Case 1: Aggregation without alias — buckets appear first
SELECT count(*), category FROM ecommerce GROUP BY category
Expected: count_* | category
Actual: category | count_*
The GROUP BY bucket column is placed first, ignoring the SELECT order.
Case 2: Aggregation with alias — aggregation appears first
SELECT category, count(*) as cnt FROM ecommerce GROUP BY category
Expected: category | cnt
Actual: cnt | category
The aliased aggregation is placed first, ignoring the SELECT order.
Case 3: Multiple columns
SELECT country, count(*) as cnt, category FROM ecommerce GROUP BY country, category
Expected: country | cnt | category
Actual: order is unpredictable
Expected Behaviour
Result columns MUST appear in the exact order specified in the SELECT clause, regardless of whether columns are bucket columns (GROUP BY), aliased aggregations, or unaliased aggregations. This is mandated by the SQL standard.
Impact
- JDBC/Arrow clients: consumers accessing columns by index get unexpected data
- User expectations: the SQL standard guarantees SELECT order is preserved
- Downstream tools: Superset, DuckDB, and other tools assume column order matches the query
Resolution
Fixed as a side-effect of Issue #41 (fieldsWithComputedAliases). The root cause was a mismatch between the names used in ES aggregations and the names returned by extractOutputFieldNames, which prevented normalizeRow from correctly reordering columns.
Root cause
normalizeRow(row, requestedFields) reorders row entries to match requestedFields (SELECT clause order). However, for unnamed expression columns (e.g., COUNT(*) without AS), the ES aggregation was named count_* (via SQLAggregation.fromField) while extractOutputFieldNames returned a different name (e.g., * via sourceField). This mismatch meant normalizeRow couldn't find the field in the row, resulting in wrong column order.
How Issue #41 fixes this
fieldsWithComputedAliases assigns consistent positional aliases (__c1, __c2, ...) to expression columns without explicit aliases. This ensures:
- ES aggregation name =
__c1 (from SQLAggregation.fromField using fieldAlias)
extractOutputFieldNames returns __c1 (from fieldsWithComputedAliases)
normalizeRow matches __c1 in the row → reorders correctly to SELECT clause order
For explicitly aliased columns (Case 2), the naming was already consistent — the alias is used both in the ES aggregation and in extractOutputFieldNames. normalizeRow enforces the correct order in all cases.
Tests
- Integration tests:
ReplGatewayIntegrationSpec — 3 new tests verifying column order:
- Aggregation first, bucket second (
SELECT COUNT(*), profile.city)
- Bucket first, aliased aggregation second (
SELECT profile.city, COUNT(*) AS cnt)
- Mixed: bucket, aliased aggregation, unnamed aggregation (
SELECT profile.city, COUNT(*) AS cnt, AVG(age))
Description
When executing aggregation queries with
GROUP BY, the order of columns in the result set does not match the order specified in theSELECTclause. The behavior depends on whether the aggregation has an explicit alias or not, but in both cases the result order is wrong.Examples
Case 1: Aggregation without alias — buckets appear first
Expected:
count_*|categoryActual:
category|count_*The GROUP BY bucket column is placed first, ignoring the SELECT order.
Case 2: Aggregation with alias — aggregation appears first
Expected:
category|cntActual:
cnt|categoryThe aliased aggregation is placed first, ignoring the SELECT order.
Case 3: Multiple columns
Expected:
country|cnt|categoryActual: order is unpredictable
Expected Behaviour
Result columns MUST appear in the exact order specified in the SELECT clause, regardless of whether columns are bucket columns (GROUP BY), aliased aggregations, or unaliased aggregations. This is mandated by the SQL standard.
Impact
Resolution
Fixed as a side-effect of Issue #41 (
fieldsWithComputedAliases). The root cause was a mismatch between the names used in ES aggregations and the names returned byextractOutputFieldNames, which preventednormalizeRowfrom correctly reordering columns.Root cause
normalizeRow(row, requestedFields)reorders row entries to matchrequestedFields(SELECT clause order). However, for unnamed expression columns (e.g.,COUNT(*)withoutAS), the ES aggregation was namedcount_*(viaSQLAggregation.fromField) whileextractOutputFieldNamesreturned a different name (e.g.,*viasourceField). This mismatch meantnormalizeRowcouldn't find the field in the row, resulting in wrong column order.How Issue #41 fixes this
fieldsWithComputedAliasesassigns consistent positional aliases (__c1,__c2, ...) to expression columns without explicit aliases. This ensures:__c1(fromSQLAggregation.fromFieldusingfieldAlias)extractOutputFieldNamesreturns__c1(fromfieldsWithComputedAliases)normalizeRowmatches__c1in the row → reorders correctly to SELECT clause orderFor explicitly aliased columns (Case 2), the naming was already consistent — the alias is used both in the ES aggregation and in
extractOutputFieldNames.normalizeRowenforces the correct order in all cases.Tests
ReplGatewayIntegrationSpec— 3 new tests verifying column order:SELECT COUNT(*), profile.city)SELECT profile.city, COUNT(*) AS cnt)SELECT profile.city, COUNT(*) AS cnt, AVG(age))