Skip to content

Feat: Add StarRocks engine support#5658

Open
jaogoy wants to merge 15 commits into
SQLMesh:mainfrom
jaogoy:feat.support_sr
Open

Feat: Add StarRocks engine support#5658
jaogoy wants to merge 15 commits into
SQLMesh:mainfrom
jaogoy:feat.support_sr

Conversation

@jaogoy
Copy link
Copy Markdown

@jaogoy jaogoy commented Jan 13, 2026

What

  • Add StarRocks engine support to SQLMesh via StarRocks’ MySQL-compatible protocol.
  • Ship engine adapter + docs + real integration tests to ensure generated SQL works on StarRocks.

Why

  • User demand / adoption: StarRocks is a common OLAP choice; SQLMesh users want to run the same model lifecycle (build, incremental maintenance, views/MVs) on StarRocks without bespoke SQL.
  • Engine-specific semantics: StarRocks differs from vanilla MySQL in DDL/DML constraints (e.g., key types, delete behavior, rename caveats). An adapter is needed to produce correct and predictable SQL.
  • Confidence & maintainability: Documenting config patterns + codifying behavior with integration tests prevents regressions and makes support “real” (not just “it parses”).

Scope (what’s supported)

  • Connectivity: Connect through MySQL protocol (e.g., pymysql).
  • Table creation / DDL:
  • Key table types via physical_properties: DUPLICATE KEY (default), PRIMARY KEY (recommended for incremental), UNIQUE KEY
  • Partitioning: simple partitioned_by and advanced partition_by (complex expression partitioning) + optional initial partitions
  • Distribution: distributed_by structured form or string fallback (HASH / RANDOM; buckets required)
    • Ordering: order_by / clustered_by
    • Generic PROPERTIES passthrough (string key/value)
  • Views:
    • Regular views
  • Materialized views via kind VIEW(materialized true) with StarRocks-specific notes/constraints
  • DML / maintenance:
    • Insert/select/update basics
  • Delete behavior handled with StarRocks compatibility constraints (PRIMARY KEY tables recommended for robust deletes)

Changes

  • Engine adapter: sqlmesh/core/engine_adapter/starrocks.py
  • Docs: docs/integrations/engines/starrocks.md
  • Integration tests: tests/core/engine_adapter/integration/test_integration_starrocks.py, and tests/core/engine_adapter/test_starrocks.py

Verification

  • Integration tests require a running StarRocks instance.
  • Ran:
    • set STARROCKS_HOST/PORT/USER/PASSWORD
  • pytest -m "starrocks and docker" tests/core/engine_adapter/integration/test_integration_starrocks.py

Known limitations / caveats

Acknowledgement

This implementation was largely inspired by #5033 — thanks to @xinge-ji for the solid groundwork.

### What

- **Add StarRocks engine support to SQLMesh** via StarRocks’
MySQL-compatible protocol.
- Ship **engine adapter + docs + real integration tests** to ensure
generated SQL works on StarRocks.

### Why

- **User demand / adoption**: StarRocks is a common OLAP choice; SQLMesh
users want to run the same model lifecycle (build, incremental
maintenance, views/MVs) on StarRocks without bespoke SQL.
- **Engine-specific semantics**: StarRocks differs from vanilla MySQL in
DDL/DML constraints (e.g., key types, delete behavior, rename caveats).
An adapter is needed to produce correct and predictable SQL.
- **Confidence & maintainability**: Documenting config patterns +
codifying behavior with integration tests prevents regressions and makes
support “real” (not just “it parses”).

### Scope (what’s supported)

- **Connectivity**: Connect through MySQL protocol (e.g., `pymysql`).
- **Table creation / DDL**:
- Key table types via `physical_properties`: **DUPLICATE KEY
(default)**, **PRIMARY KEY (recommended for incremental)**, **UNIQUE
KEY**
- **Partitioning**: simple `partitioned_by` and advanced
`partition_by` (complex expression partitioning) + optional initial
`partitions`
- **Distribution**: `distributed_by` structured form or string
fallback (HASH / RANDOM; buckets required)
  - **Ordering**: `order_by` / `clustered_by`
  - **Generic PROPERTIES passthrough** (string key/value)
- **Views**:
  - Regular views
- **Materialized views** via `kind VIEW(materialized true)` with
StarRocks-specific notes/constraints
- **DML / maintenance**:
  - Insert/select/update basics
- Delete behavior handled with StarRocks compatibility constraints
(PRIMARY KEY tables recommended for robust deletes)

### Changes

- **Engine adapter**: `sqlmesh/core/engine_adapter/starrocks.py`
- **Docs**: `docs/integrations/engines/starrocks.md`
- **Integration tests**:
`tests/core/engine_adapter/integration/test_integration_starrocks.py`,
and `tests/core/engine_adapter/test_starrocks.py`

### Verification

- **Integration tests require a running StarRocks** instance.
- Ran:
  - set `STARROCKS_HOST/PORT/USER/PASSWORD`
- `pytest -m "starrocks and docker"
tests/core/engine_adapter/integration/test_integration_starrocks.py`

### Known limitations / caveats

- **No sync MV support (currently)**
- **No tuple IN**: `(c1, c2) IN ((v1, v2), ...)`
- **No `SELECT ... FOR UPDATE`**
- **RENAME caveat**: rename target can’t be qualified with a database
name

### Notes on compatibility

- **Changes are StarRocks-scoped** (adapter/docs/tests) and should not
impact other engines.

Signed-off-by: jaogoy <jaogoy@gmail.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jan 13, 2026

CLA assistant check
All committers have signed the CLA.

@jaogoy
Copy link
Copy Markdown
Author

jaogoy commented Jan 13, 2026

@erindru Hi Erin, would you like to take a review of this PR. This PR is similar with #5033, but to support StarRocks in SQLMesh.

I'll be very glad to see your comments.

I'm trying to fix the CI problem and some test cases.
And, I have a question that: for the denpendence on tobymao/sqlglot#6737, do I need to modify the dependent sqlglot version when the PR is merged?
(The previous PR tobymao/sqlglot#6827 is closed by mistake.)

jaogoy added 10 commits January 13, 2026 17:15
And optimize some test cases.

Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
@jaogoy
Copy link
Copy Markdown
Author

jaogoy commented Jan 30, 2026

@erindru Hi Erin, would you like to take a review of this PR. This PR is similar with #5033, but to support StarRocks in SQLMesh.

I'll be very glad to see your comments.

I'm trying to fix the CI problem and some test cases. And, I have a question that: for the denpendence on tobymao/sqlglot#6737, do I need to modify the dependent sqlglot version when the PR is merged? (The previous PR tobymao/sqlglot#6827 is closed by mistake.)

@erindru Hi, Erin, tobymao/sqlglot#6827 in SQLGlot is merged.
Do I need to wait a new SQLGlot version published?
If it's merged, what do I need to do? modify the depended sqlglot version?

Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
Signed-off-by: jaogoy <jaogoy@gmail.com>
@StuffbyYuki
Copy link
Copy Markdown
Collaborator

@erindru would be awesome if we could havr your final look on this!

@StuffbyYuki
Copy link
Copy Markdown
Collaborator

@jaogoy Can you take a look at the conflicts that might need to be resolved? Also it looks like engine_starrocks test failed

@jukiewiczm
Copy link
Copy Markdown

jukiewiczm commented May 28, 2026

@jaogoy
I'm interested in having this in SQLMesh and might contribute to it in the future if you need any help.

I have a question about async materialized views though and this part of the docs in the PR

If you create materialized views with replace=true, SQLMesh may drop and recreate the MV. When an MV is dropped, its data is removed and the MV must be refreshed/rebuilt again.

I actually don't want this to happen, and currently SQLMesh does indeed drop and recreate the MV on every sqlmesh run. How do I prevent that from happening?

EDIT:
Ok, I think I see where this is coming from. The problem is, setting something like this is currently not possible:

kind VIEW (
    materialized TRUE,
    replace FALSE
  )

EDIT 2:
Looking at the code on the other hand, I'm not exactly sure why it recreates it during run, I guess maybe this line is responsible:
https://github.com/jaogoy/sqlmesh/blob/f6e106ef51f627e509347f0d427302081ec498d2/sqlmesh/core/engine_adapter/starrocks.py#L2268

@jukiewiczm
Copy link
Copy Markdown

I found another issue with async materialized views (haven't checked other model types) related to audits.
For audits to work, data needs to exist in the materialized view. This cannot be done with IMMEDIATE/DEFERRED though, as the refresh might be triggered immediately but is still an async job.

It could potentially be achieved with some pre/post statement macros (once this PR points to the newer sqlglot version, because right now it fails), though it would be a bit inconvenient.
My proposition for handling this would be:
If there are audits in the materialized view, fall back to DEFERRED refresh, yet emit a REFRESH MATERIALIZED VIEW view_name WITH SYNC MODE before the audits run. Document it accordingly/throw a warning.
Optionally, add another option like "emit_sync_refresh" or sth, and fail if it's false and there are audits, or if it's true and the refresh_moment is IMMEDIATE (so people are not surprised).

@jaogoy
Copy link
Copy Markdown
Author

jaogoy commented May 29, 2026

OK, I'll take some time later to pass the test cases.

For MV to emit REFRESH MATERIALIZED VIEW view_name WITH SYNC MODE, I think it's not a good way. Because this command is also a async command, the job will run at background. So the data is still filling when the audits run. Is there any way to trigger the audits run when a flag is set, for example it's set after the MV job finishes? Other, we need to tell users about this problem when using audits for MV.

@jukiewiczm
Copy link
Copy Markdown

OK, I'll take some time later to pass the test cases.

For MV to emit REFRESH MATERIALIZED VIEW view_name WITH SYNC MODE, I think it's not a good way. Because this command is also a async command, the job will run at background. So the data is still filling when the audits run. Is there any way to trigger the audits run when a flag is set, for example it's set after the MV job finishes? Other, we need to tell users about this problem when using audits for MV.

Are you sure? I'm pretty sure the WITH SYNC MODE part of the REFRESH MATERIALIZED VIEW is built for exactly this purpose - it's blocking until the data is loaded.

https://docs.starrocks.io/docs/sql-reference/sql-statements/materialized_view/REFRESH_MATERIALIZED_VIEW/#parameters

SYNC indicates making a synchronous call of the refresh task, and StarRocks returns the task result only when the task succeeds or fails

@jukiewiczm
Copy link
Copy Markdown

There's one more thing that's currently slightly inconvenient and I believe should be handled directly in the engine.
I'm talking about excluded_trigger_tables and excluded_refresh_tables options. They're currently treated as any other field, and this is problematic if your intention is to put the another SQLMesh model there, as SQLMesh expose everything through regular views, so you'd get xxx is not the base table of this view error.
I have temporarily solved it with:

physical_properties (
    refresh_scheme = 'ASYNC',
    excluded_trigger_tables = @resolve_physical(starrocks.test_1_model),
    excluded_refresh_tables = @resolve_physical(starrocks.test_1_model)
)

but IMO those should be resolved in the engine.

Here's the code of the macro:

@macro()
def resolve_physical(evaluator, *models: exp.Expression) -> exp.Expression:
    """Emit a single-quoted, comma-separated list of physical table names."""
    names: t.List[str] = []

    for model in models:
        snapshot = evaluator.get_snapshot(model)

        if snapshot is not None:
            # table_name() is the fully-qualified, quoted physical name, e.g.
            #   "catalog"."sqlmesh__starrocks"."starrocks__test_1_model__1455206902"
            table = exp.to_table(snapshot.table_name())
            # StarRocks wants db.table only: drop the catalog and the quotes.
            names.append(f"{table.db}.{table.name}" if table.db else table.name)
        else:
            # Not a SQLMesh-managed model (e.g. a raw source) -> keep as written.
            # identify=False so we don't emit quotes into the property string.
            names.append(model.sql(dialect=evaluator.dialect))

    # Returning a string literal makes the property render as '...': a quoted
    # value, exactly like a hand-written excluded_trigger_tables string.
    return exp.Literal.string(",".join(names))

@jukiewiczm
Copy link
Copy Markdown

One last question to you @jaogoy
Would you mind sharing what's your availability for working on this, and whether you are planning to get it done anytime soon? As far as I can see you're a StarRocks enginner, so I would rather not get in your way while you're on it.

On the other hand, I'm eager to get it merged soon, ideally within a week, so if you lack capacity for working on it, I might contribute to your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants