Skip to content

Tilt local env: Expand the set of tests that can run locally by adding Toxyproxy service and separating PG connections in tests#4156

Merged
pfcoperez merged 23 commits intomainfrom
DBI-640/local-dev-env/e2e/toxy+split_pgcon
Apr 13, 2026
Merged

Tilt local env: Expand the set of tests that can run locally by adding Toxyproxy service and separating PG connections in tests#4156
pfcoperez merged 23 commits intomainfrom
DBI-640/local-dev-env/e2e/toxy+split_pgcon

Conversation

@pfcoperez
Copy link
Copy Markdown
Contributor

@pfcoperez pfcoperez commented Apr 8, 2026

This PR:

  1. Adds Toxiproxy as Tilt resource, optionally initialized as other ancillary services.
  2. Separates PG access in several tests that assumed PG DB for PeerDB catalog and Source DB to be the same.
  3. Changes error matching from == to errors.Is()
  4. Cleans-up Toxiproxy configuration when the tests using it complete.
  5. Sets PEERDB_VERSION_SHA_SHORT to non-empty strings in Tilt environment.
  6. Improves reliability of TestResyncFailed by:
    a. Cancelling the pipeline before tearing down PSQL.
    b. Limiting backend PID kill to peer db application as we want to test how the pipeline fails without affecting the client pool at test code side.
    c. Separating the catalog checks connection application from the one being killed.
  7. Adds ssh as Tilt resource, optionally initialized as other ancillary services.

This make the following e2e pass locally:

  • env -f ../.envgo test ./e2e/... -v -test.run TestApiPg
  • env -f ../.envgo test ./e2e/... -v -test.run TestApiMy
  • env -f ../.envgo test ./e2e/... -v -test.run TestApiMongo
  • env -f ../.envgo test ./e2e/... -v -test.run ^TestPeerFlowE2ETestSuitePG_CH$
  • env -f ../.envgo test ./e2e/... -v -test.run ^TestPeerFlowE2ETestSuiteMySQL_CH$
  • env -f ../.envgo test ./e2e/... -v -test.run TestGenericCH_PG
  • env -f ../.envgo test ./e2e/... -v -test.run TestGenericCH_MySQL
  • env -f ../.envgo test ./e2e/... -v -test.run TestMongoClickhouseSuite
  • env -f ../.envgo test ./e2e/... -v -test.run TestSwitchboardMongo
  • env -f ../.envgo test ./e2e/... -v -test.run TestSwitchboardMySQL
  • env -f ../.envgo test ./e2e/... -v -test.run TestSwitchboardPostgres
  • env -f ../.envgo test ./connectors/postgres/... -v
  • env -f ../.envgo test ./connectors/clickhouse/... -v -> ⚠️ Passing all but:
    • env -f ../.envTestIAMRoleCanIssueSelectFromS3 This test uses AWS access credentials, I don't think it matches minio features, skipping when FLOW_TESTS_AWS_S3_BUCKET_NAME is not present.

📝 This PR doesn't address the env var injection with godotenv so enf -f doesn't need to be used. It doesn't expand the "Run through Tilt UI clicks" addition of all the supported cases, I am adding both in a follow-up PR.

Closes: https://linear.app/clickhouse/issue/DBI-640

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
2152 2 2150 196
View the top 2 failed test(s) by shortest run time
github.com/PeerDB-io/peerdb/flow/e2e::TestPeerFlowE2ETestSuitePG_CH_Cluster
Stack Traces | 0.01s run time
=== RUN   TestPeerFlowE2ETestSuitePG_CH_Cluster
=== PAUSE TestPeerFlowE2ETestSuitePG_CH_Cluster
=== CONT  TestPeerFlowE2ETestSuitePG_CH_Cluster
--- FAIL: TestPeerFlowE2ETestSuitePG_CH_Cluster (0.01s)
github.com/PeerDB-io/peerdb/flow/e2e::TestPeerFlowE2ETestSuitePG_CH_Cluster/Test_Binary_Format_Raw
Stack Traces | 140s run time
=== RUN   TestPeerFlowE2ETestSuitePG_CH_Cluster/Test_Binary_Format_Raw
=== PAUSE TestPeerFlowE2ETestSuitePG_CH_Cluster/Test_Binary_Format_Raw
=== CONT  TestPeerFlowE2ETestSuitePG_CH_Cluster/Test_Binary_Format_Raw
2026/04/13 14:49:46 INFO Received AWS credentials from peer for connector: ci x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/13 14:49:47 INFO Received AWS credentials from peer for connector: clickhouse x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/04/13 14:49:47 INFO Executing and processing query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id"
2026/04/13 14:49:47 INFO Executing and processing query stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id"
2026/04/13 14:49:47 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_8232150594539241952 CURSOR FOR SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" args=[]
2026/04/13 14:49:47 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" channelLen=0
2026/04/13 14:49:47 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_8232150594539241952
2026/04/13 14:49:47 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_8232150594539241952 records=2 bytes=24 channelLen=1
2026/04/13 14:49:47 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" rows=2 bytes=24 channelLen=1
2026/04/13 14:49:47 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_8232150594539241952
2026/04/13 14:49:47 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_8232150594539241952 records=0 bytes=0 channelLen=0
2026/04/13 14:49:47 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/13 14:49:47 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/13 14:49:47 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" rows=2 bytes=24 channelLen=0
    clickhouse_test.go:1064: WaitFor waiting for CDC count 2026-04-13 14:49:52.128212556 +0000 UTC m=+380.253130605
    clickhouse_test.go:1068: WaitFor waiting for CDC count 2026-04-13 14:49:52.137501052 +0000 UTC m=+380.262419131
2026/04/13 14:49:52 INFO Executing and processing query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id"
2026/04/13 14:49:52 INFO Executing and processing query stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id"
2026/04/13 14:49:52 INFO [pg_query_executor] declared cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursorQuery="DECLARE peerdb_cursor_3000482916563691618 CURSOR FOR SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" args=[]
2026/04/13 14:49:52 INFO [pg_query_executor] fetching rows start x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" channelLen=0
2026/04/13 14:49:52 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_3000482916563691618
2026/04/13 14:49:52 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_3000482916563691618 records=2 bytes=24 channelLen=1
2026/04/13 14:49:52 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" rows=2 bytes=24 channelLen=1
2026/04/13 14:49:52 INFO [pg_query_executor] fetching from cursor x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_3000482916563691618
2026/04/13 14:49:52 INFO processed row stream x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart cursor=peerdb_cursor_3000482916563691618 records=0 bytes=0 channelLen=0
2026/04/13 14:49:52 INFO [pg_query_executor] fetched rows x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" rows=0 bytes=0 channelLen=0
2026/04/13 14:49:52 INFO [pg_query_executor] committing transaction x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart
2026/04/13 14:49:52 INFO [pg_query_executor] committed transaction for query x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} partitionId=testpart query="SELECT b, a, c FROM e2e_test_pgchcl_tpiwztjo.\"test_composite_pkey_ordering\" ORDER BY id" rows=2 bytes=24 channelLen=0
    clickhouse.go:94: 
        	Error Trace:	.../flow/e2e/congen.go:43
        	            				.../flow/e2e/clickhouse.go:94
        	            				.../flow/e2e/clickhouse.go:172
        	            				.../flow/e2e/test_utils.go:240
        	            				.../flow/e2e/test_utils.go:852
        	            				.../flow/e2e/test_utils.go:237
        	            				.../flow/e2e/clickhouse_test.go:1068
        	            				.../flow/e2e/clickhouse_test.go:1083
        	Error:      	Received unexpected error:
        	            	unable to establish connection with catalog: FATAL: terminating connection due to administrator command (SQLSTATE 57P01)
        	Test:       	TestPeerFlowE2ETestSuitePG_CH_Cluster/Test_Binary_Format_Raw
    clickhouse.go:125: begin tearing down postgres schema pgchcl_zyxklqxw
    clickhouse.go:125: 
        	Error Trace:	.../flow/e2e/pg.go:148
        	            				.../flow/e2e/clickhouse.go:125
        	            				.../flow/e2eshared/e2eshared.go:38
        	            				.../hostedtoolcache/go/1.26.1.../src/testing/testing.go:1317
        	            				.../hostedtoolcache/go/1.26.1.../src/testing/testing.go:1667
        	            				.../hostedtoolcache/go/1.26.1.../src/testing/testing.go:2030
        	            				.../hostedtoolcache/go/1.26.1.../src/runtime/panic.go:694
        	            				.../hostedtoolcache/go/1.26.1.../src/testing/testing.go:1022
        	            				.../flow/e2e/congen.go:43
        	            				.../flow/e2e/clickhouse.go:94
        	            				.../flow/e2e/clickhouse.go:172
        	            				.../flow/e2e/test_utils.go:240
        	            				.../flow/e2e/test_utils.go:852
        	            				.../flow/e2e/test_utils.go:237
        	            				.../flow/e2e/clickhouse_test.go:1068
        	            				.../flow/e2e/clickhouse_test.go:1083
        	            				.../hostedtoolcache/go/1.26.1.../src/reflect/value.go:586
        	            				.../hostedtoolcache/go/1.26.1.../src/reflect/value.go:369
        	            				.../flow/e2eshared/e2eshared.go:40
        	Error:      	failed to teardown postgres schema
        	Test:       	TestPeerFlowE2ETestSuitePG_CH_Cluster/Test_Binary_Format_Raw
        	Messages:   	pgchcl_zyxklqxw: failed to drop replication slots: ERROR: replication slot "peerflow_slot_ch_binary_format_raw_pgchcl_zyxklqxw" is active for PID 15417 (SQLSTATE 55006)
--- FAIL: TestPeerFlowE2ETestSuitePG_CH_Cluster/Test_Binary_Format_Raw (139.54s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

❌ Test Failure

Analysis: Deterministic regression: every matrix run hits identical UNEXPECTED TIMEOUT wait for avro stage dropped and UNEXPECTED TIMEOUT wait for qrep flow dropped failures, indicating drop/cleanup operations are no longer completing rather than intermittent flakiness.
Confidence: 0.92

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: e2e TestApiPg/TestResyncFailed failed with a transient PostgreSQL SQLSTATE 57P01 (admin_shutdown) error — a catalog connection was forcibly terminated by the system mid-test in the highly concurrent (32-worker) CI environment, unrelated to the test's actual logic.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: TestApiPg/TestResyncFailed fails across all matrix variants with a PostgreSQL catalog connection terminated by admin command (SQLSTATE 57P01), a race condition where the test's own DB-termination action bleeds into the catalog connection check — not caused by the renovate config commit.
Confidence: 0.85

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

❌ Test Failure

Analysis: TestApiPg/TestResyncFailed deterministically fails across all 3 matrix configs because the test kills all pg backends matching the flow suffix (api_test.go:1423-1426), which inadvertently terminates the connection that GetFlowStatus subsequently uses to poll the catalog, producing "UNEXPECTED ERROR: FATAL: terminating connection due to administrator command".
Confidence: 0.88

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@pfcoperez pfcoperez force-pushed the DBI-640/local-dev-env/e2e/toxy+split_pgcon branch from 625cf5b to 51abd2b Compare April 8, 2026 12:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: Tests failed due to the PostgreSQL catalog database connection being terminated mid-run (FATAL: terminating connection due to administrator command, SQLSTATE 57P01), a transient CI infrastructure issue unrelated to the test logic.
Confidence: 0.97

✅ Automatically retrying the workflow

View workflow run

…ERDB_VERSION_SHA_SHORT` to the expected value in the test
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: TestApiPg/TestResyncFailed failed with SQLSTATE 57P01 (admin-terminated connection) mid-WaitFor loop — a race condition in the shared test Postgres catalog unrelated to the triggering Renovate config commit.
Confidence: 0.87

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: TestApiPg/TestResyncFailed fails across all matrix runs with SQLSTATE 57P01 (admin_shutdown), indicating a race condition where the test's intentional pg_terminate_backend call disrupts catalog connections of concurrently running tests under -p 32 parallelism.
Confidence: 0.75

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: The e2e test suite timed out at exactly 900s (the configured timeout), indicating CI runner slowness or non-deterministic test duration rather than a code regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: TestApiPg/TestResyncFailed fails across all matrix configurations with a PostgreSQL admin_shutdown connection error (SQLSTATE 57P01) during a WaitFor timing loop, and TestGenericBQ/Test_Schema_Change_Lost_Column_Bug fails due to BigQuery schema propagation timing — both are characteristic of infrastructure/timing race conditions with no related code changes in the triggering commit.
Confidence: 0.78

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: The e2e test TestApiPg/TestResyncFailed failed due to a PostgreSQL catalog connection being forcefully terminated by an administrator command (SQLSTATE 57P01 / admin_shutdown), which is a transient CI infrastructure issue unrelated to the code changes.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

❌ Test Failure

Analysis: TestApiPg/TestResyncFailed fails consistently across all 3 matrix configurations with the same error — the catalog connection is being terminated (SQLSTATE 57P01) at an unexpected point in the resync failure test, indicating a real bug rather than a flaky environment issue.
Confidence: 0.88

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

❌ Test Failure

Analysis: TestApiPg/TestResyncFailed failed consistently across two independent CI matrix configurations with the same PostgreSQL SQLSTATE 57P01 (admin_shutdown) error, indicating the catalog connection is being unexpectedly terminated during resync failure testing rather than a transient network/timing issue.
Confidence: 0.75

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@ilidemi
Copy link
Copy Markdown
Contributor

ilidemi commented Apr 8, 2026

Straight for stretch goals 💪

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Flaky Test Detected

Analysis: The e2e ClickHouse CDC tests failed because SetupCDCFlowStatusQuery got stuck waiting in the snapshot phase (a timing-dependent state transition), which is a known flaky pattern in distributed integration tests rather than a deterministic bug.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

❌ Test Failure

Analysis: TestApiPg/TestResyncFailed fails consistently across all 3 matrix configurations with SQLSTATE 57P01 (admin_shutdown), indicating pg_terminate_backend is hitting the catalog connection unexpectedly during resync failure simulation rather than being a timing/environment flake.
Confidence: 0.82

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@pfcoperez
Copy link
Copy Markdown
Contributor Author

@claude Fix linting issues

@claude
Copy link
Copy Markdown

claude bot commented Apr 8, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

❌ Test Failure

Analysis: TestApiPg/TestResyncFailed fails consistently across all matrix variants with a catalog connection being administratively terminated (SQLSTATE 57P01), pointing to a real bug introduced by the PR's toxiproxy/split-pgcon changes rather than intermittent flakiness.
Confidence: 0.9

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: SSH keepalive tests failed with "Connection refused" when connecting through toxiproxy, indicating the test infrastructure service was unavailable rather than a code regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: TestPostgresSSHKeepaliveWithToxiproxy and TestPostgresSSHKeepaliveLatency failed with "ssh: rejected: connect failed (Connection refused)" — a transient network error indicating the SSH/Toxiproxy service wasn't ready when the tests executed.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Three SSH keepalive tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) failed across all matrix jobs with "ssh: rejected: connect failed (Connection refused)" errors from Toxiproxy, indicating a transient network connectivity issue in the test environment rather than a code bug.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@pfcoperez pfcoperez force-pushed the DBI-640/local-dev-env/e2e/toxy+split_pgcon branch from 8ed0a95 to 8840ea9 Compare April 10, 2026 21:17
@github-actions
Copy link
Copy Markdown
Contributor

❌ Test Failure

Analysis: SSH keepalive tests fail deterministically across all matrix configurations with "Connection refused" errors on initial connection, and e2e tests show data mismatch errors, indicating a real regression rather than intermittent flakiness.
Confidence: 0.75

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: The entire e2e test suite timed out after 900 seconds (panic: test timed out after 15m0s), killing all in-flight tests — a classic flaky CI failure from resource contention or slow infrastructure, not a code logic bug.
Confidence: 0.93

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Two e2e tests failed due to transient "conn closed" and "context deadline exceeded" errors — Test_PubSub/TestSimple hit a DeadlineExceeded RPC error and TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting failed during schema teardown with "conn closed", both indicating intermittent connection drops rather than a code regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@pfcoperez
Copy link
Copy Markdown
Contributor Author

pfcoperez commented Apr 13, 2026

Thanks for finding these @ilidemi !

These two hardcode the port and don't run against mysql-pos (can be seen when mysql-gtid is stopped):

Fix 23bebc3
Checked:

Screenshot From 2026-04-13 12-17-43 Screenshot From 2026-04-13 12-20-36

Port: 3306,

Fix: 846b320
Checked:

Screenshot From 2026-04-13 12-25-06

This one runs against catalog:

Fix: 55ca562
Checked:

  • With dedicated source resource enabled:
image
  • With resource disabled:
image

This family of problems is partially generated by the fact that different tests obtain their connection parameters with their own logic. I've opened this https://linear.app/clickhouse/issue/DBI-674/unify-connection-parameters-in-tests linear ticket to address connection parameter unification across tests.

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Three SSH/Toxiproxy tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) failed with "Connection refused" errors on SSH tunnel ports, indicating CI infrastructure wasn't ready rather than a code bug.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Three SSH Toxiproxy tests failed with "Connection refused" (likely Postgres not ready at test start) across all matrix jobs, and one BQ e2e test hit a STATUS_SNAPSHOT timeout in only one of three matrix jobs — both are classic infrastructure-readiness/timing flakes with no deterministic assertion failures.
Confidence: 0.85

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: All failures are infrastructure/network errors: SSH-over-Toxiproxy connection refused and catalog PostgreSQL terminated by admin command (SQLSTATE 57P01), both consistent with CI environment instability rather than a code regression.
Confidence: 0.82

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Three SSH/Toxiproxy tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) failed with transient "Connection refused" network errors through the Toxiproxy tunnel, indicating a race condition in CI network setup rather than a real bug.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: All three failing SSH keepalive tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) fail with identical "Connection refused" network errors across every matrix configuration simultaneously, indicating a Toxiproxy/SSH tunnel timing race in CI infrastructure rather than a code regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: The e2e test TestApiPg/TestTableAdditionWithoutInitialLoad failed due to a transient PostgreSQL SQLSTATE 57P01 (admin_shutdown) error — the CI catalog DB connection was forcibly terminated mid-test, not a code bug.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: TestGenericBQ/Test_Simple_Flow timed out with STATUS_SNAPSHOT — a transient timeout waiting for BigQuery snapshot completion, indicative of CI environment latency rather than a code defect.
Confidence: 0.88

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: TestGenericBQ/Test_Simple_Flow timed out waiting for STATUS_SNAPSHOT to transition, a classic timing-dependent failure in E2E tests running under high concurrency (-p 32) on the mysql-gtid CI matrix.
Confidence: 0.9

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: The e2e MySQL-GTID test suite failed after 722s (near the 900s timeout) with no specific error messages, consistent with a flaky timing/waiting issue rather than a deterministic bug introduced by the recent normalize race fix.
Confidence: 0.72

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: TestGenericBQ/Test_Simple_Flow failed with a snapshot status timeout ("UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT"), consistent with known BigQuery test flakiness under high concurrency — not a code regression.
Confidence: 0.92

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Two unrelated failures: a teardown race condition (active replication slot during cleanup in CH cluster test) and BigQuery timing issues (destination columns not propagated before validation), both unrelated to the PG-to-PG MERGE commit that triggered this run.
Confidence: 0.88

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: Both e2e failures are transient infrastructure issues: a PostgreSQL catalog connection killed by an admin command (SQLSTATE 57P01) and a snapshot phase timeout, neither related to the code change.
Confidence: 0.95

✅ Automatically retrying the workflow

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

❌ Test Failure

Analysis: CI run 24349427840 is still in progress; logs are not yet available to assess flakiness.
Confidence: 0

⚠️ This appears to be a real bug - manual intervention needed

View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: The e2e test suite failed only on the mysql-pos/7.0/stable matrix variant (744s runtime) while identical code passed on mysql-gtid and maria variants, strongly suggesting a timing-sensitive flake in the async CDC/Temporal workflow tests rather than a real regression.
Confidence: 0.82

✅ Automatically retrying the workflow

View workflow run

@pfcoperez pfcoperez merged commit cd8fe6d into main Apr 13, 2026
20 of 26 checks passed
@pfcoperez pfcoperez deleted the DBI-640/local-dev-env/e2e/toxy+split_pgcon branch April 13, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants