Conversation
❌ 2 Tests Failed:
View the top 2 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
❌ Test FailureAnalysis: Deterministic regression: every matrix run hits identical |
…operly matching the expected error.
… function applied in the previous commit
🔄 Flaky Test DetectedAnalysis: e2e TestApiPg/TestResyncFailed failed with a transient PostgreSQL SQLSTATE 57P01 (admin_shutdown) error — a catalog connection was forcibly terminated by the system mid-test in the highly concurrent (32-worker) CI environment, unrelated to the test's actual logic. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestApiPg/TestResyncFailed fails across all matrix variants with a PostgreSQL catalog connection terminated by admin command (SQLSTATE 57P01), a race condition where the test's own DB-termination action bleeds into the catalog connection check — not caused by the renovate config commit. ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: TestApiPg/TestResyncFailed deterministically fails across all 3 matrix configs because the test kills all pg backends matching the flow suffix (api_test.go:1423-1426), which inadvertently terminates the connection that GetFlowStatus subsequently uses to poll the catalog, producing "UNEXPECTED ERROR: FATAL: terminating connection due to administrator command". |
625cf5b to
51abd2b
Compare
🔄 Flaky Test DetectedAnalysis: Tests failed due to the PostgreSQL catalog database connection being terminated mid-run ( ✅ Automatically retrying the workflow |
…ERDB_VERSION_SHA_SHORT` to the expected value in the test
🔄 Flaky Test DetectedAnalysis: TestApiPg/TestResyncFailed failed with SQLSTATE 57P01 (admin-terminated connection) mid-WaitFor loop — a race condition in the shared test Postgres catalog unrelated to the triggering Renovate config commit. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestApiPg/TestResyncFailed fails across all matrix runs with SQLSTATE 57P01 (admin_shutdown), indicating a race condition where the test's intentional pg_terminate_backend call disrupts catalog connections of concurrently running tests under -p 32 parallelism. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: The e2e test suite timed out at exactly 900s (the configured timeout), indicating CI runner slowness or non-deterministic test duration rather than a code regression. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestApiPg/TestResyncFailed fails across all matrix configurations with a PostgreSQL admin_shutdown connection error (SQLSTATE 57P01) during a WaitFor timing loop, and TestGenericBQ/Test_Schema_Change_Lost_Column_Bug fails due to BigQuery schema propagation timing — both are characteristic of infrastructure/timing race conditions with no related code changes in the triggering commit. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: The e2e test ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: TestApiPg/TestResyncFailed fails consistently across all 3 matrix configurations with the same error — the catalog connection is being terminated (SQLSTATE 57P01) at an unexpected point in the resync failure test, indicating a real bug rather than a flaky environment issue. |
❌ Test FailureAnalysis: TestApiPg/TestResyncFailed failed consistently across two independent CI matrix configurations with the same PostgreSQL SQLSTATE 57P01 (admin_shutdown) error, indicating the catalog connection is being unexpectedly terminated during resync failure testing rather than a transient network/timing issue. |
|
Straight for stretch goals 💪 |
🔄 Flaky Test DetectedAnalysis: The e2e ClickHouse CDC tests failed because ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: TestApiPg/TestResyncFailed fails consistently across all 3 matrix configurations with SQLSTATE 57P01 (admin_shutdown), indicating pg_terminate_backend is hitting the catalog connection unexpectedly during resync failure simulation rather than being a timing/environment flake. |
|
@claude Fix linting issues |
|
I'll analyze this and get back to you. |
❌ Test FailureAnalysis: TestApiPg/TestResyncFailed fails consistently across all matrix variants with a catalog connection being administratively terminated (SQLSTATE 57P01), pointing to a real bug introduced by the PR's toxiproxy/split-pgcon changes rather than intermittent flakiness. |
🔄 Flaky Test DetectedAnalysis: SSH keepalive tests failed with "Connection refused" when connecting through toxiproxy, indicating the test infrastructure service was unavailable rather than a code regression. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestPostgresSSHKeepaliveWithToxiproxy and TestPostgresSSHKeepaliveLatency failed with "ssh: rejected: connect failed (Connection refused)" — a transient network error indicating the SSH/Toxiproxy service wasn't ready when the tests executed. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Three SSH keepalive tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) failed across all matrix jobs with "ssh: rejected: connect failed (Connection refused)" errors from Toxiproxy, indicating a transient network connectivity issue in the test environment rather than a code bug. ✅ Automatically retrying the workflow |
8ed0a95 to
8840ea9
Compare
❌ Test FailureAnalysis: SSH keepalive tests fail deterministically across all matrix configurations with "Connection refused" errors on initial connection, and e2e tests show data mismatch errors, indicating a real regression rather than intermittent flakiness. |
🔄 Flaky Test DetectedAnalysis: The entire e2e test suite timed out after 900 seconds ( ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Two e2e tests failed due to transient "conn closed" and "context deadline exceeded" errors — Test_PubSub/TestSimple hit a DeadlineExceeded RPC error and TestGenericBQ/Test_Inheritance_Table_With_Dynamic_Setting failed during schema teardown with "conn closed", both indicating intermittent connection drops rather than a code regression. ✅ Automatically retrying the workflow |
|
Thanks for finding these @ilidemi !
Fix 23bebc3
Fix: 846b320
Fix: 55ca562
This family of problems is partially generated by the fact that different tests obtain their connection parameters with their own logic. I've opened this https://linear.app/clickhouse/issue/DBI-674/unify-connection-parameters-in-tests linear ticket to address connection parameter unification across tests. |
🔄 Flaky Test DetectedAnalysis: Three SSH/Toxiproxy tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) failed with "Connection refused" errors on SSH tunnel ports, indicating CI infrastructure wasn't ready rather than a code bug. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Three SSH Toxiproxy tests failed with "Connection refused" (likely Postgres not ready at test start) across all matrix jobs, and one BQ e2e test hit a STATUS_SNAPSHOT timeout in only one of three matrix jobs — both are classic infrastructure-readiness/timing flakes with no deterministic assertion failures. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: All failures are infrastructure/network errors: SSH-over-Toxiproxy connection refused and catalog PostgreSQL terminated by admin command (SQLSTATE 57P01), both consistent with CI environment instability rather than a code regression. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Three SSH/Toxiproxy tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) failed with transient "Connection refused" network errors through the Toxiproxy tunnel, indicating a race condition in CI network setup rather than a real bug. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: All three failing SSH keepalive tests (TestPostgresSSHKeepaliveWithToxiproxy, TestPostgresSSHKeepaliveLatency, TestPostgresSSHResetPeer) fail with identical "Connection refused" network errors across every matrix configuration simultaneously, indicating a Toxiproxy/SSH tunnel timing race in CI infrastructure rather than a code regression. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: The e2e test TestApiPg/TestTableAdditionWithoutInitialLoad failed due to a transient PostgreSQL ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestGenericBQ/Test_Simple_Flow timed out with STATUS_SNAPSHOT — a transient timeout waiting for BigQuery snapshot completion, indicative of CI environment latency rather than a code defect. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestGenericBQ/Test_Simple_Flow timed out waiting for STATUS_SNAPSHOT to transition, a classic timing-dependent failure in E2E tests running under high concurrency (-p 32) on the mysql-gtid CI matrix. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: The e2e MySQL-GTID test suite failed after 722s (near the 900s timeout) with no specific error messages, consistent with a flaky timing/waiting issue rather than a deterministic bug introduced by the recent normalize race fix. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: TestGenericBQ/Test_Simple_Flow failed with a snapshot status timeout ("UNEXPECTED STATUS TIMEOUT STATUS_SNAPSHOT"), consistent with known BigQuery test flakiness under high concurrency — not a code regression. ✅ Automatically retrying the workflow |
…nv/e2e/toxy+split_pgcon
🔄 Flaky Test DetectedAnalysis: Two unrelated failures: a teardown race condition (active replication slot during cleanup in CH cluster test) and BigQuery timing issues (destination columns not propagated before validation), both unrelated to the PG-to-PG MERGE commit that triggered this run. ✅ Automatically retrying the workflow |
🔄 Flaky Test DetectedAnalysis: Both e2e failures are transient infrastructure issues: a PostgreSQL catalog connection killed by an admin command (SQLSTATE 57P01) and a snapshot phase timeout, neither related to the code change. ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: CI run 24349427840 is still in progress; logs are not yet available to assess flakiness. |
🔄 Flaky Test DetectedAnalysis: The e2e test suite failed only on the mysql-pos/7.0/stable matrix variant (744s runtime) while identical code passed on mysql-gtid and maria variants, strongly suggesting a timing-sensitive flake in the async CDC/Temporal workflow tests rather than a real regression. ✅ Automatically retrying the workflow |






This PR:
Toxiproxyas Tilt resource, optionally initialized as other ancillary services.==toerrors.Is()PEERDB_VERSION_SHA_SHORTto non-empty strings in Tilt environment.TestResyncFailedby:a. Cancelling the pipeline before tearing down PSQL.
b. Limiting backend PID kill to
peerdb application as we want to test how the pipeline fails without affecting the client pool at test code side.c. Separating the catalog checks connection application from the one being killed.
sshas Tilt resource, optionally initialized as other ancillary services.This make the following e2e pass locally:
env -f ../.envgo test ./e2e/... -v -test.run TestApiPgenv -f ../.envgo test ./e2e/... -v -test.run TestApiMyenv -f ../.envgo test ./e2e/... -v -test.run TestApiMongoenv -f ../.envgo test ./e2e/... -v -test.run ^TestPeerFlowE2ETestSuitePG_CH$env -f ../.envgo test ./e2e/... -v -test.run ^TestPeerFlowE2ETestSuiteMySQL_CH$env -f ../.envgo test ./e2e/... -v -test.run TestGenericCH_PGenv -f ../.envgo test ./e2e/... -v -test.run TestGenericCH_MySQLenv -f ../.envgo test ./e2e/... -v -test.run TestMongoClickhouseSuiteenv -f ../.envgo test ./e2e/... -v -test.run TestSwitchboardMongoenv -f ../.envgo test ./e2e/... -v -test.run TestSwitchboardMySQLenv -f ../.envgo test ./e2e/... -v -test.run TestSwitchboardPostgresenv -f ../.envgo test ./connectors/postgres/... -venv -f ../.envgo test ./connectors/clickhouse/... -v->env -f ../.envTestIAMRoleCanIssueSelectFromS3This test uses AWS access credentials, I don't think it matches minio features, skipping whenFLOW_TESTS_AWS_S3_BUCKET_NAMEis not present.📝 This PR doesn't address the env var injection with
godotenvsoenf -fdoesn't need to be used. It doesn't expand the "Run through Tilt UI clicks" addition of all the supported cases, I am adding both in a follow-up PR.Closes: https://linear.app/clickhouse/issue/DBI-640