Description
The prestissimo-worker:dev Docker image crashes with SIGSEGV at address 0x0 immediately after connecting to the coordinator. The crash occurs in ResponseHandler::setTransaction() when calling txn_->getTransport().getCodec().getProtocol()
*** Signal 11 (SIGSEGV) (0x0) received ***
@ 0x0 (null pointer dereference)
@ ResponseHandler::setTransaction(HTTPTransaction*)
@ HTTPTransaction::setHandler(HTTPTransactionHandler*)
@ HTTPUpstreamSession::newTransactionWithError(HTTPTransactionHandler*)
@ HTTPUpstreamSession::newTransaction(HTTPTransactionHandler*)
@ ConnectionHandler::connectSuccess(HTTPUpstreamSession*)
Root Cause
This is the same class of bug as prestodb#22995. GCC (tested with versions 11 and 13) generates incorrect code in Release mode (-O3 -DNDEBUG) for proxygen's HTTP session/transport virtual function dispatch. Clang does not exhibit this issue, which is why Meta (who use Clang internally) never encountered it.
The upstream fix (prestodb#23531) only guarded the createTransaction() code path with #if defined(clang). The setTransaction() code path, which was added later in the 0.297 integration, has the same vulnerable call but was not guarded.
Reproduction
| Build |
Compiler |
Mode |
Result |
| Local |
GCC 11 |
Debug (-O0) |
Works |
| Local |
GCC 11 |
Release (-O3) |
SIGSEGV |
| Local |
GCC 13 |
Release (-O3) |
SIGSEGV |
| Local |
Clang 17 |
Release (-O3) |
Works |
| CI (CentOS) |
GCC 12 |
Release (-O3) |
Works (e2e passes) |
Your Environment
- Presto version used:
- Storage (HDFS/S3/GCS..):
- Data source and connector used:
- Deployment (Cloud or On-prem):
- Pastebin link to the complete debug logs:
Expected Behavior
Current Behavior
Possible Solution
The following options are all tried locally and have proven working e2e. However, we discuss their trade-off and justify why we pick Option 1 as our immediate action item.
Option 1: #if defined(clang) guard (selected)
Guard the getTransport().getCodec().getProtocol() call in setTransaction() with #if defined(clang), matching the existing upstream pattern from PR prestodb#23531.
- Pros: Minimal change (4 lines, 1 file). Follows established upstream precedent. No build infrastructure changes.
- Cons: The protocol_ field won't be set on GCC builds, but it is std::optional and only used for a VLOG(2) debug log, so there is no functional impact.
Option 2: Switch worker image build to Clang
Install clang-17 and lld-17 in the dependency image and set CC/CXX to Clang in the runtime Dockerfile.
- Pros: Eliminates the entire class of GCC codegen bugs. Clang is Meta's internal compiler for this codebase, which is the very reason why they didn't encounter this bug.
- Cons:
- All compilation needs to be compiled by Clang for format matching. Many dependencies were built by GCC. This is a rather radical change which takes time for full deployment.
Option 3: Use upstream CentOS dependency image (GCC 12)
Currently, our e2e tests spin up both the Presto coordinator and the Velox worker and pass without this crash. This is because the e2e test builds against a different base image (prestodb/presto-native-dependency:0.297-*, CentOS with GCC 12) than our published worker image (custom Ubuntu with GCC 11). This mismatch between the tested and published environments is itself a problem worth fixing. T
This approach replaces our custom Ubuntu dependency image with the upstream CentOS one, aligning the published image with the tested configuration. Note that while GCC 12 does not trigger this specific codegen bug, the root cause is not fully understood.
- Pros: Exact same toolchain as the proven e2e tests. No custom dep image to maintain.
- Cons:
- GCC toolset sourcing: The CentOS image has GCC 12 installed via gcc-toolset-12, but it must be explicitly activated with source /opt/rh/gcc-toolset-12/enable before building. The runtime Dockerfile currently only does this for CUDF builds — the build step needs modification to source it for all CentOS builds.
- Runtime library mismatch: The libraries bundled from a CentOS build environment (glibc, libstdc++, etc.) may not be compatible with an Ubuntu base image for the final runtime
container. The BASE_IMAGE may also need to switch from ubuntu:22.04 to a CentOS-based image, which changes the runtime environment for customers.
Steps to Reproduce
Screenshots (if appropriate)
Context
References
Description
The prestissimo-worker:dev Docker image crashes with SIGSEGV at address 0x0 immediately after connecting to the coordinator. The crash occurs in
ResponseHandler::setTransaction()when callingtxn_->getTransport().getCodec().getProtocol()Root Cause
This is the same class of bug as prestodb#22995. GCC (tested with versions 11 and 13) generates incorrect code in Release mode (-O3 -DNDEBUG) for proxygen's HTTP session/transport virtual function dispatch. Clang does not exhibit this issue, which is why Meta (who use Clang internally) never encountered it.
The upstream fix (prestodb#23531) only guarded the createTransaction() code path with #if defined(clang). The setTransaction() code path, which was added later in the 0.297 integration, has the same vulnerable call but was not guarded.
Reproduction
-O0)-O3)-O3)-O3)-O3)Your Environment
Expected Behavior
Current Behavior
Possible Solution
The following options are all tried locally and have proven working e2e. However, we discuss their trade-off and justify why we pick Option 1 as our immediate action item.
Option 1: #if defined(clang) guard (selected)
Guard the getTransport().getCodec().getProtocol() call in setTransaction() with #if defined(clang), matching the existing upstream pattern from PR prestodb#23531.
Option 2: Switch worker image build to Clang
Install clang-17 and lld-17 in the dependency image and set CC/CXX to Clang in the runtime Dockerfile.
Option 3: Use upstream CentOS dependency image (GCC 12)
Currently, our e2e tests spin up both the Presto coordinator and the Velox worker and pass without this crash. This is because the e2e test builds against a different base image (prestodb/presto-native-dependency:0.297-*, CentOS with GCC 12) than our published worker image (custom Ubuntu with GCC 11). This mismatch between the tested and published environments is itself a problem worth fixing. T
This approach replaces our custom Ubuntu dependency image with the upstream CentOS one, aligning the published image with the tested configuration. Note that while GCC 12 does not trigger this specific codegen bug, the root cause is not fully understood.
container. The BASE_IMAGE may also need to switch from ubuntu:22.04 to a CentOS-based image, which changes the runtime environment for customers.
Steps to Reproduce
Screenshots (if appropriate)
Context
References
createTransaction()path