Skip to content

Worker image crashes with SIGSEGV in Release mode due to GCC codegen bug with proxygen #152

@20001020ycx

Description

@20001020ycx

Description

The prestissimo-worker:dev Docker image crashes with SIGSEGV at address 0x0 immediately after connecting to the coordinator. The crash occurs in ResponseHandler::setTransaction() when calling txn_->getTransport().getCodec().getProtocol()

  *** Signal 11 (SIGSEGV) (0x0) received ***
  @ 0x0                    (null pointer dereference)
  @ ResponseHandler::setTransaction(HTTPTransaction*)
  @ HTTPTransaction::setHandler(HTTPTransactionHandler*)
  @ HTTPUpstreamSession::newTransactionWithError(HTTPTransactionHandler*)
  @ HTTPUpstreamSession::newTransaction(HTTPTransactionHandler*)
  @ ConnectionHandler::connectSuccess(HTTPUpstreamSession*)

Root Cause

This is the same class of bug as prestodb#22995. GCC (tested with versions 11 and 13) generates incorrect code in Release mode (-O3 -DNDEBUG) for proxygen's HTTP session/transport virtual function dispatch. Clang does not exhibit this issue, which is why Meta (who use Clang internally) never encountered it.

The upstream fix (prestodb#23531) only guarded the createTransaction() code path with #if defined(clang). The setTransaction() code path, which was added later in the 0.297 integration, has the same vulnerable call but was not guarded.

Reproduction

Build Compiler Mode Result
Local GCC 11 Debug (-O0) Works
Local GCC 11 Release (-O3) SIGSEGV
Local GCC 13 Release (-O3) SIGSEGV
Local Clang 17 Release (-O3) Works
CI (CentOS) GCC 12 Release (-O3) Works (e2e passes)

Your Environment

  • Presto version used:
  • Storage (HDFS/S3/GCS..):
  • Data source and connector used:
  • Deployment (Cloud or On-prem):
  • Pastebin link to the complete debug logs:

Expected Behavior

Current Behavior

Possible Solution

The following options are all tried locally and have proven working e2e. However, we discuss their trade-off and justify why we pick Option 1 as our immediate action item.

Option 1: #if defined(clang) guard (selected)

Guard the getTransport().getCodec().getProtocol() call in setTransaction() with #if defined(clang), matching the existing upstream pattern from PR prestodb#23531.

  • Pros: Minimal change (4 lines, 1 file). Follows established upstream precedent. No build infrastructure changes.
  • Cons: The protocol_ field won't be set on GCC builds, but it is std::optional and only used for a VLOG(2) debug log, so there is no functional impact.

Option 2: Switch worker image build to Clang

Install clang-17 and lld-17 in the dependency image and set CC/CXX to Clang in the runtime Dockerfile.

  • Pros: Eliminates the entire class of GCC codegen bugs. Clang is Meta's internal compiler for this codebase, which is the very reason why they didn't encounter this bug.
  • Cons:
    • All compilation needs to be compiled by Clang for format matching. Many dependencies were built by GCC. This is a rather radical change which takes time for full deployment.

Option 3: Use upstream CentOS dependency image (GCC 12)

Currently, our e2e tests spin up both the Presto coordinator and the Velox worker and pass without this crash. This is because the e2e test builds against a different base image (prestodb/presto-native-dependency:0.297-*, CentOS with GCC 12) than our published worker image (custom Ubuntu with GCC 11). This mismatch between the tested and published environments is itself a problem worth fixing. T

This approach replaces our custom Ubuntu dependency image with the upstream CentOS one, aligning the published image with the tested configuration. Note that while GCC 12 does not trigger this specific codegen bug, the root cause is not fully understood.

  • Pros: Exact same toolchain as the proven e2e tests. No custom dep image to maintain.
  • Cons:
    • GCC toolset sourcing: The CentOS image has GCC 12 installed via gcc-toolset-12, but it must be explicitly activated with source /opt/rh/gcc-toolset-12/enable before building. The runtime Dockerfile currently only does this for CUDF builds — the build step needs modification to source it for all CentOS builds.
    • Runtime library mismatch: The libraries bundled from a CentOS build environment (glibc, libstdc++, etc.) may not be compatible with an Ubuntu base image for the final runtime
      container. The BASE_IMAGE may also need to switch from ubuntu:22.04 to a CentOS-based image, which changes the runtime environment for customers.

Steps to Reproduce

Screenshots (if appropriate)

Context

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions