proxy-client: fix TSan data race in clientDestroy#286
Conversation
clientDestroy() reads m_context.connection to decide whether to use MP_LOG vs KJ_LOG, but Connection::~Connection() can set it to null concurrently from the event loop thread (via the disconnect_cb sync cleanup callback) while the destructor runs on an async cleanup thread, causing a TSan-reported data race. The race is exposed by the test added in commit 90be835 ("test: regression for ~ProxyClient destroy after peer disconnect"). The KJ_LOG fallback was only needed before commit 315ff53 ("refactor: Add ProxyContext EventLoop* member"), when logging required going through connection->m_loop. Since that commit, m_context.loop is a direct EventLoopRef that is always valid regardless of whether m_context.connection is null. The KJ_LOG branch is now dead code, so drop it and the connection check entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
gen.cpp uses IWYU pragma: keep for include which appears to be necessary for newer versions of capnproto (1.4.0 but not 1.1.0)
|
Added 1 commits 25bb3e6 -> 0963634 ( Updated 0963634 -> 039e5ac ( |
|
Concept ACK, |
There was a problem hiding this comment.
ACK 039e5ac
Built and ran mptest with -fsanitize=thread, all 11 tests pass cleanly.
The fix makes sense to me. Instead of adding synchronization for what constitutes a log line, we drop the branch and unconditionally log via MP_LOG.
|
Thanks for the reviews. I also went back and tried to debug why CI did not catch the TSAN error. It turns out to be because an interaction between #273 and #218. The new test added in #273 does not trigger the TSAN error deterministically. When it is run standalone without #218 there's usually no error, although I did see a TSAN error once running the test maybe 10 times. But when #273 and #218 are combined, the test triggers TSAN failures most of the time. This seems to happen because #218 changes |
|
Started to merge this but I saw some more IWYU errors locally that seem better to fix. I think the errors don't happen in CI because nixpkgs are older and use an older version of capnproto. I also ran IWYU locally before my last push, but didn't see these errors because I was using a combined branch with #287 with example files removed. Errors look like: Updated 039e5ac -> 90982f7 ( |
|
lgtm ACK 90982f7 A bit tedious, that different capnp versions require different set of includes, but I guess there is nothing that can be done about that in this repo. |
Yes, this repo is taking a different strategy for IWYU than the Bitcoin core repo so it is easier to run IWYU locally. With Bitcoin core AFAIK you need to run IWYU in a container and can't run it as part of your normal development process because headers must be changed to match what a single CI machine enforces, and pragmas are discouraged. With this repo there is no special IWYU configuration. IWYU can be run in any configuration and "pragma: keeps" can be added freely to turn off complaints from other configurations. From my perspective this is good because there's no need for a container to run IWYU, and you can waste less time debugging IWYU errors and figuring out what "correct" includes are. You get the main benefits of IWYU (faster compiles, cleaner headers) at the cost of needing a few mysterious pragmas. I think this is about as much as you can hope for with IWYU. |
|
Right, it is obviously more correct to run iwyu for the full range of all supported third-party dependency versions (especially if they have different include requirements). Maybe it could make sense to expand the CI to cover both oldeps-iwyu and nightly-iwyu (or so) here? I think for Bitcoin Core this doesn't matter, because the dependencies don't change the include requirements across versions. (E.g. boost multi index is just a single include header) If that were to change, we should do the same. But no strong opinion, and this seems to be trailing off a bit 😅 |
ViniciusCestarii
left a comment
There was a problem hiding this comment.
ACK, 90982f7
This change removes differentiation on the logging when the connection was torn down "IPC interrupted client destroy" but I believe this is fine and simpler than locking a mutex.
|
tACK 90982f7 |
|
post-merge ACK |
8412fcdc65 Merge bitcoin-core/libmultiprocess#295: Mark Waiter m_cv as guarded by m_mutex 1593ee2d18 Merge bitcoin-core/libmultiprocess#294: test: Add passDouble smoke test 9885d7dd33 Merge bitcoin-core/libmultiprocess#286: proxy-client: fix TSan data race in clientDestroy fa35501c4f Mark Waiter m_cv as guarded by m_mutex faaedb11f8 test: Add passDouble smoke test 733c64318d Merge bitcoin-core/libmultiprocess#292: type-number: fix clang-tidy modernize-use-nullptr 9cc3479ab3 Merge bitcoin-core/libmultiprocess#291: cmake: Add `mp_headers` custom target 201abd9e3a Merge bitcoin-core/libmultiprocess#289: cmake: make target_capnp_sources use CURRENT dirs 99820c8aec Merge bitcoin-core/libmultiprocess#279: doc: Add comments to FIELD_* constants in proxy.h 73b985540c Merge bitcoin-core/libmultiprocess#278: doc: Fix and expand design.md e7e91b2e23 Merge bitcoin-core/libmultiprocess#277: Add std::unordered_set support and a helper BuildList to dedup list build handlers 91a951f59a tidy fix: modernize-use-nullptr 16362f42d0 cmake: Add `mp_headers` custom target 615a94fe3a cmake: document ONLY_CAPNP option in target_capnp_sources 90982f75c6 mpgen: iwyu changes required by previous commit 25bb3e67f3 proxy-client: fix TSan data race in clientDestroy 620f297f31 cmake: make target_capnp_sources use CURRENT dirs 9de4b885aa test: use camelCase + $Proxy.name for FooStruct fields 011b91793d type: add std::unordered_set support 20d19b9644 proxy: add BuildList helper and dedup map/set/vector build handlers e863c6cdf6 doc: Add comments to FIELD_* constants in proxy.h 18db0ab957 doc: Fix and expand design.md 61de697536 Merge bitcoin-core/libmultiprocess#273: proxy-client: tolerate exceptions from remote destroy during cleanup 9cec9d6ca5 Merge bitcoin-core/libmultiprocess#243: mpgen: support primitive std::optional struct fields 4aaff11374 Merge bitcoin-core/libmultiprocess#238: cmake, ci: updates for recent nixpkgs 2ac55a56b5 Merge bitcoin-core/libmultiprocess#218: Better error and log messages 6de92e1c73 proxy-client: tolerate exceptions from remote destroy during cleanup 90be8354d4 test: regression for ~ProxyClient destroy after peer disconnect 3c69d125a1 Merge bitcoin-core/libmultiprocess#260: event loop: tolerate unexpected exceptions in `post()` callbacks b8a48c65e6 event loop: tolerate unexpected exceptions in `post()` callbacks f787863d2c Merge bitcoin-core/libmultiprocess#270: doc: Bump version 10 > 11 a22f602910 doc: Bump version 10 > 11 4eae445d6d debug: Add TypeName() function and log statements for Proxy objects being created and destroyed f326c5b1b7 logging: Add better logging on IPC server-side failures 6dbfa56a04 mpgen: support primitive std::optional struct fields 8d1277deb5 mpgen refactor: add AccessorType function db716bbcba mpgen refactor: Move field handling code to FieldList class db7acb3ce2 ci: Fix shell.nix compatibility with CMake 4.0 91a7759a9a cmake: Fix IWYU in nix by adding CMAKE_CXX_IMPLICIT_INCLUDE_DIRECTORIES git-subtree-dir: src/ipc/libmultiprocess git-subtree-split: 8412fcdc659e1379f9b4dea896c26bc1c5f3afa8
#273 added a new test which exposed a longstanding race condition in
clientDestroylogging code, causing the sanitize CI job to now fail with a TSAN error. Fix the race by simplifying the logging code.Details about the fix are in the commit message. TSAN error looked like
https://github.com/bitcoin-core/libmultiprocess/actions/runs/26851297744/job/79183796152
Note: #273 made it possible for this TSAN error to trigger, but it didn't trigger most of the time, so it was not detected in CI. There is also an interaction with #218 and #273 + #218 together make it much more likely to trigger