sv2-tp: reconnect to bitcoin core IPC after disconnect#101
Conversation
79e0653 to
8f0ba30
Compare
| // capnp interface does not define a destroy method, this will just call | ||
| // an empty stub defined in the ProxyClientBase class and do nothing. | ||
| Sub::destroy(*this); | ||
| if (m_context.connection && !m_context.connection->m_disconnected.load()) { |
There was a problem hiding this comment.
In commit "ipc: skip proxy destroy calls after disconnect starts" (fa010b9)
FWIW, there's already a PR on upstream that fixes this issue: bitcoin-core/libmultiprocess#273
There was a problem hiding this comment.
Thanks for pointing that out @xyzconstant. I was not aware that there was already an upstream libmultiprocess PR addressing this.
I have reworked the branch to remove the local libmultiprocess change and replace it with a proper subtree update based on bitcoin-core/libmultiprocess#273. The sv2-tp reconnect commits are now rebased on top of that.
I reran the regtest stack on the updated branch and the reconnect path is still working as expected.
8f0ba30 to
451b144
Compare
|
I'm marking this draft until bitcoin-core/libmultiprocess#273 lands. I also plan to review this. |
|
FYI: bitcoin-core/libmultiprocess#273 just went through |
61de697 Merge bitcoin-core/libmultiprocess#273: proxy-client: tolerate exceptions from remote destroy during cleanup 9cec9d6 Merge bitcoin-core/libmultiprocess#243: mpgen: support primitive std::optional struct fields 4aaff11 Merge bitcoin-core/libmultiprocess#238: cmake, ci: updates for recent nixpkgs 2ac55a5 Merge bitcoin-core/libmultiprocess#218: Better error and log messages 6de92e1 proxy-client: tolerate exceptions from remote destroy during cleanup 90be835 test: regression for ~ProxyClient destroy after peer disconnect 3c69d12 Merge bitcoin-core/libmultiprocess#260: event loop: tolerate unexpected exceptions in `post()` callbacks b8a48c6 event loop: tolerate unexpected exceptions in `post()` callbacks f787863 Merge bitcoin-core/libmultiprocess#270: doc: Bump version 10 > 11 a22f602 doc: Bump version 10 > 11 4eae445 debug: Add TypeName() function and log statements for Proxy objects being created and destroyed f326c5b logging: Add better logging on IPC server-side failures 6dbfa56 mpgen: support primitive std::optional struct fields 8d1277d mpgen refactor: add AccessorType function db716bb mpgen refactor: Move field handling code to FieldList class db7acb3 ci: Fix shell.nix compatibility with CMake 4.0 91a7759 cmake: Fix IWYU in nix by adding CMAKE_CXX_IMPLICIT_INCLUDE_DIRECTORIES git-subtree-dir: src/ipc/libmultiprocess git-subtree-split: 61de6975362a7070276da47cc2aa2c2b8909ed49
…t-subtree-clean-series
451b144 to
f8b5fd5
Compare
Add a reconnect loop around the initial Bitcoin Core IPC setup. When the IPC connection cannot be established, retry with exponential backoff instead of exiting immediately. This provides the basis for recovering sv2-tp after backend loss.
62f92cd to
41cbc55
Compare
Thanks for the heads up @xyzconstant. @Sjors, I have updated the subtree to use the latest changes. I think this should be ready for review now. |
|
The test each commit CI job is taking unexpectedly long. My agent thinks a9f45af is hanging in |
41cbc55 to
decadbd
Compare
Decouple the template provider lifetime from the Bitcoin Core IPC backend. Keep the Stratum v2 listener and connected clients alive when the backend disconnects, wait for a replacement backend, and resume serving templates once a new IPC connection is installed. Also harden the template provider shutdown path, catch disconnect exceptions from IPC calls to trigger a backend reconnect, and ensure proper destruction ordering to avoid hanging threads.
decadbd to
99a1bed
Compare
Thanks for giving this a look @Sjors. It seems, in a9f45af, resetting |
|
I did the subtree update in #109 to reduce the scope here. Still reviewing the changes, but you'll need a trivial rebase. |
There was a problem hiding this comment.
When I shut the node down gracefully (ctrl + c) I get a flood of waitNext() calls. Probably because the node cancels them, but then sv2-tp tries another one immediately. We need to back-off mechanism if waitNext returns null.
I left inline suggestions for some other issues I ran into while testing (on mainnet, since it has more mempool churn, and using the sv2-apps pool role).
| continue; | ||
| } | ||
|
|
||
| LogPrintf("Restarting sv2-tp after Bitcoin Core IPC disconnect\n"); |
There was a problem hiding this comment.
99a1bed: this wording suggests it's restarting the whole application, but we're just resuming template creation?
LogPrintf("Bitcoin Core IPC disconnected; reconnecting before resuming template creation\n");| */ | ||
| [[nodiscard]] bool Start(const Sv2TemplateProviderOptions& options = {}); | ||
|
|
||
| bool BackendDisconnected() const { return m_backend_disconnected.load(); } |
There was a problem hiding this comment.
nit: light preference for renaming this BackendConnected and checking the inverse.
| m_backend_disconnected = true; | ||
| { | ||
| LOCK(m_tp_mutex); | ||
| ClearTemplateCache(); |
There was a problem hiding this comment.
Losing these templates is consequential; if a stratum client sends a solution we can't process it. So maybe add warning log:
diff --git a/src/sv2/template_provider.cpp b/src/sv2/template_provider.cpp
index 2d007e5fba..4315ee8525 100644
--- a/src/sv2/template_provider.cpp
+++ b/src/sv2/template_provider.cpp
@@ -277,4 +277,9 @@ void Sv2TemplateProvider::ClearTemplateCache()
{
AssertLockHeld(m_tp_mutex);
+ if (!m_block_template_cache.empty()) {
+ LogPrintLevel(BCLog::SV2, BCLog::Level::Warning,
+ "Dropping %zu cached block templates after Bitcoin Core IPC backend reset\n",
+ m_block_template_cache.size());
+ }
m_block_template_cache.clear();
m_best_prev_hash = uint256::ZERO;|
|
||
| if (backend && gArgs.GetChainType() != ChainType::SIGNET) { | ||
| try { | ||
| if (backend->Mining().isInitialBlockDownload()) { |
There was a problem hiding this comment.
This is now hammering the node even well after IBD, making the logs hard to read. If a node crashes, it may need to IBD again, so it makes sense to check again, but only until it's out of IBD.
Something like this should help:
diff --git a/src/sv2/template_provider.cpp b/src/sv2/template_provider.cpp
index 2d007e5fba..62c793e240 100644
--- a/src/sv2/template_provider.cpp
+++ b/src/sv2/template_provider.cpp
@@ -290,4 +290,5 @@ void Sv2TemplateProvider::ThreadSv2Handler()
std::map<size_t, std::thread> client_threads;
+ std::shared_ptr<BackendSession> checked_ibd_backend;
while (!m_flag_interrupt_sv2) {
@@ -298,7 +299,7 @@ void Sv2TemplateProvider::ThreadSv2Handler()
}
- if (backend && gArgs.GetChainType() != ChainType::SIGNET) {
+ if (backend != checked_ibd_backend && gArgs.GetChainType() != ChainType::SIGNET) {
try {
- if (backend->Mining().isInitialBlockDownload()) {
+ if (backend && backend->Mining().isInitialBlockDownload()) {
std::this_thread::sleep_for(1000ms);
continue;
@@ -308,4 +309,5 @@ void Sv2TemplateProvider::ThreadSv2Handler()
continue;
}
+ checked_ibd_backend = backend;
}
diff --git a/src/test/sv2_mock_mining.cpp b/src/test/sv2_mock_mining.cpp
index 17a544e358..0aaacbd524 100644
--- a/src/test/sv2_mock_mining.cpp
+++ b/src/test/sv2_mock_mining.cpp
@@ -139,5 +139,11 @@ void MockBlockTemplate::interruptWait()
MockMining::MockMining(std::shared_ptr<MockState> st) : state(std::move(st)) {}
bool MockMining::isTestChain() { return true; }
-bool MockMining::isInitialBlockDownload() { return false; }
+bool MockMining::isInitialBlockDownload()
+{
+ LOCK(state->m);
+ ++state->initial_block_download_checks;
+ state->cv.notify_all();
+ return false;
+}
std::optional<interfaces::BlockRef> MockMining::getTip() { return std::nullopt; }
std::optional<interfaces::BlockRef> MockMining::waitTipChanged(uint256, MillisecondsDouble) { return std::nullopt; }
@@ -151,4 +157,10 @@ void MockMining::interrupt() { LogPrintLevel(BCLog::SV2, BCLog::Level::Trace, "m
bool MockMining::checkBlock(const CBlock&, const node::BlockCheckOptions&, std::string&, std::string&) { return true; }
+uint64_t MockMining::GetInitialBlockDownloadChecks()
+{
+ LOCK(state->m);
+ return state->initial_block_download_checks;
+}
+
uint64_t MockMining::GetTemplateSeq()
{
@@ -163,4 +175,13 @@ uint64_t MockMining::GetHeight()
}
+bool MockMining::WaitForInitialBlockDownloadChecks(uint64_t target, std::chrono::milliseconds timeout)
+{
+ std::unique_lock<Mutex> lk(state->m);
+ auto deadline = std::chrono::steady_clock::now() + timeout;
+ return state->cv.wait_until(lk, deadline, [&] {
+ return state->shutdown || state->initial_block_download_checks >= target;
+ }) && !state->shutdown && state->initial_block_download_checks >= target;
+}
+
bool MockMining::WaitForTemplateSeq(uint64_t target, std::chrono::milliseconds timeout)
{
diff --git a/src/test/sv2_mock_mining.h b/src/test/sv2_mock_mining.h
index 3865922669..ec2349a2be 100644
--- a/src/test/sv2_mock_mining.h
+++ b/src/test/sv2_mock_mining.h
@@ -47,4 +47,5 @@ struct MockState {
std::condition_variable_any cv;
int wait_next_waiters{0};
+ uint64_t initial_block_download_checks{0};
bool shutdown{false};
uint64_t wait_interrupt_generation{0};
@@ -87,4 +88,5 @@ public:
// Accessors for tests (thread-safe)
+ uint64_t GetInitialBlockDownloadChecks();
uint64_t GetTemplateSeq();
uint64_t GetHeight();
@@ -96,4 +98,5 @@ public:
// Wait until internal template sequence reaches at least target (returns false on timeout/shutdown)
+ bool WaitForInitialBlockDownloadChecks(uint64_t target, std::chrono::milliseconds timeout = std::chrono::milliseconds{2000});
bool WaitForTemplateSeq(uint64_t target, std::chrono::milliseconds timeout = std::chrono::milliseconds{2000});
bool WaitForWaitNext(std::chrono::milliseconds timeout = std::chrono::milliseconds{2000});
diff --git a/src/test/sv2_template_provider_tests.cpp b/src/test/sv2_template_provider_tests.cpp
index 29c039c309..9331243ba8 100644
--- a/src/test/sv2_template_provider_tests.cpp
+++ b/src/test/sv2_template_provider_tests.cpp
@@ -39,4 +39,15 @@ BOOST_AUTO_TEST_CASE(block_reserved_weight_floor)
}
+BOOST_AUTO_TEST_CASE(ibd_check_once_per_backend)
+{
+ TPTester tester{};
+
+ BOOST_REQUIRE(tester.m_mining_control->WaitForInitialBlockDownloadChecks(1));
+ const uint64_t checks{tester.m_mining_control->GetInitialBlockDownloadChecks()};
+
+ UninterruptibleSleep(std::chrono::milliseconds{350});
+ BOOST_REQUIRE_EQUAL(tester.m_mining_control->GetInitialBlockDownloadChecks(), checks);
+}
+
BOOST_AUTO_TEST_CASE(client_tests)
{
This PR makes sv2-tp recover from a lost IPC connection without exiting.
The first commits update the libmultiprocess subtree to include the upstream fix from bitcoin-core/libmultiprocess#273. That fix makes proxy client destruction tolerate remote
destroy()failures during teardown. Without it, stale IPC-backed objects can abort the process after Bitcoin Core disconnects.On top of that, sv2-tp keeps the template provider and pool-facing connection manager alive across backend loss, reconnects to Bitcoin Core with backoff, installs a fresh IPC backend, and resumes serving templates on the existing pool connection.
The last commit drops the reconnect-time cleanup workarounds that were only needed before the IPC fix and returns the backend to normal
unique_ptrownership. It also catches disconnect exceptions from template-provider IPC calls and uses them to trigger backend reconnect instead of crashing.