Skip to content

sv2-tp: reconnect to bitcoin core IPC after disconnect#101

Open
enirox001 wants to merge 4 commits into
stratum-mining:masterfrom
enirox001:reconnect-subtree-clean-series
Open

sv2-tp: reconnect to bitcoin core IPC after disconnect#101
enirox001 wants to merge 4 commits into
stratum-mining:masterfrom
enirox001:reconnect-subtree-clean-series

Conversation

@enirox001

@enirox001 enirox001 commented May 4, 2026

Copy link
Copy Markdown
Contributor

This PR makes sv2-tp recover from a lost IPC connection without exiting.

The first commits update the libmultiprocess subtree to include the upstream fix from bitcoin-core/libmultiprocess#273. That fix makes proxy client destruction tolerate remote destroy() failures during teardown. Without it, stale IPC-backed objects can abort the process after Bitcoin Core disconnects.

On top of that, sv2-tp keeps the template provider and pool-facing connection manager alive across backend loss, reconnects to Bitcoin Core with backoff, installs a fresh IPC backend, and resumes serving templates on the existing pool connection.

The last commit drops the reconnect-time cleanup workarounds that were only needed before the IPC fix and returns the backend to normal unique_ptr ownership. It also catches disconnect exceptions from template-provider IPC calls and uses them to trigger backend reconnect instead of crashing.

@enirox001 enirox001 changed the title sv2-tp: reconnect to Bitcoin Core IPC after disconnect sv2-tp: reconnect to bitcoin core IPC after disconnect May 4, 2026
@enirox001 enirox001 force-pushed the reconnect-subtree-clean-series branch 5 times, most recently from 79e0653 to 8f0ba30 Compare May 6, 2026 09:57
// capnp interface does not define a destroy method, this will just call
// an empty stub defined in the ProxyClientBase class and do nothing.
Sub::destroy(*this);
if (m_context.connection && !m_context.connection->m_disconnected.load()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit "ipc: skip proxy destroy calls after disconnect starts" (fa010b9)

FWIW, there's already a PR on upstream that fixes this issue: bitcoin-core/libmultiprocess#273

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out @xyzconstant. I was not aware that there was already an upstream libmultiprocess PR addressing this.

I have reworked the branch to remove the local libmultiprocess change and replace it with a proper subtree update based on bitcoin-core/libmultiprocess#273. The sv2-tp reconnect commits are now rebased on top of that.

I reran the regtest stack on the updated branch and the reconnect path is still working as expected.

@enirox001 enirox001 force-pushed the reconnect-subtree-clean-series branch from 8f0ba30 to 451b144 Compare May 8, 2026 14:31
@Sjors

Sjors commented May 11, 2026

Copy link
Copy Markdown
Collaborator

I'm marking this draft until bitcoin-core/libmultiprocess#273 lands.

I also plan to review this.

@Sjors Sjors marked this pull request as draft May 11, 2026 09:14
@xyzconstant

Copy link
Copy Markdown

FYI: bitcoin-core/libmultiprocess#273 just went through

enirox001 added 2 commits June 3, 2026 09:41
61de697 Merge bitcoin-core/libmultiprocess#273: proxy-client: tolerate exceptions from remote destroy during cleanup
9cec9d6 Merge bitcoin-core/libmultiprocess#243: mpgen: support primitive std::optional struct fields
4aaff11 Merge bitcoin-core/libmultiprocess#238: cmake, ci: updates for recent nixpkgs
2ac55a5 Merge bitcoin-core/libmultiprocess#218: Better error and log messages
6de92e1 proxy-client: tolerate exceptions from remote destroy during cleanup
90be835 test: regression for ~ProxyClient destroy after peer disconnect
3c69d12 Merge bitcoin-core/libmultiprocess#260: event loop: tolerate unexpected exceptions in `post()` callbacks
b8a48c6 event loop: tolerate unexpected exceptions in `post()` callbacks
f787863 Merge bitcoin-core/libmultiprocess#270: doc: Bump version 10 > 11
a22f602 doc: Bump version 10 > 11
4eae445 debug: Add TypeName() function and log statements for Proxy objects being created and destroyed
f326c5b logging: Add better logging on IPC server-side failures
6dbfa56 mpgen: support primitive std::optional struct fields
8d1277d mpgen refactor: add AccessorType function
db716bb mpgen refactor: Move field handling code to FieldList class
db7acb3 ci: Fix shell.nix compatibility with CMake 4.0
91a7759 cmake: Fix IWYU in nix by adding CMAKE_CXX_IMPLICIT_INCLUDE_DIRECTORIES

git-subtree-dir: src/ipc/libmultiprocess
git-subtree-split: 61de6975362a7070276da47cc2aa2c2b8909ed49
@enirox001 enirox001 force-pushed the reconnect-subtree-clean-series branch from 451b144 to f8b5fd5 Compare June 3, 2026 08:42
Add a reconnect loop around the initial Bitcoin Core IPC setup.

When the IPC connection cannot be established, retry with exponential
backoff instead of exiting immediately. This provides the basis for
recovering sv2-tp after backend loss.
@enirox001 enirox001 force-pushed the reconnect-subtree-clean-series branch 2 times, most recently from 62f92cd to 41cbc55 Compare June 3, 2026 09:07
@enirox001

Copy link
Copy Markdown
Contributor Author

FYI: bitcoin-core/libmultiprocess#273 just went through

Thanks for the heads up @xyzconstant.

@Sjors, I have updated the subtree to use the latest changes. I think this should be ready for review now.

@Sjors Sjors marked this pull request as ready for review June 6, 2026 07:24
@Sjors

Sjors commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

The test each commit CI job is taking unexpectedly long. My agent thinks a9f45af is hanging in sv2_tester_lifecycle_tests and sv2_template_provider_tests.

@enirox001 enirox001 force-pushed the reconnect-subtree-clean-series branch from 41cbc55 to decadbd Compare June 8, 2026 13:13
Decouple the template provider lifetime from the Bitcoin Core IPC
backend.

Keep the Stratum v2 listener and connected clients alive when the
backend disconnects, wait for a replacement backend, and resume
serving templates once a new IPC connection is installed.

Also harden the template provider shutdown path, catch disconnect
exceptions from IPC calls to trigger a backend reconnect, and ensure
proper destruction ordering to avoid hanging threads.
@enirox001 enirox001 force-pushed the reconnect-subtree-clean-series branch from decadbd to 99a1bed Compare June 8, 2026 13:30
@enirox001

Copy link
Copy Markdown
Contributor Author

The test each commit CI job is taking unexpectedly long. My agent thinks a9f45af is hanging in sv2_tester_lifecycle_tests and sv2_template_provider_tests.

Thanks for giving this a look @Sjors. It seems, in a9f45af, resetting m_backend before calling Interrupt() made InterruptBackend() a no-op, leaving client threads hung in waitNext() during teardown.

@Sjors

Sjors commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

I did the subtree update in #109 to reduce the scope here. Still reviewing the changes, but you'll need a trivial rebase.

@Sjors Sjors left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I shut the node down gracefully (ctrl + c) I get a flood of waitNext() calls. Probably because the node cancels them, but then sv2-tp tries another one immediately. We need to back-off mechanism if waitNext returns null.

I left inline suggestions for some other issues I ran into while testing (on mainnet, since it has more mempool churn, and using the sv2-apps pool role).

Comment thread src/sv2-tp.cpp
continue;
}

LogPrintf("Restarting sv2-tp after Bitcoin Core IPC disconnect\n");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

99a1bed: this wording suggests it's restarting the whole application, but we're just resuming template creation?

LogPrintf("Bitcoin Core IPC disconnected; reconnecting before resuming template creation\n");

*/
[[nodiscard]] bool Start(const Sv2TemplateProviderOptions& options = {});

bool BackendDisconnected() const { return m_backend_disconnected.load(); }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: light preference for renaming this BackendConnected and checking the inverse.

m_backend_disconnected = true;
{
LOCK(m_tp_mutex);
ClearTemplateCache();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Losing these templates is consequential; if a stratum client sends a solution we can't process it. So maybe add warning log:

diff --git a/src/sv2/template_provider.cpp b/src/sv2/template_provider.cpp
index 2d007e5fba..4315ee8525 100644
--- a/src/sv2/template_provider.cpp
+++ b/src/sv2/template_provider.cpp
@@ -277,4 +277,9 @@ void Sv2TemplateProvider::ClearTemplateCache()
 {
     AssertLockHeld(m_tp_mutex);
+    if (!m_block_template_cache.empty()) {
+        LogPrintLevel(BCLog::SV2, BCLog::Level::Warning,
+                      "Dropping %zu cached block templates after Bitcoin Core IPC backend reset\n",
+                      m_block_template_cache.size());
+    }
     m_block_template_cache.clear();
     m_best_prev_hash = uint256::ZERO;


if (backend && gArgs.GetChainType() != ChainType::SIGNET) {
try {
if (backend->Mining().isInitialBlockDownload()) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now hammering the node even well after IBD, making the logs hard to read. If a node crashes, it may need to IBD again, so it makes sense to check again, but only until it's out of IBD.

Something like this should help:

diff --git a/src/sv2/template_provider.cpp b/src/sv2/template_provider.cpp
index 2d007e5fba..62c793e240 100644
--- a/src/sv2/template_provider.cpp
+++ b/src/sv2/template_provider.cpp
@@ -290,4 +290,5 @@ void Sv2TemplateProvider::ThreadSv2Handler()

     std::map<size_t, std::thread> client_threads;
+    std::shared_ptr<BackendSession> checked_ibd_backend;

     while (!m_flag_interrupt_sv2) {
@@ -298,7 +299,7 @@ void Sv2TemplateProvider::ThreadSv2Handler()
         }

-        if (backend && gArgs.GetChainType() != ChainType::SIGNET) {
+        if (backend != checked_ibd_backend && gArgs.GetChainType() != ChainType::SIGNET) {
             try {
-                if (backend->Mining().isInitialBlockDownload()) {
+                if (backend && backend->Mining().isInitialBlockDownload()) {
                     std::this_thread::sleep_for(1000ms);
                     continue;
@@ -308,4 +309,5 @@ void Sv2TemplateProvider::ThreadSv2Handler()
                 continue;
             }
+            checked_ibd_backend = backend;
         }

diff --git a/src/test/sv2_mock_mining.cpp b/src/test/sv2_mock_mining.cpp
index 17a544e358..0aaacbd524 100644
--- a/src/test/sv2_mock_mining.cpp
+++ b/src/test/sv2_mock_mining.cpp
@@ -139,5 +139,11 @@ void MockBlockTemplate::interruptWait()
 MockMining::MockMining(std::shared_ptr<MockState> st) : state(std::move(st)) {}
 bool MockMining::isTestChain() { return true; }
-bool MockMining::isInitialBlockDownload() { return false; }
+bool MockMining::isInitialBlockDownload()
+{
+    LOCK(state->m);
+    ++state->initial_block_download_checks;
+    state->cv.notify_all();
+    return false;
+}
 std::optional<interfaces::BlockRef> MockMining::getTip() { return std::nullopt; }
 std::optional<interfaces::BlockRef> MockMining::waitTipChanged(uint256, MillisecondsDouble) { return std::nullopt; }
@@ -151,4 +157,10 @@ void MockMining::interrupt() { LogPrintLevel(BCLog::SV2, BCLog::Level::Trace, "m
 bool MockMining::checkBlock(const CBlock&, const node::BlockCheckOptions&, std::string&, std::string&) { return true; }

+uint64_t MockMining::GetInitialBlockDownloadChecks()
+{
+    LOCK(state->m);
+    return state->initial_block_download_checks;
+}
+
 uint64_t MockMining::GetTemplateSeq()
 {
@@ -163,4 +175,13 @@ uint64_t MockMining::GetHeight()
 }

+bool MockMining::WaitForInitialBlockDownloadChecks(uint64_t target, std::chrono::milliseconds timeout)
+{
+    std::unique_lock<Mutex> lk(state->m);
+    auto deadline = std::chrono::steady_clock::now() + timeout;
+    return state->cv.wait_until(lk, deadline, [&] {
+        return state->shutdown || state->initial_block_download_checks >= target;
+    }) && !state->shutdown && state->initial_block_download_checks >= target;
+}
+
 bool MockMining::WaitForTemplateSeq(uint64_t target, std::chrono::milliseconds timeout)
 {
diff --git a/src/test/sv2_mock_mining.h b/src/test/sv2_mock_mining.h
index 3865922669..ec2349a2be 100644
--- a/src/test/sv2_mock_mining.h
+++ b/src/test/sv2_mock_mining.h
@@ -47,4 +47,5 @@ struct MockState {
     std::condition_variable_any cv;
     int wait_next_waiters{0};
+    uint64_t initial_block_download_checks{0};
     bool shutdown{false};
     uint64_t wait_interrupt_generation{0};
@@ -87,4 +88,5 @@ public:

     // Accessors for tests (thread-safe)
+    uint64_t GetInitialBlockDownloadChecks();
     uint64_t GetTemplateSeq();
     uint64_t GetHeight();
@@ -96,4 +98,5 @@ public:

     // Wait until internal template sequence reaches at least target (returns false on timeout/shutdown)
+    bool WaitForInitialBlockDownloadChecks(uint64_t target, std::chrono::milliseconds timeout = std::chrono::milliseconds{2000});
     bool WaitForTemplateSeq(uint64_t target, std::chrono::milliseconds timeout = std::chrono::milliseconds{2000});
     bool WaitForWaitNext(std::chrono::milliseconds timeout = std::chrono::milliseconds{2000});
diff --git a/src/test/sv2_template_provider_tests.cpp b/src/test/sv2_template_provider_tests.cpp
index 29c039c309..9331243ba8 100644
--- a/src/test/sv2_template_provider_tests.cpp
+++ b/src/test/sv2_template_provider_tests.cpp
@@ -39,4 +39,15 @@ BOOST_AUTO_TEST_CASE(block_reserved_weight_floor)
 }

+BOOST_AUTO_TEST_CASE(ibd_check_once_per_backend)
+{
+    TPTester tester{};
+
+    BOOST_REQUIRE(tester.m_mining_control->WaitForInitialBlockDownloadChecks(1));
+    const uint64_t checks{tester.m_mining_control->GetInitialBlockDownloadChecks()};
+
+    UninterruptibleSleep(std::chrono::milliseconds{350});
+    BOOST_REQUIRE_EQUAL(tester.m_mining_control->GetInitialBlockDownloadChecks(), checks);
+}
+
 BOOST_AUTO_TEST_CASE(client_tests)
 {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants