Skip to content

Migrate pool to new synchronous API#531

Draft
Shourya742 wants to merge 3 commits into
stratum-mining:mainfrom
Shourya742:2026-05-27-migrate-pool-to-new-synchronous-apis
Draft

Migrate pool to new synchronous API#531
Shourya742 wants to merge 3 commits into
stratum-mining:mainfrom
Shourya742:2026-05-27-migrate-pool-to-new-synchronous-apis

Conversation

@Shourya742

Copy link
Copy Markdown
Member

closes: #205

@Shourya742 Shourya742 force-pushed the 2026-05-27-migrate-pool-to-new-synchronous-apis branch 3 times, most recently from acdb471 to f9e19b3 Compare June 3, 2026 12:01
@Shourya742 Shourya742 marked this pull request as ready for review June 3, 2026 12:28
@Shourya742

Shourya742 commented Jun 3, 2026

Copy link
Copy Markdown
Member Author

Opening this up for review. Let me know what you think of API surface and if anything can be improved or is not clear

@Shourya742 Shourya742 force-pushed the 2026-05-27-migrate-pool-to-new-synchronous-apis branch 2 times, most recently from 76f7eb1 to 99ed182 Compare June 4, 2026 08:27
Comment thread pool-apps/pool/src/lib/channel_manager/mod.rs
@GitGab19

GitGab19 commented Jun 8, 2026

Copy link
Copy Markdown
Member

My clanker found a real concern about the usage of self.downstream.get_cloned() in the channel openings handlers:

Possible interleaving:

1. Open-channel handler clones the `Downstream`.
2. The downstream disconnects.
3. `remove_downstream()` removes the downstream and retains away existing `vardiff` entries.
4. The open-channel handler continues with the cloned handle.
5. It inserts a channel into `downstream.standard_channels` / `extended_channels`.
6. It inserts a new `(downstream_id, channel_id)` entry into `self.vardiff`.

At that point, `self.vardiff` can contain an entry for a downstream that no longer exists in `self.downstream`. `run_vardiff()` will then skip it forever because `self.downstream.get_cloned(&downstream_id)` returns `None`.

This is not a memory-safety issue, but it does break the manager invariant that `vardiff` entries correspond to live downstream/channel state. I think the final channel insertion and `vardiff` insertion should happen while holding/rechecking the downstream map entry, or otherwise verify membership at commit time.

@plebhash plebhash self-requested a review June 9, 2026 15:38
@Shourya742

Copy link
Copy Markdown
Member Author

This is not a memory-safety issue, but it does break the manager invariant that vardiff entries correspond to live downstream/channel state. I think the final channel insertion and vardiff insertion should happen while holding/rechecking the downstream map entry, or otherwise verify membership at commit time.

Added here: b3dcf5d, thanks.

Comment thread pool-apps/pool/src/lib/channel_manager/mod.rs Outdated
@plebhash

plebhash commented Jun 11, 2026

Copy link
Copy Markdown
Member

I'm a bit concerned with the usage of get_cloned on this PR.

I assume it's being introduced as a strategy to optimize against contention from long-held entry guards in with and with_mut

more specifically, the point raised by @GitGab19 above has not been fully addressed

there are other potential TOCTOU race windows related to similar usage of get_cloned in several places in this PR:

  • handle_open_standard_mining_channel (as already stated)
  • handle_open_extended_mining_channel (as already stated)
  • handle_update_channel
  • handle_set_custom_mining_job
  • handle_new_template
  • handle_set_new_prev_hash
  • run_vardiff

b3dcf5d partially mitigates the issue, but open-channel handlers still allow stale cloned downstream handles to commit channel/vardiff state after disconnect... while I think this could be ok to live with as a deliberate tradeoff for this specific scenario, we're still violating invariants and I also don't think we can trivially generalize this pattern to solve all potential TOCTOU race windows listed above


overall, we're already swapping the monolithic/nested-lock model for fine-grained SharedLock and SharedMap, and we're yet to analyze the performance gains coming from this improvement

let's not forget that the goal of #368 was to introduce safe DashMap-based primitives that will help us deprecate the unsafe custom Mutex... the fact that now we're bypassing those primitives via get_cloned feels like we're moving the goal post in a precipitated way

moreover, #529 still needs to be fixed, so even if de-nesting locks + DashMap sharding only brings mild/moderate performance improvements, we still lack empirical proof of broader contention issues

so IMHO we should do things one step at a time:

for now, avoid aggressively using get_cloned as an over-optimized strategy to reduce contention on this PR, since that is introducing concurrency side-effects for which mitigation is non-trivial and will likely blow up scope

for now, get_cloned usage should be restricted to snapshot-style usage only (and the function should be documented accordingly)

if we later see that contention remains an issue, then we should leverage the foundations established by #368 + this PR to address it with a proper strategy

@Shourya742

Copy link
Copy Markdown
Member Author

The argument that the above suggestion is only partially valid is not entirely sound.

If you look closely, the concern raised above is that we are checking liveness only at the end rather than at every commit point. This is intentional, as I wanted to keep the logic simple and easier to reason about. These intermediate commit points do not materially affect the outcome. If a downstream disconnect occurs during channel opening, performing the cleanup at the end is sufficient. Otherwise, we would need to introduce cleanup logic after every commit step, which would significantly increase complexity and reduce maintainability.

Regarding the TOCTOU concern, where it is mentioned that a race could occur in-between the channel manager methods and the vardiff task, I agree that the concern is partially valid(only for vardiff and select task). However, all channel manager methods execute sequentially, so a race condition cannot occur between those methods themselves. The only concurrency involved is with the vardiff task. A side note the partial commits are ok if the object eventually gonna be dropped

It will be useful, if you can mention the TOCTOU scenarios?

Thanks for review though, would be awesome discuss more on this.

@plebhash

plebhash commented Jun 11, 2026

Copy link
Copy Markdown
Member

so a race condition cannot occur between those methods themselves.

that is not the claim here, it's about handle_* execution and remove_downstream

The argument that the above suggestion is only partially valid is not entirely sound.

is_alive introduced via b3dcf5d is only partial mitigation because it could return true and allow execution to continue, only for remove_downstream to be executed immediately afterwards and break invariants again

end of the day, the truth is that with this extensive usage of get_cloned, the removal of a Downstream object is not atomic, and is_alive only gives us loose guarantees but no sound atomicity

if we decided to live with these loose guarantees, we should consciously accept the following tradeoffs:

  • potential wasted CPU cycles
  • potential misleading logs with concurrent contradictory signals about the application state
  • temporary ghost vardiff entries from TOCTOU races on handle_open_*

(maybe more, these are the ones I can identify right now)

but the question is: do we want/need to make that decision now?


overall I'm not going to die on this hill, and I don't want to spend more time debating all the details of this rabbit hole

I'd rather focus on my broader point, which I'm happy to clarify in case it's not already sufficiently clear:

the extensive usage of get_cloned as a strategy for contention optimization feels precipitated (given scopes of #368 + #529), and it seems we don't have a good grasp of the second-order consequences of it, so I don't see the point in rushing things at the expense of a new category of unknown problems to be solved in the future

IMHO we should stick to the originally planned scope and measure the performance gains before we jump into new strategies for contention optimization... if we later find that we're still facing performance issues, we can evaluate get_cloned as a potential strategy and its inherent tradeoffs more carefully

@Shourya742

Shourya742 commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Hmm, the idea behind introducing the clone was not because of any technical limitation or API issue . It was mainly about making the code easier to write and follow. If you look at the code now, it is just a simple flow: get the object, perform the operation, and move on, let the system follow eventual consistency.

Even if the object becomes invalid, the operation can still happen, and the system will eventually recover and become consistent again. That was the idea behind this. The introduction of is_alive was also along the same lines: making the system faster by relying on eventual consistency instead of always enforcing strong consistency. The side effects of operating on an invalid object were not expected to be serious. At least, that was my thinking.

That said, nothing is stopping us from providing atomic access to objects and ensuring they cannot become invalid while they are being used. That would give us stronger consistency guarantees (like before).

I agree that we should get some numbers before deciding. I will add a commit that prevents operations on invalid objects. Cool.

@Shourya742 Shourya742 force-pushed the 2026-05-27-migrate-pool-to-new-synchronous-apis branch 2 times, most recently from 8ecf55a to ce528a0 Compare June 12, 2026 10:34
@Shourya742

Copy link
Copy Markdown
Member Author

Removed get_cloned for downstream, here: ce528a0.

I will squash commit later, would make it easy to review.

@plebhash

Copy link
Copy Markdown
Member

Even if the object becomes invalid, the operation can still happen, and the system will eventually recover and become consistent again. That was the idea behind this. The introduction of is_alive was also along the same lines: making the system faster by relying on eventual consistency instead of always enforcing strong consistency. The side effects of operating on an invalid object were not expected to be serious. At least, that was my thinking.

I think this is a reasonable rationale, and I wouldn't completely rule it out of the design space yet.

but tbh conflicting logs and ghost vardiff entries that could last up to 60s in memory worry me about the UX implications

now that sv2-ui is a first-class citizen on the SRI stack, we must always put ourselves in the shoes of the end user before we make key architectural/design decisions

Hmm, the idea behind introducing the clone was not because of any technical limitation or API issue . It was mainly about making the code easier to write and follow. If you look at the code now, it is just a simple flow: get the object, perform the operation, and move on, let the system follow eventual consistency.

I empathize with the urge to make devX better. It's been painful to live with closure complexity of the soon-to-be-deprecated custom Mutex, and I also wish it went away.

But we made it this far, and migrating away from it is a delicate operation. So I feel it's wiser to move slow, evaluate consequences abroad the entire SRI stack, and optimize things gradually.

We laid some solid foundation on #368. Let's take some time and explore it. How will granularly locked sharded maps perform in face of curtailed loads (with granular instrumentation of the source code)? I'm genuinely curious to look at those numbers!

Removed get_cloned for downstream, here: ce528a0.

I will squash commit later, would make it easy to review.

awesome, thanks!

@plebhash

plebhash commented Jun 15, 2026

Copy link
Copy Markdown
Member

can we clean up commit history?

e0fe2cb and ce528a0 are a bit redundant now (add get_cloned then remove get_cloned)

@Shourya742 Shourya742 force-pushed the 2026-05-27-migrate-pool-to-new-synchronous-apis branch 2 times, most recently from c7c485a to 7b361d0 Compare June 15, 2026 16:22
@Shourya742

Copy link
Copy Markdown
Member Author

I will squash commit later, would make it easy to review.

I assume you had looked into the final commit earlier.

can we clean up commit history?

e0fe2cb and ce528a0 are a bit redundant now (add get_cloned then remove get_cloned)

Squashed. Awesome, thanks.

@Shourya742 Shourya742 requested a review from GitGab19 June 16, 2026 09:12
Comment thread pool-apps/pool/src/lib/monitoring.rs Outdated
@Shourya742 Shourya742 force-pushed the 2026-05-27-migrate-pool-to-new-synchronous-apis branch from 7b361d0 to f5a3a8f Compare June 16, 2026 14:52
remove channel_id from group_channel when we receive close channel message
@Shourya742 Shourya742 force-pushed the 2026-05-27-migrate-pool-to-new-synchronous-apis branch from f5a3a8f to 8452be1 Compare June 16, 2026 14:52
Comment thread stratum-apps/src/sync.rs
Comment on lines +126 to +128
/// Get an owned clone of a value.
///
/// Prefer [`Self::with`] when only part of a large value is needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make docs explicitly clear about the appropriate vs inappropriate usage of this method

@plebhash plebhash marked this pull request as draft June 16, 2026 16:28
Comment thread stratum-apps/src/sync.rs
Comment on lines +226 to 234
pub fn try_for_each<F, E>(&self, mut f: F) -> Result<(), E>
where
F: Fn(K, &V) -> Result<(), E>,
F: FnMut(K, &V) -> Result<(), E>,
{
for entry in self.0.iter() {
f(entry.key().clone(), entry.value())?;
}
Ok(())
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about the for_each APIs. In cases like this, if an error occurs, we simply return. Since we're iterating over a collection, multiple items may fail, and we generally don't want a single error to cause an early exit and prevent the remaining items from being processed.

At the same time, I would prefer not to make the API surface significantly more complex. However, I think we do need some form of batching or error aggregation here to ensure that iteration isn't aborted prematurely.

All ears for new design.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My clanker gave me this example from tokio, which could be taken into consideration as a possible approach: https://docs.rs/tokio/latest/tokio/task/struct.JoinSet.html#method.join_all

In the join_all docs, they explicitly say: "If any tasks on the JoinSet fail with an JoinError, then this call to join_all will panic and all remaining tasks on the JoinSet are cancelled. To handle errors in any other way, manually call join_next in a loop.`

The join_next docs can be found here: https://docs.rs/tokio/latest/tokio/task/struct.JoinSet.html#method.join_next


In our case here, the method corresponding to the tokio's join_next() is the for_each() that we already have, right?

So I guess we can use that when we know there's a fallible operation, and collect the eventual errors at the caller site?

Maybe I'm missing something?

@plebhash plebhash Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shourya742 is there some part of this PR that you feel is blocked by this?

I'm asking because we already have #182 (batch error flow) and #257 (batched disconnect)

I'm not against brainstorming here, this is a good opportunity because in a way we're still trimming the rough edges of #368

but since we already have trackers for this specific issue, it might make sense to tackle that in a separate PR?

unless there's a concrete blocker for this PR ofc


btw, from a relatively superficial analysis I think next seems to make sense... instead of looping inside the function body, do it on the call site and handle errors individually (e.g.: disconnect) before moving on to the next iteration

maybe we can take note of next pattern as a potential solution on #182 and #257 and proceed with this PR without addressing this specific issue?

@jokeez

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor Pool to Reduce Nested Locking

4 participants