Migrate pool to new synchronous API#531
Conversation
acdb471 to
f9e19b3
Compare
|
Opening this up for review. Let me know what you think of API surface and if anything can be improved or is not clear |
76f7eb1 to
99ed182
Compare
|
My clanker found a real concern about the usage of |
Added here: b3dcf5d, thanks. |
|
I'm a bit concerned with the usage of I assume it's being introduced as a strategy to optimize against contention from long-held entry guards in more specifically, the point raised by @GitGab19 above has not been fully addressed there are other potential TOCTOU race windows related to similar usage of
b3dcf5d partially mitigates the issue, but open-channel handlers still allow stale cloned downstream handles to commit channel/vardiff state after disconnect... while I think this could be ok to live with as a deliberate tradeoff for this specific scenario, we're still violating invariants and I also don't think we can trivially generalize this pattern to solve all potential TOCTOU race windows listed above overall, we're already swapping the monolithic/nested-lock model for fine-grained let's not forget that the goal of #368 was to introduce safe moreover, #529 still needs to be fixed, so even if de-nesting locks + so IMHO we should do things one step at a time: for now, avoid aggressively using for now, if we later see that contention remains an issue, then we should leverage the foundations established by #368 + this PR to address it with a proper strategy |
|
The argument that the above suggestion is only partially valid is not entirely sound. If you look closely, the concern raised above is that we are checking liveness only at the end rather than at every commit point. This is intentional, as I wanted to keep the logic simple and easier to reason about. These intermediate commit points do not materially affect the outcome. If a downstream disconnect occurs during channel opening, performing the cleanup at the end is sufficient. Otherwise, we would need to introduce cleanup logic after every commit step, which would significantly increase complexity and reduce maintainability. Regarding the TOCTOU concern, where it is mentioned that a race could occur in-between the channel manager methods and the vardiff task, I agree that the concern is partially valid(only for vardiff and select task). However, all channel manager methods execute sequentially, so a race condition cannot occur between those methods themselves. The only concurrency involved is with the vardiff task. A side note the partial commits are ok if the object eventually gonna be dropped It will be useful, if you can mention the TOCTOU scenarios? Thanks for review though, would be awesome discuss more on this. |
that is not the claim here, it's about
end of the day, the truth is that with this extensive usage of if we decided to live with these loose guarantees, we should consciously accept the following tradeoffs:
(maybe more, these are the ones I can identify right now) but the question is: do we want/need to make that decision now? overall I'm not going to die on this hill, and I don't want to spend more time debating all the details of this rabbit hole I'd rather focus on my broader point, which I'm happy to clarify in case it's not already sufficiently clear: the extensive usage of IMHO we should stick to the originally planned scope and measure the performance gains before we jump into new strategies for contention optimization... if we later find that we're still facing performance issues, we can evaluate |
|
Hmm, the idea behind introducing the clone was not because of any technical limitation or API issue . It was mainly about making the code easier to write and follow. If you look at the code now, it is just a simple flow: get the object, perform the operation, and move on, let the system follow eventual consistency. Even if the object becomes invalid, the operation can still happen, and the system will eventually recover and become consistent again. That was the idea behind this. The introduction of That said, nothing is stopping us from providing atomic access to objects and ensuring they cannot become invalid while they are being used. That would give us stronger consistency guarantees (like before). I agree that we should get some numbers before deciding. I will add a commit that prevents operations on invalid objects. Cool. |
8ecf55a to
ce528a0
Compare
|
Removed get_cloned for downstream, here: ce528a0. I will squash commit later, would make it easy to review. |
I think this is a reasonable rationale, and I wouldn't completely rule it out of the design space yet. but tbh conflicting logs and ghost vardiff entries that could last up to 60s in memory worry me about the UX implications now that
I empathize with the urge to make devX better. It's been painful to live with closure complexity of the soon-to-be-deprecated custom But we made it this far, and migrating away from it is a delicate operation. So I feel it's wiser to move slow, evaluate consequences abroad the entire SRI stack, and optimize things gradually. We laid some solid foundation on #368. Let's take some time and explore it. How will granularly locked sharded maps perform in face of curtailed loads (with granular instrumentation of the source code)? I'm genuinely curious to look at those numbers!
awesome, thanks! |
c7c485a to
7b361d0
Compare
7b361d0 to
f5a3a8f
Compare
remove channel_id from group_channel when we receive close channel message
f5a3a8f to
8452be1
Compare
| /// Get an owned clone of a value. | ||
| /// | ||
| /// Prefer [`Self::with`] when only part of a large value is needed. |
There was a problem hiding this comment.
let's make docs explicitly clear about the appropriate vs inappropriate usage of this method
| pub fn try_for_each<F, E>(&self, mut f: F) -> Result<(), E> | ||
| where | ||
| F: Fn(K, &V) -> Result<(), E>, | ||
| F: FnMut(K, &V) -> Result<(), E>, | ||
| { | ||
| for entry in self.0.iter() { | ||
| f(entry.key().clone(), entry.value())?; | ||
| } | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
I have a question about the for_each APIs. In cases like this, if an error occurs, we simply return. Since we're iterating over a collection, multiple items may fail, and we generally don't want a single error to cause an early exit and prevent the remaining items from being processed.
At the same time, I would prefer not to make the API surface significantly more complex. However, I think we do need some form of batching or error aggregation here to ensure that iteration isn't aborted prematurely.
All ears for new design.
There was a problem hiding this comment.
My clanker gave me this example from tokio, which could be taken into consideration as a possible approach: https://docs.rs/tokio/latest/tokio/task/struct.JoinSet.html#method.join_all
In the join_all docs, they explicitly say: "If any tasks on the JoinSet fail with an JoinError, then this call to join_all will panic and all remaining tasks on the JoinSet are cancelled. To handle errors in any other way, manually call join_next in a loop.`
The join_next docs can be found here: https://docs.rs/tokio/latest/tokio/task/struct.JoinSet.html#method.join_next
In our case here, the method corresponding to the tokio's join_next() is the for_each() that we already have, right?
So I guess we can use that when we know there's a fallible operation, and collect the eventual errors at the caller site?
Maybe I'm missing something?
There was a problem hiding this comment.
@Shourya742 is there some part of this PR that you feel is blocked by this?
I'm asking because we already have #182 (batch error flow) and #257 (batched disconnect)
I'm not against brainstorming here, this is a good opportunity because in a way we're still trimming the rough edges of #368
but since we already have trackers for this specific issue, it might make sense to tackle that in a separate PR?
unless there's a concrete blocker for this PR ofc
btw, from a relatively superficial analysis I think next seems to make sense... instead of looping inside the function body, do it on the call site and handle errors individually (e.g.: disconnect) before moving on to the next iteration
maybe we can take note of next pattern as a potential solution on #182 and #257 and proceed with this PR without addressing this specific issue?
closes: #205