Conversation
84ef7ea to
a0ce7ec
Compare
|
/review |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Co-Authored-By: Timothy Coleman <timothy.coleman@gmail.com>
… the consumer strategy skips events * When a consumer strategy skips an event, don't keep it in the buffer but add it to the retry list instead. The checkpointing mechanism considers the lowest message in `_retry` - not the lowest message in `_buffer`. Thus leaving it in the buffer may result in a wrong checkpoint being produced. * Always update the last known event when a new sequence number has been assigned A new sequence number being assigned means that we are seeing an event that was not processed before. In other words, it's the last event we've read from the source stream, so we update the last known sequence number / last known event position accordingly
the newSequenceNumberAssigned check isn't enough, because in the NoMoreCapacity case we discard the message that the _lastKnown vars are based on. then next time we call OutstandingMessage.ForPushedEvent the _lastKnownMessage we pass in will be wrong, potentially leading to an incorrect checkpoint (see test)
- safer because Scan() says we shouldn't add to each linked list while scanning it. before this commit we could add to _retry while scanning through it. now we only add to _retry while scanning _buffer - more efficient because before this commit adding to _retry involved searching through it, now we only add to _retry when skipping an event in _buffer in which case we know the event needs to go to the back of retry.
a0ce7ec to
ad80b7e
Compare
Originally (before pinned strategies)
With the addition of pinned strategies
Problems (these only affect pinned strategies):
This PRTo solve this, we
The easiest way to achieve both these things is, when we skip an event we give it a sequence number and move it to the _retry list immediately. This is natural because conceptually the Alternatively, when skipping an event we could give it a sequence number and leave it in the Note that:
|
There was a problem hiding this comment.
🚨 @shaan1337 Failed to create cherry Pick PR due to error:
RequestError [HttpError]: Merge conflict
at /home/runner/work/_actions/kurrent-io/Automations/master/cherry-pick-pr-for-label/node_modules/@octokit/request/dist-node/index.js:66:23
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
status: '409',
headers: {
'access-control-allow-origin': '*',
'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset',
'content-length': '127',
'content-security-policy': "default-src 'none'",
'content-type': 'application/json; charset=utf-8',
date: 'Tue, 14 Apr 2026 06:37:10 GMT',
'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin',
server: 'github.com',
'strict-transport-security': 'max-age=31536000; includeSubdomains; preload',
vary: 'Accept-Encoding, Accept, X-Requested-With',
'x-accepted-github-permissions': 'contents=write',
'x-content-type-options': 'nosniff',
'x-frame-options': 'deny',
'x-github-api-version-selected': '2022-11-28',
'x-github-media-type': 'github.v3; format=json',
'x-github-request-id': 'F401:230BC3:2136B40:88AA66D:69DDE095',
'x-ratelimit-limit': '5000',
'x-ratelimit-remaining': '4990',
'x-ratelimit-reset': '1776152226',
'x-ratelimit-resource': 'core',
'x-ratelimit-used': '10',
'x-xss-protection': '0'
},
request: {
method: 'POST',
url: 'https://api.github.com/repos/kurrent-io/KurrentDB/merges',
headers: {
accept: 'application/vnd.github.v3+json',
'user-agent': 'octokit-core.js/3.3.2 Node.js/20.20.1 (linux; x64)',
authorization: 'token [REDACTED]',
'content-type': 'application/json; charset=utf-8'
},
body: '{"base":"cherry-pick-cherry-pick/5497/shaan1337/fix-wrong-ps-checkpoint-release/v24.10-01e7b37d-b44d-41d0-89ef-e650209d5b0a","commit_message":"Merge c07492e966e97aff3cf4fff0f27e51c960d21509 into cherry-pick-cherry-pick/5497/shaan1337/fix-wrong-ps-checkpoint-release/v24.10-01e7b37d-b44d-41d0-89ef-e650209d5b0a [skip ci]\\n\\n\\nskip-checks: true\\n","head":"c07492e966e97aff3cf4fff0f27e51c960d21509"}',
request: { agent: [Agent], hook: [Function: bound bound register] }
},
documentation_url: 'https://docs.github.com/rest/branches/branches#merge-a-branch'
}
🚨👉 Check https://github.com/kurrent-io/KurrentDB/actions/runs/24384744335
There was a problem hiding this comment.
@shaan1337 👉 Created pull request targeting release/v26.0: #5578
* [DB-1964]: Added migration engine for secondary index tables (#5561) * Initial version of the index DB migration engine * Fixed regression * Row must by passed by ref because its instance method mutates the struct fields * Create schema from script only when fresh setup * Removed version setup from the script file * Fixed regression * Removed redundant migration action * Fixed occasional NRE * Align guide with the source code * Fixed typo * Review feedback * Fixed indentation * Fixed identation * [DB-1951] Add MSA idempotency tests that correspond to the existing behaviour (#5565) * Add MSA idempotency tests that correspond to the existing behaviour * chore: restructure CLAUDE.md with progressive disclosure and add architect review agent (#5572) Restructure AI documentation for better context efficiency: - CLAUDE.md reduced from 812 to 150 lines, keeping only rules that affect every coding decision (threading model, C# conventions, log levels, parameter design, naming) - Extracted reference docs to .claude/docs/ with clear "when to fetch" pointers: architecture.md, api-v2-patterns.md, testing.md, protocol-v2.md, patterns-and-conventions.md - New content from Tim Coleman's projections-v2 review (PR #5562): - TcsEnvelope<> vs CallbackEnvelope threading rule (was showing anti-pattern as correct) - Non-optional parameters / no-fallbacks convention - ISystemClient preference over raw IPublisher - ClaimsPrincipal threading-through requirement - Projections V2 engine architecture and threading model - Add architect-review agent (.claude/agents/architect-review.md) encoding Tim Coleman's review methodology for future agentic code reviews Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Treat writes as idempotent if all events were previously written (#5547) An idempotent multi-stream write retry should succeed even if a check-only stream's expected version no longer matches. The events are already written and the check-only stream's state may have changed since, ultimately we resend the same response to the client as they should have received the first time they sent the request. Previously, a check-only stream (failing or not) caused the entire retry to fail with WrongExpectedVersion. * Fix incorrect result stream name for root partition in V2 projections The double-dash `$projections-{name}--result` was produced when the partition key was empty. Now uses `ProjectionNamesBuilder.MakeResultStreamName` — consistent with V1's `MakePartitionResultStreamName`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [DB-1912] Persistent subscriptions: Fix wrong checkpoint when using pinned strategy (#5497) * Add failing test Co-Authored-By: Timothy Coleman <timothy.coleman@gmail.com> * Fixed: Wrong persistent subscription checkpoints can be produced when the consumer strategy skips events * When a consumer strategy skips an event, don't keep it in the buffer but add it to the retry list instead. The checkpointing mechanism considers the lowest message in `_retry` - not the lowest message in `_buffer`. Thus leaving it in the buffer may result in a wrong checkpoint being produced. * Always update the last known event when a new sequence number has been assigned A new sequence number being assigned means that we are seeing an event that was not processed before. In other words, it's the last event we've read from the source stream, so we update the last known sequence number / last known event position accordingly * fix: only update the _lastKnown vars when event leaves the buffer the newSequenceNumberAssigned check isn't enough, because in the NoMoreCapacity case we discard the message that the _lastKnown vars are based on. then next time we call OutstandingMessage.ForPushedEvent the _lastKnownMessage we pass in will be wrong, potentially leading to an incorrect checkpoint (see test) * safer and more efficient - safer because Scan() says we shouldn't add to each linked list while scanning it. before this commit we could add to _retry while scanning through it. now we only add to _retry while scanning _buffer - more efficient because before this commit adding to _retry involved searching through it, now we only add to _retry when skipping an event in _buffer in which case we know the event needs to go to the back of retry. --------- Co-authored-by: Timothy Coleman <timothy.coleman@gmail.com> * Use $ProjectionState event type for V2 state checkpoints instead of Result because what we are doing here is checkpointing not outputting results. "Result" event type can be used when we implement outputState() which would emit the state after every event processed (at least, as long as the state changed) * [DB-1872] Improve pinned persistent subscription performance under burst load (#5576) * Improve persistent subscription performance under burst load When events arrive in a burst, each NotifyLiveSubscriptionMessage call was scanning the entire buffer to push events to clients. Client acks are queued behind the event flood in the same FIFO queue, so clients appear full and every scan is fruitless. With N events, this results in O(N²) total work. Instead of pushing synchronously on event arrival, we now schedule a push message on the PS queue. This lets events accumulate in the buffer cheaply, gives acks a chance to interleave, and batches the buffer scan. All PushToClients call sites use the deferred path for consistency, since the buffer scan cost applies equally to acks, nacks, timeouts, and client changes. * update package ref for cve CVE-2026-26171 * [DB-2006]: Migration to major .NEXT version (#5581) * Migration to major .NEXT version * Fixed dependency * Fixed regression * Fixed regression * Removed item index * review: fix regression * restore previous order --------- Co-authored-by: Timothy Coleman <timothy.coleman@gmail.com> * Write V2 state checkpoints to -state streams instead of -result streams Result streams are for user-visible output (outputState()), not internal state persistence. V2 was writing $ProjectionState events to -result streams, conflating checkpointing with output. State now goes to dedicated -state streams: - $projections-{name}-state (root partition) - $projections-{name}-{partition}-state (per partition) This keeps -result streams clean for future outputState() support and avoids collisions with the -checkpoint stream used for position. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rename V2 state event type to $ProjectionState.V2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rename V2 checkpoint event type to $ProjectionCheckpoint.V2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove V1-to-V2 migration test that was silently passing V2 can't parse V1's checkpoint events (V1 stores state in data, position in metadata; V2 expects position in data), so it falls back to starting from the beginning. The test passed because V2 reprocessed everything from scratch, not because migration worked. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Rename ProjectionsStateStreamSuffix to ProjectionsResultStreamSuffix The constant was confusingly named — it holds "-result", not "-state". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * better name... --------- Co-authored-by: Roman Sakno <roman.sakno@kurrent.io> Co-authored-by: Alexey Zimarev <alex@zimarev.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Shaan Nobee <sniper111@gmail.com>
Persistent subscription consumer strategies that can skip events can produce wrong checkpoints when events are skipped. The pinned consumer strategy is currently the only strategy that can skip events.
This PR fixes the following issues:
OutstandingMessagedid not take skipped events into consideration - this could result into wrongly assigned sequence numbers / previous event positions and subsequently to wrong checkpoints too.