perf(database): per-thread mysql connections, async player load, thread-safe saves#3989
perf(database): per-thread mysql connections, async player load, thread-safe saves#3989beats-dh wants to merge 6 commits into
Conversation
…ad-safe saves replace the single global-mutex mysql connection with one lazily-created connection per worker thread, making the query hot path lock-free. this removes the global serialization that froze the game loop during mass login. - database: per-thread ConnectionContext, lazy getContext, lock-free hot path - database: mysql_library_init once at startup; transaction deadlock/lock-wait retry (1213/1205); query capture (record-replay) for saves; connection logs - login: dispatch the heavy loadPlayerById to the thread pool; finishLogin runs the two global-state steps (guild, system init) on the dispatcher; reserveLogin guards against duplicate concurrent logins - save: serialize the player on the dispatcher and flush sql on the pool; per-player flush ordering; saveAll builds on dispatcher, flushes on pool - player: cache the players.save flag to avoid a select in the save build - docs: add database-connection-model.md
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughPer-thread lazy MySQL ConnectionContext replaces a global locked handle; query capture supports record/replay for thread-safe saves; transactions retry on transient deadlocks; heavy login DB work is moved to the pool with dispatcher-only deferred world-data and login reservation; saves are split into dispatcher build + pool flush. ChangesDatabase Concurrency Refactor with Async Player Operations
Sequence Diagram(s)sequenceDiagram
participant Dispatcher
participant Pool as Pool Thread
participant Conn as ConnectionContext
participant MySQL
Dispatcher->>Dispatcher: reserveLogin(guid)
Dispatcher->>Pool: detach loadPlayerById(deferWorldData=true)
Pool->>Conn: getContext() (thread-local, lazy)
Conn->>MySQL: mysql_thread_init + mysql_real_connect
Pool->>MySQL: SELECT player basic data
Pool->>Pool: skip guild/world-data steps (deferred)
Pool-->>Dispatcher: schedule finishLogin(loaded=true)
Dispatcher->>Dispatcher: finishLogin() continuation
Dispatcher->>Dispatcher: loadPlayerWorldData() (guild, exiva)
Dispatcher->>Dispatcher: place creature, enable packets
Dispatcher->>Dispatcher: releaseLogin(guid)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a per-thread MySQL connection model to resolve mass-login stalls by replacing the single global connection lock with lazily-created, thread-local connections. It also moves heavy player loading off the dispatcher thread and implements a thread-safe player saving mechanism. The review feedback highlights several critical issues: a potential transaction commit on failure in flushPlayerSave which should use rollback-on-failure; a data loss vulnerability where pending saves can be dropped if a player logs out while a save is in flight; a memory leak from not calling mysql_thread_end() on thread exit; and potential crashes due to missing null-pointer checks on connection handles in getLastInsertId and escapeBlob.
… cleanup - flushPlayerSave: use executeWithinTransactionRollbackOnFailure so a failing query rolls back the whole save (no partial commit) and triggers deadlock retry - save_manager: queue the already-built sql of an in-flight player's next save (m_pendingFlushes map) instead of a guid to re-look-up, so a final save (e.g. logout) is never dropped when the player object is gone as the flush completes - database: call mysql_thread_end() via a thread_local cleanup on thread exit - database: null-check the connection handle in getLastInsertId and escapeBlob
- ConnectionContext: delete copy ops (rule of 5) to prevent a double mysql_close - mark ignored return values from mysql_thread_init/mysql_options as (void) - use push_back instead of emplace_back where the returned reference is unused - drop redundant cast on mysql_insert_id (already returns uint64_t)
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (4)
src/database/database.cpp (1)
29-35: ⚡ Quick winReject nested query capture on the same thread.
beginQueryCapture()overwritestlsQueryCaptureunconditionally. A nestedQueryCaptureScopewill silently stop recording into the outer buffer, andendQueryCapture()then disables capture for the rest of the outer scope. Add a guard/assert here so misuse fails loudly instead of truncating the replay batch.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/database/database.cpp` around lines 29 - 35, beginQueryCapture currently overwrites the thread-local tlsQueryCapture allowing nested QueryCaptureScope to silently break the outer capture; modify Database::beginQueryCapture to check tlsQueryCapture and fail loudly (e.g., assert or throw) if it is already non-null, and add a complementary check in Database::endQueryCapture to assert that tlsQueryCapture is non-null before setting it to nullptr so improper nesting is detected immediately; reference Database::beginQueryCapture, Database::endQueryCapture, tlsQueryCapture and QueryCaptureScope when making the change.src/game/scheduling/save_manager.hpp (1)
56-62: ⚡ Quick winAdd direct coverage for the in-flight/resave ordering logic.
m_flushInFlightandm_resavePendingare the guardrails that keep same-player flushes ordered. Please add a focused test, or the smallest practical concurrency check, for “save requested during an in-flight flush triggers exactly one rebuild after the flush completes”.As per coding guidelines, "For C++ changes, prefer focused tests or the smallest practical build/check that validates the touched code".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/game/scheduling/save_manager.hpp` around lines 56 - 62, Add a focused unit test that simulates a player save being requested while a flush is in-flight and asserts exactly one rebuild occurs after the flush completes: create a test harness around the SaveManager (the class using m_flushInFlight and m_resavePending), start a simulated async flush for a test GUID by inserting into or triggering the code path that sets m_flushInFlight, then concurrently call the public save request API (e.g., requestSave/savePlayer or the method that enqueues resaves) to trigger the m_resavePending path, complete the in-flight flush, and verify that m_resavePending results in exactly one rebuild invocation (mock or spy the rebuild/flush callback) and that m_flushInFlight is cleared appropriately; keep the test minimal, deterministic (use condition variables/promises or injected schedulers to control timing), and only reference m_flushInFlight, m_resavePending, and the public request/flush/rebuild entry points.src/io/functions/iologindata_save_player.cpp (1)
176-178: ⚡ Quick winAdd a focused regression test for the cached
savefast path.This branch now trusts
Player::getSaveFlag()to skip the full player update. Please cover theplayers.save = 0case directly, including thatlastlogin/lastipstill persist, so a cache-init regression does not silently change persistence behavior.As per coding guidelines, "For C++ changes, prefer focused tests or the smallest practical build/check that validates the touched code".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/io/functions/iologindata_save_player.cpp` around lines 176 - 178, Add a focused regression test that exercises the cached save fast-path used in iologindata_save_player.cpp: simulate a player with players.save = 0 by initializing the cache via loadPlayerBasicInfo (or the test helper that mirrors login cache init), call the code path that leads to Player::getSaveFlag() returning false so the DB SELECT is skipped, and assert that lastlogin and lastip are still persisted to the DB despite the fast-path; locate relevant symbols Player::getSaveFlag(), loadPlayerBasicInfo, and the function containing the fast-path in iologindata_save_player.cpp to hook the test and verify persistence for the players.save = 0 case.src/game/game.hpp (1)
178-184: ⚡ Quick winAdd a focused regression check for the login reservation contract.
This dispatcher-only guard now sits on the duplicate-login path, so it deserves a small validation that covers reserve → reject duplicate → release → allow again before merge.
As per coding guidelines,
**/*.{cpp,hpp,h,cc,cxx}: For C++ changes, prefer focused tests or the smallest practical build/check that validates the touched code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/game/game.hpp` around lines 178 - 184, Add a focused regression unit test that verifies the dispatcher-only login reservation contract: call Game::reserveLogin(guid) and assert it returns true, then call reserveLogin(guid) again and assert it returns false (duplicate rejected), then call Game::releaseLogin(guid) and finally assert reserveLogin(guid) returns true again; put the test in the smallest relevant test target (a new or existing unit test for src/game, e.g. game_reservation_test) and ensure it runs in single-threaded/dispatcher context so no locking is required and uses the public methods reserveLogin, releaseLogin, and isLoginPending for assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/database/database.cpp`:
- Around line 131-154: Move publishing the thread-local cache until after a
successful connection: in Database::getContext(), create the ConnectionContext
and emplace it into connections under connectionsMutex as you do now, but call
establishConnection(*contextPtr) before assigning tlsContext = contextPtr; if
establishConnection fails, under connectionsMutex erase the just-added
unique_ptr from connections (so the failed context is destroyed and not cached)
and do not set tlsContext; only set tlsContext and log success when
establishConnection returns true. Ensure you reference ConnectionContext,
tlsContext, establishConnection, connections and connectionsMutex when making
these changes.
- Around line 23-27: The code calls mysql_library_init() in Database::connect()
but never calls mysql_library_end(); add a single global shutdown call to
mysql_library_end() executed after all worker threads are joined and after
per-thread mysql_thread_end() (ThreadCleanup) has run—either by invoking
mysql_library_end() from the Database destructor (ensure destructor runs after
thread join logic) or by installing a std::call_once/atexit hook that runs once
at process shutdown; guard the call so it only runs if mysql_library_init()
previously succeeded (use an init flag or the same call_once token).
In `@src/game/game.cpp`:
- Around line 11090-11103: Add a runtime dispatcher-thread guard to the three
functions that touch the dispatcher-only m_pendingLogins: reserveLogin,
releaseLogin, and isLoginPending. At the start of each method assert or
otherwise check DispatcherContext::isOn() (or use g_dispatcher().context()
helper) to fail fast if called off the dispatcher, then proceed with the
existing logic; this enforces the documented dispatcher-only contract and
prevents unsynchronized access to m_pendingLogins.
In `@src/game/scheduling/save_manager.cpp`:
- Around line 29-49: The full-save path in SaveManager::buildAllPlayers builds
PlayerSaveBatch entries without GUIDs and bypasses the m_flushInFlight /
m_resavePending ordering used by schedulePlayer(), allowing older full-save
snapshots to overwrite newer per-player saves; fix by including each player's
GUID in PlayerSaveBatch (use player->getGUID() or equivalent) and before
dispatching the full-save batch mark those GUIDs in m_flushInFlight (or push
each player through the same schedulePlayer() path) so the same per-GUID
ordering/in-flight tracking is used; update IOLoginData::buildPlayerSave usage
if needed to carry the GUID and ensure the batch entries are marked
in-flight/resavePending consistent with schedulePlayer() prior to dispatching
the DB flush.
In `@src/io/iologindata.cpp`:
- Around line 231-240: The lambda passed to
DBTransaction::executeWithinTransaction in IOLoginData::flushPlayerSave
currently returns false on a failed Database::executeQuery, but
executeWithinTransaction commits on boolean returns; change the logic to throw
an exception when db.executeQuery(query) fails (e.g., std::runtime_error with
context including the failed query) so the transaction mechanism will detect the
exception and roll back; keep returning true at the end of the lambda for the
success path.
In `@src/server/network/protocol/protocolgame.cpp`:
- Around line 692-697: The pool task dereferences self->player which may be
cleared by release() if the client disconnects; capture the reserved player's
GUID (and if needed a copy of the minimal player identity) into a local variable
before calling g_threadPool().detach_task and pass that captured guid into both
the worker lambda and the continuation so finishLogin and any releaseLogin(guid)
use the reserved GUID rather than reading self->player; update the detach_task
lambda and the g_dispatcher().addEvent continuation to accept the captured guid
(instead of relying on self->player), and ensure finishLogin(signature) or its
call site uses that captured guid for reservation cleanup.
---
Nitpick comments:
In `@src/database/database.cpp`:
- Around line 29-35: beginQueryCapture currently overwrites the thread-local
tlsQueryCapture allowing nested QueryCaptureScope to silently break the outer
capture; modify Database::beginQueryCapture to check tlsQueryCapture and fail
loudly (e.g., assert or throw) if it is already non-null, and add a
complementary check in Database::endQueryCapture to assert that tlsQueryCapture
is non-null before setting it to nullptr so improper nesting is detected
immediately; reference Database::beginQueryCapture, Database::endQueryCapture,
tlsQueryCapture and QueryCaptureScope when making the change.
In `@src/game/game.hpp`:
- Around line 178-184: Add a focused regression unit test that verifies the
dispatcher-only login reservation contract: call Game::reserveLogin(guid) and
assert it returns true, then call reserveLogin(guid) again and assert it returns
false (duplicate rejected), then call Game::releaseLogin(guid) and finally
assert reserveLogin(guid) returns true again; put the test in the smallest
relevant test target (a new or existing unit test for src/game, e.g.
game_reservation_test) and ensure it runs in single-threaded/dispatcher context
so no locking is required and uses the public methods reserveLogin,
releaseLogin, and isLoginPending for assertions.
In `@src/game/scheduling/save_manager.hpp`:
- Around line 56-62: Add a focused unit test that simulates a player save being
requested while a flush is in-flight and asserts exactly one rebuild occurs
after the flush completes: create a test harness around the SaveManager (the
class using m_flushInFlight and m_resavePending), start a simulated async flush
for a test GUID by inserting into or triggering the code path that sets
m_flushInFlight, then concurrently call the public save request API (e.g.,
requestSave/savePlayer or the method that enqueues resaves) to trigger the
m_resavePending path, complete the in-flight flush, and verify that
m_resavePending results in exactly one rebuild invocation (mock or spy the
rebuild/flush callback) and that m_flushInFlight is cleared appropriately; keep
the test minimal, deterministic (use condition variables/promises or injected
schedulers to control timing), and only reference m_flushInFlight,
m_resavePending, and the public request/flush/rebuild entry points.
In `@src/io/functions/iologindata_save_player.cpp`:
- Around line 176-178: Add a focused regression test that exercises the cached
save fast-path used in iologindata_save_player.cpp: simulate a player with
players.save = 0 by initializing the cache via loadPlayerBasicInfo (or the test
helper that mirrors login cache init), call the code path that leads to
Player::getSaveFlag() returning false so the DB SELECT is skipped, and assert
that lastlogin and lastip are still persisted to the DB despite the fast-path;
locate relevant symbols Player::getSaveFlag(), loadPlayerBasicInfo, and the
function containing the fast-path in iologindata_save_player.cpp to hook the
test and verify persistence for the players.save = 0 case.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0125f01a-d637-46af-88c8-da290ec5e6d6
📒 Files selected for processing (15)
docs/database-connection-model.mdsrc/creatures/players/player.hppsrc/database/database.cppsrc/database/database.hppsrc/game/game.cppsrc/game/game.hppsrc/game/scheduling/save_manager.cppsrc/game/scheduling/save_manager.hppsrc/io/functions/iologindata_load_player.cppsrc/io/functions/iologindata_load_player.hppsrc/io/functions/iologindata_save_player.cppsrc/io/iologindata.cppsrc/io/iologindata.hppsrc/server/network/protocol/protocolgame.cppsrc/server/network/protocol/protocolgame.hpp
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/game/scheduling/save_manager.hpp`:
- Around line 57-64: The full-save path (scheduleAll()/flushBuiltPlayers())
currently bypasses the per-player ordering guarded by m_flushInFlight and
m_pendingFlushes, allowing stale full-save batches to overwrite newer per-player
saves; modify scheduleAll()/flushBuiltPlayers() to route each player flush
through the same per-guid coordinator used by schedulePlayer() (i.e.,
check/insert into m_flushInFlight and enqueue into m_pendingFlushes instead of
directly writing), or alternatively have scheduleAll() mark per-guid
batch-in-flight (and block/merge schedulePlayer() for those GUIDs) so all
flushes for a given guid go through the single serialized path implemented by
m_flushInFlight and m_pendingFlushes. Ensure you update flushBuiltPlayers() to
use the same enqueue/flush logic as schedulePlayer() so older full-save SQL
cannot clobber newer per-player saves.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5cead72b-cfdb-4c44-b5cf-d18d7ec0c7ee
📒 Files selected for processing (6)
docs/database-connection-model.mdsrc/database/database.cppsrc/database/database.hppsrc/game/scheduling/save_manager.cppsrc/game/scheduling/save_manager.hppsrc/io/iologindata.cpp
✅ Files skipped from review due to trivial changes (1)
- docs/database-connection-model.md
🚧 Files skipped from review as they are similar to previous changes (4)
- src/database/database.hpp
- src/io/iologindata.cpp
- src/game/scheduling/save_manager.cpp
- src/database/database.cpp
- login: capture the reserved guid + a local shared_ptr before the pool load, so a mid-load disconnect can't null-deref self->player or leak the reservation - save_manager: route full-save (scheduleAll) players through schedulePlayer so the per-guid flush ordering also covers the global save — no older snapshot overwriting a newer per-player save - database: don't cache a failed connection in the thread_local (retry next call) - database: pair mysql_library_init with mysql_library_end on shutdown - database: assert against nested query capture - game: assert dispatcher thread for reserveLogin/releaseLogin/isLoginPending
Replace the global std::atomic init flag with a Database member bool, avoiding the global-mutable and CTAD code smells. The singleton that runs mysql_library_init() is the one that pairs mysql_library_end() on shutdown (set/read single-threaded at startup/teardown, so no atomic is needed).
|



Problem
Every DB query in the server went through a single
MYSQL*handle behind one globalstd::recursive_mutex(Database::databaseLock), so all queries — from every thread — were serialized.ProtocolGame::login()runs on the dispatcher (game-loop) thread and callsloadPlayerById, which issues ~20 sequential queries per character. Under mass login the dispatcher sat blocked on serialized DB I/O and the whole world stopped updating (movement, combat,SERVER_BEAT) while CPU stayed mostly idle.What changed
1. One MySQL connection per worker thread (lock-free hot path)
Databaseno longer holds a single handle. Each thread that touches the DB gets its ownConnectionContext(handle +maxPacketSize+lastErrno), created lazily on first query and cached in athread_localpointer. A registry vector owns the contexts for shutdown; its mutex is taken once per thread, never on the query path. Because a context belongs to a single thread, the handle needs no lock —storeQuery/executeQuery/escapeBlobusegetContext().handledirectly. PublicDatabaseAPI is unchanged (~150 call sites untouched); connection params are captured inconnect()and reused for each new per-thread connection.mysql_library_init()is now called explicitly once (viacall_onceinconnect(), single-threaded at startup) —mysql_init()'s implicit init is not thread-safe when two threads race their first connection.mysql_thread_init()per connection, andmysql_thread_end()via athread_localcleanup on thread exit.DBTransaction::executeWithinTransaction*now retries the whole transaction up to 3x.2. Player load off the dispatcher (login)
login()runs the cheap checks + waiting list, reserves the login (Game::reserveLogin, a dispatcher-only set), then dispatchesloadPlayerByIdto the thread pool. Audit found only 2 of ~27 load steps mutate shared global state and are unsafe on a pool thread:loadPlayerGuild→ writes the global guild map and mutates a sharedGuild.loadPlayerInitializeSystem→ mutates a sharedPartyvia the wheel.These (plus
updateSystem/exiva, which must run afterinitializeSystem) are skipped on the pool (deferWorldData = true) and run on the dispatcher infinishLoginvialoadPlayerWorldData.finishLoginthen does the PZ check,placeCreature,acceptPackets, andreleaseLoginon every exit path.3. Thread-safe player save
The save already ran on pool threads but read the live
Playerwhile the dispatcher mutated it — a data race.savePlayeris now split intobuildPlayerSave(serialize the player to SQL — runs on the dispatcher, consistent read) andflushPlayerSave(execute the SQL in a transaction — pool). It uses aDatabasequery-capture mode: during the build,executeQueryrecords SQL into a buffer instead of running it. Theplayers.saveflag is cached on thePlayerat login so the build needs no DB round-trip (pure CPU). Per-player flushes are serialized/ordered (m_flushInFlight/m_pendingFlushes) so an older flush can never overwrite a newer one, and a save queued behind an in-flight one keeps its built SQL so a final (logout) save is never dropped.saveAllbuilds on the dispatcher and flushes on the pool.Review notes
DEFAULT_NUMBER_OF_THREADS, default 4), bounded by MySQLmax_connections. Reducing the ~20 queries/load is the next lever.initializeSystem → updateSystem → exivaorder is preserved. No load step readsplayer->guild, so this is safe — worth a second look.executeQueryduring a build (only known case: a KV cache eviction) would be captured into the save buffer. Rare and bounded; documented in docs/database-connection-model.md.mysql_thread_end()is released on thread exit via athread_localcleanup.Not done
Mass-login load test with bots is pending. Validated so far: build, boot (per-thread connection logs appear as threads reach the DB), and login/logout/relog.
📄 Full rationale and trade-offs: docs/database-connection-model.md
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Documentation