After #7630 and #7648 will be merged, this where we will stand (benchmarked on a combined branch):
4 threads with value size = 128, average of three measurements
and a profile of the top performing case
pprof top 50
Showing nodes accounting for 39.11s, 65.78% of 59.46s total
Dropped 567 nodes (cum <= 0.30s)
Showing top 70 nodes out of 244
flat flat% sum% cum cum%
5.55s 9.33% 9.33% 5.55s 9.33% __memcpy_sve
5.16s 8.68% 18.01% 6.59s 11.08% dfly::detail::Segment::Bucket::FindByFp
2.24s 3.77% 21.78% 9.78s 16.45% dfly::detail::Segment::FindIt (inline)
1.29s 2.17% 23.95% 1.29s 2.17% _mi_page_malloc_zero (partial-inline)
1.19s 2.00% 25.95% 1.26s 2.12% dfly::CompactKey::operator== (inline)
1.08s 1.82% 27.77% 1.08s 1.82% dfly::CompactObj::ObjType
0.95s 1.60% 29.36% 38.87s 65.37% dfly::MultiCommandSquasher::SquashedHopCb
0.91s 1.53% 30.89% 0.91s 1.53% dfly::detail::Segment::HomeIndex (inline)
0.78s 1.31% 32.21% 0.78s 1.31% dfly::detail::ascii_unpack
0.77s 1.29% 33.50% 0.77s 1.29% io_uring_submit_and_get_events
0.69s 1.16% 34.66% 3.31s 5.57% std::__do_visit (partial-inline)
0.68s 1.14% 35.81% 0.68s 1.14% std::pair::pair (inline)
0.67s 1.13% 36.93% 0.67s 1.13% base::it::Range::end (inline)
0.56s 0.94% 37.87% 0.56s 0.94% XXH64
0.56s 0.94% 38.82% 0.64s 1.08% std::__detail::__variant::_Uninitialized::_Uninitialized (inline)
0.53s 0.89% 39.71% 0.53s 0.89% [[vdso]]
0.52s 0.87% 40.58% 3.50s 5.89% dfly::Transaction::InitByKeys
0.47s 0.79% 41.37% 0.47s 0.79% dfly::Transaction::GetDbSlice
0.47s 0.79% 42.16% 22.86s 38.45% dfly::Transaction::ScheduleInShard
0.47s 0.79% 42.95% 58.70s 98.72% operator() (inline)
0.43s 0.72% 43.68% 4.81s 8.09% dfly::MultiCommandSquasher::TrySquash
0.41s 0.69% 44.37% 0.50s 0.84% facade::OpResult::OpResult (inline)
0.40s 0.67% 45.04% 0.40s 0.67% base::it::Range::begin (inline)
0.37s 0.62% 45.66% 0.40s 0.67% dfly::Namespace::GetDbSlice
0.37s 0.62% 46.28% 1.38s 2.32% facade::RespSrvParser::ParseInline
0.37s 0.62% 46.91% 0.57s 0.96% std::__upper_bound (inline)
0.36s 0.61% 47.51% 0.36s 0.61% dfly::KeyIndex::KeyIndex (inline)
0.35s 0.59% 48.10% 0.36s 0.61% dfly::Transaction::MultiSwitchCmd
0.34s 0.57% 48.67% 0.37s 0.62% dfly::AllocationTracker::ProcessDelete
0.33s 0.55% 49.23% 0.33s 0.55% dfly::AllocationTracker::Get
0.33s 0.55% 49.78% 0.87s 1.46% util::fb2::EventCount::NotifyInternal (inline)
0.32s 0.54% 50.32% 0.48s 0.81% dfly::DbSlice::Acquire
0.32s 0.54% 50.86% 0.37s 0.62% dfly::LockTagOptions::instance
0.32s 0.54% 51.40% 0.93s 1.56% dfly::Shard
0.31s 0.52% 51.92% 0.37s 0.62% dfly::CompactObj::Size
0.30s 0.5% 52.42% 4.31s 7.25% facade::Connection::ParseRedis
0.30s 0.5% 52.93% 0.48s 0.81% std::__cxx11::basic_string::_M_is_local (inline)
0.30s 0.5% 53.43% 0.46s 0.77% std::construct_at (inline)
0.29s 0.49% 53.92% 7.61s 12.80% dfly::CompactObj::GetString
0.29s 0.49% 54.41% 30.60s 51.46% dfly::Service::InvokeCmd
0.29s 0.49% 54.89% 0.47s 0.79% dfly::intrusive_ptr_release (partial-inline)
0.28s 0.47% 55.36% 28.09s 47.24% dfly::(anonymous namespace)::CmdGet
0.28s 0.47% 55.84% 24.18s 40.67% dfly::Transaction::ScheduleInternal
0.27s 0.45% 56.29% 21.35s 35.91% absl::lts_20250512::functional_internal::InvokeObject (partial-inline)
0.27s 0.45% 56.74% 0.35s 0.59% facade::CmdArgParser::Next
0.27s 0.45% 57.20% 0.61s 1.03% util::fb2::EmbeddedBlockingCounter::Dec
0.26s 0.44% 57.64% 0.34s 0.57% __time
0.26s 0.44% 58.07% 0.87s 1.46% dfly::Service::VerifyCommandState
0.26s 0.44% 58.51% 0.43s 0.72% dfly::Transaction::CanRunInlined
0.26s 0.44% 58.95% 11.86s 19.95% facade::Connection::SquashPipeline
0.25s 0.42% 59.37% 0.32s 0.54% dfly::acl::IsUserAllowedToInvokeCommandGeneric
0.24s 0.4% 59.77% 0.55s 0.92% cmn::HeapSize (inline)
0.24s 0.4% 60.17% 0.34s 0.57% facade::ParsedCommand::Resolve
0.24s 0.4% 60.58% 0.34s 0.57% util::fb2::EventCount::await
0.23s 0.39% 60.97% 0.45s 0.76% absl::lts_20250512::inlined_vector_internal::Storage::Resize
0.23s 0.39% 61.35% 0.31s 0.52% facade::CapturingReplyBuilder::Take[abi:cxx11]
0.21s 0.35% 61.71% 0.51s 0.86% dfly::CommandContext::RecordLatency
0.20s 0.34% 62.04% 0.30s 0.5% facade::Connection::DispatchSingle
0.20s 0.34% 62.38% 0.51s 0.86% facade::Connection::ReleaseParsedCommand
0.20s 0.34% 62.71% 0.93s 1.56% facade::SinkReplyBuilder::WritePieces
0.20s 0.34% 63.05% 0.48s 0.81% std::operator!=()::{lambda(auto:1&&, auto:2)#1}::operator() (inline)
0.19s 0.32% 63.37% 6.12s 10.29% dfly::MultiCommandSquasher::Run
0.19s 0.32% 63.69% 2.08s 3.50% dfly::Transaction::StoreKeysInArgs
0.19s 0.32% 64.01% 0.31s 0.52% facade::ParsedCommand::ResetForReuse
0.18s 0.3% 64.31% 0.31s 0.52% [dragonfly] (inline)
0.18s 0.3% 64.61% 0.70s 1.18% operator delete
0.18s 0.3% 64.92% 0.92s 1.55% std::__detail::__variant::_Variant_storage::_M_reset (inline)
0.17s 0.29% 65.20% 11.51s 19.36% dfly::DbSlice::FindReadOnly
0.17s 0.29% 65.49% 4.46s 7.50% dfly::Transaction::InitByArgs
0.17s 0.29% 65.78% 0.51s 0.86% mi_free (partial-inline)
(pprof)
After #7630 and #7648 will be merged, this where we will stand (benchmarked on a combined branch):
4 threads with value size = 128, average of three measurements
and a profile of the top performing case
pprof top 50