A low-latency limit order book and matching engine written in C++17, designed from the ground up with cache-friendly, zero-allocation data structures. No STL containers on the hot path.
git clone https://github.com/Sahil0912/NanoMatch.git
cd NanoMatch
mkdir -p build && cd build
cmake ..
make -j$(nproc)To run the basic engine:
./OrderBookEngineTo run the performance suite (Google Benchmark):
./bench_engineRequires: CMake ≥ 3.13, a C++17 compiler (GCC/Clang).
- Custom Memory Pool - Pre-allocates a contiguous block of
Orderslots at startup. Allocation and deallocation are O(1) via an intrusive free-list. Zeronew/malloccalls on the hot path. - Intrusive Doubly-Linked Lists -
prev/nextpointers live inside theOrderstruct itself. No external container nodes, nostd::deque, no heap-allocated iterators.
- Replaced
std::map<Price, PriceLevel>(red-black tree with pointer-chasing cache misses) with a fixed-size contiguous array indexed directly by price tick. Price level lookup is strictly O(1). - Algorithmic Bitmap Acceleration - Eliminated catastrophic O(N) scanning latency by wrapping the array in an L1-resident 12KB bitmap. Leverages CPU bit-manipulation intrinsics (
tzcnt/lzcntvia__builtin_ctzll) to jump directly to the next active price level in a single CPU cycle (O(N/64)).
- Flat sparse lookup table mapping
OrderID → Order*for O(1) find. No hash maps, no heap allocation. - Full cancel lifecycle: lookup → DLL unlink → best-price bitmap recalc → pool deallocation. All O(1).
MatchingEngineowns the order book, memory pool, and lookup table — single point of ownership for the entire order lifecycle.- Filled passive orders are deallocated back to the pool immediately (no memory leaks).
- Supports Limit and Market order types with strict price-time priority.
- Wait-free SPSC ring buffer (
SPSCRing<T, CAPACITY>) usingstd::atomicwith explicitacquire/releasememory ordering — zero mutexes, zero syscalls, zero allocations on the hot path. alignas(64)cache-line isolation onhead_/tail_indices to eliminate false sharing (MESI ping-pong) between producer and consumer cores.- Compile-time safety:
static_assertenforces power-of-2 capacity (bitmask indexing) andstd::is_trivially_copyable(safe lock-free copying). - Trade events are pushed into the ring buffer immediately after each fill (~5 ns overhead per fill), then a dedicated background thread spin-polls and writes to disk asynchronously.
- Non-blocking design:
push()returnsfalseon a full buffer rather than blocking, preserving deterministic matching latency.
Compiled with -O3 -march=native and rigorously tested using asm volatile assembly memory clobbers to prevent compiler Dead Code Elimination (DCE). All numbers include lock-free SPSC ring buffer push on every fill.
- Throughput: 23.81 million orders/sec
- Matching Latency (p50): 200 cycles (117.6 ns)
- Matching Latency (p90): 214 cycles (125.9 ns)
- Matching Latency (p99): 252 cycles (148.2 ns)
- Tail Latency (p99.9): 296 cycles (174.1 ns)
- Passive Order Add: 15.2 ns
- Order Cancellation: 16.0 ns
- 5-Level Sweep: 589 ns
- Sweep Optimization: Empty-level scans are ~250x faster due to bitmap hardware intrinsics.
- Built custom cycle-accurate
rdtscandrdtscpfences to measure absolute cycle latency directly from the CPU, bypassing slow OS clock boundaries (std::chrono).
- Order Modification - Cancel-replace with correct price-time priority semantics (keep priority on qty decrease, lose priority on price change).
- Zero-Copy Data Ingestion -
mmap-based market data parser bypassingstd::ifstreamentirely. Read directly from the kernel page cache.