Skip to content

[Performance][Critical] Session cache is ineffective — cache key uses current block height instead of session start height, causing full node overload #509

@jorgecuesta

Description

@jorgecuesta

Problem Statement

Every relay request triggers gRPC GetSession calls to the full node despite sessions lasting ~50 blocks. The in-memory cache (SturdyC) is effectively bypassed because the cache key changes on every new block, turning the cache into a write-only buffer.

Production Impact (PNF Mainnet)

Metric Value
seed-one restarts (2 days) 29
seed-two restarts (2 days) 28
seed-three restarts (2 days) 36
Supplier failure rate 100% for affected services
Failure point Pre-routing (session fetch)

All suppliers show 100% relay failure because every BuildHTTPRequestContextForEndpoint call fails when GetSession returns errors from the overwhelmed full node.

Error Signatures

Full node restarted, not yet in check/finalize state:

rpc error: code = Unknown desc = codespace sdk code 26: invalid height:
context did not contain latest block height in either check state or finalize block state (653615)

Connection reset by overwhelmed node:

rpc error: code = Unavailable desc = error reading from server:
read tcp 10.42.12.164:46910->10.43.16.84:9090: read: connection reset by peer

Node crashed mid-stream:

rpc error: code = Unavailable desc = error reading from server: EOF

Root Cause

1. Cache key includes current block height, not session start height

fullnode_cache.go:230-237:

height, err := cfn.GetCurrentBlockHeight(ctx)  // latest block height
// ...
sessionKey := getSessionCacheKey(serviceID, appAddr, height)  // key changes every block

The key function (fullnode_cache.go:352-355):

func getSessionCacheKey(serviceID protocol.ServiceID, appAddr string, height int64) string {
    return fmt.Sprintf("%s:%s:%s:%d", sessionCacheKeyPrefix, serviceID, appAddr, height)
}

But lazyFullNode.GetSession() queries with height=0 (latest session), so the returned session is identical across all ~50 blocks within a session window. Every new block → new key → cache miss → redundant gRPC call for the same session. The 30s TTL is irrelevant because the key changes before expiry.

2. getActiveGatewaySessions() called per-endpoint with no protocol-level cache

protocol.go:677 calls getActiveGatewaySessions() fresh on every BuildHTTPRequestContextForEndpoint invocation. This is triggered from 6 code paths:

Call Site File Line
Initial endpoint selection http_request_context.go 534
Retry (path 1) http_request_context_handle_request.go 405
Fallback endpoint loop http_request_context_handle_request.go 963
Retry (path 2) http_request_context_handle_request.go 1298
Hedge request hedge.go 163
Health check health_check_executor.go 963

A single relay with hedging + 1 retry = 3 calls × N owned apps = 30 GetSession gRPC calls for one user request (with 10 apps).

3. No cross-replica session sharing

Each PATH replica has an independent in-memory SturdyC cache. With 8 mainnet replicas, all session fetches are multiplied 8×. Cold starts (deploys, restarts) cause a thundering herd.

Compounding this, Sauron routes all GetSession gRPC to whichever seed node has the highest block height, concentrating 100% of session query load on a single node:

{"msg":"gRPC routing decision made","selected_node":"seed-mainnet-three",
 "method":"/pocket.session.Query/GetSession"}

This explains why seed-three had the most restarts (36 vs 28-29).


Suggested Fix Direction

The core fix is straightforward: use session start height instead of current block height in the cache key.

params, err := cfn.GetSharedParams(ctx)  // already cached with 2min TTL
sessionStartHeight := height - (height % int64(params.NumBlocksPerSession))
sessionKey := getSessionCacheKey(serviceID, appAddr, sessionStartHeight)

This makes the cache key stable for the entire session window (~50 blocks). SharedParams is already cached — no additional gRPC calls needed.

Further improvements (protocol-level caching, Redis cross-replica sharing, proactive prefetch at session boundaries) can build on this incrementally.

Expected Impact

  • Current: ~1 GetSession call per block per (serviceID, appAddr) per replica
  • After fix: ~1 GetSession call per session rotation per (serviceID, appAddr) per replica
  • ~98% reduction in GetSession gRPC calls to full nodes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions