Skip to content

refactor: reuse P2P client across node modes#3065

Merged
alpe merged 3 commits intomainfrom
alex/p2p_exec_ai1
Feb 16, 2026
Merged

refactor: reuse P2P client across node modes#3065
alpe merged 3 commits intomainfrom
alex/p2p_exec_ai1

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Feb 12, 2026

Refactor P2P client reuse and pending block storage

This PR introduces improvements to the P2P client lifecycle management and pending block storage:

Key Changes:

  • P2P Client Reuse: Refactored the P2P client to be reused across node restarts instead of creating new instances, improving resource efficiency
  • Failover Improvements: Streamlined failover logic in the node package for better reliability
  • Test Updates: Updated integration and unit tests to reflect the new P2P client

@github-actions
Copy link
Contributor

github-actions bot commented Feb 12, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedFeb 16, 2026, 12:07 PM

@claude
Copy link
Contributor

claude bot commented Feb 12, 2026

Claude finished @alpe's task in 2m 3s —— View job


Code Review Complete

I've performed a comprehensive code review of this P2P client reuse and pending block storage refactoring PR. Below are my findings organized by category.


1. Architecture & Design Quality ⚠️

Strengths:

  • Centralized P2P lifecycle management: Moving P2P client creation to the top-level node constructors is a solid architectural improvement that eliminates redundant initialization across mode switches
  • Clean dependency injection: The refactoring properly passes *p2p.Client down through constructors, maintaining clear ownership
  • Concurrent sync startup: The failover.go:192-208 parallel initialization of header and data sync services reduces cumulative startup delays

Issues:

CRITICAL - Missing P2P Client Reconfiguration (node/failover.go:104-163)
The PR description and Gemini's review mention that Reconfigure() is called to update P2P configuration during mode switches, but this call does not exist in setupFailoverState. The P2P client is passed through but never reconfigured when transitioning between aggregator and sync modes. This could lead to:

  • Stale P2P configuration after mode switches
  • Incorrect PubSub topics or DHT parameters for the new mode
  • Resource leaks if certain P2P resources aren't updated

Recommendation: Either:

  1. Add the missing p2pClient.Reconfigure(nodeConfig.P2P) call in setupFailoverState, OR
  2. Update documentation/comments to clarify that reconfiguration is not needed (with justification)

Reference: node/failover.go:104-163


MEDIUM - P2P Client Start/Stop Lifecycle Issue (node/full.go:284-292)
The P2P client is started once in FullNode.Run() and closed in the defer, but the failoverState previously managed its own P2P lifecycle. This change means:

  • If leaderElection.Run() fails early, the P2P client will still be running until the defer cleanup
  • There's a potential race between mode switches and P2P operations if the client is shared across concurrent mode transitions
  • The error handling at line 289-291 logs errors but doesn't propagate them

Recommendation:

  • Consider whether P2P client errors during close should be propagated or just logged
  • Document the thread-safety guarantees of the shared P2P client across mode switches

2. Concurrency & Resource Management ⚠️

Issues:

MEDIUM - Concurrent Sync Service Startup Error Handling (node/failover.go:192-208)
The concurrent startup of sync services uses errgroup, which is good, but:

syncWg, syncCtx := errgroup.WithContext(ctx)
syncWg.Go(func() error {
    if err := f.headerSyncService.Start(syncCtx); err != nil {
        return fmt.Errorf("header sync service: %w", err)
    }
    return nil
})

Problems:

  • If one service fails during startup, the other continues running but may not be properly stopped
  • The deferred stopService calls at lines 209-210 might try to stop a service that never fully started
  • The order of cleanup (defer statements) means data sync stops before header sync, but they started concurrently

Recommendation:

  • Add explicit cleanup for partially-started services when syncWg.Wait() returns an error
  • Document the expected behavior when one sync service fails during concurrent startup

LOW - P2P Client Thread Safety (pkg/p2p/client.go:117-174)
The started field (line 57) is a simple bool, not atomic. While Start() checks if c.started { return nil }, concurrent calls to Start() could race:

func (c *Client) Start(ctx context.Context) error {
    if c.started {  // ← Race condition
        return nil
    }
    // ... initialization
    c.started = true
}

Recommendation: Use atomic.Bool or add mutex protection to prevent potential double-initialization if Start() is called concurrently.


3. Error Handling & Resilience ✅

Strengths:

  • Proper use of errors.Join for aggregating multiple shutdown errors (failover.go:172, full.go:314)
  • Timeout contexts for shutdown operations (3s in failover, 9s in full node)
  • Retry logic in sync service P2P initialization with exponential backoff (pkg/sync/sync_service.go:383-413)

Minor Issues:

LOW - Inconsistent Error Context (pkg/cmd/run_node.go:167-170)
When P2P client creation fails, the error message is generic:

p2pClient, err := p2p.NewClient(...)
if err != nil {
    return fmt.Errorf("create p2p client: %w", err)
}

This doesn't indicate where in the startup process this occurred or what the implications are.

Recommendation: Add more context, e.g., "failed to create P2P client for node initialization: %w"


4. Testing & Reliability ⚠️

Coverage Concerns:

CRITICAL - Test Coverage is 40.57% (Codecov Report)
The PR has 82 lines missing coverage, with major gaps in:

  • block/internal/executing/pending.go: 40 missing, 12 partial
  • block/internal/executing/executor.go: 2 missing, 4 partial
  • node/failover.go: 3 missing, 3 partial
  • pkg/cmd/run_node.go: 4 missing (0% coverage)

Recommendation: Add tests for:

  1. P2P client reuse across multiple node restarts
  2. Failover between aggregator and sync modes with shared P2P client
  3. Concurrent sync service startup failure scenarios
  4. The new concurrent initialization path in failover.go

Test Quality Issues:

MEDIUM - Test Helper Inconsistency (node/helpers_test.go:71-83)
The createTestComponents function now returns crypto.PrivKey directly, but the function signature comment wasn't updated. Also, the returned private key is always the genesis validator key, which might not be appropriate for multi-node tests with different identities.

Recommendation: Document that the same key is reused and consider whether test isolation requires unique keys.


LOW - Missing Test for P2P Interface (pkg/sync/sync_service.go:40-47)
The new P2PClient interface is good abstraction, but there's no test verifying that *p2p.Client actually implements this interface. If the P2P client API changes, this could break at runtime.

Recommendation: Add a compile-time interface check:

var _ P2PClient = (*p2p.Client)(nil)

5. Security Considerations ✅

Strengths:

  • Private keys are properly passed through as crypto.PrivKey without exposure
  • No hard-coded credentials or secrets
  • Proper separation of concerns between P2P identity and application logic

Minor Notes:

  • The P2P client stores the private key (pkg/p2p/client.go:82) but only exposes public methods
  • Connection gater properly manages peer blocking/allowing (client.go:74-76)

6. Performance Considerations ⚡

Improvements:

  • Concurrent sync service startup reduces initialization time - Good optimization for nodes syncing from peers
  • P2P client reuse eliminates redundant libp2p initialization - Significant improvement for HA scenarios with frequent mode switches

Potential Issues:

MEDIUM - Store Adapter Pending Cache Growth (pkg/store/store_adapter.go:24, 614-677)
The pending cache has a hard limit of 10,000 items with backpressure blocking:

const maxPendingCacheSize = 10_000
// ...
func (a *StoreAdapter[H]) waitForSpace(ctx context.Context) error {
    for {
        if a.pending.len() < maxPendingCacheSize {
            return nil
        }
        // ... prune and retry
    }
}

Problems:

  • If executor falls significantly behind P2P gossip, this will block incoming P2P messages
  • The pruneRetryInterval of 50ms could cause head-of-line blocking in the P2P layer
  • No metrics/logging when backpressure is applied

Recommendation:

  • Add logging/metrics when cache reaches capacity
  • Consider dynamic backpressure based on executor lag
  • Document the expected behavior under high load

7. Documentation & Code Quality ⚠️

Issues:

MEDIUM - Misleading Comment (node/failover.go:104-163)
As mentioned in the Gemini review, there's no Reconfigure() call in setupFailoverState despite documentation suggesting it exists. This is confusing for maintainers.


LOW - Inconsistent Import Style (multiple files)
Some files use:

import (
    "github.com/evstack/ev-node/pkg/p2p"
)

While others use:

import (
    p2pkey "github.com/evstack/ev-node/pkg/p2p/key"
)

The removal of the p2p/key package references is correct, but ensure consistency across the codebase.


LOW - Test Comment Clarity (block/components_test.go:100-106)
The comments say "nil arguments" but don't explain why nil is acceptable:

// Create noop DAHintAppenders for testing
headerHintAppender := noopDAHintAppender{}  // ← Not nil
dataHintAppender := noopDAHintAppender{}    // ← Not nil

Recommendation: Update comments to match actual code.


Summary & Recommendations

Must Fix Before Merge:

  1. Investigate missing p2pClient.Reconfigure() call - Either add it or document why it's not needed
  2. ⚠️ Improve test coverage - 40.57% is too low for core infrastructure changes
  3. ⚠️ Add P2P client thread-safety protection - Use atomic.Bool for started field

Should Fix:

  1. Add explicit cleanup for partial sync service startup failures
  2. Add logging/metrics for pending cache backpressure
  3. Fix misleading documentation about Reconfiguration
  4. Add tests for P2P client reuse across restarts

Nice to Have:

  1. Improve error messages with more context
  2. Add compile-time interface checks for P2PClient
  3. Document thread-safety guarantees for shared P2P client

Overall Assessment

This is a solid architectural improvement that addresses real performance issues with P2P client initialization. The core design is sound, but there are several gaps in error handling, testing, and documentation that should be addressed before merging. The missing reconfiguration logic is particularly concerning and needs clarification.

Recommendation: Request changes focused on the critical issues, particularly the P2P reconfiguration question and test coverage improvements.


@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the P2P client's lifecycle management within the node architecture. By centralizing the P2P client's creation and making it persistent across node mode switches, it aims to improve startup efficiency and reduce resource overhead. Additionally, it introduces concurrent initialization for synchronization services and strengthens data consistency checks in the store getters, ensuring that only fully committed data is exposed via the P2P layer.

Highlights

  • Centralized P2P Client Management: The P2P client is now instantiated once at the top-level FullNode or LightNode creation and passed down to sub-components. This ensures the P2P client's lifecycle is managed independently of node mode switches (e.g., between aggregator and sync modes), preventing costly teardown and re-initialization of the libp2p stack.
  • Decoupled P2P Client Lifecycle: The p2p.Client now includes a Reconfigure method, allowing its internal configuration to be updated without restarting the underlying libp2p host, PubSub, or DHT. This is crucial for seamless transitions between node operational modes.
  • Concurrent Sync Service Startup: The header and data synchronization services in failover.go now start concurrently using errgroup.WithContext. This optimization reduces cumulative startup delays, especially in scenarios where peers might initially lack blocks.
  • Enhanced Header and Data Store Guards: New height guards have been implemented in HeaderStoreGetter and DataStoreGetter. These guards ensure that only headers and data at or below the committed store height are returned, preventing the P2P layer from serving unsigned or uncommitted data (e.g., from an executor's early save during crash recovery).
  • Refactored Node Constructors: Node constructors (newFullNode, newLightNode, newSyncMode, newAggregatorMode) and related test helpers have been updated to accept an initialized p2p.Client or crypto.PrivKey directly, simplifying their signatures and reinforcing the centralized P2P client management.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • apps/testapp/go.mod
    • Uncommented the local replace directive for github.com/evstack/ev-node.
  • apps/testapp/go.sum
    • Removed specific github.com/evstack/ev-node entries, reflecting dependency changes.
  • block/components_test.go
    • Added clarifying comments for nil arguments in test component creation functions.
  • node/failover.go
    • Removed unused imports for p2p/key and go-datastore.
    • Updated newSyncMode and newAggregatorMode to accept a p2p.Client instead of nodeKey, rootDB, and daStore.
    • Modified setupFailoverState to receive p2p.Client and use rktStore for sync services.
    • Removed p2pClient.Start() and p2pClient.Close() from failoverState.Run as the client is now managed at a higher level.
    • Implemented concurrent startup for header and data sync services in failoverState.Run.
  • node/full.go
    • Updated import from p2p/key to p2p.
    • Added p2pClient field to the FullNode struct.
    • Modified newFullNode to accept p2p.Client instead of nodeKey.
    • Updated leaderFactory and followerFactory to pass the p2pClient to mode constructors.
    • Added logic to FullNode.Run to start and close the p2pClient once for the node's entire lifecycle.
  • node/helpers_test.go
    • Updated imports to use github.com/evstack/ev-node/pkg/p2p and github.com/libp2p/go-libp2p/core/crypto.
    • Modified createTestComponents to return crypto.PrivKey directly and removed key.NodeKey struct creation.
    • Updated newTestNode and related test helper functions to handle crypto.PrivKey and p2p.Client parameters.
    • Added newTestP2PClient helper function for creating P2P clients in tests.
  • node/light.go
    • Removed unused import for p2p/key.
    • Modified newLightNode to accept p2p.Client instead of nodeKey.
    • Removed p2p.NewClient call from newLightNode.
    • Updated error handling for ln.P2P.Close() in LightNode.Run.
  • node/light_test.go
    • Added import for github.com/evstack/ev-node/pkg/p2p.
    • Modified TestLightNodeLifecycle to explicitly create and pass a p2p.Client.
  • node/node.go
    • Updated import from p2p/key to p2p.
    • Modified NewNode to accept p2p.Client instead of nodeKey.
    • Updated calls to newLightNode and newFullNode to pass the p2pClient.
  • node/single_sequencer_integration_test.go
    • Updated calls to createTestComponents and createNodeWithCustomComponents to align with privKey parameter changes.
  • pkg/cmd/run_node.go
    • Added import for github.com/evstack/ev-node/pkg/p2p.
    • Instantiated p2p.NewClient before node.NewNode and passed the client to the node creation function.
  • pkg/cmd/run_node_test.go
    • Added import for github.com/stretchr/testify/require.
    • Modified createTestComponents to generate and return a dummy nodeKey.
  • pkg/p2p/client.go
    • Added rawHost and started fields to Client struct for internal state management.
    • Modified Start method to prevent re-starting and track the rawHost.
    • Updated Close method to use errors.Join for aggregating errors.
    • Added PrivKey() method to expose the node's private key.
    • Added Reconfigure(conf config.P2PConfig) method for in-place configuration updates.
  • pkg/p2p/utils_test.go
    • Removed errors import and modified testNet.Close() to ignore errors for test simplicity.
  • pkg/store/header_store_adapter_test.go
    • Updated context.Background() to t.Context() in TestHeaderStoreAdapter_GetFromPendingByHash.
    • Added TestHeaderStoreGetter_HeightGuard to verify committed height checks.
  • pkg/store/store_adapter.go
    • Implemented height guards in HeaderStoreGetter.GetByHeight and HasAt to ensure data consistency.
    • Implemented height guards in DataStoreGetter.GetByHeight and HasAt for similar data consistency checks.
  • pkg/sync/sync_service.go
    • Removed direct import of github.com/evstack/ev-node/pkg/p2p.
    • Introduced P2PClient interface to abstract P2P client dependencies.
    • Updated SyncService and its constructors (NewDataSyncService, NewHeaderSyncService, newSyncService) to use the P2PClient interface.
  • test/e2e/failover_e2e_test.go
    • Added a Kill() call for old processes in TestHASequencerRollingRestartE2E to ensure clean state before restarting nodes.
Activity
  • The pull request title indicates '[DO NOT MERGRE]', suggesting this is a work-in-progress or experimental branch not yet ready for merging.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 51.28205% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.03%. Comparing base (d210055) to head (f854686).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
node/failover.go 60.00% 3 Missing and 3 partials ⚠️
node/full.go 50.00% 2 Missing and 2 partials ⚠️
pkg/cmd/run_node.go 0.00% 4 Missing ⚠️
node/light.go 0.00% 1 Missing and 1 partial ⚠️
pkg/p2p/client.go 75.00% 1 Missing and 1 partial ⚠️
node/node.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3065      +/-   ##
==========================================
- Coverage   61.07%   61.03%   -0.05%     
==========================================
  Files         113      113              
  Lines       11431    11440       +9     
==========================================
  Hits         6982     6982              
- Misses       3669     3676       +7     
- Partials      780      782       +2     
Flag Coverage Δ
combined 61.03% <51.28%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the P2P client lifecycle and integration within the node architecture. The primary change involves centralizing the creation and management of the P2P client in the FullNode and LightNode constructors, passing an already initialized p2p.Client instance rather than a nodeKey for internal P2P client creation. This ensures the P2P client is long-lived and persists across mode switches (e.g., between aggregator and sync modes), avoiding costly re-initialization of the libp2p stack. Consequently, imports related to p2p/key and ipfs/go-datastore were removed from several files, and failover.go now runs header and data sync services concurrently to improve startup performance. Test helper functions were updated to reflect these changes, now returning crypto.PrivKey and creating p2p.Client instances. Additionally, pkg/store/store_adapter.go was updated to include a height guard in HeaderStoreGetter and DataStoreGetter to prevent serving uncommitted or future block data, and a new test TestHeaderStoreGetter_HeightGuard was added to verify this behavior. A review comment highlighted an inaccuracy in node/failover.go regarding the Reconfigure() call, which should be addressed for clarity.

node/failover.go Outdated
}
defer f.p2pClient.Close() // nolint: errcheck
// P2P client persists across mode switches (started/closed by FullNode.Run).
// Reconfigure() was already called in setupFailoverState to re-bootstrap DHT.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment states that Reconfigure() is called in setupFailoverState, but this call does not appear to be present in the setupFailoverState function. This is misleading and could cause confusion about how the P2P client is reconfigured during mode switches. Please remove or update this comment to accurately reflect the code's behavior.

@alpe alpe force-pushed the alex/p2p_exec_ai1 branch from 09f4a76 to 24c7d76 Compare February 16, 2026 11:32
@alpe alpe force-pushed the alex/p2p_exec_ai1 branch from 24c7d76 to f854686 Compare February 16, 2026 12:07
@alpe alpe changed the title [DO NOT MERGRE] Re-use p2p-client refactor: reuse P2P client across node modes Feb 16, 2026
@alpe
Copy link
Contributor Author

alpe commented Feb 16, 2026

@claude review again

@claude
Copy link
Contributor

claude bot commented Feb 16, 2026

Claude finished @alpe's task in 2m 21s —— View job


Code Review Complete

I've performed a comprehensive code review of this P2P client refactoring PR. Here are my findings:


Overview

This PR successfully refactors the P2P client lifecycle to be centralized and reused across node mode switches (aggregator ↔ sync). The core architectural improvement is solid, moving P2P client creation from individual mode constructors to the top-level node initialization. However, there are several important issues that should be addressed.


Critical Issues

1. Thread-Safety Race Condition in started Flag

Location: pkg/p2p/client.go:57, pkg/p2p/client.go:125-127

Issue: The started boolean field is not thread-safe and can race if Start() is called concurrently:

func (c *Client) Start(ctx context.Context) error {
    if c.started {  // ← Race: read without synchronization
        return nil
    }
    // ... initialization code ...
    c.started = true  // ← Race: write without synchronization
}

Impact: If Start() is called concurrently (which could happen during rapid mode switches or in tests), the P2P client could be initialized twice, leading to resource leaks (duplicate DHT, multiple PubSub instances, etc.).

Recommendation: Use atomic.Bool instead:

// In Client struct:
started atomic.Bool

// In Start():
if !c.started.CompareAndSwap(false, true) {
    return nil  // Already started
}

2. Missing Test Coverage (51.28% patch coverage)

Location: Multiple files, see Codecov report

Issue: The PR has significant test coverage gaps:

  • pkg/cmd/run_node.go: 0% coverage (4 missing lines)
  • node/failover.go: 60% coverage (3 missing, 3 partial)
  • node/full.go: 50% coverage (2 missing, 2 partial)
  • node/light.go: 0% coverage for new code (1 missing, 1 partial)

Missing test scenarios:

  1. P2P client reuse across node restarts
  2. Concurrent Start() calls on the same P2P client
  3. Mode switch from aggregator to sync and back
  4. P2P client lifecycle during failover scenarios
  5. Concurrent sync service startup failure handling (lines 192-208 in failover.go)

Recommendation: Add integration tests covering:

  • Multiple node restarts reusing the same P2P client
  • Concurrent mode switches with shared P2P client
  • Failure scenarios during concurrent sync service startup

Major Issues

3. Unclear P2P Client Reconfiguration Strategy

Location: node/failover.go:104-163

Issue: Previous reviewers (Gemini) mentioned that Reconfigure() should be called during mode switches to update P2P configuration, but no such call exists in setupFailoverState. Looking at the code:

  1. The P2P client is created once in pkg/cmd/run_node.go:167-170
  2. It's passed through to both aggregator and sync modes
  3. But no reconfiguration happens when switching modes

Questions:

  • Does the P2P configuration need to change between aggregator and sync modes?
  • Are there different PubSub topics, DHT parameters, or peer lists per mode?
  • If not, why was this mentioned in previous reviews?

Current behavior: The P2P client retains the configuration it was initialized with, even after mode switches.

Recommendation: Either:

  1. Add p2pClient.Reconfigure(nodeConfig.P2P) in setupFailoverState if reconfiguration is needed, OR
  2. Document clearly that P2P configuration is static across mode switches and explain why this is acceptable

4. Concurrent Sync Service Startup - Partial Failure Handling

Location: node/failover.go:192-210

Issue: The concurrent startup of header and data sync services has incomplete error handling:

syncWg, syncCtx := errgroup.WithContext(ctx)
syncWg.Go(func() error {
    if err := f.headerSyncService.Start(syncCtx); err != nil {
        return fmt.Errorf("header sync service: %w", err)
    }
    return nil
})
syncWg.Go(func() error {
    if err := f.dataSyncService.Start(syncCtx); err != nil {
        return fmt.Errorf("data sync service: %w", err)
    }
    return nil
})
if err := syncWg.Wait(); err != nil {
    return err  // ← Returns immediately on error
}
defer stopService(f.headerSyncService.Stop, "header sync")
defer stopService(f.dataSyncService.Stop, "data sync")

Problem: If one service fails during startup:

  • The errgroup context is canceled, which may stop the other service
  • But we immediately return err before reaching the deferred stopService calls
  • This means if headerSyncService.Start() succeeded but dataSyncService.Start() failed, the header sync service keeps running and is never stopped

Recommendation: Add explicit cleanup when syncWg.Wait() returns an error:

if err := syncWg.Wait(); err != nil {
    // Clean up any partially started services
    cleanupCtx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
    defer cancel()
    _ = f.headerSyncService.Stop(cleanupCtx)
    _ = f.dataSyncService.Stop(cleanupCtx)
    return err
}

Medium Priority Issues

5. P2P Client Error Handling in FullNode.Run()

Location: node/full.go:288-292

Issue: When P2P client fails to close during shutdown, the error is only logged but not propagated:

defer func() {
    if err := n.p2pClient.Close(); err != nil {
        n.Logger.Error().Err(err).Msg("error closing p2p client")
        // ← Error is not added to shutdownMultiErr
    }
}()

Impact: P2P shutdown errors are silently swallowed, making it hard to diagnose resource leaks.

Recommendation: Consider whether P2P close errors should be added to shutdownMultiErr for consistency with other shutdown operations.


6. P2PClient Interface Missing Compile-Time Check

Location: pkg/sync/sync_service.go:40-47

Issue: The new P2PClient interface is a good abstraction, but there's no compile-time verification that *p2p.Client implements it. If the P2P client API changes, this could break at runtime.

Recommendation: Add a compile-time interface assertion:

// In pkg/p2p/client.go or pkg/sync/sync_service.go
var _ sync.P2PClient = (*p2p.Client)(nil)

7. Misleading Comment About daStore Usage

Location: node/failover.go:113-121 (in the diff context)

Observation: The previous code used a separate daStore parameter, but now both sync services use rktStore. This is correct based on the refactoring, but ensure this doesn't cause issues if the store implementations differ.

Action: Verify that using rktStore for both header and data sync services is intentional and correct.


Minor Issues

8. Inconsistent Error Context Messages

Location: pkg/cmd/run_node.go:167-170

Issue: The error message when P2P client creation fails is generic:

p2pClient, err := p2p.NewClient(...)
if err != nil {
    return fmt.Errorf("create p2p client: %w", err)
}

Recommendation: Add more context about the stage and implications:

return fmt.Errorf("failed to create P2P client during node initialization: %w", err)

9. Documentation Gap

Location: pkg/p2p/client.go:117-174

Issue: The Start() method now has idempotency (returns nil if already started), but this behavior isn't documented in the method's godoc comment.

Recommendation: Update the godoc:

// Start establishes Client's P2P connectivity. This method is idempotent and
// safe to call multiple times - subsequent calls return nil if already started.
//
// Following steps are taken:
// ...

Positive Aspects

  1. Solid architectural improvement: Centralizing P2P client lifecycle is the right approach
  2. Resource efficiency: Reusing the P2P client across mode switches avoids costly libp2p re-initialization
  3. Concurrent sync startup: Good optimization to reduce cumulative startup delays (lines 192-208)
  4. Clean dependency injection: P2P client is properly passed through constructors
  5. Error aggregation: Good use of errors.Join for combining shutdown errors

Test Coverage Breakdown

According to Codecov, the patch coverage is 51.28% with 19 lines missing coverage:

File Coverage Missing Lines
node/failover.go 60.00% 3 missing, 3 partial
node/full.go 50.00% 2 missing, 2 partial
pkg/cmd/run_node.go 0.00% 4 missing
node/light.go 0.00% 1 missing, 1 partial
pkg/p2p/client.go 75.00% 1 missing, 1 partial
node/node.go 50.00% 1 missing

Recommendations Summary

Must Fix (Before Merge):

  1. Fix thread-safety race in started flag - Use atomic.Bool
  2. ⚠️ Add explicit cleanup for partial sync service startup failures
  3. ⚠️ Clarify P2P reconfiguration strategy - Document or implement

Should Fix:

  1. 📊 Improve test coverage - Add integration tests for P2P client reuse and mode switches
  2. 🔍 Add compile-time interface check for P2PClient
  3. 📝 Document idempotent Start() behavior

Nice to Have:

  1. Consider propagating P2P close errors in shutdown
  2. Improve error messages with more context
  3. Verify rktStore usage is correct for both sync services

Overall Assessment

This is a solid architectural refactoring that addresses real performance and resource management issues. The core design is sound, but there are critical thread-safety and error handling gaps that must be addressed before merging.

Recommendation: Request changes to address the thread-safety issue (#1), partial failure cleanup (#4), and P2P reconfiguration clarity (#3). Test coverage improvements (#4) are also highly recommended.

The concurrent sync service startup optimization is excellent, but needs better error handling. Once the critical issues are addressed, this will be a valuable improvement to the codebase.


@alpe alpe requested a review from julienrbrt February 16, 2026 12:40
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK! I saw CI green on ev-abci too, nice!

@alpe alpe marked this pull request as ready for review February 16, 2026 13:10
@alpe alpe added this pull request to the merge queue Feb 16, 2026
Merged via the queue into main with commit 1da7634 Feb 16, 2026
31 of 33 checks passed
@alpe alpe deleted the alex/p2p_exec_ai1 branch February 16, 2026 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants