Skip to content

Feature request: Boot loop detection #1370

@mattzzw

Description

@mattzzw

In solar-powered mesh networks like Meshcore, the transition between "low battery" and "operational" states can be treacherous. When a solar panel provides just enough current to start the processor but not enough to sustain the surge of a LoRa transmission or a full boot sequence, the device undergoes a brownout. This creates a cycle where the device boots, attempts to communicate, crashes due to voltage drop, and restarts immediately.

The Problem: Flooding the Mesh

By default, Meshcore repeaters are designed to announce their presence to the network. If a repeater is caught in a rapid boot loop, it may successfully reach the stage of sending a flood advert (broadcasted to all nodes) before it crashes.

  • Airtime Saturation: A looping node can send dozens of high-priority flood packets per minute.
  • Cascading Congestion: Since other repeaters are programmed to repeat these adverts, a single unstable node can trigger a "broadcast storm," effectively silencing legitimate traffic across the entire mesh.
  • Battery Drain on Neighbors: Healthy nodes waste power receiving and re-transmitting the garbage packets from the looping node.

Proposal: Boot Loop Mitigation for Meshcore

This proposal introduces a persistent Boot Counter and a Cooldown Timer to distinguish between a healthy reboot and a power-instability loop. This would fix #1091.

1. Mechanism Design

We utilize a small segment of Non-Volatile Storage (NVM) or EEPROM to track the boot state.

  • The Boot Counter: Incremented immediately upon CPU initialization.
  • The Stability Threshold: Set to 10 minutes. If the device stays alive longer than this, the power is considered "stable."
  • The Limit: Set to 3-10 boots. (tbd)

2. Logic Flow

  1. On Boot: * Read from NVM.
  • Increment and write back to NVM.
  • Start a timer for (10 minutes).
  1. Check Condition: * If :
  • Action: Set flood.advert.interval to 0 (Disabled).
  • Action: Potentially do the same for zero hop advert, set set advert.interval to 0
  • Action: (Optional) Enter "Low Power Mode" or increase RX-only time to allow the battery to charge.
  1. On Timer Expiry (Stable Operation):
  • If the device reaches 10 minutes without crashing, reset in NVM.
  • Restore the original advert.interval if it was previously suppressed.

3. Pseudo-Code Implementation

// Define storage and constants
#define BOOT_LIMIT 10
#define STABILITY_TIME_MS 600000 // 10 minutes

void setup() {
    int boot_count = nvm_read_boot_count();
    boot_count++;
    nvm_write_boot_count(boot_count);

    if (boot_count > BOOT_LIMIT) {
        // Suppress flood adverts to protect the mesh
        meshcore_set_flood_advert_interval(0); 
        serial_println("Boot loop detected! Adverts disabled.");
    }

    // Schedule the reset of the counter
    timer_run_after(STABILITY_TIME_MS, reset_boot_counter);
}

void reset_boot_counter() {
    nvm_write_boot_count(0);
    // Optionally restore default advert interval here
}

Impact Assessment

By implementing this "circuit breaker," a failing solar node will only "hit" the mesh 10 times before being silenced. This allows the node to potentially recover once the sun provides enough peak current to sustain the system for more than 10 minutes, at which point it will automatically resume its role in the network.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions