Add RedPanda cluster-failure detection rule (prequel-dev#63)

Ani-4x · web-flow · commit 813251c8e900 · 2025-06-07T20:22:08.000-05:00
* new file:   rules/cre-2025-0102/redpanda-test-error.yml
	new file:   rules/cre-2025-0102/test.log
	modified:   rules/tags/categories.yaml
	modified:   rules/tags/tags.yaml

* modified:   rules/cre-2025-0102/redpanda-test-error.yml

* modified:   rules/cre-2025-0102/redpanda-test-error.yml

* renamed:    rules/cre-2025-0102/redpanda-test-error.yml -&gt; rules/cre-2025-0102/redpanda-test-error.yaml
diff --git a/rules/cre-2025-0102/redpanda-test-error.yaml b/rules/cre-2025-0102/redpanda-test-error.yaml
@@ -0,0 +1,74 @@
+
+rules:
+  - metadata:
+      kind: prequel
+      id: JQAAVPJSiVrsWxtj3jqzBy
+      generation: 1
+    cre:
+      id: CRE-2025-0102
+      severity: 1
+      title: Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability Impacted
+      category: "distributed-systems"
+      author: Prequel
+      description: |
+        - The Redpanda streaming data platform is experiencing a severe, cascading failure.
+        - This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down.
+        - Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability.
+      cause: |
+        - Persistent hardware failures on a broker node (e.g., disk I/O errors, failing disk, NIC issues).
+        - A Redpanda broker process crashing repeatedly due to software bugs, resource exhaustion (OOM), or critical misconfiguration.
+        - Severe network partitioning that isolates nodes or groups of nodes from each other, preventing Raft consensus.
+        - Failure of critical system resources (e.g., full disk preventing writes, insufficient memory).
+        - Cascading effects where an initial failure on one node triggers instability across other nodes.
+      impact: |
+        - Significant disruption or complete loss of data production and consumption capabilities for multiple topics/partitions.
+        - High likelihood of data unavailability for partitions whose leaders or sufficient replicas are on the failed/unreachable nodes.
+        - Inability to perform cluster management operations (e.g., creating/deleting topics, reconfiguring partitions) if controller quorum is lost.
+        - Drastically increased error rates and latencies for client applications.
+        - Potential for prolonged service outage requiring manual intervention and complex recovery procedures.
+        - In worst-case scenarios with multiple simultaneous failures, risk of data loss for under-replicated partitions.
+      tags:
+        - redpanda
+        - streaming-data
+        - cluster-failure
+        - node-down
+        - quorum-loss
+        - data-availability
+        - critical-error
+        - distributed-system
+      mitigation: |
+        - **Initial Triage & Isolation:**
+          - Identify the specific node(s) reporting critical errors (e.g., I/O errors, shutdown messages) from Redpanda logs.
+          - Check basic system health on affected nodes: `dmesg`, disk space (`df -h`), memory usage (`free -m`), CPU load (`top` or `htop`), network connectivity (`ping`, `ip addr`).
+        - **Address Node-Specific Failures:**
+          - **Disk Issues:** If I/O errors or "No space left on device" occur, check disk health (e.g., `smartctl`), free up space, or prepare for disk replacement.
+          - **Node Shutdowns:** Investigate Redpanda logs on the failed node for the root cause of the shutdown.
+          - Attempt to restart the Redpanda service on the affected node if the underlying issue is resolved.
+        - **Cluster Stability:**
+          - **Controller Quorum:** If controller quorum is lost, prioritize bringing controller nodes back online. This may require resolving issues on those specific nodes.
+          - **Network Issues:** Verify robust network connectivity and low latency between all Redpanda brokers. Check switches, firewalls, and MTU settings.
+        - **Recovery Procedures (Consult Redpanda Documentation):**
+          - If a node is permanently lost, follow official Redpanda procedures for removing the dead node from the cluster and replacing it. This will trigger data re-replication.
+          - Monitor partition health and re-replication progress (`rpk cluster partitions list`, `rpk cluster health`).
+          - If leadership elections are failing, ensure a majority of replicas for those partitions are online and healthy.
+        - **Preventative Measures:**
+          - Implement comprehensive monitoring for Redpanda metrics and system-level metrics (disk, CPU, memory, network).
+          - Regularly review Redpanda logs for warnings or errors.
+          - Ensure sufficient disk space and I/O capacity.
+          - Maintain up-to-date Redpanda versions.
+          - Test disaster recovery procedures.
+      references:
+        - "https://docs.redpanda.com/redpanda-cloud/reference/public-metrics-reference/#redpanda_cluster_members_backend_queued_node_operations"
+        - "https://docs.redpanda.com/redpanda-cloud/reference/rpk/rpk-commands/"
+      applications:
+        - name: redpanda
+          version: "24.1.2"
+      impactScore: 0
+      mitigationScore: 0
+      reports: 1
+    rule:
+      set:
+        event:
+          source: cre.log.redpanda
+        match:
+          - regex: 'failure|leaving all raft groups|down|CRITICAL|Multiple nodes unresponsive|Low available memory|health degraded'
diff --git a/rules/cre-2025-0102/test.log b/rules/cre-2025-0102/test.log
@@ -0,0 +1,14 @@
+2025-06-06 16:15:30.125+05:30 [node_id=3] [subsystem=storage] [level=ERROR] Critical I/O error on partition 'orders_topic-1' (segment: 00000000000000001234.log): Write failed due to EIO (Input/output error) on disk '/var/lib/redpanda/data'. Disk may be failing.
+2025-06-06 16:15:30.780+05:30 [node_id=1] [subsystem=rpc] [level=WARN] Heartbeat to node_id=3 (10.0.1.103:33145) timed out after 5000ms. Retrying...
+2025-06-06 16:15:31.201+05:30 [node_id=2] [subsystem=raft] [level=ERROR] Partition 'user_profiles_topic-0' (group_id=5, term=102): Failed to replicate entry 5038291 to follower node_id=3. Follower is offline or disk full.
+2025-06-06 16:15:32.500+05:30 [node_id=3] [subsystem=main] [level=CRITICAL] Node shutting down due to persistent critical storage failure. Unable to write to data directory '/var/lib/redpanda/data'.
+2025-06-06 16:15:32.505+05:30 [node_id=3] [subsystem=rpc] [level=INFO] RPC server shutting down.
+2025-06-06 16:15:32.510+05:30 [node_id=3] [subsystem=raft] [level=INFO] Leaving all Raft groups.
+2025-06-06 16:15:33.950+05:30 [node_id=1] [subsystem=rpc] [level=ERROR] Heartbeat to node_id=3 (10.0.1.103:33145) failed after 3 retries. Marking node as down.
+2025-06-06 16:15:34.010+05:30 [node_id=1] [subsystem=cluster] [level=WARN] Node node_id=3 reported as down. Re-evaluating leadership for its partitions.
+2025-06-06 16:15:34.550+05:30 [node_id=2] [subsystem=controller] [level=CRITICAL] Controller quorum lost. Available controller nodes: [node_id=2]. Required: 2. Cluster metadata operations stalled.
+2025-06-06 16:15:35.112+05:30 [node_id=1] [subsystem=raft] [level=ERROR] Partition 'orders_topic-2' (group_id=8, term=77): Leadership election failed. Not enough live replicas to form quorum (1/3 available). Data production/consumption may be stalled.
+2025-06-06 16:15:35.834+05:30 [node_id=2] [subsystem=kafka_api] [level=ERROR] Produce request for topic 'inventory_updates_topic' failed: NOT_LEADER_FOR_PARTITION. No leader available for partition 0.
+2025-06-06 16:15:36.200+05:30 [node_id=1] [subsystem=main] [level=ERROR] High number of under-replicated partitions: 15. Cluster health critical.
+2025-06-06 16:15:37.005+05:30 [node_id=2] [subsystem=memory_monitor] [level=CRITICAL] Low available memory detected (210MB free / 16384MB total). Risk of OOM. Aggressive resource reclamation initiated.
+2025-06-06 16:15:38.150+05:30 [node_id=1] [subsystem=health_manager] [level=CRITICAL] Cluster health degraded: Multiple nodes unresponsive or reporting critical errors. Controller quorum lost. Data availability severely impacted.
diff --git a/rules/tags/categories.yaml b/rules/tags/categories.yaml
@@ -144,9 +144,15 @@ categories:
   - name: distributed-worker-connectivity
     displayName: Distributed Worker Connectivity Issues
     description: Failures where a distributed systems worker fails to reach or stay connected to the orchestration backend (e.g., Temporal, Celery).
+
+  - name: redpanda-cluster-failure
+    displayName: Redpanda Cluster Failure
+    description: Problems related to Redpanda cluster failures, including node loss, quorum loss, and data availability impact.
+
   - name: authorization-systems
     displayName: Authorization Systems
     description: |
       Failures in systems that manage access control, identity, or permissions.
       This includes tools like SpiceDB, OPA, or Auth0 where schema, policy, or
       integration issues can block authentication or authorization flows.
+
diff --git a/rules/tags/tags.yaml b/rules/tags/tags.yaml
@@ -567,6 +567,26 @@ tags:
   - name: startup-failure
     displayName: Startup Failure
     description: Problems related to start-up failures
+
+  - name: redpanda
+    displayName: Redpanda
+    description: Problems related to Redpanda, a streaming data platform
+  - name: streaming-data
+    displayName: Streaming Data
+    description: Problems related to streaming data platforms and systems
+  - name: cluster-failure
+    displayName: Cluster Failure
+    description: Problems related to cluster failures, including node loss, quorum loss, and data availability impact
+  - name: node-down
+    displayName: Node Down
+    description: Problems related to nodes going down in a cluster, impacting availability and performance
+  - name: quorum-loss
+    displayName: Quorum Loss
+    description: Problems related to loss of quorum in distributed systems, impacting consensus and availability
+  - name: data-availability
+    displayName: Data Availability
+    description: Problems related to data availability in distributed systems, such as loss of access to critical data          
+
   - name: spicedb
     displayName: SpiceDB
     description: Issues related to SpiceDB authorization service
@@ -576,6 +596,10 @@ tags:
   - name: schema-error
     displayName: Schema Error
     description: Missing or corrupted database schema elements such as tables or columns
+
+
+
   - name: logs
     displayName: Logs
     description: Problems with log processing
+