Skip to content

Commit 813251c

Browse files
authored
Add RedPanda cluster-failure detection rule (prequel-dev#63)
* new file: rules/cre-2025-0102/redpanda-test-error.yml new file: rules/cre-2025-0102/test.log modified: rules/tags/categories.yaml modified: rules/tags/tags.yaml * modified: rules/cre-2025-0102/redpanda-test-error.yml * modified: rules/cre-2025-0102/redpanda-test-error.yml * renamed: rules/cre-2025-0102/redpanda-test-error.yml -> rules/cre-2025-0102/redpanda-test-error.yaml
1 parent e8b1f82 commit 813251c

4 files changed

Lines changed: 118 additions & 0 deletions

File tree

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
2+
rules:
3+
- metadata:
4+
kind: prequel
5+
id: JQAAVPJSiVrsWxtj3jqzBy
6+
generation: 1
7+
cre:
8+
id: CRE-2025-0102
9+
severity: 1
10+
title: Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability Impacted
11+
category: "distributed-systems"
12+
author: Prequel
13+
description: |
14+
- The Redpanda streaming data platform is experiencing a severe, cascading failure.
15+
- This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down.
16+
- Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability.
17+
cause: |
18+
- Persistent hardware failures on a broker node (e.g., disk I/O errors, failing disk, NIC issues).
19+
- A Redpanda broker process crashing repeatedly due to software bugs, resource exhaustion (OOM), or critical misconfiguration.
20+
- Severe network partitioning that isolates nodes or groups of nodes from each other, preventing Raft consensus.
21+
- Failure of critical system resources (e.g., full disk preventing writes, insufficient memory).
22+
- Cascading effects where an initial failure on one node triggers instability across other nodes.
23+
impact: |
24+
- Significant disruption or complete loss of data production and consumption capabilities for multiple topics/partitions.
25+
- High likelihood of data unavailability for partitions whose leaders or sufficient replicas are on the failed/unreachable nodes.
26+
- Inability to perform cluster management operations (e.g., creating/deleting topics, reconfiguring partitions) if controller quorum is lost.
27+
- Drastically increased error rates and latencies for client applications.
28+
- Potential for prolonged service outage requiring manual intervention and complex recovery procedures.
29+
- In worst-case scenarios with multiple simultaneous failures, risk of data loss for under-replicated partitions.
30+
tags:
31+
- redpanda
32+
- streaming-data
33+
- cluster-failure
34+
- node-down
35+
- quorum-loss
36+
- data-availability
37+
- critical-error
38+
- distributed-system
39+
mitigation: |
40+
- **Initial Triage & Isolation:**
41+
- Identify the specific node(s) reporting critical errors (e.g., I/O errors, shutdown messages) from Redpanda logs.
42+
- Check basic system health on affected nodes: `dmesg`, disk space (`df -h`), memory usage (`free -m`), CPU load (`top` or `htop`), network connectivity (`ping`, `ip addr`).
43+
- **Address Node-Specific Failures:**
44+
- **Disk Issues:** If I/O errors or "No space left on device" occur, check disk health (e.g., `smartctl`), free up space, or prepare for disk replacement.
45+
- **Node Shutdowns:** Investigate Redpanda logs on the failed node for the root cause of the shutdown.
46+
- Attempt to restart the Redpanda service on the affected node if the underlying issue is resolved.
47+
- **Cluster Stability:**
48+
- **Controller Quorum:** If controller quorum is lost, prioritize bringing controller nodes back online. This may require resolving issues on those specific nodes.
49+
- **Network Issues:** Verify robust network connectivity and low latency between all Redpanda brokers. Check switches, firewalls, and MTU settings.
50+
- **Recovery Procedures (Consult Redpanda Documentation):**
51+
- If a node is permanently lost, follow official Redpanda procedures for removing the dead node from the cluster and replacing it. This will trigger data re-replication.
52+
- Monitor partition health and re-replication progress (`rpk cluster partitions list`, `rpk cluster health`).
53+
- If leadership elections are failing, ensure a majority of replicas for those partitions are online and healthy.
54+
- **Preventative Measures:**
55+
- Implement comprehensive monitoring for Redpanda metrics and system-level metrics (disk, CPU, memory, network).
56+
- Regularly review Redpanda logs for warnings or errors.
57+
- Ensure sufficient disk space and I/O capacity.
58+
- Maintain up-to-date Redpanda versions.
59+
- Test disaster recovery procedures.
60+
references:
61+
- "https://docs.redpanda.com/redpanda-cloud/reference/public-metrics-reference/#redpanda_cluster_members_backend_queued_node_operations"
62+
- "https://docs.redpanda.com/redpanda-cloud/reference/rpk/rpk-commands/"
63+
applications:
64+
- name: redpanda
65+
version: "24.1.2"
66+
impactScore: 0
67+
mitigationScore: 0
68+
reports: 1
69+
rule:
70+
set:
71+
event:
72+
source: cre.log.redpanda
73+
match:
74+
- regex: 'failure|leaving all raft groups|down|CRITICAL|Multiple nodes unresponsive|Low available memory|health degraded'

rules/cre-2025-0102/test.log

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
2025-06-06 16:15:30.125+05:30 [node_id=3] [subsystem=storage] [level=ERROR] Critical I/O error on partition 'orders_topic-1' (segment: 00000000000000001234.log): Write failed due to EIO (Input/output error) on disk '/var/lib/redpanda/data'. Disk may be failing.
2+
2025-06-06 16:15:30.780+05:30 [node_id=1] [subsystem=rpc] [level=WARN] Heartbeat to node_id=3 (10.0.1.103:33145) timed out after 5000ms. Retrying...
3+
2025-06-06 16:15:31.201+05:30 [node_id=2] [subsystem=raft] [level=ERROR] Partition 'user_profiles_topic-0' (group_id=5, term=102): Failed to replicate entry 5038291 to follower node_id=3. Follower is offline or disk full.
4+
2025-06-06 16:15:32.500+05:30 [node_id=3] [subsystem=main] [level=CRITICAL] Node shutting down due to persistent critical storage failure. Unable to write to data directory '/var/lib/redpanda/data'.
5+
2025-06-06 16:15:32.505+05:30 [node_id=3] [subsystem=rpc] [level=INFO] RPC server shutting down.
6+
2025-06-06 16:15:32.510+05:30 [node_id=3] [subsystem=raft] [level=INFO] Leaving all Raft groups.
7+
2025-06-06 16:15:33.950+05:30 [node_id=1] [subsystem=rpc] [level=ERROR] Heartbeat to node_id=3 (10.0.1.103:33145) failed after 3 retries. Marking node as down.
8+
2025-06-06 16:15:34.010+05:30 [node_id=1] [subsystem=cluster] [level=WARN] Node node_id=3 reported as down. Re-evaluating leadership for its partitions.
9+
2025-06-06 16:15:34.550+05:30 [node_id=2] [subsystem=controller] [level=CRITICAL] Controller quorum lost. Available controller nodes: [node_id=2]. Required: 2. Cluster metadata operations stalled.
10+
2025-06-06 16:15:35.112+05:30 [node_id=1] [subsystem=raft] [level=ERROR] Partition 'orders_topic-2' (group_id=8, term=77): Leadership election failed. Not enough live replicas to form quorum (1/3 available). Data production/consumption may be stalled.
11+
2025-06-06 16:15:35.834+05:30 [node_id=2] [subsystem=kafka_api] [level=ERROR] Produce request for topic 'inventory_updates_topic' failed: NOT_LEADER_FOR_PARTITION. No leader available for partition 0.
12+
2025-06-06 16:15:36.200+05:30 [node_id=1] [subsystem=main] [level=ERROR] High number of under-replicated partitions: 15. Cluster health critical.
13+
2025-06-06 16:15:37.005+05:30 [node_id=2] [subsystem=memory_monitor] [level=CRITICAL] Low available memory detected (210MB free / 16384MB total). Risk of OOM. Aggressive resource reclamation initiated.
14+
2025-06-06 16:15:38.150+05:30 [node_id=1] [subsystem=health_manager] [level=CRITICAL] Cluster health degraded: Multiple nodes unresponsive or reporting critical errors. Controller quorum lost. Data availability severely impacted.

rules/tags/categories.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,9 +144,15 @@ categories:
144144
- name: distributed-worker-connectivity
145145
displayName: Distributed Worker Connectivity Issues
146146
description: Failures where a distributed systems worker fails to reach or stay connected to the orchestration backend (e.g., Temporal, Celery).
147+
148+
- name: redpanda-cluster-failure
149+
displayName: Redpanda Cluster Failure
150+
description: Problems related to Redpanda cluster failures, including node loss, quorum loss, and data availability impact.
151+
147152
- name: authorization-systems
148153
displayName: Authorization Systems
149154
description: |
150155
Failures in systems that manage access control, identity, or permissions.
151156
This includes tools like SpiceDB, OPA, or Auth0 where schema, policy, or
152157
integration issues can block authentication or authorization flows.
158+

rules/tags/tags.yaml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -567,6 +567,26 @@ tags:
567567
- name: startup-failure
568568
displayName: Startup Failure
569569
description: Problems related to start-up failures
570+
571+
- name: redpanda
572+
displayName: Redpanda
573+
description: Problems related to Redpanda, a streaming data platform
574+
- name: streaming-data
575+
displayName: Streaming Data
576+
description: Problems related to streaming data platforms and systems
577+
- name: cluster-failure
578+
displayName: Cluster Failure
579+
description: Problems related to cluster failures, including node loss, quorum loss, and data availability impact
580+
- name: node-down
581+
displayName: Node Down
582+
description: Problems related to nodes going down in a cluster, impacting availability and performance
583+
- name: quorum-loss
584+
displayName: Quorum Loss
585+
description: Problems related to loss of quorum in distributed systems, impacting consensus and availability
586+
- name: data-availability
587+
displayName: Data Availability
588+
description: Problems related to data availability in distributed systems, such as loss of access to critical data
589+
570590
- name: spicedb
571591
displayName: SpiceDB
572592
description: Issues related to SpiceDB authorization service
@@ -576,6 +596,10 @@ tags:
576596
- name: schema-error
577597
displayName: Schema Error
578598
description: Missing or corrupted database schema elements such as tables or columns
599+
600+
601+
579602
- name: logs
580603
displayName: Logs
581604
description: Problems with log processing
605+

0 commit comments

Comments
 (0)