You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: Redpanda Cluster Critical Failure - Node Loss, Quorum Lost, and Data Availability Impacted
11
+
category: "distributed-systems"
12
+
author: Prequel
13
+
description: |
14
+
- The Redpanda streaming data platform is experiencing a severe, cascading failure.
15
+
- This typically involves critical errors on one or more nodes (e.g., storage failures), leading to nodes becoming unresponsive or shutting down.
16
+
- Subsequently, this can cause loss of controller quorum, leadership election problems for partitions, and a significant degradation in overall cluster health and data availability.
17
+
cause: |
18
+
- Persistent hardware failures on a broker node (e.g., disk I/O errors, failing disk, NIC issues).
19
+
- A Redpanda broker process crashing repeatedly due to software bugs, resource exhaustion (OOM), or critical misconfiguration.
20
+
- Severe network partitioning that isolates nodes or groups of nodes from each other, preventing Raft consensus.
21
+
- Failure of critical system resources (e.g., full disk preventing writes, insufficient memory).
22
+
- Cascading effects where an initial failure on one node triggers instability across other nodes.
23
+
impact: |
24
+
- Significant disruption or complete loss of data production and consumption capabilities for multiple topics/partitions.
25
+
- High likelihood of data unavailability for partitions whose leaders or sufficient replicas are on the failed/unreachable nodes.
26
+
- Inability to perform cluster management operations (e.g., creating/deleting topics, reconfiguring partitions) if controller quorum is lost.
27
+
- Drastically increased error rates and latencies for client applications.
28
+
- Potential for prolonged service outage requiring manual intervention and complex recovery procedures.
29
+
- In worst-case scenarios with multiple simultaneous failures, risk of data loss for under-replicated partitions.
30
+
tags:
31
+
- redpanda
32
+
- streaming-data
33
+
- cluster-failure
34
+
- node-down
35
+
- quorum-loss
36
+
- data-availability
37
+
- critical-error
38
+
- distributed-system
39
+
mitigation: |
40
+
- **Initial Triage & Isolation:**
41
+
- Identify the specific node(s) reporting critical errors (e.g., I/O errors, shutdown messages) from Redpanda logs.
42
+
- Check basic system health on affected nodes: `dmesg`, disk space (`df -h`), memory usage (`free -m`), CPU load (`top` or `htop`), network connectivity (`ping`, `ip addr`).
43
+
- **Address Node-Specific Failures:**
44
+
- **Disk Issues:** If I/O errors or "No space left on device" occur, check disk health (e.g., `smartctl`), free up space, or prepare for disk replacement.
45
+
- **Node Shutdowns:** Investigate Redpanda logs on the failed node for the root cause of the shutdown.
46
+
- Attempt to restart the Redpanda service on the affected node if the underlying issue is resolved.
47
+
- **Cluster Stability:**
48
+
- **Controller Quorum:** If controller quorum is lost, prioritize bringing controller nodes back online. This may require resolving issues on those specific nodes.
49
+
- **Network Issues:** Verify robust network connectivity and low latency between all Redpanda brokers. Check switches, firewalls, and MTU settings.
- If a node is permanently lost, follow official Redpanda procedures for removing the dead node from the cluster and replacing it. This will trigger data re-replication.
52
+
- Monitor partition health and re-replication progress (`rpk cluster partitions list`, `rpk cluster health`).
53
+
- If leadership elections are failing, ensure a majority of replicas for those partitions are online and healthy.
54
+
- **Preventative Measures:**
55
+
- Implement comprehensive monitoring for Redpanda metrics and system-level metrics (disk, CPU, memory, network).
56
+
- Regularly review Redpanda logs for warnings or errors.
2025-06-06 16:15:30.125+05:30 [node_id=3] [subsystem=storage] [level=ERROR] Critical I/O error on partition 'orders_topic-1' (segment: 00000000000000001234.log): Write failed due to EIO (Input/output error) on disk '/var/lib/redpanda/data'. Disk may be failing.
2
+
2025-06-06 16:15:30.780+05:30 [node_id=1] [subsystem=rpc] [level=WARN] Heartbeat to node_id=3 (10.0.1.103:33145) timed out after 5000ms. Retrying...
3
+
2025-06-06 16:15:31.201+05:30 [node_id=2] [subsystem=raft] [level=ERROR] Partition 'user_profiles_topic-0' (group_id=5, term=102): Failed to replicate entry 5038291 to follower node_id=3. Follower is offline or disk full.
4
+
2025-06-06 16:15:32.500+05:30 [node_id=3] [subsystem=main] [level=CRITICAL] Node shutting down due to persistent critical storage failure. Unable to write to data directory '/var/lib/redpanda/data'.
5
+
2025-06-06 16:15:32.505+05:30 [node_id=3] [subsystem=rpc] [level=INFO] RPC server shutting down.
6
+
2025-06-06 16:15:32.510+05:30 [node_id=3] [subsystem=raft] [level=INFO] Leaving all Raft groups.
7
+
2025-06-06 16:15:33.950+05:30 [node_id=1] [subsystem=rpc] [level=ERROR] Heartbeat to node_id=3 (10.0.1.103:33145) failed after 3 retries. Marking node as down.
8
+
2025-06-06 16:15:34.010+05:30 [node_id=1] [subsystem=cluster] [level=WARN] Node node_id=3 reported as down. Re-evaluating leadership for its partitions.
9
+
2025-06-06 16:15:34.550+05:30 [node_id=2] [subsystem=controller] [level=CRITICAL] Controller quorum lost. Available controller nodes: [node_id=2]. Required: 2. Cluster metadata operations stalled.
10
+
2025-06-06 16:15:35.112+05:30 [node_id=1] [subsystem=raft] [level=ERROR] Partition 'orders_topic-2' (group_id=8, term=77): Leadership election failed. Not enough live replicas to form quorum (1/3 available). Data production/consumption may be stalled.
11
+
2025-06-06 16:15:35.834+05:30 [node_id=2] [subsystem=kafka_api] [level=ERROR] Produce request for topic 'inventory_updates_topic' failed: NOT_LEADER_FOR_PARTITION. No leader available for partition 0.
12
+
2025-06-06 16:15:36.200+05:30 [node_id=1] [subsystem=main] [level=ERROR] High number of under-replicated partitions: 15. Cluster health critical.
0 commit comments