User tickets & software testing related to slow faults

| System | Created | Feature | Fixed? | Testing | Others|
|--------|--------|--------|--------|--------|--------|
| HBase | 2019.04 | [Log roll on slow sync](https://issues.apache.org/jira/browse/HBASE-22301) | ✔️  | [TestLogRolling.java](https://github.com/apache/hbase/blob/master/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java#L153) | Explicitly mentioned 'gray failure'.|
| HBase | 2017.09 | [Alert slow reads on a block](https://issues.apache.org/jira/browse/HBASE-18764) | Stale  | x | x |
| etcd | 2023.02 | [Leader stuck in handling raft Ready due to slow fdatasync or high CPU](https://github.com/etcd-io/etcd/issues/15247) | ✔️ | [v3_lease_no_proxy_test.go](https://github.com/etcd-io/etcd/blob/main/tests/e2e/v3_lease_no_proxy_test.go) | Still active |
| crdb | 2023.01 | [Adapt charybdefs-based roachtest to allow for indefinite stalls](https://github.com/cockroachdb/cockroach/issues/95874). Also discussed in [etcd-issues](https://github.com/etcd-io/etcd/issues/15247#issuecomment-1423769455) | ✔️ | [commit history](https://github.com/cockroachdb/cockroach/commit/6ed2fb0f1e98b0bdbf6d0f100bef2c1bf5dbf6de), [disk_stall.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/tests/disk_stall.go), [pebble_test.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/pebble_test.go#L277-L291) | x |
| crdb | 2022.05 | [Initial discussion on handling of degraded storage devices](https://github.com/cockroachdb/cockroach/issues/80942).  | 🚫  | x | Developers said "not clear to me what the best mechanism nor policy for this would be".  Marked as stale 18 months later. |
| crdb | 2020.10 | [Throw warning on slow log write, fatal on engine disk stall](https://github.com/cockroachdb/cockroach/pull/55186).  | ✔️  | They found this to be [incomplete](https://github.com/cockroachdb/cockroach/issues/56893) and thus move to the above mentioned solutions | Developers intend to be "conservative" to respond to disk stalls (>20s) and only print a warning on disk slow (>10s). |
| cassandra | 2016.08 | [Slow query detecting](https://issues.apache.org/jira/browse/CASSANDRA-12403).  | ✔️  | [stef1927/cql_test.py](https://github.com/stef1927/cassandra-dtest/blob/ed55a4961f424f8456e125fdeb70ca644e8572c9/cql_test.py#L1006) and [riptano/cql_test.py](https://github.com/riptano/cassandra-dtest-deprecated/pull/1249/files) | By setting `slow_query_log_timeout_in_ms` to a small value (i.e., 10ms and 30ms) so that it's easily triggered (NOT INJECTING SLOW FAULTS)   |
| kafka | - | [Mitigating Kafka Broker ‘Gray’ Failures](https://www.confluent.io/events/current/2023/mitigating-kafka-broker-gray-failures-for-key-based-partitioners-with/).  |   |  |    |
| kafka | - | [SRECON'23: Improving Kafka Resilience - Gray Failures Mitigation](https://www.usenix.org/conference/srecon23emea/presentation/valentinova).  |   |  |    |


Misc:
* CRDB is actually using `charbydefs` but only simulate & test **_disk stalls_**. See [this ticket](https://github.com/OrderLab/xinda/issues/17#issue-2133165699) for how developers use charybdefs. 
> Generally, developers are accessible to and intentionally use tools to inject slow faults, but rather targeting the worst-case scenario: disk stalls; they did not iterate through different severity levels even if they are able to. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User tickets & software testing related to slow faults #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

System	Created	Feature	Fixed?	Testing	Others
HBase	2019.04	Log roll on slow sync	✔️	TestLogRolling.java	Explicitly mentioned 'gray failure'.
HBase	2017.09	Alert slow reads on a block	Stale	x	x
etcd	2023.02	Leader stuck in handling raft Ready due to slow fdatasync or high CPU	✔️	v3_lease_no_proxy_test.go	Still active
crdb	2023.01	Adapt charybdefs-based roachtest to allow for indefinite stalls. Also discussed in etcd-issues	✔️	commit history, disk_stall.go, pebble_test.go	x
crdb	2022.05	Initial discussion on handling of degraded storage devices.	🚫	x	Developers said "not clear to me what the best mechanism nor policy for this would be". Marked as stale 18 months later.
crdb	2020.10	Throw warning on slow log write, fatal on engine disk stall.	✔️	They found this to be incomplete and thus move to the above mentioned solutions	Developers intend to be "conservative" to respond to disk stalls (>20s) and only print a warning on disk slow (>10s).
cassandra	2016.08	Slow query detecting.	✔️	stef1927/cql_test.py and riptano/cql_test.py	By setting `slow_query_log_timeout_in_ms` to a small value (i.e., 10ms and 30ms) so that it's easily triggered (NOT INJECTING SLOW FAULTS)
kafka	-	Mitigating Kafka Broker ‘Gray’ Failures.
kafka	-	SRECON'23: Improving Kafka Resilience - Gray Failures Mitigation.

User tickets & software testing related to slow faults #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions