Skip to content

User tickets & software testing related to slow faults #17

@ruiming-lu

Description

@ruiming-lu
System Created Feature Fixed? Testing Others
HBase 2019.04 Log roll on slow sync ✔️ TestLogRolling.java Explicitly mentioned 'gray failure'.
HBase 2017.09 Alert slow reads on a block Stale x x
etcd 2023.02 Leader stuck in handling raft Ready due to slow fdatasync or high CPU ✔️ v3_lease_no_proxy_test.go Still active
crdb 2023.01 Adapt charybdefs-based roachtest to allow for indefinite stalls. Also discussed in etcd-issues ✔️ commit history, disk_stall.go, pebble_test.go x
crdb 2022.05 Initial discussion on handling of degraded storage devices. 🚫 x Developers said "not clear to me what the best mechanism nor policy for this would be". Marked as stale 18 months later.
crdb 2020.10 Throw warning on slow log write, fatal on engine disk stall. ✔️ They found this to be incomplete and thus move to the above mentioned solutions Developers intend to be "conservative" to respond to disk stalls (>20s) and only print a warning on disk slow (>10s).
cassandra 2016.08 Slow query detecting. ✔️ stef1927/cql_test.py and riptano/cql_test.py By setting slow_query_log_timeout_in_ms to a small value (i.e., 10ms and 30ms) so that it's easily triggered (NOT INJECTING SLOW FAULTS)
kafka - Mitigating Kafka Broker ‘Gray’ Failures.
kafka - SRECON'23: Improving Kafka Resilience - Gray Failures Mitigation.

Misc:

  • CRDB is actually using charbydefs but only simulate & test disk stalls. See this ticket for how developers use charybdefs.

Generally, developers are accessible to and intentionally use tools to inject slow faults, but rather targeting the worst-case scenario: disk stalls; they did not iterate through different severity levels even if they are able to.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions