Add runbook for common operational incidents

## Background

Operating a production financial application on the Stellar network means being responsible for users' funds and transaction integrity at all times. The FuTuRe platform interacts with several external systems — Stellar Horizon nodes, a PostgreSQL database, a Redis cache, and a JWT-based authentication layer — each of which can experience degraded performance or outright failure. Without written runbooks, incident response relies entirely on tribal knowledge held by a small number of senior engineers, creating an unacceptable single point of failure for on-call rotations.

## Problem

There is currently no centralised documentation for handling operational incidents. When something goes wrong at 2 AM, the on-call engineer must improvise a response based on memory, Slack history, or conversations with colleagues who may be unavailable. This leads to several recurring issues:

- **Slow mean time to recovery (MTTR).** Without step-by-step guidance, responders spend valuable time diagnosing before they even begin mitigating. In a financial application, every minute of downtime has user impact.
- **Inconsistent remediation.** Different engineers apply different approaches to the same incident class, sometimes leaving the system in subtly different post-incident states that require additional cleanup.
- **Knowledge concentration risk.** Operational procedures known only to one or two people become unavailable when those people are on leave, in a different timezone, or have left the organisation.
- **Audit gaps.** Post-incident reviews cannot verify that correct procedures were followed when no reference document exists.
- **Repeated root cause analysis.** Without a documented investigation playbook, engineers re-derive the same diagnostic steps for recurring incident classes.

## Proposed Solution

Create a `docs/runbooks/` directory containing one Markdown file per incident type. Each runbook should follow a consistent template:

- **Overview:** What the incident is and how it typically presents.
- **Indicators:** Which alerts fire, what log patterns appear, or what user-visible symptoms occur.
- **Immediate mitigation:** Steps to stop further impact as quickly as possible.
- **Root cause investigation:** How to determine why the incident occurred.
- **Resolution:** Steps to fully restore service to a healthy state.
- **Escalation path:** When to escalate and who to contact.
- **Post-incident actions:** Required cleanup, monitoring follow-up, and post-mortem tasks.

The initial runbooks to create are:

1. **`horizon-outage.md`** — Detecting and responding to Stellar Horizon being unreachable or returning persistent 5xx errors, including fallback configuration and user communication guidance.
2. **`db-failover.md`** — Promoting a PostgreSQL replica to primary, updating application connection strings, and verifying full application connectivity after failover.
3. **`jwt-secret-rotation.md`** — Rotating the JWT signing secret using the dual-secret approach described in issue #727 without terminating active user sessions.
4. **`stuck-transaction-recovery.md`** — Identifying transactions stuck in a pending state, determining whether they were broadcast to the Stellar network, and either confirming or safely cancelling them.

## Implementation Steps

1. Create `docs/runbooks/` and add a `README.md` index listing each runbook with a one-line summary.
2. Draft each of the four runbooks, validating procedures with engineers who have first-hand incident experience for each area.
3. Link the runbooks index from the main project `README.md` and `docs/README.md`.
4. Have each runbook reviewed and approved by at least one engineer with relevant experience before merging.
5. Add a link to the runbooks directory in the on-call rotation documentation or internal Slack channel topic.
6. Establish a quarterly review schedule to keep runbooks current as infrastructure changes.

## Acceptance Criteria

- `docs/runbooks/` exists with a `README.md` index and the four initial runbooks.
- Each runbook follows the standard template defined in this issue.
- All runbooks have been reviewed by at least one subject-matter expert.
- The directory is linked from the project's main documentation index.
- Each runbook is written for an engineer who may be unfamiliar with the specific subsystem.

## Notes

Runbooks are living documents. Accuracy is more important than completeness — a shorter runbook that is correct is more valuable than a long one with outdated steps. Version-control history will serve as an audit trail. The runbooks should assume access to standard tools (psql, stellar-cli, docker) but should not assume familiarity with recent architectural changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add runbook for common operational incidents #814

Background

Problem

Proposed Solution

Implementation Steps

Acceptance Criteria

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add runbook for common operational incidents #814

Description

Background

Problem

Proposed Solution

Implementation Steps

Acceptance Criteria

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions