Skip to content

Add runbook for common operational incidents #814

Description

@Mystery-CLI

Background

Operating a production financial application on the Stellar network means being responsible for users' funds and transaction integrity at all times. The FuTuRe platform interacts with several external systems — Stellar Horizon nodes, a PostgreSQL database, a Redis cache, and a JWT-based authentication layer — each of which can experience degraded performance or outright failure. Without written runbooks, incident response relies entirely on tribal knowledge held by a small number of senior engineers, creating an unacceptable single point of failure for on-call rotations.

Problem

There is currently no centralised documentation for handling operational incidents. When something goes wrong at 2 AM, the on-call engineer must improvise a response based on memory, Slack history, or conversations with colleagues who may be unavailable. This leads to several recurring issues:

  • Slow mean time to recovery (MTTR). Without step-by-step guidance, responders spend valuable time diagnosing before they even begin mitigating. In a financial application, every minute of downtime has user impact.
  • Inconsistent remediation. Different engineers apply different approaches to the same incident class, sometimes leaving the system in subtly different post-incident states that require additional cleanup.
  • Knowledge concentration risk. Operational procedures known only to one or two people become unavailable when those people are on leave, in a different timezone, or have left the organisation.
  • Audit gaps. Post-incident reviews cannot verify that correct procedures were followed when no reference document exists.
  • Repeated root cause analysis. Without a documented investigation playbook, engineers re-derive the same diagnostic steps for recurring incident classes.

Proposed Solution

Create a docs/runbooks/ directory containing one Markdown file per incident type. Each runbook should follow a consistent template:

  • Overview: What the incident is and how it typically presents.
  • Indicators: Which alerts fire, what log patterns appear, or what user-visible symptoms occur.
  • Immediate mitigation: Steps to stop further impact as quickly as possible.
  • Root cause investigation: How to determine why the incident occurred.
  • Resolution: Steps to fully restore service to a healthy state.
  • Escalation path: When to escalate and who to contact.
  • Post-incident actions: Required cleanup, monitoring follow-up, and post-mortem tasks.

The initial runbooks to create are:

  1. horizon-outage.md — Detecting and responding to Stellar Horizon being unreachable or returning persistent 5xx errors, including fallback configuration and user communication guidance.
  2. db-failover.md — Promoting a PostgreSQL replica to primary, updating application connection strings, and verifying full application connectivity after failover.
  3. jwt-secret-rotation.md — Rotating the JWT signing secret using the dual-secret approach described in issue Rotate JWT secret without service restart #727 without terminating active user sessions.
  4. stuck-transaction-recovery.md — Identifying transactions stuck in a pending state, determining whether they were broadcast to the Stellar network, and either confirming or safely cancelling them.

Implementation Steps

  1. Create docs/runbooks/ and add a README.md index listing each runbook with a one-line summary.
  2. Draft each of the four runbooks, validating procedures with engineers who have first-hand incident experience for each area.
  3. Link the runbooks index from the main project README.md and docs/README.md.
  4. Have each runbook reviewed and approved by at least one engineer with relevant experience before merging.
  5. Add a link to the runbooks directory in the on-call rotation documentation or internal Slack channel topic.
  6. Establish a quarterly review schedule to keep runbooks current as infrastructure changes.

Acceptance Criteria

  • docs/runbooks/ exists with a README.md index and the four initial runbooks.
  • Each runbook follows the standard template defined in this issue.
  • All runbooks have been reviewed by at least one subject-matter expert.
  • The directory is linked from the project's main documentation index.
  • Each runbook is written for an engineer who may be unfamiliar with the specific subsystem.

Notes

Runbooks are living documents. Accuracy is more important than completeness — a shorter runbook that is correct is more valuable than a long one with outdated steps. Version-control history will serve as an audit trail. The runbooks should assume access to standard tools (psql, stellar-cli, docker) but should not assume familiarity with recent architectural changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions