You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Operating a production financial application on the Stellar network means being responsible for users' funds and transaction integrity at all times. The FuTuRe platform interacts with several external systems — Stellar Horizon nodes, a PostgreSQL database, a Redis cache, and a JWT-based authentication layer — each of which can experience degraded performance or outright failure. Without written runbooks, incident response relies entirely on tribal knowledge held by a small number of senior engineers, creating an unacceptable single point of failure for on-call rotations.
Problem
There is currently no centralised documentation for handling operational incidents. When something goes wrong at 2 AM, the on-call engineer must improvise a response based on memory, Slack history, or conversations with colleagues who may be unavailable. This leads to several recurring issues:
Slow mean time to recovery (MTTR). Without step-by-step guidance, responders spend valuable time diagnosing before they even begin mitigating. In a financial application, every minute of downtime has user impact.
Inconsistent remediation. Different engineers apply different approaches to the same incident class, sometimes leaving the system in subtly different post-incident states that require additional cleanup.
Knowledge concentration risk. Operational procedures known only to one or two people become unavailable when those people are on leave, in a different timezone, or have left the organisation.
Audit gaps. Post-incident reviews cannot verify that correct procedures were followed when no reference document exists.
Repeated root cause analysis. Without a documented investigation playbook, engineers re-derive the same diagnostic steps for recurring incident classes.
Proposed Solution
Create a docs/runbooks/ directory containing one Markdown file per incident type. Each runbook should follow a consistent template:
Overview: What the incident is and how it typically presents.
Indicators: Which alerts fire, what log patterns appear, or what user-visible symptoms occur.
Immediate mitigation: Steps to stop further impact as quickly as possible.
Root cause investigation: How to determine why the incident occurred.
Resolution: Steps to fully restore service to a healthy state.
Escalation path: When to escalate and who to contact.
Post-incident actions: Required cleanup, monitoring follow-up, and post-mortem tasks.
The initial runbooks to create are:
horizon-outage.md — Detecting and responding to Stellar Horizon being unreachable or returning persistent 5xx errors, including fallback configuration and user communication guidance.
db-failover.md — Promoting a PostgreSQL replica to primary, updating application connection strings, and verifying full application connectivity after failover.
jwt-secret-rotation.md — Rotating the JWT signing secret using the dual-secret approach described in issue Rotate JWT secret without service restart #727 without terminating active user sessions.
stuck-transaction-recovery.md — Identifying transactions stuck in a pending state, determining whether they were broadcast to the Stellar network, and either confirming or safely cancelling them.
Implementation Steps
Create docs/runbooks/ and add a README.md index listing each runbook with a one-line summary.
Draft each of the four runbooks, validating procedures with engineers who have first-hand incident experience for each area.
Link the runbooks index from the main project README.md and docs/README.md.
Have each runbook reviewed and approved by at least one engineer with relevant experience before merging.
Add a link to the runbooks directory in the on-call rotation documentation or internal Slack channel topic.
Establish a quarterly review schedule to keep runbooks current as infrastructure changes.
Acceptance Criteria
docs/runbooks/ exists with a README.md index and the four initial runbooks.
Each runbook follows the standard template defined in this issue.
All runbooks have been reviewed by at least one subject-matter expert.
The directory is linked from the project's main documentation index.
Each runbook is written for an engineer who may be unfamiliar with the specific subsystem.
Notes
Runbooks are living documents. Accuracy is more important than completeness — a shorter runbook that is correct is more valuable than a long one with outdated steps. Version-control history will serve as an audit trail. The runbooks should assume access to standard tools (psql, stellar-cli, docker) but should not assume familiarity with recent architectural changes.
Background
Operating a production financial application on the Stellar network means being responsible for users' funds and transaction integrity at all times. The FuTuRe platform interacts with several external systems — Stellar Horizon nodes, a PostgreSQL database, a Redis cache, and a JWT-based authentication layer — each of which can experience degraded performance or outright failure. Without written runbooks, incident response relies entirely on tribal knowledge held by a small number of senior engineers, creating an unacceptable single point of failure for on-call rotations.
Problem
There is currently no centralised documentation for handling operational incidents. When something goes wrong at 2 AM, the on-call engineer must improvise a response based on memory, Slack history, or conversations with colleagues who may be unavailable. This leads to several recurring issues:
Proposed Solution
Create a
docs/runbooks/directory containing one Markdown file per incident type. Each runbook should follow a consistent template:The initial runbooks to create are:
horizon-outage.md— Detecting and responding to Stellar Horizon being unreachable or returning persistent 5xx errors, including fallback configuration and user communication guidance.db-failover.md— Promoting a PostgreSQL replica to primary, updating application connection strings, and verifying full application connectivity after failover.jwt-secret-rotation.md— Rotating the JWT signing secret using the dual-secret approach described in issue Rotate JWT secret without service restart #727 without terminating active user sessions.stuck-transaction-recovery.md— Identifying transactions stuck in a pending state, determining whether they were broadcast to the Stellar network, and either confirming or safely cancelling them.Implementation Steps
docs/runbooks/and add aREADME.mdindex listing each runbook with a one-line summary.README.mdanddocs/README.md.Acceptance Criteria
docs/runbooks/exists with aREADME.mdindex and the four initial runbooks.Notes
Runbooks are living documents. Accuracy is more important than completeness — a shorter runbook that is correct is more valuable than a long one with outdated steps. Version-control history will serve as an audit trail. The runbooks should assume access to standard tools (psql, stellar-cli, docker) but should not assume familiarity with recent architectural changes.