Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 76 additions & 0 deletions .claude/commands/loki-query.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,82 @@ bash tools/loki/loki-query.sh --output=raw --limit=20000 --since=1h \

For raw-output queries, log lines are JSON; useful fields include `["@timestamp"]`, `log_level`, `["log.logger"]`, `["process.thread.name"]`, `message`. Pipe through `jq -r '...'` to extract a clean digest.

## Direct kubectl logs (when Loki isn't enough)

Use `kubectl logs` directly when:
- **A pod just crashed/restarted** β€” `--previous` shows the prior container's logs. Critical when a `JmsFailoverWatchdog` `System.exit` recycles a pod and you need to see why.
- **Broker pods** β€” `activemqint` (ActiveMQ Classic, legacy daemon-to-daemon traffic β€” counterpart to client-side JMS errors in `data`/`db`/`sched`/`submit`) and `activemqsim` (sim/solver traffic). These may not be in Loki's collection set; check Loki first and fall back here.
- **Real-time tail** of a single pod, or windows that have aged out of Loki retention.

Setup (same kubeconfig as Loki):
```bash
KCFG="${LOKI_KUBECONFIG:-$HOME/.kube/kubeconfig_vxrails.yaml}"
NS=prod # or stage / dev
```

Discover the actual workload names β€” deployment vs statefulset, exact selectors:
```bash
kubectl --kubeconfig "$KCFG" -n "$NS" get deployments,statefulsets,pods \
| grep -iE "data|activemq|api|db|sched|submit"
```

Common log patterns:
```bash
# Last 200 lines from the data pod (the one that wedged on 2026-05-06)
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --tail=200

# Real-time tail
kubectl --kubeconfig "$KCFG" -n "$NS" logs -f deployment/data

# Previous container instance β€” after a crash, OOM, or watchdog-driven exit
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --previous --tail=500

# Time-bounded
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since=10m
kubectl --kubeconfig "$KCFG" -n "$NS" logs deployment/data --since-time="2026-05-06T06:30:00Z"

```

### Broker pods are special β€” supervisord hides the broker log

`activemqint`, `activemqsim`, and `artemismq` all run as `Deployments` in `prod` (and the same naming applies in `stage`/`dev` if present), but their container PID 1 is **supervisord**. The actual broker logs to a file inside the pod, not stdout. Consequences:

- `kubectl logs deployment/activemqint` returns ~10 lines of supervisord lifecycle, frozen at pod startup (44d old in prod). It does **not** show broker activity.
- Loki's promtail isn't scraping these pods either (verified by `logcli series '{namespace="prod"}'` β€” no `activemq`/`artemis` containers appear).
- The only path to broker events is `kubectl exec` against the in-pod log file.

```bash
# ActiveMQ Classic (activemqint, activemqsim) β€” log at /var/log/activemq/activemq.log
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
tail -200 /var/log/activemq/activemq.log

# Around an incident window (in-pod awk filter β€” efficient on a multi-MB log)
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
bash -c 'awk "/2026-05-06 06:[34][0-9]/" /var/log/activemq/activemq.log'

# Errors only
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/activemqint -- \
grep -E "WARN|ERROR" /var/log/activemq/activemq.log

# Artemis (artemismq) β€” different layout; discover the log path
kubectl --kubeconfig "$KCFG" -n "$NS" exec deployment/artemismq -- \
bash -c 'find / -name "*.log" -size +1k 2>/dev/null | head -10'
```

The in-pod ActiveMQ log file rotates after roughly 14h of activity. For older incidents, the broker side is not recoverable β€” a real operational gap when investigating wedges that took longer than that to manifest.

If `deployment/<name>` doesn't resolve (e.g. a broker is later deployed as a StatefulSet), discover the pod by label or by listing pods.

**Worth fixing**: reconfigure these brokers to log to stdout (or have supervisord forward child stdout/stderr). Once stdout flows, both `kubectl logs` and Loki's promtail pick them up automatically β€” no more `exec` archaeology and no more 14h horizon.

### Putting it together for a JMS-wedge incident

Pull both sides around the same window:
- **Client side** (`data`/`db`/`sched`/`submit`) via Loki β€” the `IllegalStateException: Timer already cancelled.` WARNs surface here, but the *triggering* network event happened earlier and silently.
- **Broker side** (`activemqint`/`activemqsim`) via `kubectl exec` β€” `Transport Connection to: tcp://<podIP>:<port> failed: java.io.EOFException` lines tell you which client connection actually dropped, and at what timestamp. Compare to client-side reconnect attempts to confirm the wedge was already latent before the trigger.

When to switch back to Loki: multi-pod sweeps, time-range queries spanning multiple containers, structured filters (`|~ "ERROR"`, regex), and historical incidents older than the kubelet's local log retention.

## Workflow

1. Read the user's request and identify: namespace (prod/stage/dev), time window, suspected container(s), keywords.
Expand Down
Loading