istio-meshmedic

MeshMedic — a Go CLI and node-agent DaemonSet that detects and heals Istio ambient-mesh enrollment orphans: workloads that should be captured by their node's ztunnel but aren't.

The problem it solves

In Istio ambient mode, every pod's traffic is redirected to its node's ztunnel. A pod can be correctly enrolled at startup and then silently lose its in-pod redirection — most often after a node reboot, when istio-cni fails to re-reconcile already-running pods (reconcileExistingPod/getNetns bails; upstream istio/istio#55968, unfixed). When that happens the pod:

keeps its ambient.istio.io/redirection: enabled annotation and looks healthy,
remains in ztunnel's workloadState (so the control plane thinks it's fine),
but its netns is missing ztunnel's in-pod listeners, so its traffic isn't captured — peers reject it (policy rejection: allow policies exist, but none allowed, empty src.identity) and HBONE to it is refused.

See docs/upstream-istio-55968.md for the root-cause analysis and a proposed upstream fix — the only true cure. MeshMedic is the mitigation.

How it detects (the important part)

A control-plane check is blind to this: the orphan stays in ztunnel's workloadState. The authoritative signal is the pod's own network namespace — an orphan is annotated ambient-enrolled but has none of ztunnel's in-pod listeners (15001 outbound, 15006 inbound, 15008 HBONE, 15053 DNS). MeshMedic reads /proc/net/tcp from the pod's netns and checks for those ports.

Caveat learned in production: netns sockets are necessary but the state flaps (istio-cni reconciles in the background), so a single missing reading is re-confirmed before any action.

Commands

command	what
`scan`	netns-socket detector. Execs the pod's own container to read `/proc/net/tcp` (no ephemeral container); falls back to a baseline-safe ephemeral probe only for distroless images. `-n <ns>` to scope.
`scan --behavioral`	cheap fleet-wide pre-filter: scrape every ztunnel's access logs for the orphan signatures (HBONE refused to `:15008`; policy-rejection with no `src.identity`), then netns-probe only the flagged pods.
`repair`	re-enroll orphans. `--strategy restart` (default) deletes the pod for a fresh, durable enrollment; `--strategy toggle` flips the `dataplane-mode` label (gentle, but flaps). Dry-run unless `--yes`; requires `-n` or `--behavioral`.
`agent`	per-node DaemonSet: reads `/host/proc` directly (zero injection), sweeps on a loop, and with `--auto-repair` restarts confirmed-stuck orphans (flap-guarded).

meshmedic scan -n delivery-bag
meshmedic scan --behavioral --since 15m
meshmedic repair -n hookshot            # dry-run
meshmedic repair -n hookshot --yes      # restart the orphans

Deploy the node-agent (no ephemeral probes, continuous)

kubectl apply -f deploy/meshmedic-daemonset.yaml   # detect-only by default

One pod per node (hostPID, host /proc at /host/proc) maps each local pod to a PID via its cgroup UID and reads its netns sockets directly — cap-free, nothing injected. Flip --auto-repair=true in the DaemonSet to have it keep the mesh healed.

Build / run

Built and shipped through StageFreight (docker.io/prplanit/istio-meshmedic). To run the CLI against a cluster from a container:

docker run --rm --network host --user "$(id -u):$(id -g)" -v "$HOME/.kube:/kube:ro" \
  docker.io/prplanit/istio-meshmedic:v0.0.1 \
  scan --kubeconfig /kube/config -n <namespace>

(Run as your own uid so the kubeconfig is readable — the image runs as nonroot.)

Mitigation vs. cure

MeshMedic finds and restarts stuck orphans; it does not stop them being created. The cure is the substrate: the upstream #55968 fix, or rolling ztunnel/istio-cni. Per-pod repair is whack-a-mole on a flapping data plane — which is exactly why the agent runs continuously.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.stagefreight/badges		.stagefreight/badges
ansible		ansible
cmd/meshmedic		cmd/meshmedic
deploy		deploy
docs		docs
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.stagefreight.yml		.stagefreight.yml
Dockerfile		Dockerfile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

istio-meshmedic

The problem it solves

How it detects (the important part)

Commands

Deploy the node-agent (no ephemeral probes, continuous)

Build / run

Mitigation vs. cure

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

istio-meshmedic

The problem it solves

How it detects (the important part)

Commands

Deploy the node-agent (no ephemeral probes, continuous)

Build / run

Mitigation vs. cure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages