Q: What happens when your system goes down - how do you approach it?
A: Structured incident response:
- Detect & acknowledge: confirm the outage (monitoring alert, user report, status page). Acknowledge in PagerDuty/Opsgenie to stop alert escalation. Post in the incident Slack channel: "Investigating [service] outage, started [time]."
- Assess blast radius: who is affected? All users? A specific region? A single tenant?
- Mitigate first, fix later: if there's a quick mitigation (rollback a recent deploy, failover to another region, scale up), do it immediately. Don't wait for root cause before stopping user impact.
- Gather signals: check dashboards (Grafana/Datadog), logs (Kibana/Loki), recent changes (git log, deploy history), cloud provider status page.
- Hypothesize and test: form a theory based on signals; validate with a targeted check.
- Fix and verify: apply the fix; watch metrics normalize; verify from a user perspective.
- Communicate: keep stakeholders updated every 15–30 minutes during the incident.
- Post-mortem: blameless RCA within 48 hours; document timeline, root cause, contributing factors, and action items to prevent recurrence.
Q: Your app teams demand custom AMIs. What's your pre-prod vetting strategy at kernel and runtime?
A:
- Automated build pipeline: use HashiCorp Packer to build AMIs from a base (Amazon Linux 2023, Ubuntu) with a standardized provisioning script. All customizations go through Packer — no manual AMI modifications.
- CIS hardening: apply the CIS Level 1 benchmark automatically in the Packer build using Ansible roles (e.g.,
dev-sec/cis-hardening). - Security scanning: run AWS Inspector or Trivy on the built AMI; fail the pipeline if HIGH/CRITICAL kernel or package CVEs are found.
- Functional testing: launch a test EC2 from the AMI; run automated tests:
cloud-initcompletes without error.- Required services start (
kubelet,containerd,ssm-agent). - Kernel version meets minimum requirement.
- SELinux/AppArmor in enforcing mode.
- auditd is running and logging.
- Performance baseline: run
sysbenchorfioto confirm I/O performance regressions aren't introduced. - Staged rollout: deploy to dev node group first; monitor for 24h; then staging; then production node groups using a blue-green ASG replacement.
- Approval gate: security team sign-off required before production AMI promotion.
Q: You deployed a sidecar logging agent. Suddenly, CPU throttling spikes. Diagnose and rollback.
A: Diagnose:
# Which pods are being throttled?
kubectl top pods -A --sort-by=cpu
# Check throttling metric (cgroup-level)
# On the node:
cat /sys/fs/cgroup/cpu/kubepods/pod<pod-uid>/<container-id>/cpu.stat
# Look for: nr_throttled and throttled_time
# Prometheus metric for throttling
rate(container_cpu_cfs_throttled_seconds_total{container="logging-agent"}[5m])Root cause candidates:
- The logging agent has a very low
resources.limits.cpu, causing CFS throttling when log volume spikes. - The agent is doing expensive regex parsing or compression on every log line.
- The log volume itself increased (e.g., verbose debug logging turned on accidentally).
Rollback:
# Rollback the DaemonSet to previous version
kubectl rollout undo daemonset/logging-agent -n logging
# Verify
kubectl rollout status daemonset/logging-agent -n loggingFix: increase resources.limits.cpu for the sidecar, add log sampling/filtering to reduce processing volume, or switch to async batched log shipping.
Q: Systemd journal logs vanish on reboot across some AMIs. What do you check in the image build and boot sequence?
A:
- Journal storage mode: check
/etc/systemd/journald.confforStorage=. IfStorage=volatile(default on some distros), journals are in/run/log/journalwhich is cleared on reboot. Fix: setStorage=persistentto write to/var/log/journal. - Journal directory: confirm
/var/log/journal/<machine-id>directory exists and has correct permissions (systemd-journalgroup). If the directory doesn't exist, journald falls back to volatile. - Packer provisioner: check if the Packer build script deletes
/var/log/journalas part of cleanup (common in image hardening scripts). If so, add a step to recreate it. - cloud-init: check if cloud-init or user data scripts are inadvertently clearing
/var/log. - AMI snapshot timing: if the AMI snapshot is taken before
journaldcreates the persistent dir on first boot, subsequent instances won't have it. Fix: addmkdir -p /var/log/journalandsystemd-tmpfiles --createto the Packer provisioner. - Machine ID:
/etc/machine-idmust be stable (or cleared for golden AMIs and regenerated on first boot). Journal dirs are keyed by machine ID.
Q: Kernel panic on a GKE node mid-deploy. How do you identify if it's infra, base image, or app-level?
A:
- Collect crash data: GKE nodes write kernel panic info to the serial console. Access via
gcloud compute instances get-serial-port-output <instance-name>. Look forKernel panic,BUG:,Oops:, orCall Trace:in the output. - Identify the responsible kernel module: the call trace shows the faulting module. If it's a driver or vendor kernel module → infrastructure issue (VM hypervisor, hardware, NVIDIA driver, etc.). If it's the container runtime (containerd/runc) → base image or runtime issue.
- Correlate with deploy: did the panic happen immediately after deploying a new workload? Check if the panic correlates with a specific container/image.
- eBPF programs (Cilium, Falco) can trigger kernel panics on buggy kernels — check if the workload uses eBPF.
- Containers with
privileged: trueor host PID/network namespace sharing can cause kernel-level issues.
- GKE node logs:
gcloud logging read "resource.type=gce_instance AND resource.labels.instance_id=<id>"for OS-level logs before the panic. - Base image: compare the panicking node's kernel version (
uname -r) vs healthy nodes. If the GKE node pool was recently upgraded, file a support ticket with GCP if it's a known kernel bug. - App-level: if the call trace points to a syscall made by the application (e.g., io_uring, eBPF), that's app-level. Review the application for unsafe kernel interactions.
Q: How do you check network connectivity between two servers?
A:
# Basic connectivity
ping <target-ip>
# Port-level connectivity
telnet <target-ip> <port>
nc -zv <target-ip> <port> # netcat, non-interactive
curl -v http://<target-ip>:<port> # HTTP-level check
# Trace the network path
traceroute <target-ip> # ICMP/UDP
tracepath <target-ip> # doesn't require root
mtr <target-ip> # continuous traceroute with loss stats
# Check if a port is listening on the target
ss -tlnp | grep <port> # on the target server
netstat -tlnp | grep <port>
# DNS resolution
dig <hostname> @<dns-server>
nslookup <hostname>
# Check firewall / security groups
# On Linux: iptables -L -n -v
# On AWS: check Security Group inbound/outbound rules and NACLs
# Advanced: check MTU / packet loss
ping -s 1472 <target-ip> # test with large packets (1472 = 1500 MTU - 28 IP/ICMP headers)For cross-VPC or cross-account issues in AWS: also verify VPC peering routes, Transit Gateway route tables, and Security Group rules allow the specific port/protocol.