30. Production Checklist#
mindmap
root((Production<br/>Readiness))
Resources
CPU/Memory Requests & Limits
HPA / VPA
PodDisruptionBudget
ResourceQuota & LimitRange
Reliability
Liveness / Readiness / Startup Probes
Pod Anti-affinity
Topology Spread Constraints
Graceful Shutdown
Rolling Update Strategy
Security
RBAC Least Privilege
Pod Security Standards
NetworkPolicy Default Deny
Image Scanning & Signing
Secrets Management
Observability
Prometheus Metrics
Structured Logging
Distributed Tracing
Alerting Rules & SLOs
Networking
TLS Everywhere
Ingress Configuration
DNS & External DNS
Rate Limiting & WAF
Backup & DR
etcd Snapshots
PV Snapshots
Database Backups
Disaster Recovery Plan
CI/CD
GitOps Workflow
Canary / Blue-Green
Rollback Procedure
Staging Parity
This chapter provides a comprehensive pre-production checklist organized by category. Use it as a gate before any production deployment. Every item should be verified and signed off by the responsible team.
Resource Management#
Proper resource configuration prevents scheduling failures, noisy-neighbor problems, and unexpected OOM kills.
# |
Item |
Status |
Priority |
Notes |
|---|---|---|---|---|
1 |
CPU requests set for all containers |
☐ |
Critical |
Ensures fair scheduling |
2 |
Memory requests set for all containers |
☐ |
Critical |
Prevents eviction |
3 |
CPU limits set for all containers |
☐ |
High |
Prevents CPU starvation |
4 |
Memory limits set for all containers |
☐ |
Critical |
Prevents OOMKill of other pods |
5 |
HPA configured for stateless workloads |
☐ |
High |
Auto-scale on CPU/memory/custom metrics |
6 |
VPA evaluated for stateful workloads |
☐ |
Medium |
Right-size resource requests |
7 |
PodDisruptionBudget (PDB) set |
☐ |
High |
Ensures availability during node drain |
8 |
ResourceQuota set per namespace |
☐ |
High |
Prevents resource exhaustion |
9 |
LimitRange set per namespace |
☐ |
Medium |
Default limits for pods without explicit limits |
10 |
No unbounded resource usage |
☐ |
High |
Every container has requests AND limits |
# ResourceQuota — cap total resources per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: production-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi
limits.cpu: "40"
limits.memory: 80Gi
pods: "100"
persistentvolumeclaims: "20"
services.loadbalancers: "5"
---
# LimitRange — default limits for containers without explicit config
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container
---
# PodDisruptionBudget — ensure minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
namespace: production
spec:
minAvailable: 2 # or use maxUnavailable: 1
selector:
matchLabels:
app: myapp
Verification Commands#
# Check all pods have resource requests/limits
kubectl get pods -n production -o json | \
jq '.items[] | {name: .metadata.name, containers: [.spec.containers[] | {name: .name, resources: .resources}]}'
# Find pods WITHOUT resource limits
kubectl get pods -n production -o json | \
jq -r '.items[] | .spec.containers[] | select(.resources.limits == null) | .name'
# Check HPA status
kubectl get hpa -n production
# Check PDB status
kubectl get pdb -n production
Reliability#
Reliability ensures your application survives failures, restarts, and infrastructure changes without downtime.
# |
Item |
Status |
Priority |
Notes |
|---|---|---|---|---|
1 |
Liveness probe configured |
☐ |
Critical |
Restart unhealthy containers |
2 |
Readiness probe configured |
☐ |
Critical |
Remove from service during issues |
3 |
Startup probe for slow-starting apps |
☐ |
High |
Prevent premature liveness failures |
4 |
Graceful shutdown implemented |
☐ |
Critical |
Handle SIGTERM, drain connections |
5 |
|
☐ |
High |
Default 30s; increase for long-running requests |
6 |
|
☐ |
Medium |
Delay for load balancer deregistration |
7 |
Pod anti-affinity configured |
☐ |
High |
Spread pods across nodes |
8 |
Topology spread constraints |
☐ |
High |
Even distribution across zones |
9 |
Rolling update: |
☐ |
High |
Zero-downtime deployments |
10 |
Rollback plan documented and tested |
☐ |
Critical |
|
11 |
At least 2 replicas for critical services |
☐ |
Critical |
Single replica = single point of failure |
12 |
Init containers for dependency checks |
☐ |
Medium |
Wait for DB/cache before starting |
# Complete reliability configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
version: "1.5.2"
spec:
terminationGracePeriodSeconds: 60
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values: ["myapp"]
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: myapp
initContainers:
- name: wait-for-db
image: busybox:1.36
command: ['sh', '-c', 'until nc -z postgres.production.svc.cluster.local 5432; do echo waiting for db; sleep 2; done']
containers:
- name: myapp
image: registry.example.com/myapp:1.5.2
ports:
- containerPort: 8080
name: http
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # allow LB to deregister
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 3
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: http
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /healthz
port: http
failureThreshold: 30
periodSeconds: 10
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1"
memory: "512Mi"
Verification Commands#
# Check all deployments have probes
kubectl get deployments -n production -o json | \
jq -r '.items[] | .spec.template.spec.containers[] | select(.livenessProbe == null) | .name' | \
xargs -I{} echo "WARNING: container '{}' has no liveness probe"
# Check replica counts
kubectl get deployments -n production -o custom-columns=NAME:.metadata.name,REPLICAS:.spec.replicas,READY:.status.readyReplicas
# Check PDB coverage
kubectl get pdb -n production -o custom-columns=NAME:.metadata.name,MIN-AVAILABLE:.spec.minAvailable,ALLOWED-DISRUPTIONS:.status.disruptionsAllowed
# Test rollback
kubectl rollout history deployment/myapp -n production
kubectl rollout undo deployment/myapp -n production --to-revision=<N>
Security#
Security must be enforced at every layer — from image build to runtime.
# |
Item |
Status |
Priority |
Notes |
|---|---|---|---|---|
1 |
RBAC configured (least privilege) |
☐ |
Critical |
No |
2 |
Service accounts scoped per workload |
☐ |
High |
Don’t use |
3 |
|
☐ |
High |
Unless SA token is needed |
4 |
Pod Security Standards: |
☐ |
Critical |
Namespace-level enforcement |
5 |
Containers run as non-root |
☐ |
Critical |
|
6 |
Read-only root filesystem |
☐ |
High |
|
7 |
No privilege escalation |
☐ |
Critical |
|
8 |
All capabilities dropped |
☐ |
High |
|
9 |
Seccomp profile: RuntimeDefault |
☐ |
High |
Restrict syscalls |
10 |
NetworkPolicy: default deny |
☐ |
Critical |
Explicit allow rules only |
11 |
Images scanned in CI |
☐ |
Critical |
Fail on CRITICAL CVEs |
12 |
Images signed (Cosign) |
☐ |
Medium |
Verify in admission controller |
13 |
No |
☐ |
Critical |
Use specific version or SHA |
14 |
Secrets from external store |
☐ |
High |
Vault / Sealed Secrets / ESO |
15 |
No secrets in environment variables |
☐ |
Medium |
Use volume mounts instead |
16 |
Image pull from private registry |
☐ |
High |
|
17 |
OPA/Gatekeeper policies enforced |
☐ |
Medium |
Require labels, limits, no latest |
# Namespace with Pod Security Standards enforced
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/enforce-version: latest
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Security Verification Commands#
# Find pods running as root
kubectl get pods -n production -o json | \
jq -r '.items[] | .spec.containers[] | select(.securityContext.runAsNonRoot != true) | .name'
# Find pods without security context
kubectl get pods -n production -o json | \
jq -r '.items[] | select(.spec.securityContext == null) | .metadata.name'
# Check NetworkPolicies exist
kubectl get networkpolicy -n production
# Should have at least: default-deny-all, allow-dns, and app-specific rules
# Audit RBAC — find cluster-admin bindings
kubectl get clusterrolebindings -o json | \
jq -r '.items[] | select(.roleRef.name == "cluster-admin") | {name: .metadata.name, subjects: .subjects}'
# Scan running images
kubectl get pods -n production -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | \
sort -u | xargs -I{} trivy image --severity CRITICAL {}
Observability#
You cannot manage what you cannot measure. Observability covers metrics, logs, traces, and alerts.
# |
Item |
Status |
Priority |
Notes |
|---|---|---|---|---|
1 |
Prometheus metrics exposed ( |
☐ |
Critical |
Application + infrastructure metrics |
2 |
ServiceMonitor / PodMonitor configured |
☐ |
Critical |
Prometheus scraping |
3 |
Grafana dashboards created |
☐ |
High |
RED metrics (Rate, Errors, Duration) |
4 |
Alerting rules defined |
☐ |
Critical |
Error rate, latency, saturation |
5 |
Alert routing configured |
☐ |
Critical |
PagerDuty / Slack / email |
6 |
Structured logging (JSON) |
☐ |
High |
Parseable by log aggregator |
7 |
Log aggregation configured |
☐ |
High |
Loki / ELK / CloudWatch |
8 |
Log retention policy set |
☐ |
Medium |
30-90 days depending on compliance |
9 |
Distributed tracing enabled |
☐ |
High |
OpenTelemetry / Jaeger / Tempo |
10 |
SLOs defined and monitored |
☐ |
High |
Availability, latency targets |
11 |
Error budget tracking |
☐ |
Medium |
SLO-based alerting |
12 |
On-call runbook documented |
☐ |
Critical |
Step-by-step for each alert |
Must-Have Alerts#
# PrometheusRule — critical alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
namespace: production
spec:
groups:
- name: myapp.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{namespace="production",app="myapp",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{namespace="production",app="myapp"}[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High 5xx error rate (> 5%) for myapp"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace="production",app="myapp"}[5m])) by (le))
> 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency > 1s for myapp"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 3
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
- alert: PodOOMKilled
expr: |
kube_pod_container_status_last_terminated_reason{namespace="production",reason="OOMKilled"} == 1
for: 0m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} was OOMKilled"
- alert: PVCAlmostFull
expr: |
kubelet_volume_stats_used_bytes{namespace="production"}
/
kubelet_volume_stats_capacity_bytes{namespace="production"}
> 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "PVC {{ $labels.persistentvolumeclaim }} is > 85% full"
- alert: CertificateExpiringSoon
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 14
for: 1h
labels:
severity: warning
annotations:
summary: "Certificate {{ $labels.name }} expires in < 14 days"
Networking#
# |
Item |
Status |
Priority |
Notes |
|---|---|---|---|---|
1 |
TLS termination at Ingress |
☐ |
Critical |
cert-manager + Let’s Encrypt |
2 |
HTTP → HTTPS redirect |
☐ |
Critical |
|
3 |
TLS certificates auto-renewed |
☐ |
Critical |
cert-manager ClusterIssuer |
4 |
Rate limiting configured |
☐ |
High |
Prevent abuse / DDoS |
5 |
CORS configured (if API) |
☐ |
Medium |
Restrict allowed origins |
6 |
Security headers set |
☐ |
High |
X-Frame-Options, CSP, HSTS |
7 |
DNS records configured |
☐ |
Critical |
A/CNAME for all services |
8 |
External DNS automated |
☐ |
Medium |
Auto-create DNS records |
9 |
Health check endpoints excluded from auth |
☐ |
High |
|
10 |
gRPC/WebSocket support configured |
☐ |
Medium |
If applicable |
# Production Ingress with security headers and TLS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp
namespace: production
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers "X-Frame-Options: DENY";
more_set_headers "X-Content-Type-Options: nosniff";
more_set_headers "X-XSS-Protection: 1; mode=block";
more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
spec:
ingressClassName: nginx
tls:
- hosts:
- myapp.example.com
secretName: myapp-tls
rules:
- host: myapp.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp
port:
number: 80
Backup & Disaster Recovery#
# |
Item |
Status |
Priority |
Notes |
|---|---|---|---|---|
1 |
etcd backup automated (daily) |
☐ |
Critical |
Tested restore procedure |
2 |
etcd backup stored off-cluster |
☐ |
Critical |
S3 / GCS / separate storage |
3 |
PV snapshots configured |
☐ |
High |
VolumeSnapshot for databases |
4 |
Database backup automated |
☐ |
Critical |
pg_dump / mysqldump + CronJob |
5 |
Backup retention policy defined |
☐ |
High |
7 daily, 4 weekly, 12 monthly |
6 |
Restore procedure documented |
☐ |
Critical |
Step-by-step runbook |
7 |
Restore tested quarterly |
☐ |
Critical |
Regular DR drills |
8 |
Multi-region / multi-cluster DR plan |
☐ |
Medium |
For critical services |
9 |
Velero or similar backup tool |
☐ |
High |
Full cluster backup |
# etcd backup CronJob
cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 2 * * *" # daily at 2 AM
jobTemplate:
spec:
template:
spec:
hostNetwork: true
containers:
- name: etcd-backup
image: bitnami/etcd:3.5
command:
- /bin/sh
- -c
- |
etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Upload to S3
aws s3 cp /backup/ s3://my-etcd-backups/ --recursive
# Clean up local backups older than 7 days
find /backup -mtime +7 -delete
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup
hostPath:
path: /var/lib/etcd-backups
restartPolicy: OnFailure
nodeSelector:
node-role.kubernetes.io/control-plane: ""
tolerations:
- key: node-role.kubernetes.io/control-plane
effect: NoSchedule
EOF
CI/CD#
# |
Item |
Status |
Priority |
Notes |
|---|---|---|---|---|
1 |
CI pipeline runs unit + integration tests |
☐ |
Critical |
Every PR |
2 |
Image built and scanned in CI |
☐ |
Critical |
Fail on critical CVEs |
3 |
GitOps workflow configured |
☐ |
High |
ArgoCD / Flux |
4 |
Staging environment mirrors production |
☐ |
High |
Same configs, smaller scale |
5 |
Canary or blue-green deployment |
☐ |
Medium |
Progressive rollout |
6 |
Rollback procedure documented |
☐ |
Critical |
One-command rollback |
7 |
Deployment notifications |
☐ |
High |
Slack / Teams on deploy |
8 |
Feature flags for risky changes |
☐ |
Medium |
Decouple deploy from release |
9 |
Database migration strategy |
☐ |
High |
Backward-compatible migrations |
10 |
Smoke tests after deployment |
☐ |
High |
Automated health verification |
Pre-Deployment Verification Script#
Run this script before every production deployment:
#!/bin/bash
# pre-deploy-check.sh — run before production deployment
set -euo pipefail
NAMESPACE="production"
RELEASE="myapp"
IMAGE_TAG="${1:?Usage: $0 <image-tag>}"
REGISTRY="registry.example.com"
echo "═══════════════════════════════════════════"
echo " Pre-Deployment Checklist"
echo "═══════════════════════════════════════════"
# 1. Verify image exists
echo -n "✓ Image exists... "
docker manifest inspect ${REGISTRY}/${RELEASE}:${IMAGE_TAG} > /dev/null 2>&1 && echo "OK" || { echo "FAIL"; exit 1; }
# 2. Verify image was scanned
echo -n "✓ Image scan passed... "
trivy image --exit-code 1 --severity CRITICAL --quiet ${REGISTRY}/${RELEASE}:${IMAGE_TAG} && echo "OK" || { echo "FAIL — critical CVEs found"; exit 1; }
# 3. Dry-run the deployment
echo -n "✓ Helm dry-run... "
helm upgrade --install ${RELEASE} ./chart \
-f chart/values-production.yaml \
--set image.tag=${IMAGE_TAG} \
--namespace ${NAMESPACE} \
--dry-run > /dev/null 2>&1 && echo "OK" || { echo "FAIL"; exit 1; }
# 4. Check resource quotas
echo -n "✓ Resource quota available... "
kubectl describe resourcequota -n ${NAMESPACE} 2>/dev/null | grep -q "used" && echo "OK" || echo "WARN — no quota set"
# 5. Check PDB exists
echo -n "✓ PDB configured... "
kubectl get pdb -n ${NAMESPACE} -l app=${RELEASE} 2>/dev/null | grep -q ${RELEASE} && echo "OK" || echo "WARN — no PDB"
# 6. Check current deployment health
echo -n "✓ Current deployment healthy... "
kubectl rollout status deployment/${RELEASE} -n ${NAMESPACE} --timeout=10s > /dev/null 2>&1 && echo "OK" || echo "WARN — current deployment unhealthy"
echo ""
echo "═══════════════════════════════════════════"
echo " All checks passed. Ready to deploy."
echo "═══════════════════════════════════════════"
Post-Deployment Verification#
#!/bin/bash
# post-deploy-verify.sh — run after production deployment
set -euo pipefail
NAMESPACE="production"
RELEASE="myapp"
URL="https://myapp.example.com"
echo "Waiting for rollout to complete..."
kubectl rollout status deployment/${RELEASE} -n ${NAMESPACE} --timeout=300s
echo "Running health checks..."
# Health endpoint
curl -sf ${URL}/healthz || { echo "FAIL: healthz"; exit 1; }
echo "✓ /healthz OK"
# Readiness endpoint
curl -sf ${URL}/readyz || { echo "FAIL: readyz"; exit 1; }
echo "✓ /readyz OK"
# Smoke test — basic API call
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" ${URL}/api/v1/status)
if [ "$HTTP_CODE" -eq 200 ]; then
echo "✓ API returns 200"
else
echo "FAIL: API returned ${HTTP_CODE}"
echo "Rolling back..."
kubectl rollout undo deployment/${RELEASE} -n ${NAMESPACE}
exit 1
fi
# Check no pods are crash-looping
RESTARTS=$(kubectl get pods -n ${NAMESPACE} -l app=${RELEASE} -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}' | tr ' ' '\n' | sort -rn | head -1)
if [ "${RESTARTS:-0}" -gt 2 ]; then
echo "WARN: Pod restarts detected (${RESTARTS})"
fi
echo ""
echo "✅ Deployment verified successfully"
Summary: The Golden Rules#
Every container has resource requests AND limits
Every deployment has liveness, readiness, and startup probes
Every namespace has NetworkPolicy default-deny
Every image is scanned and uses a specific tag (never
:latest)Every secret comes from an external store (Vault / Sealed Secrets)
Every service has at least 2 replicas with PDB
Every deployment has a documented rollback procedure
Every alert has a runbook
Every backup has a tested restore procedure
Every change goes through Git (GitOps)