30. Production Checklist

30. Production Checklist#

        mindmap
  root((Production<br/>Readiness))
    Resources
      CPU/Memory Requests & Limits
      HPA / VPA
      PodDisruptionBudget
      ResourceQuota & LimitRange
    Reliability
      Liveness / Readiness / Startup Probes
      Pod Anti-affinity
      Topology Spread Constraints
      Graceful Shutdown
      Rolling Update Strategy
    Security
      RBAC Least Privilege
      Pod Security Standards
      NetworkPolicy Default Deny
      Image Scanning & Signing
      Secrets Management
    Observability
      Prometheus Metrics
      Structured Logging
      Distributed Tracing
      Alerting Rules & SLOs
    Networking
      TLS Everywhere
      Ingress Configuration
      DNS & External DNS
      Rate Limiting & WAF
    Backup & DR
      etcd Snapshots
      PV Snapshots
      Database Backups
      Disaster Recovery Plan
    CI/CD
      GitOps Workflow
      Canary / Blue-Green
      Rollback Procedure
      Staging Parity

This chapter provides a comprehensive pre-production checklist organized by category. Use it as a gate before any production deployment. Every item should be verified and signed off by the responsible team.

Resource Management#

Proper resource configuration prevents scheduling failures, noisy-neighbor problems, and unexpected OOM kills.

#	Item	Status	Priority	Notes
1	CPU requests set for all containers	☐	Critical	Ensures fair scheduling
2	Memory requests set for all containers	☐	Critical	Prevents eviction
3	CPU limits set for all containers	☐	High	Prevents CPU starvation
4	Memory limits set for all containers	☐	Critical	Prevents OOMKill of other pods
5	HPA configured for stateless workloads	☐	High	Auto-scale on CPU/memory/custom metrics
6	VPA evaluated for stateful workloads	☐	Medium	Right-size resource requests
7	PodDisruptionBudget (PDB) set	☐	High	Ensures availability during node drain
8	ResourceQuota set per namespace	☐	High	Prevents resource exhaustion
9	LimitRange set per namespace	☐	Medium	Default limits for pods without explicit limits
10	No unbounded resource usage	☐	High	Every container has requests AND limits

# ResourceQuota — cap total resources per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    pods: "100"
    persistentvolumeclaims: "20"
    services.loadbalancers: "5"
---
# LimitRange — default limits for containers without explicit config
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
    - default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4"
        memory: "8Gi"
      min:
        cpu: "50m"
        memory: "64Mi"
      type: Container
---
# PodDisruptionBudget — ensure minimum availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: myapp-pdb
  namespace: production
spec:
  minAvailable: 2    # or use maxUnavailable: 1
  selector:
    matchLabels:
      app: myapp

Verification Commands#

# Check all pods have resource requests/limits
kubectl get pods -n production -o json | \
  jq '.items[] | {name: .metadata.name, containers: [.spec.containers[] | {name: .name, resources: .resources}]}'

# Find pods WITHOUT resource limits
kubectl get pods -n production -o json | \
  jq -r '.items[] | .spec.containers[] | select(.resources.limits == null) | .name'

# Check HPA status
kubectl get hpa -n production

# Check PDB status
kubectl get pdb -n production

Reliability#

Reliability ensures your application survives failures, restarts, and infrastructure changes without downtime.

#	Item	Status	Priority	Notes
1	Liveness probe configured	☐	Critical	Restart unhealthy containers
2	Readiness probe configured	☐	Critical	Remove from service during issues
3	Startup probe for slow-starting apps	☐	High	Prevent premature liveness failures
4	Graceful shutdown implemented	☐	Critical	Handle SIGTERM, drain connections
5	`terminationGracePeriodSeconds` set	☐	High	Default 30s; increase for long-running requests
6	`preStop` hook if needed	☐	Medium	Delay for load balancer deregistration
7	Pod anti-affinity configured	☐	High	Spread pods across nodes
8	Topology spread constraints	☐	High	Even distribution across zones
9	Rolling update: `maxSurge=1, maxUnavailable=0`	☐	High	Zero-downtime deployments
10	Rollback plan documented and tested	☐	Critical	`kubectl rollout undo` or `helm rollback`
11	At least 2 replicas for critical services	☐	Critical	Single replica = single point of failure
12	Init containers for dependency checks	☐	Medium	Wait for DB/cache before starting

# Complete reliability configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: "1.5.2"
    spec:
      terminationGracePeriodSeconds: 60
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values: ["myapp"]
                topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: myapp
      initContainers:
        - name: wait-for-db
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -z postgres.production.svc.cluster.local 5432; do echo waiting for db; sleep 2; done']
      containers:
        - name: myapp
          image: registry.example.com/myapp:1.5.2
          ports:
            - containerPort: 8080
              name: http
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 5"]    # allow LB to deregister
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 3
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3
          startupProbe:
            httpGet:
              path: /healthz
              port: http
            failureThreshold: 30
            periodSeconds: 10
          resources:
            requests:
              cpu: "200m"
              memory: "256Mi"
            limits:
              cpu: "1"
              memory: "512Mi"

Verification Commands#

# Check all deployments have probes
kubectl get deployments -n production -o json | \
  jq -r '.items[] | .spec.template.spec.containers[] | select(.livenessProbe == null) | .name' | \
  xargs -I{} echo "WARNING: container '{}' has no liveness probe"

# Check replica counts
kubectl get deployments -n production -o custom-columns=NAME:.metadata.name,REPLICAS:.spec.replicas,READY:.status.readyReplicas

# Check PDB coverage
kubectl get pdb -n production -o custom-columns=NAME:.metadata.name,MIN-AVAILABLE:.spec.minAvailable,ALLOWED-DISRUPTIONS:.status.disruptionsAllowed

# Test rollback
kubectl rollout history deployment/myapp -n production
kubectl rollout undo deployment/myapp -n production --to-revision=<N>

Security#

Security must be enforced at every layer — from image build to runtime.

#	Item	Status	Priority	Notes
1	RBAC configured (least privilege)	☐	Critical	No `cluster-admin` for apps
2	Service accounts scoped per workload	☐	High	Don’t use `default` SA
3	`automountServiceAccountToken: false`	☐	High	Unless SA token is needed
4	Pod Security Standards: `restricted`	☐	Critical	Namespace-level enforcement
5	Containers run as non-root	☐	Critical	`runAsNonRoot: true`
6	Read-only root filesystem	☐	High	`readOnlyRootFilesystem: true`
7	No privilege escalation	☐	Critical	`allowPrivilegeEscalation: false`
8	All capabilities dropped	☐	High	`drop: ["ALL"]`
9	Seccomp profile: RuntimeDefault	☐	High	Restrict syscalls
10	NetworkPolicy: default deny	☐	Critical	Explicit allow rules only
11	Images scanned in CI	☐	Critical	Fail on CRITICAL CVEs
12	Images signed (Cosign)	☐	Medium	Verify in admission controller
13	No `:latest` tag in production	☐	Critical	Use specific version or SHA
14	Secrets from external store	☐	High	Vault / Sealed Secrets / ESO
15	No secrets in environment variables	☐	Medium	Use volume mounts instead
16	Image pull from private registry	☐	High	`imagePullSecrets` configured
17	OPA/Gatekeeper policies enforced	☐	Medium	Require labels, limits, no latest

# Namespace with Pod Security Standards enforced
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Security Verification Commands#

# Find pods running as root
kubectl get pods -n production -o json | \
  jq -r '.items[] | .spec.containers[] | select(.securityContext.runAsNonRoot != true) | .name'

# Find pods without security context
kubectl get pods -n production -o json | \
  jq -r '.items[] | select(.spec.securityContext == null) | .metadata.name'

# Check NetworkPolicies exist
kubectl get networkpolicy -n production
# Should have at least: default-deny-all, allow-dns, and app-specific rules

# Audit RBAC — find cluster-admin bindings
kubectl get clusterrolebindings -o json | \
  jq -r '.items[] | select(.roleRef.name == "cluster-admin") | {name: .metadata.name, subjects: .subjects}'

# Scan running images
kubectl get pods -n production -o jsonpath='{range .items[*]}{.spec.containers[*].image}{"\n"}{end}' | \
  sort -u | xargs -I{} trivy image --severity CRITICAL {}

Observability#

You cannot manage what you cannot measure. Observability covers metrics, logs, traces, and alerts.

#	Item	Status	Priority	Notes
1	Prometheus metrics exposed (`/metrics`)	☐	Critical	Application + infrastructure metrics
2	ServiceMonitor / PodMonitor configured	☐	Critical	Prometheus scraping
3	Grafana dashboards created	☐	High	RED metrics (Rate, Errors, Duration)
4	Alerting rules defined	☐	Critical	Error rate, latency, saturation
5	Alert routing configured	☐	Critical	PagerDuty / Slack / email
6	Structured logging (JSON)	☐	High	Parseable by log aggregator
7	Log aggregation configured	☐	High	Loki / ELK / CloudWatch
8	Log retention policy set	☐	Medium	30-90 days depending on compliance
9	Distributed tracing enabled	☐	High	OpenTelemetry / Jaeger / Tempo
10	SLOs defined and monitored	☐	High	Availability, latency targets
11	Error budget tracking	☐	Medium	SLO-based alerting
12	On-call runbook documented	☐	Critical	Step-by-step for each alert

Must-Have Alerts#

# PrometheusRule — critical alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: myapp-alerts
  namespace: production
spec:
  groups:
    - name: myapp.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{namespace="production",app="myapp",status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{namespace="production",app="myapp"}[5m]))
            > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High 5xx error rate (> 5%) for myapp"
            runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace="production",app="myapp"}[5m])) by (le))
            > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "P99 latency > 1s for myapp"

        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total{namespace="production"}[1h]) > 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"

        - alert: PodOOMKilled
          expr: |
            kube_pod_container_status_last_terminated_reason{namespace="production",reason="OOMKilled"} == 1
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} was OOMKilled"

        - alert: PVCAlmostFull
          expr: |
            kubelet_volume_stats_used_bytes{namespace="production"}
            /
            kubelet_volume_stats_capacity_bytes{namespace="production"}
            > 0.85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "PVC {{ $labels.persistentvolumeclaim }} is > 85% full"

        - alert: CertificateExpiringSoon
          expr: |
            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 14
          for: 1h
          labels:
            severity: warning
          annotations:
            summary: "Certificate {{ $labels.name }} expires in < 14 days"

Networking#

#	Item	Status	Priority	Notes
1	TLS termination at Ingress	☐	Critical	cert-manager + Let’s Encrypt
2	HTTP → HTTPS redirect	☐	Critical	`ssl-redirect` annotation
3	TLS certificates auto-renewed	☐	Critical	cert-manager ClusterIssuer
4	Rate limiting configured	☐	High	Prevent abuse / DDoS
5	CORS configured (if API)	☐	Medium	Restrict allowed origins
6	Security headers set	☐	High	X-Frame-Options, CSP, HSTS
7	DNS records configured	☐	Critical	A/CNAME for all services
8	External DNS automated	☐	Medium	Auto-create DNS records
9	Health check endpoints excluded from auth	☐	High	`/healthz`, `/readyz` public
10	gRPC/WebSocket support configured	☐	Medium	If applicable

# Production Ingress with security headers and TLS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
  namespace: production
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/rate-limit: "100"
    nginx.ingress.kubernetes.io/rate-limit-window: "1m"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      more_set_headers "X-Frame-Options: DENY";
      more_set_headers "X-Content-Type-Options: nosniff";
      more_set_headers "X-XSS-Protection: 1; mode=block";
      more_set_headers "Referrer-Policy: strict-origin-when-cross-origin";
      more_set_headers "Strict-Transport-Security: max-age=31536000; includeSubDomains";
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - myapp.example.com
      secretName: myapp-tls
  rules:
    - host: myapp.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp
                port:
                  number: 80

Backup & Disaster Recovery#

#	Item	Status	Priority	Notes
1	etcd backup automated (daily)	☐	Critical	Tested restore procedure
2	etcd backup stored off-cluster	☐	Critical	S3 / GCS / separate storage
3	PV snapshots configured	☐	High	VolumeSnapshot for databases
4	Database backup automated	☐	Critical	pg_dump / mysqldump + CronJob
5	Backup retention policy defined	☐	High	7 daily, 4 weekly, 12 monthly
6	Restore procedure documented	☐	Critical	Step-by-step runbook
7	Restore tested quarterly	☐	Critical	Regular DR drills
8	Multi-region / multi-cluster DR plan	☐	Medium	For critical services
9	Velero or similar backup tool	☐	High	Full cluster backup

# etcd backup CronJob
cat <<'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 2 * * *"    # daily at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          containers:
            - name: etcd-backup
              image: bitnami/etcd:3.5
              command:
                - /bin/sh
                - -c
                - |
                  etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
                    --endpoints=https://127.0.0.1:2379 \
                    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
                    --cert=/etc/kubernetes/pki/etcd/server.crt \
                    --key=/etc/kubernetes/pki/etcd/server.key
                  # Upload to S3
                  aws s3 cp /backup/ s3://my-etcd-backups/ --recursive
                  # Clean up local backups older than 7 days
                  find /backup -mtime +7 -delete
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/kubernetes/pki/etcd
                  readOnly: true
                - name: backup
                  mountPath: /backup
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
            - name: backup
              hostPath:
                path: /var/lib/etcd-backups
          restartPolicy: OnFailure
          nodeSelector:
            node-role.kubernetes.io/control-plane: ""
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              effect: NoSchedule
EOF

CI/CD#

#	Item	Status	Priority	Notes
1	CI pipeline runs unit + integration tests	☐	Critical	Every PR
2	Image built and scanned in CI	☐	Critical	Fail on critical CVEs
3	GitOps workflow configured	☐	High	ArgoCD / Flux
4	Staging environment mirrors production	☐	High	Same configs, smaller scale
5	Canary or blue-green deployment	☐	Medium	Progressive rollout
6	Rollback procedure documented	☐	Critical	One-command rollback
7	Deployment notifications	☐	High	Slack / Teams on deploy
8	Feature flags for risky changes	☐	Medium	Decouple deploy from release
9	Database migration strategy	☐	High	Backward-compatible migrations
10	Smoke tests after deployment	☐	High	Automated health verification

Pre-Deployment Verification Script#

Run this script before every production deployment:

#!/bin/bash
# pre-deploy-check.sh — run before production deployment
set -euo pipefail

NAMESPACE="production"
RELEASE="myapp"
IMAGE_TAG="${1:?Usage: $0 <image-tag>}"
REGISTRY="registry.example.com"

echo "═══════════════════════════════════════════"
echo "  Pre-Deployment Checklist"
echo "═══════════════════════════════════════════"

# 1. Verify image exists
echo -n "✓ Image exists... "
docker manifest inspect ${REGISTRY}/${RELEASE}:${IMAGE_TAG} > /dev/null 2>&1 && echo "OK" || { echo "FAIL"; exit 1; }

# 2. Verify image was scanned
echo -n "✓ Image scan passed... "
trivy image --exit-code 1 --severity CRITICAL --quiet ${REGISTRY}/${RELEASE}:${IMAGE_TAG} && echo "OK" || { echo "FAIL — critical CVEs found"; exit 1; }

# 3. Dry-run the deployment
echo -n "✓ Helm dry-run... "
helm upgrade --install ${RELEASE} ./chart \
  -f chart/values-production.yaml \
  --set image.tag=${IMAGE_TAG} \
  --namespace ${NAMESPACE} \
  --dry-run > /dev/null 2>&1 && echo "OK" || { echo "FAIL"; exit 1; }

# 4. Check resource quotas
echo -n "✓ Resource quota available... "
kubectl describe resourcequota -n ${NAMESPACE} 2>/dev/null | grep -q "used" && echo "OK" || echo "WARN — no quota set"

# 5. Check PDB exists
echo -n "✓ PDB configured... "
kubectl get pdb -n ${NAMESPACE} -l app=${RELEASE} 2>/dev/null | grep -q ${RELEASE} && echo "OK" || echo "WARN — no PDB"

# 6. Check current deployment health
echo -n "✓ Current deployment healthy... "
kubectl rollout status deployment/${RELEASE} -n ${NAMESPACE} --timeout=10s > /dev/null 2>&1 && echo "OK" || echo "WARN — current deployment unhealthy"

echo ""
echo "═══════════════════════════════════════════"
echo "  All checks passed. Ready to deploy."
echo "═══════════════════════════════════════════"

Post-Deployment Verification#

#!/bin/bash
# post-deploy-verify.sh — run after production deployment
set -euo pipefail

NAMESPACE="production"
RELEASE="myapp"
URL="https://myapp.example.com"

echo "Waiting for rollout to complete..."
kubectl rollout status deployment/${RELEASE} -n ${NAMESPACE} --timeout=300s

echo "Running health checks..."
# Health endpoint
curl -sf ${URL}/healthz || { echo "FAIL: healthz"; exit 1; }
echo "✓ /healthz OK"

# Readiness endpoint
curl -sf ${URL}/readyz || { echo "FAIL: readyz"; exit 1; }
echo "✓ /readyz OK"

# Smoke test — basic API call
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" ${URL}/api/v1/status)
if [ "$HTTP_CODE" -eq 200 ]; then
  echo "✓ API returns 200"
else
  echo "FAIL: API returned ${HTTP_CODE}"
  echo "Rolling back..."
  kubectl rollout undo deployment/${RELEASE} -n ${NAMESPACE}
  exit 1
fi

# Check no pods are crash-looping
RESTARTS=$(kubectl get pods -n ${NAMESPACE} -l app=${RELEASE} -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}' | tr ' ' '\n' | sort -rn | head -1)
if [ "${RESTARTS:-0}" -gt 2 ]; then
  echo "WARN: Pod restarts detected (${RESTARTS})"
fi

echo ""
echo "✅ Deployment verified successfully"

Summary: The Golden Rules#

Every container has resource requests AND limits
Every deployment has liveness, readiness, and startup probes
Every namespace has NetworkPolicy default-deny
Every image is scanned and uses a specific tag (never :latest)
Every secret comes from an external store (Vault / Sealed Secrets)
Every service has at least 2 replicas with PDB
Every deployment has a documented rollback procedure
Every alert has a runbook
Every backup has a tested restore procedure
Every change goes through Git (GitOps)

30. Production Checklist

Contents

30. Production Checklist#

Resource Management#

Verification Commands#

Reliability#

Verification Commands#

Security#

Security Verification Commands#

Observability#

Must-Have Alerts#

Networking#

Backup & Disaster Recovery#

CI/CD#

Pre-Deployment Verification Script#

Post-Deployment Verification#

Summary: The Golden Rules#