saturation
In software service observability, saturation refers to how "full" or constrained a system or service is in terms of its capacity to handle more work. It is one of the "Four Golden Signals" in the Google SRE book:
Latency, Traffic (Usage), Errors, and Saturation
To understand saturation beyond Usage (traffic), Errors, and Delay (latency), think of it as how close the system is to its resource limits — even before it starts to show degraded performance.
📌 Key Ways to Understand Saturation
1. Resource Utilization
Saturation is typically measured by resource usage nearing limits:
- CPU usage near 100%
- Memory usage nearing full allocation
- Disk I/O or network bandwidth bottlenecks
- Thread pool exhaustion or connection pool exhaustion
Example:
A web server has 100 worker threads and all are busy. Even if latency hasn't spiked yet, new requests are queuing up — this is saturation.
2. Queue Lengths and Backpressure
Saturation often shows up in growing queues:
- Request queues (HTTP handlers, message brokers)
- Internal work queues (goroutines, thread pools)
- Database connection waiting queues
Example:
A Redis instance is still responding fast, but the number of client connections waiting for responses is growing — it’s becoming saturated.
3. Retries and Retries Queuing
When systems saturate, other services might retry requests, adding more load and increasing pressure, leading to:
- Retry storms
- Feedback loops of congestion
4. Capacity Forecasting
Saturation can also be proactive: monitoring trends and predicting when you'll run out of headroom.
Example:
You’re running at 70% memory today, but usage is growing at 5% per week. You’ll hit 90% in 4 weeks.
🔍 Observability Indicators of Saturation
- Queue depth metrics
- CPU/Memory usage
- GC pause times
- Connection pool usage
- Error spikes from downstream services
- Thread exhaustion (e.g., JVM thread count)
⚠️ Saturation ≠ Errors or Latency (but can cause them)
Saturation often precedes errors or latency spikes. Think of it as early warning.
📖 Summary
Signal | Description | Analogy |
---|---|---|
Usage | How much work the system is doing | Cars on a highway |
Errors | Failed or incorrect responses | Crashes on the road |
Delay | How long requests take | Time to drive from A to B |
Saturation | Nearing capacity or limits | Highway traffic jam forming |
If you'd like, I can help you add saturation metrics to your current observability setup (e.g., with Prometheus, Grafana, or OpenTelemetry).