saturation

May 16, 2025 by walter

Table of Contents

In software service observability, saturation refers to how "full" or constrained a system or service is in terms of its capacity to handle more work. It is one of the "Four Golden Signals" in the Google SRE book:

Latency, Traffic (Usage), Errors, and Saturation

To understand saturation beyond Usage (traffic), Errors, and Delay (latency), think of it as how close the system is to its resource limits — even before it starts to show degraded performance.

📌 Key Ways to Understand Saturation

1. Resource Utilization

Saturation is typically measured by resource usage nearing limits:

CPU usage near 100%
Memory usage nearing full allocation
Disk I/O or network bandwidth bottlenecks
Thread pool exhaustion or connection pool exhaustion

Example:
A web server has 100 worker threads and all are busy. Even if latency hasn't spiked yet, new requests are queuing up — this is saturation.

2. Queue Lengths and Backpressure

Saturation often shows up in growing queues:

Request queues (HTTP handlers, message brokers)
Internal work queues (goroutines, thread pools)
Database connection waiting queues

Example:
A Redis instance is still responding fast, but the number of client connections waiting for responses is growing — it’s becoming saturated.

3. Retries and Retries Queuing

When systems saturate, other services might retry requests, adding more load and increasing pressure, leading to:

Retry storms
Feedback loops of congestion

4. Capacity Forecasting

Saturation can also be proactive: monitoring trends and predicting when you'll run out of headroom.

Example:
You’re running at 70% memory today, but usage is growing at 5% per week. You’ll hit 90% in 4 weeks.

🔍 Observability Indicators of Saturation

Queue depth metrics
CPU/Memory usage
GC pause times
Connection pool usage
Error spikes from downstream services
Thread exhaustion (e.g., JVM thread count)

⚠️ Saturation ≠ Errors or Latency (but can cause them)

Saturation often precedes errors or latency spikes. Think of it as early warning.

📖 Summary

Signal	Description	Analogy
Usage	How much work the system is doing	Cars on a highway
Errors	Failed or incorrect responses	Crashes on the road
Delay	How long requests take	Time to drive from A to B
Saturation	Nearing capacity or limits	Highway traffic jam forming

If you'd like, I can help you add saturation metrics to your current observability setup (e.g., with Prometheus, Grafana, or OpenTelemetry).

Comments |0|

Cancel

Legend *) Required fields are marked
**) You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Category: 似水流年

老范的自言自语 Walter's Solo