saturation

Table of Contents

In software service observability, saturation refers to how "full" or constrained a system or service is in terms of its capacity to handle more work. It is one of the "Four Golden Signals" in the Google SRE book:

Latency, Traffic (Usage), Errors, and Saturation

To understand saturation beyond Usage (traffic), Errors, and Delay (latency), think of it as how close the system is to its resource limits — even before it starts to show degraded performance.


📌 Key Ways to Understand Saturation

1. Resource Utilization

Saturation is typically measured by resource usage nearing limits:

  • CPU usage near 100%
  • Memory usage nearing full allocation
  • Disk I/O or network bandwidth bottlenecks
  • Thread pool exhaustion or connection pool exhaustion

Example:
A web server has 100 worker threads and all are busy. Even if latency hasn't spiked yet, new requests are queuing up — this is saturation.


2. Queue Lengths and Backpressure

Saturation often shows up in growing queues:

  • Request queues (HTTP handlers, message brokers)
  • Internal work queues (goroutines, thread pools)
  • Database connection waiting queues

Example:
A Redis instance is still responding fast, but the number of client connections waiting for responses is growing — it’s becoming saturated.


3. Retries and Retries Queuing

When systems saturate, other services might retry requests, adding more load and increasing pressure, leading to:

  • Retry storms
  • Feedback loops of congestion

4. Capacity Forecasting

Saturation can also be proactive: monitoring trends and predicting when you'll run out of headroom.

Example:
You’re running at 70% memory today, but usage is growing at 5% per week. You’ll hit 90% in 4 weeks.


🔍 Observability Indicators of Saturation

  • Queue depth metrics
  • CPU/Memory usage
  • GC pause times
  • Connection pool usage
  • Error spikes from downstream services
  • Thread exhaustion (e.g., JVM thread count)

⚠️ Saturation ≠ Errors or Latency (but can cause them)

Saturation often precedes errors or latency spikes. Think of it as early warning.


📖 Summary

Signal Description Analogy
Usage How much work the system is doing Cars on a highway
Errors Failed or incorrect responses Crashes on the road
Delay How long requests take Time to drive from A to B
Saturation Nearing capacity or limits Highway traffic jam forming

If you'd like, I can help you add saturation metrics to your current observability setup (e.g., with Prometheus, Grafana, or OpenTelemetry).

Comments |0|

Legend *) Required fields are marked
**) You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>
Category: 似水流年