Skip to content

SLI — Service Level Indicator#

How do you know what your system is actually doing right now?

You measure it. That measurement is your SLI.


What it is#

An SLI is simply a number your system is actively generating right now.

Not a target. Not a promise. Just a real measurement being produced by your system at this moment.

  • P99 latency = 180ms → SLI
  • 0.5% of requests are returning errors → SLI
  • System was available 99.7% of the last 30 days → SLI
  • 95% of DB queries completed under 50ms → SLI

What to measure — and what NOT to measure#

You don't measure everything. You pick the measurements that directly reflect whether your users are having a good experience.

SLIs must reflect user experience — not internal infrastructure health

Good SLIs: - Message delivery latency — users feel this directly - Error rate on the API — users see these failures - Availability — is the service up when users try to use it?

Bad SLIs: - CPU usage — a server can be at 90% CPU and users are perfectly fine - Memory usage — same problem, doesn't map to user experience - Number of database connections — internal plumbing, users don't feel this

A server can be at 10% CPU and every request is still failing

Internal metrics tell you about your infrastructure. SLIs tell you about your users. These are different things.


Wait — what's the difference between P99 and an SLI?#

This is the most common confusion after learning percentiles.

P99 is a calculation method. SLI is a concept — what you decided to measure.

  • The SLI is the what — what are we measuring?
  • P99 is the how — how are we computing that measurement?

You decide to measure response time → that decision is your SLI. You choose to express it as P99 → that's the calculation method you picked.

The same SLI can be expressed multiple ways:

SLI Expressed as Result
Response time P99 P99 latency = 180ms
Response time P50 P50 latency = 50ms
Response time Average Average latency = 60ms (but averages lie)

And not all SLIs use percentiles at all: - SLI = error rate → "0.5% of requests failed" — a ratio, no percentile involved - SLI = availability → "99.9% uptime this month" — a percentage, no percentile involved

P99 is just one tool for expressing an SLI — not the same thing as an SLI


SLIs are always ratios or percentiles — never raw counts#

"500 errors happened today" — raw count, meaningless without context. 500 errors out of 100 requests is catastrophic. 500 errors out of 10 million requests is fine.

"0.5% of requests returned errors" — a ratio, meaningful regardless of traffic volume

The test for a good SLI

If traffic doubles, does the number stay meaningful?

  • Error count doubles → not meaningful anymore
  • Error rate stays the same → meaningful, it's a good SLI

Common SLIs by system type#

System SLI examples
API / Web service Request success rate, P99 latency
Storage system Read/write success rate, P99 read latency
Data pipeline Freshness (how old is the latest data?), throughput
Batch job Job completion rate, job duration P95
Video streaming Buffering rate, playback start latency