Skip to content

What are Service Levels?#

You already have percentiles to measure if your system is doing well. So what problem does this framework solve?

Percentiles give you the raw numbers. Service Levels give those numbers meaning — a target, an agreement, and consequences.


The Problem#

You've built a system. It's running in production. Thousands of users are using it.

How do you know if it's working well?

You could wait for users to complain. But by the time they complain, the damage is done. And "users are complaining" is not a useful signal — how many? About what exactly? How bad is it?

You need something more structured. You need to: 1. Measure specific things about your system continuously 2. Agree on what "good enough" looks like before anything goes wrong 3. Commit to those standards externally so there are real consequences for failure

That's exactly what SLI, SLO, and SLA are.


But We Already Have Percentiles — What's the Difference?#

Percentiles tell you the shape of your performance. P99 latency = 180ms. That's a measurement.

But percentiles alone don't answer: - Is 180ms good or bad for this system? - Who agreed that 180ms is the target? - What happens if it goes to 300ms?

Percentiles are just raw numbers. This framework is what you build around those numbers.

Percentiles feed INTO this framework — they don't replace it

  • SLI — P99 latency = 180ms ← this IS a percentile
  • SLO — P99 latency must stay under 200ms ← target set around that percentile
  • SLA — if P99 exceeds 200ms for more than 1 hour per month, customers get a refund ← consequence

The Three Layers at a Glance#

Term Full name What it is Who it's for
SLI Service Level Indicator The actual measurement Engineering team
SLO Service Level Objective The internal target Engineering team
SLA Service Level Agreement The external contract Customers / business

Without vs With This Framework#

Without: "Is the system healthy?" "Yeah it seems fine... it's not throwing errors?"

With: "Our SLI shows P99 latency at 180ms. Our SLO is 200ms — we're within target. Error rate is 0.3%, SLO is 1%. We have 72% of our error budget remaining for the month."

Same question. Completely different answer. The second engineer sounds like a senior engineer.