A 5-minute introduction to SLA, SLO, and SLI

This is not the first time we are building critical financial services applications, but these days we are building for high-efficiency ratio and we are building lots of “mini” applications, leveraging distributed and elastic environments.

Whether we call them “Breaking the monolith” or “digital” or “Not to micro-services”, these applications are no longer a single blob.

I usually refer people to this great introductory article, but generally, I get a few more follow-ups.

The theme for SLA, SLO, and SLIs is, you can only improve what can be measured

Historically, we measured SLA with the number of 9s.

Example: 99.9, 99.99, 99.995 (3,4,4.5 9’s environments) and this was a theoretical exercise at best.

I will take a quick detour here to give the math behind these 9’s. We will start with legacy SLAs.

Hypothetically, say, you are going to use an AWS EC2 machine to build a web server and you want to make sure that this servers’ static content be served 99.99% of the time.

ECS SLA documentation says, “AWS will use commercially reasonable efforts to make the Included Services each available for each AWS region with a Monthly Uptime Percentage of at least 99.99%”!

AWS will return back to you up to 30% service credit if they cannot maintain SLAs. Now, what are you going to do? Can you do the same with your customer?

https://aws.amazon.com/compute/sla/ [update as of Mar’19]

If you cannot return your customer his/her money back, then, you have to make it highly available. How do you do that? You build a cluster using multiple EC2 machines of 95% availability (most pessimistic estimate according to AWS) and calculate the probability of having at least one of the servers up and running.

We call these SLA tables. As you increase the number of servers in the cluster, you will, you will decrease the probability of service failure, never really reaching 100%, but just getting very very close to 100%. SRE principles say that there is no 100% availability.

For n web servers, the availability of the cluster is 1-(1–0.95)^n, where 0.95 (or 95%) is the expected availability of 1 server.

So, you will build your web server farm with a 3-node cluster to achieve 4 9’s for your customer needs.

Now, let’s say, you have a stack of services to build your application, an example below and each of them come with their SLAs, then you make each tier horizontally scale and compute each tier’s SLA as above and multiple all availabilities to obtain the entire “application SLA”

An n-tier application availability

Obviously, the availability of “entire application” falls below your 4 9’s and more components you add, more < 1 number you are multiplying, more points of failure.

How do you make this 99.99% available?

You add another data center!

1-(1–0.9947)² where 2 as in 2 different data centers, will give you 99.99%! voila!

All this stuff is good on paper, but does this really work? Can you put this on a legal paper? Maybe not. Maybe?

So, we broke SLAs (legacy) into SLOs and legalese.

AWS table above are SLAs, it has 2 parts — Availability and Consequences of what happens when you breach those.

SLA = Availability (SLOs) + Consequences (e.g., payback 10%).

This might sound trivial, but it is extremely difficult to enumerate what might be the consequences, especially in financial services applications. Traditionally, we have ignored it, which is why this has been a theoretical exercise. SLA = SLO, means we cannot articulate the value of unavailability!

But, now, availability (i.e., 99.99%) a.k.a. SLOs, has many dimensions, just being UPresponding with satisfaction (< 2 seconds as an example), can service all of my customers (10,000 users on expected basis as an example). Some SLOs are not system based but are helpdesk type metrics (say, respond in less than 8 hours). We put those aside for now.

So, we said, let’s break these up as well — We said, let’s break up an SLO into multiple “measurable” metrics called SLIs. So, for technologists and operations, it is easy for them to measure at any point in time, and can answer how “the system is doing”?

SLO = SLI 1 + SLI 2 + SLI 3

SLIs are specific to the type of application you are building:

If you are building a simple web server or REST endpoint, you might want to know # of HTTP 200 requests, CPU usage, Memory usage, etc.,

If you are building a Kafka publisher/consumer for order messages, you might want to know queue depth, # of messages processed per second, # of brokers, average CPU of the cluster, etc., in addition to CPU, Memory usage

You have to be judgemental in terms of which metrics you want to pick to be an SLI — think of these as KPIs for your resource. Not every measurable metric should be an SLI.

So, to recap, given SLA is so heavily loaded term and at the highest level it is not easily measurable but it needs to be easily understood by a customer, we keep it simple.

To measure, we broke an SLA into many SLOs and many or all of these SLOs into one or many SLIs.

SLIs are really what gets recorded as time series in your centralized log monitoring platform (say, Datadog, Splunk, or just you log4j log file) at the lowest granularity and you could use various aggregate measures to calculate SLOs and combine them to showcase your SLAs!

Previous
Previous

The path to turning off legacy platforms for financial services firms

Next
Next

The Business Capability Map: A critical yet often misunderstood concept when moving from program strategy to implementation