A few weeks ago, Zencoder published a new SLA. By SLA standards, we think it's pretty good: if our service isn't available 99.9% of a given month, we issue a service credit of 10%, and for every additional 1% of downtime, we issue another 10% credit, up to 100% of your monthly bill.
Frankly, though, we feel a little ambivalent about offering an SLA. Most SLAs are worthless. They often have so many caveats and carve-outs that no one will ever see a single dime. And even when they're strong, they're marginally beneficial at best; for most people, service uptime is far more valuable than the money paid for the service.
Think about it this way. If you're paying Rackspace $500/month for a server, and your server is down for 7 days, how much does that cost you? $500? Obviously not. The downtime might cost you $10,000, or even run you out of business. A service credit for downtime is a weak remedy at best.
SLAs are redundant too. It would be odd for a service provider to be motivated to stay up by an uptime SLA, because with or without an SLA, an unreliable service provider is in trouble. So at least when it comes to uptime and availability, SLAs aren't a big motivator.
So why offer an SLA at all?
First, many large businesses expect it. If you want to sell to enterprise customers, you might be forced to offer an SLA. Often, the quality of the SLA doesn't even matter - if you offer an SLA, even if it sucks, your prospects can check off another box on their checklist. That way, their back is covered if you're a bust; if you don't have an SLA, and you're unreliable, that's just one more reason for your enterprise buyer to get yelled at by his or her boss. You can approach this cynically, or you can use this as an excuse to actually put together something decent.
Second, an SLA sets expectations. It makes explicit an implicit commitment. We want to offer a service with near-zero downtime, and we're working hard to provide that, with or without an SLA. The SLA doesn't change our goals. It just puts them on paper and makes them clearly understood. This might be useful for someone offering a service on top of another; if the first provider is aiming at 99.9% uptime or a RPO of 2 hours, a second service built on top of the first can align its own SLA with these levels.
Third, an SLA can increase transparency. Ask us how we're doing, and we'll tell you. The SLA means that we can't hide that information from you, even if we wanted to.
What if you're on the other side, and you're evaluating an SLA? What should you look for? As I said before, many SLAs are so ineffectual that they're basically worthless. So here's a few quick tips to parse an SLA.
1. What is guaranteed? What is the unit: uptime, availability, responsiveness? What is the level: 99%? What is the time granularity: 99% per month, per year?
2. What is measured? If service availability is being guaranteed, for example, what is tested? The uptime of a particular URL? For Zencoder, we're guaranteeing the availability of our API, so we guarantee that our API at https://app.zencoder.com/api/jobs will respond successfully to a valid HTTP request.
3. How is it measured? Who does the measuring, and how frequently? You'll get very different results if you check availability every minute vs. every 3 hours. Whenever possible, look for independent third-party monitoring, and not internal/closed monitoring. For example, we use Pingdom to check our service every 1 minute. We can't fudge the results, because we aren't doing the measurement.
4. What is the remedy? If the service level isn't met, what happens? Do you get a refund or a credit? How much? When? What do you have to do to trigger the remedy, if anything? Most SLAs offer a credit towards ongoing service, not a refund; and most cap the credit amount. Zencoder offers a service credit with a cap of 100% of your monthly service cost.
5. What are the exclusions? Most SLAs will have carve-outs for problems beyond the control of the service provider. Expect a few for things like force majure, DDOS attack, availability of a critical third-party. Here is the exclusion language from Zencoder's SLA: "The calculation of Service Availability excludes instances of: your acts or omissions, force majeure events, scheduled downtime, hackers or virus attacks, unavailability of Amazon Web Services, or emergency maintenance."
Beware of SLAs with too many exclusions. I've seen SLAs that literally exclude everything, such that no downtime whatsoever, for any reason, is covered by the SLA. It's easy to guarantee 100% uptime when 100% of downtime is excluded from the guarantee.
Finally, if you're interested in this sort of thing and have taken the time to read our SLA, we'd love your feedback. How does it look? If you were a Zencoder user, what would you like to see?