Cloud service providers are notoriously cautious when discussing outages, service levels, and SLAs. While many trumpet “strength” the perceived risk remains. And it is only reinforced by the massive scale and duration of recent service outages.
CloudShare has a status page – check it out here – and we think this is a very good start. It’s a simple pane to view the metrics we find critical to service uptime. We believe in simplicity and transparency, and it is our goal to communicate what we can, when we can.
More Users, More Problems:
|“Why is this even news? Were you expecting 100% availability?”|
Expectations rise as more critical services – healthcare, transportation, Netflix – move to the cloud. As these services provide more daily value, more users hit the servers, and more comments hit the support forum. In the ideal ecosystem, this cycle improves the service based on user input.
Take Netflix as an example. As a former Cloudshare architect hypothesizes, the public cloud made Netflix a success. They could focus on becoming an entertainment service, not a server service.
But this comes with a risk. While outbound surveys of Netflix show loyal, satisfied users, enraged Netflix users – perhaps after they were denied their latest House of Cards episode show something different:
“Every night, it loses the movie and frequently will not restart. This is a fairly new issue within the last few months, even though my internet has more than doubled. Fix the issue!” — From the vitriol pool at Consumer Affairs dot com
Netflix’s NPS Score shows that users love Netflix
What Happens when the Internet has more than Doubled?
The expectations for consumer services are not governed by SLAs, with assumed risk levels. So when Azure went down for eleven hours in November, the response was hotly debated – response time, alerting, and transparency are difficult tasks at scale – but it was, in the end, a version of the root cause analysis was made public, and that’s a good thing.
Since no service can work beyond the “9s”, the rule developing seems to be a good one: where the cloud cannot speak technically and factually, the cloud must pass over in silence.
While there will not be clear SLAs or infinitely redundant services any time soon, there will be more real-time status pages and incident histories like CloudShare’s. AWS, Google and Azure keep similar reporting. As more end users begin to rely on the cloud, the service providers in the middle must encourage users to compare reports, ask questions, and report service problems – though probably not to Netflix.