Reading Time: 4 minutes
When we talk about cloud services, we always think they are perfect, they never fail and they are operational 100% of the time.
Is it the reality? Is it true? Can a system be operational forever?
I don’t think so. Do you?
Just Today google cloud computing platform went down for more than 1 hour (Picture at the end because misleads readers, they start thinking “yet another external link”
To be honest it is a very strange failure, it went down globally, everywhere. So, the principle to spawn systems on multiple regions didn’t work as well (in order to implement high availability in the cloud, you are kindly asked to split servers on different regions). So, if all the regions go down what happens? Your services are down. Straight and easy!
SLA, our beloved and hated target!
Are we (traditional data-centers) better than them? No, of course. I am not saying this. For sure we cannot even dream to have their availability index! I am just saying that perfection, 100% availability does not exist, everywhere.
I still remember my professor saying “Systems fails, Design your system for failures”. This is an evergreen mantra. It is correct.
This is indeed the principle that most of the cloud providers leverage on.
If you have time, I will explain how a company like Microsoft can guarantee a 99.9% of availability of their 365 mailboxes services (something that every on premise solution would struggle to achieve).
To understand how they can do it, we should consider the number of mail servers a company can realistically implement and the number of mailboxes you need to serve.
Let me go though this by examples; a company has the following requirements (here I am using very small numbers to make it simple):
The idea is to split the mailboxes among the 3 servers and guarantee at least 2 copies of each mailbox in order to implement “high availability:
Having these information, it is a rather basic:
Again, nothing difficult, very simple random distribution of mailboxes among 3 servers.
Now, let’s simulate a server’s downtime.
Let’s say the Server0 goes down. What will happen?
The service will be still available for everyone, so availability is still 100%. There could be a performance degradation for all the users but all in all, they will survive.
Moving forward, what will happen if “Server 2” goes down in the meanwhile? There will be an outage.
Users E, G, I (guess what, 33% 🙂 ) will be without their mailboxes and a chain of events will be triggered:
Users call the SD
- SD calls the System admins
- Users call the directors
- Directors call the System admins
- System admins call……no one 🙁
- SLA is jeopardized!
in simple words, we all know what happens. Anyway, the above example was just a simple one to explain how it goes.
Moving forward, try to replace numbers. What if I have 15.000 and 3 servers? In this case, 5000 users would be without mailboxes. Not a nice experience.
How can we mitigate this? By increasing the number of servers? Yes, an easy catch John Snow, but you didn’t consider another constrain in the story, money. Money is an important factor, how many servers can you afford? Guess a number, it will be lower than that.
So, how Microsoft does it?
They do it in the simplest way possible and everything is based on the following facts:
- Microsoft has huge data-centers with thousands of servers
- It mixes tenants’ mailboxes
- The SLA they establish is global. This means that the whole service must be down in order to consider a system downtime 🙂
So, how do they do it, let’s use the previous numbers in a bit more complicated way, in a 3 tenants scenario:
Number of servers: 9
Minimum number of copies: 2 (in reality they can afford to have 3 copies)
Mailboxes would be distributed like this:
As you can see, there is an excellent distribution of data.
Like we did before, let’s try to simulate a failure. If server 0 goes down, who will be impacted? Pretty much no one, like before. There will be probably a performance degradation (but email is asynchronous by nature, who cares about it?) but no one will complain.
Now, like we did before, let’s simulate that another server goes down. Let’s say Server 6 goes down again.
How many mailboxes will be impacted?
Rather easy to say. We need to check which mailboxes are on both servers:
All in all, 2 servers went down but only a user is affected, the user A1 from Company2.
So, all in all, what will happen in this case? Simply nothing.
Let’s say we are particularly unlucky and Server8 goes down in the meantime:
So, in this case, we have 3 servers down and 3 users are affected: User I from company1, user A1 from Company2 and user H2 from Company3.
What will happen in this case? Nothing. Why nothing will happen? John Snow, Company1, Company2 and Company3 are 3 isolated brains.
They are not aware nor concerned about what happens to the other companies. All in all, only a tiny part of their users is down. The vast majority of their users is operational.
You know what will happen? The company will start self-inflicting pain, the service-desk will start thinking about a problem on the user’s mailbox. All in all, the system is operational. There must be a problem on the client’s workstation? Reset password? Antivirus? guess it. No one will think about an Office 365 since the service is up for all the other users.
Now, multiply this by thousands of times. We got 15.000 users split among tens of thousands of servers. Who will notice if 100 users are down in any given moment? No one!
Will we be able to sue Microsoft for reimbursement? Not at all! SLA is considered on the service, not on the single mailbox. Globally, the service availability will be more than 99%, in simple words, stay there, quite and calm. Nothing to be reported! Business as usual.
What these simple above example wanted to demonstrate is pretty much that, availability, is a matter of client perception and impact. It is not much a matter of how much a service is well performing.
If we check AWS’s metrics, we will see the same, instances fail (rather often) and Amazon does not say differently (why they should?).
What is calculated, in terms of service availability, is the availability of the whole, not the availability of the single component!
This is normal and it is understandable. I like to picture cloud customers like ants, they all follow pretty much a pattern, the providers understands it and they leverage on it. If an ant die, the ants’ nest is not impacted. What the respond of, and the one they bore more of, is the sanity of the ants’ nest, not the sanity of the single ants 🙂