SLA of Cloud Services. How can they make it?

Reading Time: 4 minutes

Introduction

When we talk about cloud services, we always think they are perfect, they never fail and they are operational 100% of the time.

Is it the reality? Is it true? Can a system be operational forever?

I don’t think so. Do you?

Just Today google cloud computing platform went down for more than 1 hour (Picture at the end because misleads readers, they start thinking “yet another external link”

To be honest it is a very strange failure, it went down globally, everywhere. So, the principle to spawn systems on multiple regions didn’t work as well (in order to implement high availability in the cloud, you are kindly asked to split servers on different regions). So, if all the regions go down what happens? Your services are down. Straight and easy!

 

SLA, our beloved and hated target!

Are we (traditional data-centers) better than them? No, of course. I am not saying this.  For sure we cannot even dream to have their availability index! I am just saying that perfection, 100% availability does not exist, everywhere.

I still remember my professor saying “Systems fails, Design your system for failures”. This is an evergreen mantra. It is correct.

This is indeed the principle that most of the cloud providers leverage on.

If you have time, I will explain how a company like Microsoft can guarantee a 99.9% of availability of their 365 mailboxes services (something that every on premise solution would struggle to achieve).

To understand how they can do it, we should consider the number of mail servers a company can realistically implement and the number of mailboxes you need to serve.

Let me go though this by examples; a company has the following requirements (here I am using very small numbers to make it simple):

The idea is to split the mailboxes among the 3 servers and guarantee at least 2 copies of each mailbox in order to implement “high availability:

Having these information, it is a rather basic:

Again, nothing difficult, very simple random distribution of mailboxes among 3 servers.

Now, let’s simulate a server’s downtime.

Let’s say the Server0 goes down. What will happen?

The service will be still available for everyone, so availability is still 100%. There could be a performance degradation for all the users but all in all, they will survive.

Moving forward, what will happen if “Server 2” goes down in the meanwhile? There will be an outage.

Users E, G, I (guess what, 33% 🙂 ) will be without their mailboxes and a chain of events will be triggered:

Users call the SD

  • SD calls the System admins
  • Users call the directors
  • Directors call the System admins
  • System admins call……no one 🙁
  • SLA is jeopardized!

 

in simple words, we all know what happens. Anyway, the above example was just a simple one to explain how it goes.

Moving forward, try to replace numbers. What if I have 15.000 and 3 servers? In this case, 5000 users would be without mailboxes. Not a nice experience.

How can we mitigate this? By increasing the number of servers? Yes, an easy catch John Snow,  but you didn’t consider another constrain in the story, money. Money is an important factor, how many servers can you afford? Guess a number, it will be lower than that.

So, how Microsoft does it?

They do it in the simplest way possible and everything is based on the following facts:

  • Microsoft has huge data-centers with thousands of servers
  • It  mixes tenants’ mailboxes
  • The SLA they establish  is global. This means that the whole service must be down in order to consider a system downtime 🙂

So, how do they do it, let’s use the previous numbers in a bit more complicated way, in a 3 tenants scenario:


Number of servers:        9
Minimum number of copies:    2 (in reality they can afford to have 3 copies)
Mailboxes would be distributed like this:


As you can see, there is an excellent distribution of data.

Like we did before, let’s try to simulate a failure. If server 0 goes down, who will be impacted? Pretty much no one, like before. There will be probably a performance degradation (but email is asynchronous by nature, who cares about it?) but no one will complain.

Now, like we did before, let’s simulate that another server goes down. Let’s say Server 6 goes down again.

How many mailboxes will be impacted?

Rather easy to say. We need to check which mailboxes are on both servers:

All in all, 2 servers went down but only a user is affected, the user A1 from Company2.

So, all in all, what will happen in this case? Simply nothing.

Let’s say we are particularly unlucky and Server8 goes down in the meantime:

So, in this case, we have 3 servers down and 3 users are affected: User I from company1, user A1 from Company2 and user H2 from Company3.

What will happen in this case? Nothing. Why nothing will happen? John Snow,  Company1, Company2 and Company3 are 3 isolated brains.

They are not aware nor concerned about what happens to the other companies. All in all, only a tiny part of their users is down. The vast majority of their users is operational.

You know what will happen? The company will start self-inflicting pain, the service-desk will start thinking about a problem on the user’s mailbox. All in all, the system is operational. There must be a problem on the client’s workstation? Reset password? Antivirus? guess it. No one will think about an  Office 365 since the service is up for all the other users.

Now, multiply this by thousands of times. We got 15.000 users split among tens of thousands of servers. Who will notice if 100 users are down in any given moment? No one!

Will we be able to sue Microsoft for reimbursement? Not at all! SLA is considered on the service, not on the single mailbox. Globally, the service availability will be more than 99%, in simple words, stay there, quite and calm. Nothing to be reported! Business as usual.

 

Conclusions

What these simple above example wanted to demonstrate is pretty much that, availability, is a matter of client perception and impact. It is not much a matter of how much a  service is well performing.

If we check AWS’s metrics, we will see the same, instances fail (rather often) and Amazon does not say differently (why they should?).

What is calculated, in terms of service availability, is the availability of the whole, not the availability of the single component!

This is normal and it is understandable. I like to picture cloud customers like ants, they all follow pretty much a pattern, the providers understands it and they leverage on it. If an ant die, the ants’ nest is not impacted. What the respond of, and the  one they bore more of, is the sanity of the ants’ nest, not the sanity of the single ants 🙂

 

 

 

 

RabbitMQ Service

Reading Time: 4 minutes

Introduction

Recently, I have implemented a new RabbitMQ infrastructure, for the company I regularly work,which is supposed to serve several different customers and services spacing from complex applications to CMS services to long lasting transaction systems.

What is extremely interesting of this exercise is that the request to have this new service was driven by developers, accepted by operations and commonly implemented. All the in all the solution is tailored for developers’ needs while it considers normal operational constraints of a production service (i.e. backup, monitoring etc).

 

What is RabbitMQ?

For those of you that don’t know the service, RabbitMQ is a free/open source implementation of the AMQP queue protocol.

The system is written in Erlang and runs on (in my specific case) a Redhat 7 Virtual Machine.

Out of curiosity, tales say that someone is running RabbitMQ on Microsoft platform somewhere in a dark wood. I usually don’t follow these sad stories and sincerely I don’t know how good it is on the windows OS. What I know is that RabbitMQ rocks on Linux systems and Linux is the OS that is meant to be used for production usage.

 

Tell me more about it

I suppose anyone into the IT business knows what is a messaging queue.  I am sure that a quick refresh of our knowledge will know hurt anyone btw.

A message queue is a system to let different applications or applications’ components to communicate in an asynchronous manner. All in all the world around us is asynchronous by definition and it is of paramount importance to be able to implement asynchronous solutions.

In this way, applications can easily perform tasks such as:

  • scale depending on the need
  • Implement the so-called “in flight transactions”
  • Implement heterogeneous solutions taking the best of the breed the market can offer nowadays
  • Offload frontend systems
  • Implement microservices components specialized in specific tasks. Kill the monolith!
  • Put what your imagination suggests!

A message queue has typically 3 actors involved:

  • The provider, the guy/item which feeds the queue with messages
  • The queue, the container of the messages
  • The consumer, the actor(s) that takes the message and performs a given action

 

Note: Real, modern IT systems usually have more actors than this involved but the basic logic remains.

 

A practical example

Let me give you an example of message queue usage.

You are on a social platform and you want to upload your latest profile picture (the picture when you were on holiday last summer!)

What you do is usually to upload the picture file straight away from you camera. Modern cameras have billions of pixels.

The system accepts the picture but it needs to post-process it. Post Processing is typically an heavy task and might include:

  • Resizing of the picture to a manageable format (DO I NEED A 20000X300000 resolution?)
  • Creation of a thumbnail of your picture
  • Duplication in different resolutions of the same picture for different devices (HD, non HD, Mobile devices, Search results list etc.)
  • Create different versions of the picture depending on the theme of your system (black and white? Sepia? Guess it)
  • Any other business (use your imagination 🙂 )

Now, we have two different way to accomplish it:

– The “1970 IT way”, you let the user wait for 20 minutes hoping that the last step does fail (only god knows why) and so the user has to start from scratch again
– The “nowadays IT way”, you queue n-messages for the postprocessing units. Each message will be handled by a different specialized  bot/algorithm which takes care of the process. If any of the steps fail, can be processed autonomously.

Which solution is the best? The 1970 one or the nowadays one?

 

Tell me more about its implementation

The implementation is rather easy. Our design of RabbitMQ was done bearing in mind the need of multi tenancy and service sharing in order to optimize costs and reduce the number of infrastructure components to be maintained. Having said this,  RabbitMQ has been configured in order to support multiple virtual hosts where each virtual hosts represent a logical split between customers/tenants (the server takes care of vhosts isolation).

The developer/application owners get an account that can spawn different entities into the virtual hosts such as queues, exchanges and messages.

Queues inside a given virtual host are controlled by the application, we (application hosting) don’t control the nature of the content (like we don’t enter the logic of other services) but we take care of the needed to guarantee the service to be fully operational.

For us a queue is a generic entity, a container of information. For you the queue is the aggregation of your application’s messages and all the messages together implement your business.

 

Why do we need it?

The question is not “why do we need it?”, the real question is “why not?”.  RabbitMQ makes the developer’s’ life easier, allows services decoupling, makes scalability easier and so on. The cost is ridiculously low in terms of hosting and the software comes for free even though offers a production level robustness.

Moreover, if we do not provide the service as a commodity hosting service, developers will do it anyway, in their own, unprotected/unruled way. This is already happening on tens of Feeder containers. Developers need it, it our duty to assist the developers in the most efficient and cost effective way possible.

 

Companies are moving to the cloud, why now?

When we will migrate to the cloud, depending on the provider, we will have an equivalent SAAS solution or we will have the ability to implement RabbitMQ on a virtual machine in our VPC. Think about AWS, the SQS system implements exactly the very same concepts we have been discussing so far and it does it in an decent way. It is not an AMQP compliant system but all in all, millions of developers live without it 🙂 We are not a special case (actually SQS does much more than this). If we want to stick with the standard AMQP, we can still implement rabbitMQ in the very same way we implemented it now. Nothing changes. Cloud does not change the services’ needs. It does change the way you satisfy the needs.

 

 

Conclusions

I strongly believe that anyway, nowadays, having the need to decouple service’s functions, would need to have a look to RabbitMQ.

The service is rock solid, easy to be installed and has a great maturity level.

For sure there are other competitors out there which do similar things. In my humble experience anyway, RabbitMQ looks to be easy and straight forward and can represent an excellent deal for your needs.

 

Amazon Web Services X1 Instances

Reading Time: 1 minute

Aws  released the x1 instances. The new instance type offers 128 virtual cores and  about 2 tb of RAM.

I will never stop to be susprised of this job, i still feel excited reading this stuff. I feel like my son when he sees a new toy, he just wants to play with it.

The new instance is designed to be used in big data analysis and in particular for sap hana.

I suggest to check the article, it is worth to be read.

https://aws.amazon.com/blogs/aws/x1-instances-for-ec2-ready-for-your-memory-intensive-workloads/

Besides. Aws released a new guide on how to implement a new sap hana environment in a vpc in one hour.  If i am not mistaking, it has been just released https://s3.amazonaws.com/quickstart-reference/sap/hana/latest/doc/SAP+HANA+Quick+Start.pdf

Talking to a friend of mine, he asked “yes, nice, but how much will it cost?”. It is a good question but I do believe that the point in this case, is not “how much it will cost” but it is about making possible something that it is usually impossible in most of the traditional datacenters.

Well done AWS, once again.

 

Dropbox Magic Pocket

Reading Time: 1 minute

Dropbox, one of the biggest AWS installation, implemented its own storage solution

The reading is very interesting and highlights what is under the hood of this fantastic service.

 

Actually dropbox has impressive numbers. They currently have 500 PB of data, very few companies in the world have such a massive amount of storage.

 

https://blogs.dropbox.com/tech/2016/03/magic-pocket-infrastructure/