HTTPS flow explained

This is an interesting graphical explanation on how the HTTPS handshaking occurs.

http://sudhakar.online/programming/2015/08/09/https.html

Most people working in Web Services don’t understand the HTTPS handshaking implications. This is a critical information to know since it can potentially have a detrimental effect on High Latency network connections.

Some more information about this, can be retrieved on the following websites.

 

Good overview, it requires a basic IT knowledge:

http://robertheaton.com/2014/03/27/how-does-https-actually-work/

Very Technical article, it requires a good IT background:

http://www.moserware.com/2009/06/first-few-milliseconds-of-https.html

Only for Chuck Norris IT level:

https://tools.ietf.org/html/rfc6101

 

Read More

Big Data and Social Cooling

The evolution of Big Data platforms

I found this article about Big Data solutions and the cooling effect they are causing in our society.

Social Cooling is the effect that Big Data platforms are causing on our way to live the web. The rationale behind this is that, if a person knows that he/she is monitored and recorded, will most probably not behave as natural as it would be otherwise. Somehow, this is changing the way people behave in social relations.

https://www.socialcooling.com/

This topic is very interesting and it is a good point of discussion with people that use their brain combined with the mouse. Something to read.

We are reaching a moment in our IT evolution where everything is recorded, logged and stored. Forever.

Implications

When people realize that a job offer might no come because the wrong people are in his/her facebook loop of friends, they will start to filter their relations accordingly.

Another important aspect of this is that systems never forget mistakes. If a person misbehaves once, he or she will have trouble for the rest of the life since everything remains logged on the systems. This is very much different from what we experienced so far and surely opens a wide set of considerations to be done.

Is it the right path to go for? Is this what we want for the humankind?

Maybe this all depends on the culture or the person, on his/her age, on education. There are people that don’t care at all about all the possible implications while others live with the constant phobia that systems are looking for them, are spying them.

 

Conclusion

Where is the truth? Difficult to be said and different to be evaluated with the poor (public) data at our disposal. We know what the company does publicly with our data (we accept terms and conditions for the solutions we use such as google account). We don’t know what else can be potentially logged (1st) and used (2nd) behind the scene.

Is this really the way we want our society wants to move to? Is this what we wish for our future? I don’t think so, what is your opinion about it?

 

Read More

Let’s Encrypt, free SSL encryption to everyone.

Let’s Encrypt!

I recently started to focus on the Let’s encrypt free certification authority, a total revolution in the way encryption is offered.  I thought it would be useful to write this article and here we are.

For those of you that don’t know what it is, a short description from the service’s home page

 

Let’s Encrypt is a free, automated, and open certificate authority (CA), run for the public’s benefit. It is a service provided by the Internet Security Research Group (ISRG).

 

Without the risk to oversimplify the concept behind it,  you can have a fully trusted, fully operational, SSL certificate for free.

Wait, hold on, free SSL certificates? Why?

HTTPS Everywhere logo

The letsencrypt initiative is the result of a worldwide movement where people aim to fully encrypt the entire internet. In other words, the basic principle of this movement is that the ability to protect own data is part of the rights of a modern internet citizen.

The initiative started thanks to 3 main players  Electronic Frontier Foundation,  Mozilla Foundation and the University of Michigan.

Since then, several sponsors started to support the letsencrypt initiative. Here is a link with all the major players behind it (https://letsencrypt.org/sponsors/)

More links about this topic, at the end of this article.

 

 

What is the difference between this CA and others?

“Lets Encrypt” CA  is a fully features/fully recognized CA, period.

The difference between this CA and other CAs is in mainly in:

  • The level of professional support they provide
  • No warranty on the provided certificates (while others offer up to several millions of dollars of warranty)
  • It is fully automated, there is no human interaction or (useless) additional controls
  • The CA does not validate the legal or offline identity of the certificate applicant
  • The issued certificates cannot be used for email TLS or Code signing
  • The validity of their certificates lasts for a maximum of 90 days. This is very important. The good news is that they offer a suite which makes the renewal extremely easy (and it can be automated :))

Is it a well-recognized Certification Authority?

The certification authority is extremely well recognized by modern browsers and systems

– Mozilla Firefox, which implements its own certification authority trust system, recognizes it.
https://mozillacaprogram.secure.force.com/CA/CACertificatesInFirefoxReport

– Chrome, which uses the underlying OS certification authority trust system, is well covered by OS (Windows, Linux, MAC OSX)
http://www.chromium.org/Home/chromium-security/root-ca-policy

– Internet Explorer, uses the underlying OS CA trust mechanism

Windows recognizes the LetsEncrypt CA since it is cross-signed with the “DST Root CA X3”, part of the windows trusted CAs
https://social.technet.microsoft.com/wiki/contents/articles/37425.microsoft-trusted-root-certificate-program-participants-as-of-march-9-2017.aspx

(for a full set of compatibility information, refer to this article https://community.letsencrypt.org/t/which-browsers-and-operating-systems-support-lets-encrypt/4394)

Why should I use an SSL certificate?

To protect the information you operate, to guarantee your identity, to protect the data transmission.

Because data is the most valuable asset we have and it is our duty to protect it.

Do I need to protect non-production systems?

My production system is already covered by production systems’ certificates, do I need to protect other systems?

Actually yes.

Every system handling actual data (application’s data-sets, user credentials, authentication tokens) are exposed to possible attacks and malicious access.

 

How difficult is it?

The generation of an SSL certificate with Lets Encrypt is extremely easy and supports multiple platforms and web servers.

I honestly only operate in Linux environments and I cannot judge the complexity on a windows system but, hey, guys, they built an extremely easy tool set.

Using Fedora (I use Fedora 25), the installation of the letsencrypt utility is just a “dnf” away.

sudo dnf install letsencrypt

Once installed, the generation of a certificate is very simple and it requires to publish a check file on your website, containing a special hash code (given during the process)

 letsencrypt --text --email recovery@example.com --domains www.example.com,example.com,foo.example.com --agree-tos --renew-by-default --manual certonly

Once done, all the needed files are stored into the /etc/letencrypt folder structure.

In particular, the full chain, the public, and the private keys are generated (I used the perm format).

Using Redhat systems can be a bit more cumbersome. Checking this guide can help in the process http://www.tecmint.com/install-lets-encrypt-ssl-certificate-to-secure-apache-on-rhel-centos/

 

Some useful links

Electronic Frontier Foundation

Wired – Half the web is now encrypted

Wired – A scheme to encrypt the entire web is actually working

 

Read More

How to create a Linux Daemon

Introduction

I developed several Linux daemons and Windows Services  in my career. The differences between the 2 systems are enormous.

I personally prefer to develop Linux daemons due to my love of Linux development. I feel like home working on Linux.

What is a Linux daemon and how does it differ from a traditional user-space program?

A Linux Daemon has the following characteristics and usually performs the following macro activities:

  • When it starts, it checks if there is another instance of the daemon is already running. If so, it dies.
  • Resets the  file creation mask
  • It forks itself and kills its original process leaving only its child alive. In this way the process detaches from the console. The child, at the moment the father dies, becomes child on init (Init is the root/parent of all processes executing on Linux). It becomes the session leader
  • It changes the current directory to the root file system.
  • Redirects standard input, output  and errors to /dev/null
  • Opens a syslogd stream for logging (on /var/log/messages for redhat for instance). Logging is performed through syslog daemon. (this may vary depending on your needs)
  • It creates/updates a .pid file into /va/run/ for single instancing. Usually the file has a .pid extension tough can be used a different extension
  • It reads its configuration file in /etc/ (This is optional anyway, depending on the nature of your service)
  • Spawns a signal handler thread to interact with OS

The above task list is a generic/basic implementation of a typical Linux Daemon.

Now, let’s see step by step what each operation  does and how typically it is implemented. (more…)

Read More

About end to end encryption

Introduction

How many times you felt protected by the new end to end encryption solution of your preferred mobile chat application?

I recently participated to a work-group on data security in Brussels. The work group was organized by the Brussels Privacy Hub together with ICRC (international Committee of the red cross). It was about data security in the cloud and the possible security/privacy implications of Messaging applications usage.

The work group was an excellent opportunity for all the attendees to discuss these topics applied to popular cloud services and, in addition to this, the security implications of the top players in the instant messaging market. Particular attention was given to the legal aspects involved with their usage and the applicability of the immunity and privileges of humanitarian and United Nations organizations.

I expected to be in a technical work group. Instead, I ended up to discuss the aforementioned topics from totally different standpoints, the legal standpoint and all the implications toward beneficiaries and core entities in our business. It has been an incredibly good opportunity for me, very well spent time.

During the workshop days, I noticed that the concept of end to end encryption might be a misleading topic for most of the non technical people around.

On one side, people tend to not understand some basic concepts about keys exchange, it is normal, not everyone in this world has a background on data encryption techniques. On the other side, the concept itself creates a lot of expectations and false certainty of data security and protection. For instance, the general perception is that now, since Whatsapp has end to end encryption, the system is 100% secure.

While end to end encryption is surely an excellent mechanism for data exchange protection, it does not cover some aspects that I do believe are of paramount importance in this area.

Security of the end point devices

Since the keys are generated by the endpoints and exchanged between the two of them, people generally consider the keys to be safe. All in all, the provider does not have the keys and no one else in this world can have the secret keys. This is an extremely wrong consideration. This because the endpoints, are actually the weak point of this mechanism. If an endpoint (i.e. mobile device) is compromised, the keys can be (are) compromised as well. Keys must be stored somewhere (either on storage or in memory) and must be loaded from the SW in order to be used. What if your mobile phone, running Whatsapp, has a malware program running on top of it that silently snoops your encryption keys? I bet the concept is easy to be understood, the system is somehow secure only if we can guarantee the security of the underlying components. If the device, the OS, the HW is compromised, there is very little you can do to protect the operated application.

 

endpoint

 

 

Man in the middle

This kind of security weakness is related to the fact that, a eavesdropper can potentially impersonate the recipient you intend to communicate with. In this case, the situation is rather simple to be explained. Let’s consider two valid endpoints, A and B. Now, let’s consider a third endpoint, I call it endpoint C, C is the bad guy. This attack consists in having C playing the role of one of the two real actors, for simplicity let’s consider that C impersonates B.

So, in this case, since C is pretending to be B, the exchange of the key occurs between A and C (remember that handshaking occurs between the 2 devices without any ruling entity and therefore a real identity verification cannot happen).

In this situation, C intercepts the messages and establishes a forward handshaking with the actual B endpoint.

In simple words, A talks to C which forwards to B. The very same happens to the other way around.

The flow will look normal to both A and B and they are not aware that,  in the middle, there is an invisible man. C is silently listening.

How difficult is to accomplish this? It is rather difficult, it is not easy. Is it impossible?” No, it is not.  Is it probable? I think it is very much probable. It all depends if the exchanged information are relevant to skilled malicious hackers.

 

main-in-the-middle

 

Software vulnerabilities and backdoor

We always tend to consider the Software we use as bug free components.

At the same time, we always tend to have trust in the most popular SW. This because millions (billions?) of people are already using them. I do believe we suffer, somehow of the behavior of gnu packs while they need to cross the rivers infested by  crocodiles. They do it because they need to do it, driven by their instincts. Even if we evolved, we still have the same behavior. I need to use a SW even if I know it has drawbacks. At the same time, we feel reassured by the fact that millions of people are giving the same trust to the SW manufacturer. The point becomes, there are millions doing it, why me?

Guess what, no SW is perfect. More importantly, we cannot be 100% sure that a SW does not contain potential backdoor. I am saying this, not to bring you paranoia or to start the useless discussion about how “NSA can access a system if they want”. It is just a professional consideration on what, a software, can potentially cause. Either by mistake (backdoor bug) or by intention (backdoor).

What if the SW has a way that a malicious attacker can leverage on, to obtain your encryption keys? In this case we would not even notice the vulnerability.

How can you protect from this? No one can do anything about it, we just need to be aware of this potential problem.

Are we all doomed? No, we are not, just try to not be too overconfident on the security certain services.

 

Conclusion

To conclude, I strongly believe that it is quite safe to use nowadays instant messaging software for personal usage.

All in all, no 3rd world war will happen if our conversations are sniffed. It all depends on how important is the message we are exchanging.

At the same time, the false reassurance given by end to end encryption can be misleading.

Having said this, I wouldn’t use an instant messaging SW to communicate sensitive information. Information like bank account’s information or any other item that can jeopardize my security or, the security of my organization.

 

 

 

Read More

Need to scale up 10X? It’s a cakewalk, in a real DevOps environment

Introduction

Hey guys, today I would like to share with you one of the recent experiences I have had while following my duties at WFP.

We recently deployed a new platform which consists in a set of restful APIs which serve the purpose of data gateway/brokering between different systems.

The API gateway system has been delivered up until know on a standard hosting container solution and its costs was ridiculously low, less than 100 dollars per month. The system is currently exposing more than 30 millions of records and around 40.000 requests per day. All in all, numbers are growing as we speak (or better, as you read).

Everything used to work smoothly until the usage and the load requirements of the API gateway system increased. It was a very good sign, the business was growing. At the same time, the need to scale up arose. The good thing is that we were all ready for it, it didn’t find us unprepared. We were ready (Developers and Operations) not because the API gateway is a special case. Thanks to all the efforts done, we had the right culture to support quickly and effectively those changes. All in all,  the system and the service were designed in the way they should be.

In this article I will try to share with you the “how” we were able to scale up about ten times an application stack with no particular effort. All it takes is the right approach and a good knowledge of what running a modern IT service means. All it takes is a strong automation culture.

 

Tell me more, I am curious

The API gateway used to be operated by a simple infrastructure composed of 7 different elements:

A – A Django/Python application server, the nerve of the solution

B – A Ngnix front http server to serve static content and route requests to the right app pool

C – A Celery task manager to perform scheduled and/or asynchronous activities

D – A reverse proxy service, to handle client connections and redirect HTTP requests

E – A Redis server for data caching

F – A RabbitMQ server for asynchronous in flight transactions

G – A database service, to store authentication/authorization data

All these services were hosted on simple feeder containers. All in all the cost of the infrastructure was around $100 per month. A very low price for such complex and rich service.

Once the load of the service increased, we reacted very fast. We met (5 minutes coffee) and we implemented (in around 1 hour) the new platform and voila’, the new API Gateway, standing 10 times the original load is ready to go.

How did you do it?

To answer this question there are many concepts we need to bring on the table. The most important one is about “culture change”.

Developers don’t see anymore Operations as an antagonist and Operations don’t see anymore Developers like a problem. A success is a success for both and so is a failure. This is the way we are.

No one likes to fail, everyone enjoy to succeed, we are human being and we like to in the best way our work, no compromises.

The only way to stand nowadays IT requirements is to work together is a seamless manner.

In this environment is very easy to operate. A need is identified and we know exactly who does what. We don’t have any doubt if the other team will be able to deliver what is supposed to deliver. We know they will make it. And this is fantastic, it is incredibly good to work in this way.

In this very specific case, the result was the spawn additional containers, automatically configure them using automatic deployment techniques, reconfigure the load balancing system and the game was done. Who did what? Why do you care? it is not important, it was the DevOps team :)


Tell me more, give me some detail

Sure, this is the reason why I am writing this article.

Let me start from the basics, the hosting component. We spawned 4 brand new Feeder containers with 4 Cores and 8 GB of RAM each in a matter of 10 minutes. How? Using our internal orchestration tool. It is the key for success here, at least in the hosting component. The mantra “if you do it more than twice, automate it” worked and will always work in these cases. We did our homework and we are ready for such challenges.

Then, once you are ready with multiple hosting environments, it is like if you are ready with empty boxes. How do you start using them? It is simple, using automation. In particular, each and every application which is deployed in our IT is deployed using Ansible. Ansible is a clientless automation system (https://www.ansible.com/), it is free and does not require a dedicated infrastructure. Nice? No, amazing.

In case you used puppet or chef in the past, Ansible does pretty much the same stuff but does not require a client installation. Which is the best solution in this field? It depends on your need. Is Ansible the best? I don’t know, I know that  it fit best our needs, and this is enough. We evaluated pretty much all of them, Ansible looks to n be the best for our needs.

So, to cut a long story short, deployment is just one click away. It is a matter to update your host file and update the playbook to deploy instead than one single container, on multiple containers.

Is it so easy? If everything is done by the book, yes. But it requires a good application design. Why this? continue reading.

Can I do it for any application?

In order to serve an application from multiple nodes (possibly load balanced) in a multi master scenario, one of the key element is to have a stateless design.

I consider that everyone having a bit of application design knowledge knows what it means to be state less. Stateless design is a way to design your services so that you don’t use sessions and you don’t keep a state of a transaction. Each HTTP operation (GET, POST etc.) is unrelated with the previous ones.

A good article, if you are interested in this topic is the one from RackSpace and it is called “coding in the cloud – rule 3” (http://blog.rackspace.com/coding-in-the-cloud-rule-3-use-a-stateless-des…). It is worth reading.

The concept is very basic yet many applications fail to implement it. Delivering good performances with this pattern is painful and requires skills, that’s why many applications fail in doing it.

So, let me say this, our API Gateway does it and it does it in an excellent way. The way it implements this design pattern is simply a work of art, the way many applications should follow. I leave the description of its internals to the developers. What I can say is that for the first time in my career I asked a developer “Can I go for a multi master configuration?” and the developer looked at me checking if I was serious and, smiling, saying “sure you can, we designed the service for it”.

In addition to this, our API Gateway implements several other features that makes scaling operations easy such as Distributed Multi-Layer Caching Architecture, Predictive Data Loading and Asynchronous logging. All features that make the system capable to scale if needed.

Well, to cut a long story short,  I managed in the past huge clusters and several balanced systems and this is one of the easiest implementation I have ever managed (easy is better in this case, we all love KISS approach :) )

To answer the original question, “Can I do it for any application?” the response is “yes, if you did your homework”.

 

How do you balance the load

Here in our datacenter we implemented a cluster of Nginx Reverse proxy which do an excellent job in this.

I have been using for more than 10 years Apache. If you asked me 3 years ago about Apache vs Nginx, I would have told you “My life for Apache”. Now, I use Nginx everywhere and I am an active Nginx follower/contributor. I would use Nginx everywhere, even in my coffee machine if possible,  it is simply incredible.

Nginx, in the way it is done here, makes this job extremely easy, it is just a matter to deliver a new upstream, configure the balancing rules and reconfigure the reverse proxy. It makes it a simple SW configuration when, until few years ago, a balancing with sticky sessions would have required dedicated hardware appliances (who remembers the Cisco Load Balancers?

With Nginx is easy and straightforward to implement micro services and load balancing. I suggest to check it out if you don’t know it already.

The primary node serves hundreds of applications for thousands users and this is how much the node is overloaded:

 

We serve around 100 services, thousands of users, millions of GET/POST operations per day, SSL termination for all our services and this is the HW used by the proxy services. Less than a GB of RAM (the rest is buffering). Surprised? Don’t be, this is Nginx.

 

Come on guys, everyone is moving to the cloud. Why do it now?

The solution itself is ready for the cloud as well. Nothing changes there, we will just have more flexibility.

When we will be running the API Gateway on AWS, we will scale up in similar ways. Either using ELB in place of the on premise Nginx based load balancer or using  builtin autoscaling features. The core concepts remain, the solution applies to the cloud as well, the used tools will change,

We should see AWS or Azure as a baseline for the implementation of these solutions. Surely the life would be easier and less complicated since most of the implemented solutions would be delivered in a SAAS flavor.

 

Conclusion

With this article I wanted to share with you the advantages of automation and orchestration. Such results couldn’t be achieved without a strong cooperation and trust of the different parties. In the past, the way IT was until few years ago, the approach would have been of finger pointing the other team for the fail.

The way it is know is to friendly talk on how to improve it and choose the best way to move forward.

The way it used to be in the past, was to scale vertically, now we scale horizontally, the way it should be.

Before, IT took weeks before implementing a fix or implementing a new architecture, now it is done in less than a working day.

This is the way to go guys, this is the IT we all love!

Special thanks to all the people involved in this, from both teams, development and operations

Read More

SLA of Cloud Services. How can they make it?

Introduction

When we talk about cloud services, we always think they are perfect, they never fail and they are operational 100% of the time.

Is it the reality? Is it true? Can a system be operational forever?

I don’t think so. Do you?

Just Today google cloud computing platform went down for more than 1 hour (Picture at the end because misleads readers, they start thinking “yet another external link”

To be honest it is a very strange failure, it went down globally, everywhere. So, the principle to spawn systems on multiple regions didn’t work as well (in order to implement high availability in the cloud, you are kindly asked to split servers on different regions). So, if all the regions go down what happens? Your services are down. Straight and easy!

 

SLA, our beloved and hated target!

Are we (traditional data-centers) better than them? No, of course. I am not saying this.  For sure we cannot even dream to have their availability index! I am just saying that perfection, 100% availability does not exist, everywhere.

I still remember my professor saying “Systems fails, Design your system for failures”. This is an evergreen mantra. It is correct.

This is indeed the principle that most of the cloud providers leverage on.

If you have time, I will explain how a company like Microsoft can guarantee a 99.9% of availability of their 365 mailboxes services (something that every on premise solution would struggle to achieve).

To understand how they can do it, we should consider the number of mail servers a company can realistically implement and the number of mailboxes you need to serve.

Let me go though this by examples; a company has the following requirements (here I am using very small numbers to make it simple):

The idea is to split the mailboxes among the 3 servers and guarantee at least 2 copies of each mailbox in order to implement “high availability:

Having these information, it is a rather basic:

Again, nothing difficult, very simple random distribution of mailboxes among 3 servers.

Now, let’s simulate a server’s downtime.

Let’s say the Server0 goes down. What will happen?

The service will be still available for everyone, so availability is still 100%. There could be a performance degradation for all the users but all in all, they will survive.

Moving forward, what will happen if “Server 2” goes down in the meanwhile? There will be an outage.

Users E, G, I (guess what, 33% :) ) will be without their mailboxes and a chain of events will be triggered:

Users call the SD

  • SD calls the System admins
  • Users call the directors
  • Directors call the System admins
  • System admins call……no one :(
  • SLA is jeopardized!

 

in simple words, we all know what happens. Anyway, the above example was just a simple one to explain how it goes.

Moving forward, try to replace numbers. What if I have 15.000 and 3 servers? In this case, 5000 users would be without mailboxes. Not a nice experience.

How can we mitigate this? By increasing the number of servers? Yes, an easy catch John Snow,  but you didn’t consider another constrain in the story, money. Money is an important factor, how many servers can you afford? Guess a number, it will be lower than that.

So, how Microsoft does it?

They do it in the simplest way possible and everything is based on the following facts:

  • Microsoft has huge data-centers with thousands of servers
  • It  mixes tenants’ mailboxes
  • The SLA they establish  is global. This means that the whole service must be down in order to consider a system downtime :)

So, how do they do it, let’s use the previous numbers in a bit more complicated way, in a 3 tenants scenario:


Number of servers:        9
Minimum number of copies:    2 (in reality they can afford to have 3 copies)
Mailboxes would be distributed like this:


As you can see, there is an excellent distribution of data.

Like we did before, let’s try to simulate a failure. If server 0 goes down, who will be impacted? Pretty much no one, like before. There will be probably a performance degradation (but email is asynchronous by nature, who cares about it?) but no one will complain.

Now, like we did before, let’s simulate that another server goes down. Let’s say Server 6 goes down again.

How many mailboxes will be impacted?

Rather easy to say. We need to check which mailboxes are on both servers:

All in all, 2 servers went down but only a user is affected, the user A1 from Company2.

So, all in all, what will happen in this case? Simply nothing.

Let’s say we are particularly unlucky and Server8 goes down in the meantime:

So, in this case, we have 3 servers down and 3 users are affected: User I from company1, user A1 from Company2 and user H2 from Company3.

What will happen in this case? Nothing. Why nothing will happen? John Snow,  Company1, Company2 and Company3 are 3 isolated brains.

They are not aware nor concerned about what happens to the other companies. All in all, only a tiny part of their users is down. The vast majority of their users is operational.

You know what will happen? The company will start self-inflicting pain, the service-desk will start thinking about a problem on the user’s mailbox. All in all, the system is operational. There must be a problem on the client’s workstation? Reset password? Antivirus? guess it. No one will think about an  Office 365 since the service is up for all the other users.

Now, multiply this by thousands of times. We got 15.000 users split among tens of thousands of servers. Who will notice if 100 users are down in any given moment? No one!

Will we be able to sue Microsoft for reimbursement? Not at all! SLA is considered on the service, not on the single mailbox. Globally, the service availability will be more than 99%, in simple words, stay there, quite and calm. Nothing to be reported! Business as usual.

 

Conclusions

What these simple above example wanted to demonstrate is pretty much that, availability, is a matter of client perception and impact. It is not much a matter of how much a  service is well performing.

If we check AWS’s metrics, we will see the same, instances fail (rather often) and Amazon does not say differently (why they should?).

What is calculated, in terms of service availability, is the availability of the whole, not the availability of the single component!

This is normal and it is understandable. I like to picture cloud customers like ants, they all follow pretty much a pattern, the providers understands it and they leverage on it. If an ant die, the ants’ nest is not impacted. What the respond of, and the  one they bore more of, is the sanity of the ants’ nest, not the sanity of the single ants :)

 

 

 

 

Read More

RabbitMQ Service

Introduction

Recently, I have implemented a new RabbitMQ infrastructure, for the company I regularly work,which is supposed to serve several different customers and services spacing from complex applications to CMS services to long lasting transaction systems.

What is extremely interesting of this exercise is that the request to have this new service was driven by developers, accepted by operations and commonly implemented. All the in all the solution is tailored for developers’ needs while it considers normal operational constraints of a production service (i.e. backup, monitoring etc).

 

What is RabbitMQ?

For those of you that don’t know the service, RabbitMQ is a free/open source implementation of the AMQP queue protocol.

The system is written in Erlang and runs on (in my specific case) a Redhat 7 Virtual Machine.

Out of curiosity, tales say that someone is running RabbitMQ on Microsoft platform somewhere in a dark wood. I usually don’t follow these sad stories and sincerely I don’t know how good it is on the windows OS. What I know is that RabbitMQ rocks on Linux systems and Linux is the OS that is meant to be used for production usage.

 

Tell me more about it

I suppose anyone into the IT business knows what is a messaging queue.  I am sure that a quick refresh of our knowledge will know hurt anyone btw.

A message queue is a system to let different applications or applications’ components to communicate in an asynchronous manner. All in all the world around us is asynchronous by definition and it is of paramount importance to be able to implement asynchronous solutions.

In this way, applications can easily perform tasks such as:

  • scale depending on the need
  • Implement the so-called “in flight transactions”
  • Implement heterogeneous solutions taking the best of the breed the market can offer nowadays
  • Offload frontend systems
  • Implement microservices components specialized in specific tasks. Kill the monolith!
  • Put what your imagination suggests!

A message queue has typically 3 actors involved:

  • The provider, the guy/item which feeds the queue with messages
  • The queue, the container of the messages
  • The consumer, the actor(s) that takes the message and performs a given action

 

Note: Real, modern IT systems usually have more actors than this involved but the basic logic remains.

 

A practical example

Let me give you an example of message queue usage.

You are on a social platform and you want to upload your latest profile picture (the picture when you were on holiday last summer!)

What you do is usually to upload the picture file straight away from you camera. Modern cameras have billions of pixels.

The system accepts the picture but it needs to post-process it. Post Processing is typically an heavy task and might include:

  • Resizing of the picture to a manageable format (DO I NEED A 20000X300000 resolution?)
  • Creation of a thumbnail of your picture
  • Duplication in different resolutions of the same picture for different devices (HD, non HD, Mobile devices, Search results list etc.)
  • Create different versions of the picture depending on the theme of your system (black and white? Sepia? Guess it)
  • Any other business (use your imagination :) )

Now, we have two different way to accomplish it:

– The “1970 IT way”, you let the user wait for 20 minutes hoping that the last step does fail (only god knows why) and so the user has to start from scratch again
– The “nowadays IT way”, you queue n-messages for the postprocessing units. Each message will be handled by a different specialized  bot/algorithm which takes care of the process. If any of the steps fail, can be processed autonomously.

Which solution is the best? The 1970 one or the nowadays one?

 

Tell me more about its implementation

The implementation is rather easy. Our design of RabbitMQ was done bearing in mind the need of multi tenancy and service sharing in order to optimize costs and reduce the number of infrastructure components to be maintained. Having said this,  RabbitMQ has been configured in order to support multiple virtual hosts where each virtual hosts represent a logical split between customers/tenants (the server takes care of vhosts isolation).

The developer/application owners get an account that can spawn different entities into the virtual hosts such as queues, exchanges and messages.

Queues inside a given virtual host are controlled by the application, we (application hosting) don’t control the nature of the content (like we don’t enter the logic of other services) but we take care of the needed to guarantee the service to be fully operational.

For us a queue is a generic entity, a container of information. For you the queue is the aggregation of your application’s messages and all the messages together implement your business.

 

Why do we need it?

The question is not “why do we need it?”, the real question is “why not?”.  RabbitMQ makes the developer’s’ life easier, allows services decoupling, makes scalability easier and so on. The cost is ridiculously low in terms of hosting and the software comes for free even though offers a production level robustness.

Moreover, if we do not provide the service as a commodity hosting service, developers will do it anyway, in their own, unprotected/unruled way. This is already happening on tens of Feeder containers. Developers need it, it our duty to assist the developers in the most efficient and cost effective way possible.

 

Companies are moving to the cloud, why now?

When we will migrate to the cloud, depending on the provider, we will have an equivalent SAAS solution or we will have the ability to implement RabbitMQ on a virtual machine in our VPC. Think about AWS, the SQS system implements exactly the very same concepts we have been discussing so far and it does it in an decent way. It is not an AMQP compliant system but all in all, millions of developers live without it :) We are not a special case (actually SQS does much more than this). If we want to stick with the standard AMQP, we can still implement rabbitMQ in the very same way we implemented it now. Nothing changes. Cloud does not change the services’ needs. It does change the way you satisfy the needs.

 

 

Conclusions

I strongly believe that anyway, nowadays, having the need to decouple service’s functions, would need to have a look to RabbitMQ.

The service is rock solid, easy to be installed and has a great maturity level.

For sure there are other competitors out there which do similar things. In my humble experience anyway, RabbitMQ looks to be easy and straight forward and can represent an excellent deal for your needs.

 

Read More