Hey guys, today I would like to share with you one of the recent experiences I have had while following my duties at WFP.
We recently deployed a new platform which consists in a set of restful APIs which serve the purpose of data gateway/brokering between different systems.
The API gateway system has been delivered up until know on a standard hosting container solution and its costs was ridiculously low, less than 100 dollars per month. The system is currently exposing more than 30 millions of records and around 40.000 requests per day. All in all, numbers are growing as we speak (or better, as you read).
Everything used to work smoothly until the usage and the load requirements of the API gateway system increased. It was a very good sign, the business was growing. At the same time, the need to scale up arose. The good thing is that we were all ready for it, it didn’t find us unprepared. We were ready (Developers and Operations) not because the API gateway is a special case. Thanks to all the efforts done, we had the right culture to support quickly and effectively those changes. All in all, the system and the service were designed in the way they should be.
In this article I will try to share with you the “how” we were able to scale up about ten times an application stack with no particular effort. All it takes is the right approach and a good knowledge of what running a modern IT service means. All it takes is a strong automation culture.
Tell me more, I am curious
The API gateway used to be operated by a simple infrastructure composed of 7 different elements:
A – A Django/Python application server, the nerve of the solution
B – A Ngnix front http server to serve static content and route requests to the right app pool
C – A Celery task manager to perform scheduled and/or asynchronous activities
D – A reverse proxy service, to handle client connections and redirect HTTP requests
E – A Redis server for data caching
F – A RabbitMQ server for asynchronous in flight transactions
G – A database service, to store authentication/authorization data
All these services were hosted on simple feeder containers. All in all the cost of the infrastructure was around $100 per month. A very low price for such complex and rich service.
Once the load of the service increased, we reacted very fast. We met (5 minutes coffee) and we implemented (in around 1 hour) the new platform and voila’, the new API Gateway, standing 10 times the original load is ready to go.
How did you do it?
To answer this question there are many concepts we need to bring on the table. The most important one is about “culture change”.
Developers don’t see anymore Operations as an antagonist and Operations don’t see anymore Developers like a problem. A success is a success for both and so is a failure. This is the way we are.
No one likes to fail, everyone enjoy to succeed, we are human being and we like to in the best way our work, no compromises.
The only way to stand nowadays IT requirements is to work together is a seamless manner.
In this environment is very easy to operate. A need is identified and we know exactly who does what. We don’t have any doubt if the other team will be able to deliver what is supposed to deliver. We know they will make it. And this is fantastic, it is incredibly good to work in this way.
In this very specific case, the result was the spawn additional containers, automatically configure them using automatic deployment techniques, reconfigure the load balancing system and the game was done. Who did what? Why do you care? it is not important, it was the DevOps team 🙂
Tell me more, give me some detail
Sure, this is the reason why I am writing this article.
Let me start from the basics, the hosting component. We spawned 4 brand new Feeder containers with 4 Cores and 8 GB of RAM each in a matter of 10 minutes. How? Using our internal orchestration tool. It is the key for success here, at least in the hosting component. The mantra “if you do it more than twice, automate it” worked and will always work in these cases. We did our homework and we are ready for such challenges.
Then, once you are ready with multiple hosting environments, it is like if you are ready with empty boxes. How do you start using them? It is simple, using automation. In particular, each and every application which is deployed in our IT is deployed using Ansible. Ansible is a clientless automation system (https://www.ansible.com/), it is free and does not require a dedicated infrastructure. Nice? No, amazing.
In case you used puppet or chef in the past, Ansible does pretty much the same stuff but does not require a client installation. Which is the best solution in this field? It depends on your need. Is Ansible the best? I don’t know, I know that it fit best our needs, and this is enough. We evaluated pretty much all of them, Ansible looks to n be the best for our needs.
So, to cut a long story short, deployment is just one click away. It is a matter to update your host file and update the playbook to deploy instead than one single container, on multiple containers.
Is it so easy? If everything is done by the book, yes. But it requires a good application design. Why this? continue reading.
Can I do it for any application?
In order to serve an application from multiple nodes (possibly load balanced) in a multi master scenario, one of the key element is to have a stateless design.
I consider that everyone having a bit of application design knowledge knows what it means to be state less. Stateless design is a way to design your services so that you don’t use sessions and you don’t keep a state of a transaction. Each HTTP operation (GET, POST etc.) is unrelated with the previous ones.
A good article, if you are interested in this topic is the one from RackSpace and it is called “coding in the cloud – rule 3” (http://blog.rackspace.com/coding-in-the-cloud-rule-3-use-a-stateless-des…). It is worth reading.
The concept is very basic yet many applications fail to implement it. Delivering good performances with this pattern is painful and requires skills, that’s why many applications fail in doing it.
So, let me say this, our API Gateway does it and it does it in an excellent way. The way it implements this design pattern is simply a work of art, the way many applications should follow. I leave the description of its internals to the developers. What I can say is that for the first time in my career I asked a developer “Can I go for a multi master configuration?” and the developer looked at me checking if I was serious and, smiling, saying “sure you can, we designed the service for it”.
In addition to this, our API Gateway implements several other features that makes scaling operations easy such as Distributed Multi-Layer Caching Architecture, Predictive Data Loading and Asynchronous logging. All features that make the system capable to scale if needed.
Well, to cut a long story short, I managed in the past huge clusters and several balanced systems and this is one of the easiest implementation I have ever managed (easy is better in this case, we all love KISS approach 🙂 )
To answer the original question, “Can I do it for any application?” the response is “yes, if you did your homework”.
How do you balance the load
Here in our datacenter we implemented a cluster of Nginx Reverse proxy which do an excellent job in this.
I have been using for more than 10 years Apache. If you asked me 3 years ago about Apache vs Nginx, I would have told you “My life for Apache”. Now, I use Nginx everywhere and I am an active Nginx follower/contributor. I would use Nginx everywhere, even in my coffee machine if possible, it is simply incredible.
Nginx, in the way it is done here, makes this job extremely easy, it is just a matter to deliver a new upstream, configure the balancing rules and reconfigure the reverse proxy. It makes it a simple SW configuration when, until few years ago, a balancing with sticky sessions would have required dedicated hardware appliances (who remembers the Cisco Load Balancers?
With Nginx is easy and straightforward to implement micro services and load balancing. I suggest to check it out if you don’t know it already.
The primary node serves hundreds of applications for thousands users and this is how much the node is overloaded:
We serve around 100 services, thousands of users, millions of GET/POST operations per day, SSL termination for all our services and this is the HW used by the proxy services. Less than a GB of RAM (the rest is buffering). Surprised? Don’t be, this is Nginx.
Come on guys, everyone is moving to the cloud. Why do it now?
The solution itself is ready for the cloud as well. Nothing changes there, we will just have more flexibility.
When we will be running the API Gateway on AWS, we will scale up in similar ways. Either using ELB in place of the on premise Nginx based load balancer or using builtin autoscaling features. The core concepts remain, the solution applies to the cloud as well, the used tools will change,
We should see AWS or Azure as a baseline for the implementation of these solutions. Surely the life would be easier and less complicated since most of the implemented solutions would be delivered in a SAAS flavor.
With this article I wanted to share with you the advantages of automation and orchestration. Such results couldn’t be achieved without a strong cooperation and trust of the different parties. In the past, the way IT was until few years ago, the approach would have been of finger pointing the other team for the fail.
The way it is know is to friendly talk on how to improve it and choose the best way to move forward.
The way it used to be in the past, was to scale vertically, now we scale horizontally, the way it should be.
Before, IT took weeks before implementing a fix or implementing a new architecture, now it is done in less than a working day.
This is the way to go guys, this is the IT we all love!
Special thanks to all the people involved in this, from both teams, development and operations