Principles for C programming

Reading Time: 1 minute

A very interesting article about C Programming Principles.

Even if it does not say anything new and can be considering a “yet another good article about C programming”, I found it useful and therefore I am reporting it here.

https://drewdevault.com/2017/03/15/How-I-learned-to-stop-worrying-and-love-C.html

How to create a Linux Daemon

Reading Time: 5 minutes

Introduction

I developed several Linux daemons and Windows Services  in my career. The differences between the 2 systems are enormous.

I personally prefer to develop Linux daemons due to my love of Linux development. I feel like home working on Linux.

What is a Linux daemon and how does it differ from a traditional user-space program?

A Linux Daemon has the following characteristics and usually performs the following macro activities:

  • When it starts, it checks if there is another instance of the daemon is already running. If so, it dies.
  • Resets the  file creation mask
  • It forks itself and kills its original process leaving only its child alive. In this way the process detaches from the console. The child, at the moment the father dies, becomes child on init (Init is the root/parent of all processes executing on Linux). It becomes the session leader
  • It changes the current directory to the root file system.
  • Redirects standard input, output  and errors to /dev/null
  • Opens a syslogd stream for logging (on /var/log/messages for redhat for instance). Logging is performed through syslog daemon. (this may vary depending on your needs)
  • It creates/updates a .pid file into /va/run/ for single instancing. Usually the file has a .pid extension tough can be used a different extension
  • It reads its configuration file in /etc/ (This is optional anyway, depending on the nature of your service)
  • Spawns a signal handler thread to interact with OS

The above task list is a generic/basic implementation of a typical Linux Daemon.

Now, let’s see step by step what each operation  does and how typically it is implemented. (more…)

SLA of Cloud Services. How can they make it?

Reading Time: 4 minutes

Introduction

When we talk about cloud services, we always think they are perfect, they never fail and they are operational 100% of the time.

Is it the reality? Is it true? Can a system be operational forever?

I don’t think so. Do you?

Just Today google cloud computing platform went down for more than 1 hour (Picture at the end because misleads readers, they start thinking “yet another external link”

To be honest it is a very strange failure, it went down globally, everywhere. So, the principle to spawn systems on multiple regions didn’t work as well (in order to implement high availability in the cloud, you are kindly asked to split servers on different regions). So, if all the regions go down what happens? Your services are down. Straight and easy!

 

SLA, our beloved and hated target!

Are we (traditional data-centers) better than them? No, of course. I am not saying this.  For sure we cannot even dream to have their availability index! I am just saying that perfection, 100% availability does not exist, everywhere.

I still remember my professor saying “Systems fails, Design your system for failures”. This is an evergreen mantra. It is correct.

This is indeed the principle that most of the cloud providers leverage on.

If you have time, I will explain how a company like Microsoft can guarantee a 99.9% of availability of their 365 mailboxes services (something that every on premise solution would struggle to achieve).

To understand how they can do it, we should consider the number of mail servers a company can realistically implement and the number of mailboxes you need to serve.

Let me go though this by examples; a company has the following requirements (here I am using very small numbers to make it simple):

The idea is to split the mailboxes among the 3 servers and guarantee at least 2 copies of each mailbox in order to implement “high availability:

Having these information, it is a rather basic:

Again, nothing difficult, very simple random distribution of mailboxes among 3 servers.

Now, let’s simulate a server’s downtime.

Let’s say the Server0 goes down. What will happen?

The service will be still available for everyone, so availability is still 100%. There could be a performance degradation for all the users but all in all, they will survive.

Moving forward, what will happen if “Server 2” goes down in the meanwhile? There will be an outage.

Users E, G, I (guess what, 33% 🙂 ) will be without their mailboxes and a chain of events will be triggered:

Users call the SD

  • SD calls the System admins
  • Users call the directors
  • Directors call the System admins
  • System admins call……no one 🙁
  • SLA is jeopardized!

 

in simple words, we all know what happens. Anyway, the above example was just a simple one to explain how it goes.

Moving forward, try to replace numbers. What if I have 15.000 and 3 servers? In this case, 5000 users would be without mailboxes. Not a nice experience.

How can we mitigate this? By increasing the number of servers? Yes, an easy catch John Snow,  but you didn’t consider another constrain in the story, money. Money is an important factor, how many servers can you afford? Guess a number, it will be lower than that.

So, how Microsoft does it?

They do it in the simplest way possible and everything is based on the following facts:

  • Microsoft has huge data-centers with thousands of servers
  • It  mixes tenants’ mailboxes
  • The SLA they establish  is global. This means that the whole service must be down in order to consider a system downtime 🙂

So, how do they do it, let’s use the previous numbers in a bit more complicated way, in a 3 tenants scenario:


Number of servers:        9
Minimum number of copies:    2 (in reality they can afford to have 3 copies)
Mailboxes would be distributed like this:


As you can see, there is an excellent distribution of data.

Like we did before, let’s try to simulate a failure. If server 0 goes down, who will be impacted? Pretty much no one, like before. There will be probably a performance degradation (but email is asynchronous by nature, who cares about it?) but no one will complain.

Now, like we did before, let’s simulate that another server goes down. Let’s say Server 6 goes down again.

How many mailboxes will be impacted?

Rather easy to say. We need to check which mailboxes are on both servers:

All in all, 2 servers went down but only a user is affected, the user A1 from Company2.

So, all in all, what will happen in this case? Simply nothing.

Let’s say we are particularly unlucky and Server8 goes down in the meantime:

So, in this case, we have 3 servers down and 3 users are affected: User I from company1, user A1 from Company2 and user H2 from Company3.

What will happen in this case? Nothing. Why nothing will happen? John Snow,  Company1, Company2 and Company3 are 3 isolated brains.

They are not aware nor concerned about what happens to the other companies. All in all, only a tiny part of their users is down. The vast majority of their users is operational.

You know what will happen? The company will start self-inflicting pain, the service-desk will start thinking about a problem on the user’s mailbox. All in all, the system is operational. There must be a problem on the client’s workstation? Reset password? Antivirus? guess it. No one will think about an  Office 365 since the service is up for all the other users.

Now, multiply this by thousands of times. We got 15.000 users split among tens of thousands of servers. Who will notice if 100 users are down in any given moment? No one!

Will we be able to sue Microsoft for reimbursement? Not at all! SLA is considered on the service, not on the single mailbox. Globally, the service availability will be more than 99%, in simple words, stay there, quite and calm. Nothing to be reported! Business as usual.

 

Conclusions

What these simple above example wanted to demonstrate is pretty much that, availability, is a matter of client perception and impact. It is not much a matter of how much a  service is well performing.

If we check AWS’s metrics, we will see the same, instances fail (rather often) and Amazon does not say differently (why they should?).

What is calculated, in terms of service availability, is the availability of the whole, not the availability of the single component!

This is normal and it is understandable. I like to picture cloud customers like ants, they all follow pretty much a pattern, the providers understands it and they leverage on it. If an ant die, the ants’ nest is not impacted. What the respond of, and the  one they bore more of, is the sanity of the ants’ nest, not the sanity of the single ants 🙂

 

 

 

 

RabbitMQ Service

Reading Time: 4 minutes

Introduction

Recently, I have implemented a new RabbitMQ infrastructure, for the company I regularly work,which is supposed to serve several different customers and services spacing from complex applications to CMS services to long lasting transaction systems.

What is extremely interesting of this exercise is that the request to have this new service was driven by developers, accepted by operations and commonly implemented. All the in all the solution is tailored for developers’ needs while it considers normal operational constraints of a production service (i.e. backup, monitoring etc).

 

What is RabbitMQ?

For those of you that don’t know the service, RabbitMQ is a free/open source implementation of the AMQP queue protocol.

The system is written in Erlang and runs on (in my specific case) a Redhat 7 Virtual Machine.

Out of curiosity, tales say that someone is running RabbitMQ on Microsoft platform somewhere in a dark wood. I usually don’t follow these sad stories and sincerely I don’t know how good it is on the windows OS. What I know is that RabbitMQ rocks on Linux systems and Linux is the OS that is meant to be used for production usage.

 

Tell me more about it

I suppose anyone into the IT business knows what is a messaging queue.  I am sure that a quick refresh of our knowledge will know hurt anyone btw.

A message queue is a system to let different applications or applications’ components to communicate in an asynchronous manner. All in all the world around us is asynchronous by definition and it is of paramount importance to be able to implement asynchronous solutions.

In this way, applications can easily perform tasks such as:

  • scale depending on the need
  • Implement the so-called “in flight transactions”
  • Implement heterogeneous solutions taking the best of the breed the market can offer nowadays
  • Offload frontend systems
  • Implement microservices components specialized in specific tasks. Kill the monolith!
  • Put what your imagination suggests!

A message queue has typically 3 actors involved:

  • The provider, the guy/item which feeds the queue with messages
  • The queue, the container of the messages
  • The consumer, the actor(s) that takes the message and performs a given action

 

Note: Real, modern IT systems usually have more actors than this involved but the basic logic remains.

 

A practical example

Let me give you an example of message queue usage.

You are on a social platform and you want to upload your latest profile picture (the picture when you were on holiday last summer!)

What you do is usually to upload the picture file straight away from you camera. Modern cameras have billions of pixels.

The system accepts the picture but it needs to post-process it. Post Processing is typically an heavy task and might include:

  • Resizing of the picture to a manageable format (DO I NEED A 20000X300000 resolution?)
  • Creation of a thumbnail of your picture
  • Duplication in different resolutions of the same picture for different devices (HD, non HD, Mobile devices, Search results list etc.)
  • Create different versions of the picture depending on the theme of your system (black and white? Sepia? Guess it)
  • Any other business (use your imagination 🙂 )

Now, we have two different way to accomplish it:

– The “1970 IT way”, you let the user wait for 20 minutes hoping that the last step does fail (only god knows why) and so the user has to start from scratch again
– The “nowadays IT way”, you queue n-messages for the postprocessing units. Each message will be handled by a different specialized  bot/algorithm which takes care of the process. If any of the steps fail, can be processed autonomously.

Which solution is the best? The 1970 one or the nowadays one?

 

Tell me more about its implementation

The implementation is rather easy. Our design of RabbitMQ was done bearing in mind the need of multi tenancy and service sharing in order to optimize costs and reduce the number of infrastructure components to be maintained. Having said this,  RabbitMQ has been configured in order to support multiple virtual hosts where each virtual hosts represent a logical split between customers/tenants (the server takes care of vhosts isolation).

The developer/application owners get an account that can spawn different entities into the virtual hosts such as queues, exchanges and messages.

Queues inside a given virtual host are controlled by the application, we (application hosting) don’t control the nature of the content (like we don’t enter the logic of other services) but we take care of the needed to guarantee the service to be fully operational.

For us a queue is a generic entity, a container of information. For you the queue is the aggregation of your application’s messages and all the messages together implement your business.

 

Why do we need it?

The question is not “why do we need it?”, the real question is “why not?”.  RabbitMQ makes the developer’s’ life easier, allows services decoupling, makes scalability easier and so on. The cost is ridiculously low in terms of hosting and the software comes for free even though offers a production level robustness.

Moreover, if we do not provide the service as a commodity hosting service, developers will do it anyway, in their own, unprotected/unruled way. This is already happening on tens of Feeder containers. Developers need it, it our duty to assist the developers in the most efficient and cost effective way possible.

 

Companies are moving to the cloud, why now?

When we will migrate to the cloud, depending on the provider, we will have an equivalent SAAS solution or we will have the ability to implement RabbitMQ on a virtual machine in our VPC. Think about AWS, the SQS system implements exactly the very same concepts we have been discussing so far and it does it in an decent way. It is not an AMQP compliant system but all in all, millions of developers live without it 🙂 We are not a special case (actually SQS does much more than this). If we want to stick with the standard AMQP, we can still implement rabbitMQ in the very same way we implemented it now. Nothing changes. Cloud does not change the services’ needs. It does change the way you satisfy the needs.

 

 

Conclusions

I strongly believe that anyway, nowadays, having the need to decouple service’s functions, would need to have a look to RabbitMQ.

The service is rock solid, easy to be installed and has a great maturity level.

For sure there are other competitors out there which do similar things. In my humble experience anyway, RabbitMQ looks to be easy and straight forward and can represent an excellent deal for your needs.

 

Redhat Enterprise Linux it’s free now! (for development scope!)

Reading Time: 1 minute

From Yesterday, Redhat Enterprise Linux is free for usage for developers!

This is a incredible news for anyone having worked with the best enterprise Linux distro.

This is an evident countermeasure against Ubuntu and Amazon Linux, well done Redhat!

No-Cost RHEL Developer Subscription now available

In simple words, I am sorry CENTOS 🙂 I loved you but it’s over

Install Chromium on Fedora 22

Reading Time: 1 minute

Today I tried to install chromium on my new Fedora 22 installation.

The Chromium package is not present in Fedora’s standard repositories.

Following Fedora’s documentation I performed the following:

1. Download the new repo information containing the chromium package

wget https://repos.fedorapeople.org/repos/spot/chromium-stable/fedora-chromium-stable.repo

2. Copy the new repo into yum’s respos folder

cp fedora-chromium-stable.repo /etc/yum.repos.d/.

3. Install Chromium package

yum install chromium

These 3 above steps are part of the official fedora configuration.

In my case though, the installation failed due to a  signature key error/missing

warning: /var/cache/dnf/x86_64/22/fedora-chromium-stable/packages/chromium-43.0.2357.124-2.fc22.x86_64.rpm: Header V4 DSA/SHA1 Signature, key ID 93054260: NOKEY

Curl error (37): Couldn’t read a file:// file for file:///etc/pki/rpm-gpg/spot.gpg [Couldn’t open file /etc/pki/rpm-gpg/spot.gpg]

If the same happens to you, don’t worry, it is just a matter to skip the signature key check.

Run the yum install command with the -nogpgcheck switch like this:

yum install chromium –nogpgcheck

I hope this helps you in some way folks.

The 777 developer

Reading Time: 2 minutes

The 777 developer

I love to develop code since I was 13 and I am now 38. Development is my life, I coded pretty much in every of the most common language. For me development is a serious business.

I always disliked the web scripting languages coming from a computer games development world and having a strong background in assembler, C and C++. I love to develop 3D engines in C or C++, I have done it. I love to write C daemons on Linux, Automation routines on my Linux machines, my containerization solution and may other deep coding solution.I simply love it. I hate to write PHP, Javascript, Django and other web scripts. I am not tailored for it. Simply hate them. Simple as that.

My main duty anyway, in the current role I cover is to be involved with 777 developers, people coming from a pure web scripting background. It is very difficult to deal with them. They believe to be master developers just because they to how to copy and paste pieces of code in already available CMSs.

Anyway, this is nowadays world. There are some which are very good (and these friends are not 777 developers), some others are terrible.

Who is a 777 developer? A 777 developer is a typical guy which comes to you with its own application and asks you to provide  hosting for it. They just tell you which stack they need and you are supposed to do the hard work.

Most typical problems with those guys is to let them understand that security is an important topic in a production environment. When they give you a package and it does not work they just tell you that “on my workstation works wonderfully”. It is there that you understand that the pal is a 777 developer.

A 777 developer is a person which sets the entire nginx or apache directory tree with a chmod 777 -r and it feels happy, it works! Who cares that this is a the most stupid way to do it. It works! Doesn’t it work on production environment? The fault is yours since you don’t know how to do stuff right. On my personal EC2 instance it goes very well, in my docker container it works like a charm, on my workstation it is perfect. You see? it is your fault. The secret is the great 777 command!

I do not expect to be working with the best developers on the market. They pretend to be the best but you can recognize that they are just pretending with a simple look at them. In the end, if you are just configuring a CMS there is a reason but, I do not expect to be  blamed if an application do not work.

A 777 developer always hides himself behind the devops trend. If fact, it is your fault if the application doesn’t run, you do not understand the devops principle.

….it is so difficult sometimes to have a professional approach with a 777 developer. Anyway I think we are paid to do it and we need to do it.

Present sorry page for everyone except you

Reading Time: 1 minute

 

During applications back-end systems maintenance operations, we usually present the so called sorry page to end users. This is a nice way to inform that something is going on, in particular on small systems not having a multi node configuration setup.

Typical pages are the one saying “I am sorry, the system is under maintenance, will be back soon”.

I usually configure the sorry page at reverse proxy level (nginx) since the backend system is most of the time going up and down.

The way I use is the following:

  • Create the HTML of the sorry page and I put in a local file system local /var/whatYouPrefer/www/sorry_page/index.html
  • Comment out the proxypass directive
  • Define a new document root for the website pointing to the sorry page’s location
                             root /var/whatYouPrefer/www/sorry_page/;
At this point, everyone will get the sorry page in place of a proxy service error.
This is fine, basic and easy. Everyone knows it. What about your access? What about you cannot access the HTTP service on the application server due to a local firewall configuration (in particular if you perform SSL termination you shouldn’t allow connections from players different than your reverse proxy) or due to a named based virtual hosting limitation.
The best way would be to create a conditional rule to provide the sorry page to everyone except than your IP address.
The way I usually go to achieve this result  is:

 

if ( $remote_addr ~* yourIPAddress) {
      proxy_pass http://remoteServer;
}
root /var/whatYouPrefer/www/sorry_page/;
The result of this is obvious, if your IP matches the one in the if condition, Nginx will proxy the backend service to you. If it doesn’t, NGINX will serve the sorry page.

Nginx processes’ users

Reading Time: 3 minutes

Many people get confused about user ownership of the nginx processes.

Most people believe that Nginx runs as Root (oh my god) and some others believe that Nginx runs entirely as nobody user.

Now, let’s do a distinction between the master and the workers processes.

Master process

The master process runs with the user used to launch the service. Typically root. Why root? Root is generally used in order to be able to bind sockets on ports number below 1024 (privileged ports). In fact unprivileged users cannot bind ports below 1024.
In general, master process has the following tasks (http://www.aosabook.org/en/nginx.html)
  • reading and validating configuration
  • creating, binding and closing sockets
  • starting, terminating and maintaining the configured number of worker processes
  • re-configuring without service interruption
  • controlling non-stop binary upgrades (starting new binary and rolling back if necessary)
  • re-opening log files
  • compiling embedded Perl scripts
The fact that the master process starts as root is the result that you actually run with a root account (or otherwise you will have port binding problem). Potentially you could run it with a different user and configure your system to use a port number higher than 1024. This is simple as that.
Common usage anyway(in particular for reverse proxies) is to bind standard HTTP or HTTPS ports like port 80 and or 443. Wouldn’t make much sense to have a reverse proxy listening on custom ports. Can be done technically, doesn’t make much sense logically.
So, the gold rule here is that the master process belongs to the user which has started/spawned Nginx. Simple as that.

Child Process(es), workers

Child process(es), known as workers, are the actors which actually move your web services. While the master process is pretty much an orchestrator, the workers are the real hero here.
One gold rule of web servers is to grant the minimum access level required to the web server’s processes (yes, I am talking to you my developer friend which always uses a chmod 777 approach). Having said this, wouldn’t make much sense to run workers as root.
What is the risk? The risk is that if someone breaches your web application or your web server, doesn’t need to elevate its account since it is already root. Attackers cannot ask for more! They will love you (and no, they won’t have mercy for you!).
Now, the question is, which user is being used by workers?
The response is simple but a bit more complicated than it seems.
1. If nothing is specified at compile time (configure phase) and nothing is specified at configuration file level, then the used user is “nobody”
2. If a user (and optionally a group) is specified at compile time with the options –user and –group and nothing is specified  in the config file, then the user specified during the build is used
3. If a user (and optionally a group) is specified in the config file, it will be used as owner of the spawned workers
In general, the priority is the following:
– Config file
– Default (nobody if not specified at compile time or the user name specified at compile time)
Note: A common pitfall for newcomers is to specify a user that do not exist (or “nobody” doesn’t exist). In this case, remember to add the user before you actually start nginx.

 Some examples

The way nginx operates is the following, on a typical real environment with 3 workers activated:
root     21279  0.0  0.0 109152  4136 ?        Ss   14:37   0:00 nginx: master process /usr/sbin/nginx
nginx    21280  0.0  0.0 109588  5876 ?        S    14:37   0:00  _ nginx: worker process
nginx    21281  0.0  0.0 109588  5876 ?        S    14:37   0:00  _ nginx: worker process
nginx    21282  0.0  0.0 109588  5876 ?        S    14:37   0:00  _ nginx: worker process
Useless to say, to check it, use the ps -auxf command.
This is very similar (in terms of users) of Apache HTTPD server using pre-fork
root     21401  6.5  0.3 417676 24852 ?        Ss   14:39   0:00 /usr/sbin/httpd -DFOREGROUND
apache   21410  0.5  0.1 690288 13748 ?        Sl   14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21411  0.5  0.1 690288 13748 ?        Sl   14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21412  0.5  0.1 690288 13748 ?        Sl   14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21413  0.5  0.1 690288 13748 ?        Sl   14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21414  0.5  0.1 419840 14584 ?        S    14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21415  0.5  0.1 419840 14584 ?        S    14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21416  0.5  0.1 419840 14584 ?        S    14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21417  0.5  0.1 419840 14584 ?        S    14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND
apache   21418  0.5  0.1 419840 14584 ?        S    14:39   0:00  _ /usr/sbin/httpd -DFOREGROUND