Introduction

While creating a critical piece of software (e.g. creating a Smart Home Controller for my Master's Thesis), one of the emphasis is on reliability.

Broadly speaking, reliability means not crashing or, failing that, exiting gracefully and then restarting without requiring any supervision.

Hence, I set about to write a tiny wrapper to execute applications in, which will wait for heartbeats from the application, restart them if they miss too many heartbeats or if they crash (exit with a non-zero return value), and will listen to the outside world and restart the application when asked to.

Wrapper ensuring some level of reliability

Reliability

Making applications reliable is difficult. Nay, it is impossible. Horrible things will happen to critical applications. You can blame Murphy if you want, but that is not going to improve your plight. The best we can do is to identify some common sources of errors and try to build our application such that it can recover from such problems. Let's first look at what are the common sort of problems encountered.

Some failures

Among all the things which can go wrong with the application, few are fairly common (and somewhat easily diagnosable):

Dying unnaturally (possibly because of our much loved SIGSEGV)
Never dying (deadlocks, livelocks, Infinite loops, etc.)
Aberrant behaviour (Byzantine faults, expletives in communication, physical violence, etc.)

At an abstract level, these problems can be detected by using failure detectors and protocols which promise correctness. However, reliable Failure detectors alone are hard both theoretically as well as to implement, not to talk about getting Paxos right. Also, these algorithms rely heavily on replication which may not be available in certain settings (e.g. resource limited drivers which accept only one controller instance).

Among many other issues, one of the things which makes designing Failure detectors difficult is that a process might just be too slow and not dead. It might come back up from its pre-mature coffee break at some point and start working like nothing happened.

One radical way of fixing all these errors while circumventing the problem of undead processes is by killing the unreliable process to make sure that it never pops back up and creates confusion, and then restarting it from scratch. This, of course, assumes that the the application is intelligently written: either it is stateless (e.g. a webserver) or can recover its state (e.g. stateful database). This STONITH (Shoot The Other Node In The Head) approach is also fraught with land-mines and should be carefully applied. In the rest of this article I will present a wrapper based on this approach to run your own code. This wrapper is written to make your code controllable at a very fine-grained level (application), so that one does not need to restart the entire machine (with all other applications running on it) if one application misbehaves. Some other (very) coarse level solutions are Linux HA and Pacemaker.

Of course, there are many other kinds of failures as well: OS crashes, Virtual-Machine crashes, Network partitioning, etc. A fairly general way of dealing with these (except the irreparable Network partitioning) based on hierarchal spies is explained in this excellent paper about FALCON (Fast And Lethal Component Observation Network). The basic idea employed therein is also STONITH. They also take care of the case that the wrapper itself might have some bugs inherent in it which might lead to any of those three problems discussed above. These are also taken care of the spies higher up in the hierarchy.

Solutions

The ideal solution would be to actually fix the problem, i.e. fix the program. However, often it is not possible (e.g., unpredictable closed-source libraries underneath) or not desirable (a few bazillion lines of code to sift through).

Hence, there are workarounds to these two common problems, and this wrapper presents a very basic way of how some level of reliability may be given in your home-brewed application before you move onto more coarse and heavy-weight solutions.

Now dealing with unnatural deaths is fairly easy: just run the wrapper as a subprocess and if it exits with a non-zero error code, restart it. (Thanks to Nirbheek for the code).

Dealing with infinite loops is trickier. Heartbeats are a promising solution, but getting them right is difficult both on the client and the server side. The things to watch out for (paraphrased from the reference):

Note that heartbeats are not request-reply. They flow asynchronously.

Do not force durability for heartbeats. They should not be stored and overwhelm memory while being sent to a slow wrapper. Also, all heartbeats should be dealt with during a check.

First, get the heartbeating working, and only then add in the rest of the message flow. You should be able to prove the heartbeating works by starting peers in any order, stopping and restarting them, simulating freezes, and so on.

Use a secondary timer to trigger heartbeat checks. Do not use the poll loop for this, because it might return before timeouts occur.

In a real application, heartbeating must be configurable and usually negotiated with the peer. Some peers will want aggressive heartbeating, as low as 10 msecs. Other peers will be far away and want heartbeating as high as 30 seconds.

If you have different heartbeat intervals for different peers, your poll timeout should be the lowest of these.

You might be tempted to open a separate socket dialog for heartbeats. This is superficially nice because you can separate different dialogs, e.g. the synchronous request-reply from the asynchronous heartbeating. However it's a bad idea for several reasons. First, if you're sending data you don't need to send heartbeats. Second, sockets may, due to network vagaries, become jammed. You need to know when your main data socket is silent because it's dead, rather than just not busy, so you need heartbeats on that socket. Lastly, two sockets is more complex than one.

Lastly, detecting aberrant behaviour is too domain specific a problem to be addressed by a generic wrapper. For all the wrapper knows, you probably wanted to write a violent program. Hence, the wrapper has a big red panic button on top which anyone from the outside world can press. Hopefully, the one who presses this button would do it only when he has good reasons to.

"What possibly could go wrong?"

One of the many wrong things which can happen with this naive panic button (even after taking care of all the security issues) is a classical walk to infinity or livelock problem if there are more than one external process which are allowed to press the big red button. Consider two independent processes, which communicate with the wrapped application Paranoid Monitor 1 and Paranoid Monitor 2. Then if they alternatively ask the process to be restarted before it has enough time to properly initialize this could happen:


Livelock because of not having enough time to initialize properly

It is relatively easy to avoid this problem in a one process setting by ensuring that a proper restart has taken place, but it gets more and more complicated as the complexity of the world increases.

These problems may be avoided by carefully coding the logic and putting in more checks, but there will always remain things you have not thought about.

Bottomline

Absolute reliability is impossible.
A little reliability is possible, but not very easy.
There always are things you have not thought about.

Code

With these, the wrapper is good to go. The source code of this reliable wrapper is posted on GitHub.

Feel free to check it out and any feedback is welcome.

~
musically_ut

Musically yours

Thursday, October 20, 2011

Software Reliability: 3 general problems

Introduction

Reliability

Some failures

Solutions

No comments: