By PJ Pérez in Cloud Infrastructure — 04 Feb 2025

Reliability: That untamed beast

Reliability in a system is a very complex topic.

You start designing your system and account for as many general failure cases you can, then you start relying on your experience with less obvious failures and try to account for that, too. You get too excited and just added too much complexity for the system - is it worth it? Are you adding extra complexity to cover a potential failure scenario that you have seen once in 20 years? It's tempting to just say it is really not worth it, but what would be the impact if the issue were to happen? What if it could instantly kill your business? And can you take a different approach to reduce complexity while still accounting for those failure cases?

There are countless technologies, techniques, protocols, blueprints, theorems, best practices and in general helpful stuff that are well known and readily available for you. It is overwhelming, so let's try to take on the challenge step by step, right?

One of my favourite ways to start tackling the problem is to talk about the CAP theorem, which will help us define and understand some important concepts.

CAP Theorem

Seth Gilbert and Nancy Lynch in their Brewer’s conjencture and the feasibility of consistent, available, partition-tolerant web services state that when designing a distributed web service, there are three properties that are commonly desired:

Consistency
Availability
Partition tolerance

They also state that it is impossible to achieve al three.

But what does each one of those properties mean?

Consistency: For a distributed system to be consistent, I should be able to get the same data if I query any of its nodes.
Availability: For a distributed system to be considered available, I should be able to always get a response when requesting for data from it.
Partition tolerance: For a distributed system to be considered tolerant to network partition, the system should sustain any amount of network failures that don’t result in a failure of the whole network.

Trade-offs

If you are building a distributed system the first thing you need to do is to avoid single points of failure. That means that the system must be tolerant to network partitions, so you have to find a trade-off between consistency and availability.

Alright, that sounds interesting and it may have raised many questions for the reader, so let's see how it works in practice by talking about a practical scenario.

Alice’s reminder services

Bob is a middle aged white-collar worker with a short attention span. This has caused him problems in the past as he forgets appointments.

Bob is also a smart person, so he has decided to find a solution. His solution is to write down appointments in a notebook as soon as he gets them. It works brilliantly as he now can see what appointments he’s got.

Bob casually mentioned this to his wife Alice during dinner. Alice is a smart person too and an entrepreneur by nature. She understands that Bob’s method covers a need that more people will surely have, so she starts a reminders business. She buys a new phone and a notebook and starts advertising her 9×5 & Mon-Fri appointment reminder business on the local newspapers.

The customers start coming

Alice starts getting a few calls per day. People will call her, tell her the appointment they need to be reminded of and she will write it down in her notebook. When the time arrives, she will call the customer back to remind them about the appointment. Some customers will also make a second call to update or amend the appointments, which Alice promptly finds and amends in her notebook.

The business works great and the customers are very happy with the service, so the call rate starts to increase. Money is flowing!

At this point in time, her business is consistent because the data is always up to date. As her system is only one node (Alice), it can't be partitioned so there's no point of talking about partition tolerance.

Time to scale

Eventually, Alice receives too many calls for her to handle during peak hours, so she starts missing calls to add reminders and has no time to call customers to remind them about their appointments! It is so bad that she needs to take a few long breaks just to cope with the stress! Her business is crumbling down!

Turns out her business is not available as she has several downtimes per day due to the stress caused by the work load, but it is still consistent because when Alice manages to call a customer she still delivers the right information. Alice’s business is not scalable as it is!

Scaling out and load balancing

She decides she will enlist her husband Bob to help. After all, Alice’s business is booming so they can afford to work both on it. The first action is to buy a second phone so they both can receive calls in parallel and a notebook for Bob so they both can write down and consult appointments. They configure their phone system so each of them will get one call each in turns. Balancing the load in that way is known as a round-robin system.

The measure turns out to be a great success! They are not missing any customer calls and both are just busy enough, so they still have time to call customers to remind them about their appointments.

One day Alice is not feeling great, so she decides to skip work and rest until she’s feeling better. After all she doesn’t want to write down incorrect information or make other mistakes. That would impact her business negatively. The business can keep functioning without her though because Alice’s business is tolerant to partition faults.

Eventually Bob has a really busy day and misses a few calls and appointment reminders. A small number of customers complain to Alice about the bad service. Bob can’t clearly handle this by himself.

Alice tried to make her business available by scaling out, but unfortunately the business becomes unavailable at peak times if she is not working.

Scaling up

Alice realises that it takes her 30 seconds to write down a new appointment, while Bob needs 50 seconds to do the same. She can’t afford to add another employee, so improving the efficiency of the current ones is the way to go.

She decides to send Bob to a reminder writing training course (over the weekend as they can’t afford to have only one worker in the office during weekdays) so he can improve his reminder writing efficiency. She decides to join the course too in case she can also improve her efficiency.

The course proves to be very effective and now both Bob and Alice require only 20 seconds to write a new appointment.

Bob has a doctor’s appointment the next Monday and that day Alice works by herself. She is perfectly capable of coping with all the calls and appointment reminders, but she decides to keep Bob in the payroll because she needs to be able to take time off sometimes and eventually the load will catch up with her again.

Alice’s business is available! and it keeps growing at an astounding pace!

Consistency, or lack thereof

A few days later Alice gets a call from a very angry customer. He claims he didn’t receive a very important reminder call last Monday! Alice is surprised and appalled at equal parts. How could they miss a call? And it had to be her fault, because Bob was at the doctor’s on Monday!

Alice investigates and discovers that this was an appointment that Bob wrote down on his notebook, but she had been calling customers from her notebook only!

Alice’s business it not consistent because different notebooks have different data. Customers are starting to feel the consequences of that.

Well, that hurts.

Alice comes up with an initial idea to make her business consistent again. She and Bob will write down the appointments on both notebooks for every single call.

What on paper sounded like a brilliant solution turns out to be a terrible decision. Now each phone call requires more than double the time as before! They need to write the appointment down twice, but also they need to walk to each other’s desk!

To make things even worse, when Alice is writing at Bob’s notebook Bob can’t write, so he has to wait! Each appointment requires now 90 seconds on average to be in the system! Due to their efficiency reduction Alice and Bob can only get one quarter of that day’s calls.

The system really crumbles down that day and they miss countless calls and reminders. It is so bad that they make the local news that evening.

Alice can barely sleep that night. She can’t seem to be able to make her business available and consistent at the same time! but since she added Bob, her business has been definitely tolerant to network partitions.

The CAP theorem predicted this outcome, but Alice is not the kind of person that gives up easily, specially now that her family’s wellbeing depends directly on the success of her business.

Time to choose a trade-off

Alice hasn’t slept much last night. She has been thinking about the problem and has been studying her company’s records. She had a revelation around 4.30am in the morning when she realised that barely none of her customers called to set an appointment reminder for the same day they were calling!

This revelation let her choose to trade-off some consistency for the sake of availability. Her plan is to go back to her and Bob writing down only in their own notebooks, but at the end of the day when the incoming calls are closed they’ll synchronise each other’s notebook. If there is a conflict for the same appointment written down at two different times, they will remove the one that was logged earlier as the latter would surely be an amendment.

To finish it off Alice updates their service conditions and explicitly writes that same-day reminders are not supported.

Conclusions

Alice had to reduce functionality in order to have a functioning business, but that was a data-driven decision that didn’t make her lose any current customers.

Alice’s business is now available and tolerant to partition faults. Is it consistent thought? In short, no; but Alice’s business is eventually consistent which has proven to be consistent enough for her customers needs.

Moving forward

I will leave to the reader the exercise to think further and try to solve some problems such as:

What changes does Alice’s business need to support same-day appointments?
If Alice keeps adding employees, eventually it will take too long to synchronise everyone’s notebooks. What strategies could we follow to keep the business running at that point?
Can we improve the overall efficiency of the system by adding employees with specialised tasks?
How would you design a scalable system with employees dedicated to get calls and write appointments and others dedicated to read appointments and make calls?

Please leave your questions, comments, complaints, doubts, queries, etc in the comments section.