Distributed Systems

I've been thinking a lot about distributed systems lately, and how they relate to modern application architectures and cloud-native technologies.

The whole concept of distributed systems piqued my interest once I started learning about Kubernetes and what the technology aims to achieve.

From my perspective, the objectives of a distributed system can simply be boiled down to achieving the following: fault-tolerance, scalability and resource utilization - which if we think about it relates to some of the proposed benefits of using cloud technologies.

In this post, I'll be sharing some of my current thoughts around these objectives and distributed systems more broadly.

What makes a system distributed?

I think it's important to firstly define what makes a system distributed.

At a high level, a distributed system is a system where the components are distributed across multiple machines.

A system is typically considered distributed if it meets the following criteria:

The system is composed of multiple independent components.
The components are distributed across multiple machines.
The components are loosely coupled and can operate independently.

We can think of practical examples of distributed systems as Kubernetes or Apache Kafka, i.e. where there are multiple nodes that are working together to achieve a common goal, managed by a central authority.

The cloud or cloud technologies can also be thought of as a distributed system, whereby there are datacenters located across the globe segmented into regions and availability zones; each working independently but connected together to form a global system.

Goals of Distributed Systems

Now that we've defined what makes a system distributed, we can start to think about the problems it aims to solve.

1. Fault-tolerance

Fault-tolerance is the ability of a system to continue operating even in the presence of faults, i.e. where some components are not working as expected.

Fault-tolerance is generally measure in terms of availability, i.e. fraction of the time that the system is operational.

99.9% availability (3 nines) -> 10 minutes of downtime per week
99.99% availability (4 nines) -> 1 minutes of downtime per week
99.999% availability (5 nines) -> 6 seconds of downtime per week

When designing a distributed system, we typically need to consider the following:

Service Level Objectives (SLOs) - What is the acceptable level of availability for the system?
Service Level Agreements (SLAs) - What is the agreement between the system and the users?

We can think of availability zones in a cloud environment as a way of achieving fault-tolerance; whereby if one zone within a region goes down, the system can continue to operate predictably in another zone.

2. Scalability

Scalability is the ability of a system to handle an increasing amount of work, or load, by adding more resources based on demand.

In a distributed system, there are typically two types of scalability:

Horizontal scalability (scaling out)
Vertical scalability (scaling up)

Horizontal scalability is achieved by adding more machines to the system, while vertical scalability is achieved by adding more resources to the existing machines.

When we think about designing a distributed system, our focus should be on ensuring that the service is able to scale automatically based on the demand. Typically, we may bake in some additional rules such as: minimum number of instance, or CPU/memory thresholds.

3. Resource utilization

Resource utilization is the ability of a system to use resources efficiently.

In a distibuted system, we can use resources efficiently by distributing the work across multiple machines and segmenting the workload based on the functional boundaries of the system.

More practically, we can think of a microservices architecture where different components of the system utilize different cloud resources based on their specific needs. For example, compute-intensive services might run on Kubernetes clusters, data-heavy services might use managed databases, and caching layers might leverage in-memory stores like Redis.

This approach allows for more efficient use of resources, as we can allocate resources based on the demands of each specific part of system.

When designing a distibuted system, we can also think of allocating resources based on their function, and/or expected usage, i.e. reserving more resources for components with greater demands and vice versa.

Closing thoughts

In this post, I touched on some very high level concepts of distributed systems - which I can confidently say barely scratches the surface.

Distributed systems is a fascinating yet extensive topic, with many intricacies that could honestly take a lifetime to uncover - so is to say that I may share some deeper thoughts in the future...