Introduction

This post is a summary of the key points found in Chapter 1 of Designing Data Intensive Applications.

Reliable, Scalable, and Maintainable Systems

The most important concern of software systems are:

Reliability - A system should work correctly even when there are issues caused by hardware/software faults and even human errors.
Scalability - The system should be able to handle growth reasonably.
Maintainability - The system should be easily modifiable and easy to understand by others as time goes on without impacting productivity.

Reliability

Faults

When a system isn't reliable, then it has encountered a fault. A fault is defined as a component of the system deviating from the spec whereas a failure is when the system as a whole stops providing the required service to the user. Some companies like Netflix introduce these faults through tools like Chaos Monkey so as to test their systems.

If a hardware failure occurs, one way to avoid it is to add redundancy of hardware systems. Adding redundancy usually makes systems easier to upgrade and less likely to go down because things are horizontally scaled as opposed to vertically scaled.

To avoid human errors, you should design systems so that you minimize opportunities for errors. You should be continuously monitoring for performance and error rates, otherwise known as telemetry. You need to have good management practices and training.

Scalability

Scalability is a term used to describe a system's ability to cope with increased load.

What is Load?

Load can be described as load parameters (i.e. requests per second to a web server, reads/writes to a database, active users, etc.).

Twitter Example

The book provides an example of scalability with regards to Twitter. The primary operations for twitter are (1) loading a feed (2) posting a tweet. There are a couple of ways to approach this. Early versions of twitter would post to a Tweets table with a reference to the User that posted that tweet and then finally to load the feed, the app would simply join on both the Tweets and Users table based on who the logged in user followed. This had a downside in that it would take a long time to load the timeline. So Twitter switch to a different approach. When a user would post a Tweet, it would also pre-cache that tweet for each of the followers of the user that made the Tweet. Since reading a Tweet from cache is much faster and operation intensive than reading directly from the database, this approach meant that Twitter was able to scale. But this had its own downside in that people with large amounts of followers meant that whenever that user would make a Tweet, the system would have to cache to a lot of followers. In that case, Twitter actually takes a hybrid approach for users with large followers where it will fall back to approach (1) and use approach (2) for all other types of Twitter users.

What is Performance?

Throughput is described as the number of records that can be processed per second, or the total time it takes to run a job on a dataset of a certain size.

Latency is the duration that a request is waiting tot be handled whereas a response time is the actual time to process a request which includes network delays and queueing delays.

To measure response time it is a good idea to capture all requests, sort them from fastest to slowest and then take the median. This median will allow you to break down user requests into percentile. This will aid you in identifying which requests take the longest and optimize accordingly; you don't want to optimize the 99.9th percentile because that won't yield you enough of a benefit.

Distributing load across multiple machines is known as shared-nothing architecture. Scaling up is usually much simpler but it won't scale as well as distributing load/horizontal scaling. But, that doesn't mean you should only have horizontally scaled architecture; you need a mix of both.

Maintainability

There are 3 design principles for software systems:

Operability - Make it easy for operations teams to keep the systems running.
Simplicity - Make it easy for new engineers to understand the system and keep it simple.
Evolvability - Make it easy for engineers to change the system in the future and adapt it for new requirements.

Operability

Operations are typically responsible for monitoring the health of systems and restoring service, tracking down cause of problems, keeping software/platforms up to date, how systems interact with one another, anticipating future problems, establishing good practices and tools for deployment, configuration management, performing maintenance tasks, maintaining security, defining processes that make deployment predictable and stable, and keeping the org knowledgeable about changes of all systems.

Simplicity

Eventually, as a system grows, you will increase the complexity of the system. You should avoid creating a big ball of mud architecture. When complexity gets out of control you'll notice that maintenance is hard and projects are being behind schedule and over budget. Reduce the complexity so that you can maintain the system better. You can usually do this through abstraction.

Evolvability

This is described as the ease with which you can modify a data system and adapt it to changing requirements. Basically, simple and easy to understand systems are also more capable of evolving.

Designing Data Intensive Applications Chapter 1 Summary