Chapter 1

Most Important System Concerns

Reliability
- System works correctly
Scalability
- Grows when data volume, traffic volume, or complexity grows
Maintainability
- Ease of maintaining existing behavior and adaptability to new use cases

The author lumps Databases and Message Queues into one broad term “Data Systems” because the lines between the two are getting blurred.

Data stores like Redis are are used as message queues
Message queues like Apache Kafka now have database-like durability guarantees

Imagine you were a system designer,

How would you ensure data is correct and complete? (even when failures happen)
How would you ensure fast performance?
How would you scale?
What would an API for the service look like?

Fault vs Failure

A fault is when a component deviates from spec
A failure is when the system stops providing the service to the user (impossible for probability of a failure to reach 0)

The goal is to prevent faults from becoming failures

Many bugs are actually the fault of improper error handling

Systematic software errors are hard to anticipate and can cause failure across nodes. Software errors cause more failures than hardware faults

Examples of software errors

Bugs in the system kernel that happen because of a leap second on June 30, 2012
Runaway process takes up memory, CPI, disk space, network
Dependencies become unresponsive or corrupt resources
Cascading failure (small fault causes bigger faults)

Software faults are mitigated by

Thinking carefully about assumptions and interaction
Testing
Process isolation
Letting processes crash and restart
Measure and monitor system behavior
Can run TPS checks to ensure no discrepancies

Minimize human error (like bad config changes) by

Designing great API’s and limit access/design good UI’s to help admins make good choices (don’t make too restrictive)
Create sandbox for devs to mess with and experiment
Test thoroughly (cover corner cases)
Make rollbacks easy
Setup good monitoring/telemetry

Twitter has a fan out problem, when a user requests their homepage there are two ways to get tweets:

Lookup each person they follow, find all tweets from each of those users, and merge them sorted by time
Cache the results of each user’s homepage. When a user posts, lookup each person who follows the user that posted and insert the new post into each timeline cache. (This is precompute)

Question: Why did twitter go with approach 1 for celebrities?

In the case of a celebrity, a user pulls the celebrities tweet at the time of checking the homepage. Otherwise, the celebrity’s tweet would be cached for all of their followers and millions of caches would need to be updated.

Latency vs Response Time

Latency: The duration that the request waits to be handled
Response Time: What the client sees (including network delays and queuing delays)
Service Time: Actual time the service takes to process the request

Response time is a distribution of measurable values.

The arithmetic mean is not a good measure of response time because it does not tell you how many users actually experienced that delay.
We use the median (p50) because it lets us know half of all users had a response time less than 200ms (for example)
High percentiles are known as tail latenciesand are tracked by companies like Amazon

Amazon uses the 99.9th percentile to understand the experience of customers with the most data (possibly the most valuable customers), however the 99.99th percentile does not have a lot of benefit and is too expensive to improve when latency can be out of your control.

High response time percentiles are often caused by queueing delays.

Ex: a small number of long requests block subsequent requests

Because of this, it is important to measure Client Side response times (the server may be processing requests quickly but if they are blocked the response time will suffer)

It is also important that load tests do not wait for a request to finish before sending the next one, otherwise tests may have queues shorter than reality.

Tail Latency Amplification can occur when a single end-user request has multiple calls. If just one of these calls is slow the entire response time increases.

The naïve way to calculate response times over a given period of time is to keep a list of response times in a given window and sort the list every minute.

Do not average percentiles, instead add response time data to histograms

Scaling up (vertical scaling): moving to more powerful machines Scaling out (horizontal scaling): distributing the load across multiple smaller machines

Question: “Good architectures usually involve a pragmatic mixture of approaches.” (referring to both Scaling up and Scaling out). When would be an example of a time when it is better to choose scaling up vs when it is better to scale out? What would be the tradeoff (outside of cost) for each approach? (Section: “Approaches for Coping with Load”)

Maintenance Design Principles

Operability: Make it easy for things to continue running smoothly
Simplicity: Make it easy for new engineers to understand the system (by removing complexity when possible)
Evolvability: Make it easy for engineers to change the system and adapt it to new use cases

“good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations”

Operations teams should

Monitor health of system and restore services when they go out
Root cause problems like system failures or bad performance
Keep things up to date/security patched
Keep tabs on how systems affect each other
Anticipate future problems (capacity planning)
Create good config management and deployment tools
Complex maintenance like migrations or moving from one platform to another
Maintain security
Define processes to make operations predictable and keep the env stable
Preserve knowledge of the system (even when people leave)

Thoughts: I’ve been warned about over abstracting and I’m curious whether this is a good use of an abstraction. Wouldn’t it be better to try to simplify the complex behavior to make it more understandable with helper functions/microservices rather than abstracting?

Referring to this section

“One of the best tools we have for removing accidental complexity is abstraction. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade. A good abstraction can also be used for a wide range of different applications. Not only is this reuse more efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality software, as quality improvements in the abstracted component benefit all applications that use it.”

Chapter 2

Data Models and Query Languages

Most applications are made by layering one data model on another

Ask yourself:

“How is each layer represented in terms of the next lower layer”

Example of how to think in layers

An app developer modeling real people, organizations, goods, actions which are manipulated by APIs
When storing these data structures, store as JSON or XML
JSON or XML are represented by bytes, memory, disk, network
Those bytes are represented by electrical currents, pulses of light, magnetic fields, etc.

Abstractions and layers allow people to work together effectively.

Relational Model vs Document Model

Best-known Data Model - SQL

Relational Database:

In a relational database, each row in the table is a record with a unique ID called the key. The columns of the table hold attributes of the data, and each record usually has a value for each attribute, making it easy to establish the relationships among data points.

Relational database usage started in the 60s and 70s for mundane use cases like invoicing, payroll, reporting, banking transactions, airline reservations, etc.

Most of what is built today still relies on Relational Databases (As of when this book was written in 2017, lol dang I was a junior in high school)

The Birth of NoSQL

Driving forces behind NoSQL

A need for better scalability than Relational Databases (RD) (large datasets/high write throughput)
Preference for free/open source software over commercial databases
Some queries are not well supported by RD
Restrictiveness of RD’s schemas, desire for dynamic/expressive data models

Note: JSON is a type of Document Database

Object-Relational Mismatch

If data is stored in a relational database, there is a translation between the database model of rows, tables, and columns, to the objects in application code. This disconnect is called impedance mismatch.

Object-relational mapping (ORM) reduce the code in these translation layers, but there are still differences between the two models (relational and application code).

JSON has the advantage of having better locality than multiple relational databases. To query relational databases with multiple tables, messy joins or multiple queries are necessary. JSON however only requires a single query.

Many-to-One and Many-to-Many Relationships

Sometimes having an ID with fixed options for locations or jobs is better to ensure

Consistency
No Ambiguity for options with identical names
Ease up updating
Localization support
Better Search

Using an ID also reduces value duplication. Instead of storing copies of the value, you can store the value with an ID and use the ID to refer to the single instance of the value. Removing duplications is the idea behind normalization in databases.

Normalizing data requires many-to-one relationships

This does not fit nicely into the document model
Support for joins in document databases are often weak

If your database does not support joins, you need to emulate a join in application code by making multiple queries.

Even if you design your application to not need joins and have a perfect document model, data has a tendency to become more interconnected over time.

The most popular database in the 1970s for business data was IBM’s Information Management System (IMS). This used a hierarchical model which has many similarities to JSON. All data are records nested within records. This worked well for 1→Many but not for Many→Many and no join support.

This was solved with the Relational Model and with the Network Model.

The Network Model

The Network Model/Conference on Data Systems Languages (CODASYL) Model both used a hierarchical model with every record having exactly 1 parent in the CODASYL Model and multiple parents in the Network Model.

The way queries were performed on these models involved “pointers” that followed a path from root record along a chain of links. This is called an access path and made queries difficult. If you did not have the exact path to the data you were looking for it would be difficult with a Hierarchical Model or the Network Model.

The Relational Model

A relation: A table with a collection of rows

A query optimizer will decide which parts of the query to execute in which order and which indexes to use. This is effectively an “access path” but it is dynamic because it is made automatically by the query optimizer.

Schema Flexibility in the Document Model

JSON and other Document Models have implicit schemas that are sometimes called schemaless but a better term is schema on read which is where the schema Is interpreted when the data is read. Relational databases are typically schema on write and have an explicit schema and the database ensures the data conforms to the schema

This is similar to dynamic runtime type checking vs static compile time type checking.

Chapter 3

This chapter will discuss storage engines and compare log-structured vs page-oriented storage engines.

Data Structures That Power Your Database

The word log is often used to refer to application logs, where an application outputs text that describes what’s happening

Simple storage: Key Value pairs that are written in an append only log. Writes are fast because appending is quick and no updates are made. Retrieval searches for the latest key in O(n) time. If the number of records n doubles then it takes twice as long. To make finding values more efficient we use an index.

Def: An index is an additional structure that is derived from the primary data

Indexes can be added and removed and they don’t affect the contents of the database, just the performance of queries. Although they add overhead especially on writes.

🪴 Dom Lee

Explorer

Designing Data Intensive Applications