Jim Cheung

reading notes on Database Internals

I. Storage Engines (10:21 mins)

II. Distributed Systems (03:27 mins)

The Dynamo paper, published by the team at Amazon in 2007, had so much impact on the database community that within a short period it inspired many variants and implementations. The most prominent of them were Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; and Riak, created by former Akamai engineers.

Storage engines such as BerkeleyDB, LevelDB and its descendant RocksDB, LMDB and its descendant libmdbx, Sophia, HaloDB, and many others were developed independently from the database management systems they’re now embedded into. Using pluggable storage engines has enabled database developers to bootstrap database systems using existing storage engines, and concentrate on the other subsystems.

1. Introduction and Overview

DBMS Architecture

         Transport
       +-------------------------------------+
       | +---------------+ +---------------+ |
  +--->| |    Cluster    | |    Client     | |
  |    | | Communication | | Communication | |
  |    | +---------------+ +---------------+ |
  |    +-------------------------------------+
  |                                    A
  |                                    |
  |      Query Processsor              V
  |    +-------------------------------------+
  |    | +---------------------------------+ |
  |    | |          Query Parser           | |
  |    | +---------------------------------+ |
  |    | +---------------------------------+ |
  |    | |         Query Optimizer         | |
  |    | +---------------------------------+ |
  |    +-------------------------------------+
  |                           A
  |                           |
  |      Execution Engine     V
  |    +-------------------------------------+
  |    | +---------------+ +---------------+ |
  |    | |    Remote     | |     Local     | |
  +--->| |   Execution   | |    Execution  | |
       | +---------------+ +---------------+ |
       +-------------------------------------+
                              A
                              |
         Storage Engine       V
       +-------------------------------------+
       | +---------------+ +---------------+ |
       | |  Transaction  | |     Lock      | |
       | |    Manager    | |    Manager    | |
       | +---------------+ +---------------+ |
       | +---------------------------------+ |
       | |         Access Methods          | |
       | +---------------------------------+ |
       | +---------------+ +---------------+ |
       | |    Buffer     | |   Recovery    | |
       | |    Manager    | |    Manager    | |
       | +---------------+ +---------------+ |
       +-------------------------------------+

Memory- Versus Disk-Based DBMS

The main limiting factors on the growth of in-memory databases are RAM volatility (in other words, lack of durability) and costs.

The situation is likely to change as the availability and popularity of Non-Volatile Memory (NVM) technologies grow. NVM storage reduces or completely eliminates (depending on the exact technology) asymmetry between read and write latencies, further improves read and write performance, and allows byte-addressable access.

Durability in Memory-Based Stores

In-memory database systems maintain backups on disk to provide durability and prevent loss of the volatile data.

Before the operation can be considered complete, its results have to be written to a sequential log file. (write-ahead logs)

Column- Versus Row-Oriented DBMS

One of the ways to classify databases is by how the data is stored on disk: row- or column-wise. Tables can be partitioned either horizontally (storing values belonging to the same row together), or vertically (storing values belonging to the same column together).

The two pioneer open source column-oriented stores are MonetDB and C-Store

Row-Oriented Data Layout

| ID | Name  | Birth Date  | Phone Number   |
| 10 | John  | 01 Aug 1981 | +1 111 222 333 |
| 20 | Sam   | 14 Sep 1988 | +1 555 888 999 |
| 30 | Keith | 07 Jan 1984 | +1 333 444 555 |

Column-Oriented Data Layout

Values for the same column are stored contiguously on disk

Column-oriented stores are a good fit for analytical workloads that compute aggregates, such as finding trends, computing average values, etc.

From a logical perspective, the data representing stock market price quotes can still be expressed as a table:

| ID | Symbol | Date        | Price     |
| 1  | DOW    | 08 Aug 2018 | 24,314.65 |
| 2  | DOW    | 09 Aug 2018 | 24,136.16 |
| 3  | S&P    | 08 Aug 2018 | 2,414.45  |
| 4  | S&P    | 09 Aug 2018 | 2,232.32  |

However, the physical column-based database layout looks entirely different. Values belonging to the same column are stored closely together:

Symbol: 1:DOW; 2:DOW; 3:S&P; 4:S&P
Date:   1:08 Aug 2018; 2:09 Aug 2018; 3:08 Aug 2018; 4:09 Aug 2018
Price:  1:24,314.65; 2:24,136.16; 3:2,414.45; 4:2,232.32

Some column stores use implicit identifiers (virtual IDs) instead and use the position of the value (in other words, its offset) to map it back to the related values

we’ve seen many new column-oriented file formats such as Apache Parquet, Apache ORC, RCFile, as well as column-oriented stores, such as Apache Kudu, ClickHouse, and many others

Wide Column Stores

Column-oriented databases should not be mixed up with wide column stores, such as BigTable or HBase

Data Files and Index Files

A database system usually separates data files and index files: data files store data records, while index files store record metadata and use it to locate records in data files.

Data Files

Data files (sometimes called primary files) can be implemented as index-organized tables (IOT), heap-organized tables (heap files), or hash-organized tables (hashed files).

Index Files

An index is a structure that organizes data records on disk in a way that facilitates efficient retrieval operations.

2. B-Tree Basics

3. File Formats

4. Implementing B-Trees

5. Transaction Processing and Recovery

6. B-Tree Variants

7. Log-Structured Storage

8. Introduction and Overview

9. Failure Detection

10. Leader Election

11. Replication and Consistency

12. Anti-Entropy and Dissemination

13. Distributed Transactions

14. Consensus