reading notes on Database Internals
I. Storage Engines (10:21 mins)
- 1. Introduction and Overview
- 2. B-Tree Basics
- 3. File Formats
- 4. Implementing B-Trees
- 5. Transaction Processing and Recovery
- 6. B-Tree Variants
- 7. Log-Structured Storage
II. Distributed Systems (03:27 mins)
- 8. Introduction and Overview
- 9. Failure Detection
- 10. Leader Election
- 11. Replication and Consistency
- 12. Anti-Entropy and Dissemination
- 13. Distributed Transactions
- 14. Consensus
The Dynamo paper, published by the team at Amazon in 2007, had so much impact on the database community that within a short period it inspired many variants and implementations. The most prominent of them were Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; and Riak, created by former Akamai engineers.
Storage engines such as BerkeleyDB, LevelDB and its descendant RocksDB, LMDB and its descendant libmdbx, Sophia, HaloDB, and many others were developed independently from the database management systems they’re now embedded into. Using pluggable storage engines has enabled database developers to bootstrap database systems using existing storage engines, and concentrate on the other subsystems.
1. Introduction and Overview
DBMS Architecture
Transport
+-------------------------------------+
| +---------------+ +---------------+ |
+--->| | Cluster | | Client | |
| | | Communication | | Communication | |
| | +---------------+ +---------------+ |
| +-------------------------------------+
| A
| |
| Query Processsor V
| +-------------------------------------+
| | +---------------------------------+ |
| | | Query Parser | |
| | +---------------------------------+ |
| | +---------------------------------+ |
| | | Query Optimizer | |
| | +---------------------------------+ |
| +-------------------------------------+
| A
| |
| Execution Engine V
| +-------------------------------------+
| | +---------------+ +---------------+ |
| | | Remote | | Local | |
+--->| | Execution | | Execution | |
| +---------------+ +---------------+ |
+-------------------------------------+
A
|
Storage Engine V
+-------------------------------------+
| +---------------+ +---------------+ |
| | Transaction | | Lock | |
| | Manager | | Manager | |
| +---------------+ +---------------+ |
| +---------------------------------+ |
| | Access Methods | |
| +---------------------------------+ |
| +---------------+ +---------------+ |
| | Buffer | | Recovery | |
| | Manager | | Manager | |
| +---------------+ +---------------+ |
+-------------------------------------+
Memory- Versus Disk-Based DBMS
The main limiting factors on the growth of in-memory databases are RAM volatility (in other words, lack of durability) and costs.
The situation is likely to change as the availability and popularity of Non-Volatile Memory (NVM) technologies grow. NVM storage reduces or completely eliminates (depending on the exact technology) asymmetry between read and write latencies, further improves read and write performance, and allows byte-addressable access.
Durability in Memory-Based Stores
In-memory database systems maintain backups on disk to provide durability and prevent loss of the volatile data.
Before the operation can be considered complete, its results have to be written to a sequential log file. (write-ahead logs)
Column- Versus Row-Oriented DBMS
One of the ways to classify databases is by how the data is stored on disk: row- or column-wise. Tables can be partitioned either horizontally (storing values belonging to the same row together), or vertically (storing values belonging to the same column together).
The two pioneer open source column-oriented stores are MonetDB and C-Store
Row-Oriented Data Layout
| ID | Name | Birth Date | Phone Number |
| 10 | John | 01 Aug 1981 | +1 111 222 333 |
| 20 | Sam | 14 Sep 1988 | +1 555 888 999 |
| 30 | Keith | 07 Jan 1984 | +1 333 444 555 |
Column-Oriented Data Layout
Values for the same column are stored contiguously on disk
Column-oriented stores are a good fit for analytical workloads that compute aggregates, such as finding trends, computing average values, etc.
From a logical perspective, the data representing stock market price quotes can still be expressed as a table:
| ID | Symbol | Date | Price |
| 1 | DOW | 08 Aug 2018 | 24,314.65 |
| 2 | DOW | 09 Aug 2018 | 24,136.16 |
| 3 | S&P | 08 Aug 2018 | 2,414.45 |
| 4 | S&P | 09 Aug 2018 | 2,232.32 |
However, the physical column-based database layout looks entirely different. Values belonging to the same column are stored closely together:
Symbol: 1:DOW; 2:DOW; 3:S&P; 4:S&P
Date: 1:08 Aug 2018; 2:09 Aug 2018; 3:08 Aug 2018; 4:09 Aug 2018
Price: 1:24,314.65; 2:24,136.16; 3:2,414.45; 4:2,232.32
Some column stores use implicit identifiers (virtual IDs) instead and use the position of the value (in other words, its offset) to map it back to the related values
we’ve seen many new column-oriented file formats such as Apache Parquet, Apache ORC, RCFile, as well as column-oriented stores, such as Apache Kudu, ClickHouse, and many others
Wide Column Stores
Column-oriented databases should not be mixed up with wide column stores, such as BigTable or HBase
Data Files and Index Files
A database system usually separates data files and index files: data files store data records, while index files store record metadata and use it to locate records in data files.
Data Files
Data files (sometimes called primary files) can be implemented as index-organized tables (IOT), heap-organized tables (heap files), or hash-organized tables (hashed files).
Index Files
An index is a structure that organizes data records on disk in a way that facilitates efficient retrieval operations.