Relational Database Architecture
- When we talk about key-value, value could be anything like actual row like document or vertex
- The place where rows are stored is known as heap file
Storage Engines
- Some are designed for OLAP, other for OLTP
- Types
- Log-structured
- Page-oriented
Log Structured
- Log is append only sequence of immutable records
- Suitable where value is updated frequently
- Example: Video URL ⇒ number of times it was played
- Example Engine:
- Bitcask Storage Engine of Riak
- Indexing
- whole keys are kept in memory
- values can be on disk (one disk seek to access value)
Idea
- We break the log into segments of certain size
- To efficiently use disk space, we perform following in the background thread
- compaction: throwing away duplicate keys in the log
- merging: after compaction, the segments become small and are merged in new segment
- Each segment has its own in-memory hash index
- Deletion
- adds a special deletion record to log called tombstone
- During merging, the record is discarded
- Benefits
- Efficient Disk space utilization
- Limitations
- all keys must fit in memory
- range queries are inefficient
SSTable
- Sorted String Table
- We keep the segment sorted by keys
- To achieve this, we maintain in-memory balanced tree before writing to segment
LSM-Tree
- Log-structured Merge Tree
- Originally described by Patrick O’Neil, the same idea is used to construct in-memory balanced tree
- The in-memory balanced tree is also called memtable
- Google Bigtable Paper introduced the terms SSTable and memtable
- When the memtable grows more than threshold (few MBs) it is written to disk as SSTable file/segment in sorted order
- meanwhile writes will go to new memtable while this happens
- separate log (unsorted) is also maintained for backup purpose in case of crash
- Reads will first search in memtable then recent segment then next-older etc.
- In background merging and compaction happens
- Examples:
- LevelDB Storage Engine in Riak
- Cassandra engine
- HBase engine
- Lucene Indexing Engine of Solr and Elasticsearch
- Limitations
- Slow when the key does not exist since need to look through all segments
- Compaction/Merging process can interfere with performance
- can cause multiple writes, known as write amplification
- Difficult to implement locks since the multiple copies of same key is stored in multiple segments
- Benefits
- Efficient range queries
- High throughput since writes are sequential
- https://stackoverflow.com/questions/58168809/what-is-the-differences-between-the-term-sstable-and-lsm-tree
- https://www.scylladb.com/glossary/log-structured-merge-tree/
Issues to deal with
- Concurrency Control
- Reclaiming Disk space so the log does not grow forever (Disk Usage)
- Handling errors and partially written records (Crash Recovery)
Page Oriented
- B-Trees and B+ Trees
- B-Trees break down DB into fixed size blocks or Pages
- A Page is traditionally 4KB in size, close to underlying hardware disk
- A Page can be read or written at a time
- Each Page can be identified by address
- One page is root of B-Tree
- Each Page has references to other pages based on range of keys
- Leaf Page has
- values stored inline to keys or
- reference to the pages where values are stored
- Add: Split the page and update the parent
- A 4-level tree of 4KB pages with branching factor of 500 can store up to 250 TB
- Crash Recovery
- Concurrency:
- use latches (DB terminology) — Lightweight Locks
- Disk Usage
- Use abbreviated keys
- variant known as B+ Tree
- Increases branching factor, hence fewer levels
- Examples
- All Traditional Relational DB supports B/B+ Trees
Write Ahead Log (WAL)
- aka redo log
- For Crash Recovery
- Append only file to which every B-Tree modification must be written before it can be applied to the pages of the tree
LSM Tree vs B-Tree
- Disk Usage
- LSM Trees can compress better and hence low disk space
- B-Tree can have fragmentation due to splitting of Pages
- Performance
- LSM Trees compaction process can interfere with ongoing read/writes
- LSM Trees have faster writes because of sequential writes on disk
- B-Trees are thought to have better reads
- Transactions
- B-Trees can have locks attached to the tree itself and possible to implement different isolation levels
- Difficult to implement locks on LSM-Trees
Indexes
- Clustered Index
- Store all row data within the index
- Non-Clustered Index
- Store only references to the data within the index
- Geospatial Index
- R-Trees are used
- Example: PostgreSQL generalized search tree indexing
- Fuzzy Index
- Include Synonyms
- Edit-distance is used
- Finite automation similar to Tries used
- Example: Lucene Search Index