Relational Database Architecture

  • When we talk about key-value, value could be anything like actual row like document or vertex
  • The place where rows are stored is known as heap file

Storage Engines

  • Some are designed for OLAP, other for OLTP
  • Types
    • Log-structured
    • Page-oriented

Log Structured

  • Log is append only sequence of immutable records
  • Suitable where value is updated frequently
  • Example: Video URL number of times it was played
  • Example Engine:
    • Bitcask Storage Engine of Riak
    • Indexing
      • whole keys are kept in memory
      • values can be on disk (one disk seek to access value)

Idea

  • We break the log into segments of certain size
  • To efficiently use disk space, we perform following in the background thread
    • compaction: throwing away duplicate keys in the log
    • merging: after compaction, the segments become small and are merged in new segment
  • Each segment has its own in-memory hash index
  • Deletion
    • adds a special deletion record to log called tombstone
    • During merging, the record is discarded
  • Benefits
    • Efficient Disk space utilization
  • Limitations
    • all keys must fit in memory
    • range queries are inefficient

SSTable

  • Sorted String Table
  • We keep the segment sorted by keys
  • To achieve this, we maintain in-memory balanced tree before writing to segment

LSM-Tree

  • Log-structured Merge Tree
  • Originally described by Patrick O’Neil, the same idea is used to construct in-memory balanced tree
  • The in-memory balanced tree is also called memtable
  • Google Bigtable Paper introduced the terms SSTable and memtable
  • When the memtable grows more than threshold (few MBs) it is written to disk as SSTable file/segment in sorted order
    • meanwhile writes will go to new memtable while this happens
    • separate log (unsorted) is also maintained for backup purpose in case of crash
  • Reads will first search in memtable then recent segment then next-older etc.
  • In background merging and compaction happens
  • Examples:
    • LevelDB Storage Engine in Riak
    • Cassandra engine
    • HBase engine
    • Lucene Indexing Engine of Solr and Elasticsearch
  • Limitations
    • Slow when the key does not exist since need to look through all segments
    • Compaction/Merging process can interfere with performance
      • can cause multiple writes, known as write amplification
    • Difficult to implement locks since the multiple copies of same key is stored in multiple segments
  • Benefits
    • Efficient range queries
    • High throughput since writes are sequential
  • https://stackoverflow.com/questions/58168809/what-is-the-differences-between-the-term-sstable-and-lsm-tree
  • https://www.scylladb.com/glossary/log-structured-merge-tree/

Issues to deal with

  • Concurrency Control
  • Reclaiming Disk space so the log does not grow forever (Disk Usage)
  • Handling errors and partially written records (Crash Recovery)

Page Oriented

  • B-Trees and B+ Trees
  • B-Trees break down DB into fixed size blocks or Pages
  • A Page is traditionally 4KB in size, close to underlying hardware disk
  • A Page can be read or written at a time
  • Each Page can be identified by address
  • One page is root of B-Tree
  • Each Page has references to other pages based on range of keys
  • Leaf Page has
    • values stored inline to keys or
    • reference to the pages where values are stored
  • Add: Split the page and update the parent
  • A 4-level tree of 4KB pages with branching factor of 500 can store up to 250 TB
  • Crash Recovery
    • Use Write Ahead Log
  • Concurrency:
    • use latches (DB terminology) — Lightweight Locks
  • Disk Usage
    • Use abbreviated keys
    • variant known as B+ Tree
    • Increases branching factor, hence fewer levels
  • Examples
    • All Traditional Relational DB supports B/B+ Trees

Write Ahead Log (WAL)

  • aka redo log
  • For Crash Recovery
  • Append only file to which every B-Tree modification must be written before it can be applied to the pages of the tree

LSM Tree vs B-Tree

  • Disk Usage
    • LSM Trees can compress better and hence low disk space
    • B-Tree can have fragmentation due to splitting of Pages
  • Performance
    • LSM Trees compaction process can interfere with ongoing read/writes
    • LSM Trees have faster writes because of sequential writes on disk
    • B-Trees are thought to have better reads
  • Transactions
    • B-Trees can have locks attached to the tree itself and possible to implement different isolation levels
    • Difficult to implement locks on LSM-Trees

Indexes

  • Clustered Index
    • Store all row data within the index
  • Non-Clustered Index
    • Store only references to the data within the index
  • Geospatial Index
    • R-Trees are used
    • Example: PostgreSQL generalized search tree indexing
  • Fuzzy Index
    • Include Synonyms
    • Edit-distance is used
    • Finite automation similar to Tries used
    • Example: Lucene Search Index