Column oriented database

aka Columnar Database
stores all the values from each columns together
Each column file stores all the rows in the same order
Bottleneck
- Transfer lot of Data ⇒ Main memory ⇒ CPU cache
- CPU issues
  - branch misprediction and bubbles
  - Use of SIMD (Single-instruction multi data)
- CPU optimizations
  - designed to make efficient use of CPU cycles
  - uses bitwise AND/OR operators using bitmap encoding (vectorized processing)
Row oriented store keeps every row in one place
- heap or clustered index
Column oriented store don’t have pointers to data
- column stores containing values
Useful for OLAP workloads
Examples
- Snowflake
- Google BigQuery
- DuckDB (in-memory)

Architecture

key	product_sk	store_sk	quantity
1	29	1	1
2	29	2	5
3	31	1	4
4	31	3	2
5	31	3	7

Bitmap encoding of product_sk
- map of unique product_sk -⇒ Bitmap (represent all rows)
- 1 mean the the value is present in the ith row, else 0

`product_sk`, key or row → ↓	1	2	3	4	5
29	1	1	0	0	0
31	0	0	1	1	1

`store_sk`, key or row → ↓	1	2	3	4	5
1	1	0	1	0	0
2	0	1	0	0	0
3	0	0	0	1	1

Let n = unique products
- if n is small $⟹$ Bitmaps will be dense
- If n is large $⟹$ Bitmaps will be sparse (lot of zeros)
  - Run Length encoding can be used
Run Length encoding of product_sk
- Always start with zero count
- 29 ⇒ 11000 ⇒ 0,2,3 (0 zeros, 2 ones, 3 zeros)
- 31 ⇒ 00111 ⇒ 2,3 (2 zeros, 3 ones)
Run length encoding of store_sk
- 1 ⇒ 10100 ⇒ 0,1,1,1,2 (0 zeros, 1 one, 1 zero, 1 one, 2 zeros)
- 2 ⇒ 01000 ⇒ 1,1,3 (1 zero, 1 one, 3 zeros)
- 3 ⇒ 00011 ⇒ 3,2 (3 zeros, 2 ones)

Sort keys can be defined
If sort keys have less distinct values, then Run-length compression will be good
- column value with few KBs can store about billion rows
The compression effect is strongest on first key, then second key, and so on