OLAP and OLTP
- https://stackoverflow.com/questions/21900185/what-are-oltp-and-olap-what-is-the-difference-between-them
- OL — “ONLINE”
- completion of data processing in real time and not batch-processing
- OLTP — Online Transaction Processing
- large number of short on-line transactions (INSERT, UPDATE, DELETE)
- fast query processing, maintaining data integrity
- effectiveness measured by number of transactions per second
- OLAP — Online Analytical Processing
- Historical Data
- Relatively low volume of transactions
- Queries are often very complex and involve aggregations
- suited for analytics, data mining, fewer queries but they are usually bigger
- effectiveness measured by response time
| Property | OLTP | OLAP |
|---|---|---|
| Main Read Pattern | Small number of records | Aggregate over large number of records |
| Main Write Pattern | Random access, low latency writes from user input | Batch Processing or Event Stream |
| Primary Used by | End User/Customer | Internal Analyst for Business Intelligence (BI) |
| What Data Represents | Latest state of Data | History of events that happened over time |
| Dataset size | GBs to TBs | TBs to PBs |
OLAP
- DB is known as Data Warehouse where Analyst can run queries without interfering with live data
- Process of getting data into data warehouse is called Extract-Transform-Load (ETL)
- Schema:
- Star schema
- aka dimensional modeling
- center of schema is fact table where each row = event at particular time
- can have 100s of columns
- columns are foreign keys to other tables called dimension tables
- Simpler to work with
- Snowflake schema
- Similar to Star schema
- But its dimensions are further broken down into sub-dimensions
- Star schema
- Query often involves aggregation of data
- For example:
COUNT,MAX,MIN,AVG - Materialized Views are created to cache the data
- actual copy of the query results, written to disk
- Different from virtual views, where the query is executed when the view is read
- Makes writes expensive
- Data Cubes
- aka OLAP cube
- Special case of Materialized Views
- grid of aggregates grouped by different dimensions
- for example we precompute (2-dimensions)
- sum of price of products on a given date
- sum of price of products on all dates
- Only used when performance boost needed for a particular query
- For example:
- Uses Database_Column_Oriented
- Examples
- Snowflake
- Apache Spark (for batch processing)
- Google BigQuery
- DuckDB (in-memory)
Types of OLAP
- Not sure if this is useful
- ROLAP — Relational OLAP
- complex query possible
- slower
- MOLAP — Multi-dimensional OLAP
- uses dimensions
- faster
- HOLAP — Hybrid OLAP (ROLAP + MOLAP)
Data Lake
- Unstructured Data compared to Data Warehouse which has structured data