Distributed File System
- Useful for Big Data
- Advantages:
- Scalable
- Reliability
- High Performance
- Fault Tolerance
- Consistency
- Traditional File System includes NTFS, exFAT etc.
- In Distributed FS, The file chunks are distributed and replicated across servers in a cluster
- Uses
- You can store much larger files in distributed file system
- You can retrieve your file even after losing few nodes
- Examples:
- GFS (Google File System) — Proprietary, Research Paper published
- HDFS (Hadoop Distributed File System) — Open Source
- GlusterFS — Open Source
HDFS
- Open Source
- From Apache Hadoop
- Inspired by GFS research paper
- Can be used in commodity computers keeping cost down
Big Data Processing
MapReduce
- Programming paradigm
- Inspired by map and reduce in functional programming
- Uses two main processing steps:
- In Map, data is split between parallel processing tasks
- Reduce aggregate the data after mapping is done
- Used by Apache Hadoop, Apache Couch DB, Infinispan, Rick
- Use cases:
- MapReduce is being phased out and replaced by faster framework like Apache Spark
flowchart TD
U[User Uploads Video] --> LB[Upload Load Balancer]
LB --> Ingest[Video Ingestion Service]
Ingest --> MetaDB["Metadata DB<br/>(Cassandra / DynamoDB)"]
Ingest --> Blob["Blob Storage<br/>(S3 / GCS)"]
Ingest --> DFS["Distributed FS<br/>(GFS / HDFS)"]
Blob --> CDN[CDN Edge Servers]
CDN --> Viewer[Viewer Streaming]
DFS --> Transcode[Transcoding Service]
Transcode --> Blob
Transcode --> ML[ML Pipelines]
MetaDB --> API[Metadata API]
API --> Viewer