Distributed File System

Useful for Big Data
Advantages:
- Scalable
- Reliability
- High Performance
- Fault Tolerance
- Consistency
Traditional File System includes NTFS, exFAT etc.
In Distributed FS, The file chunks are distributed and replicated across servers in a cluster
Uses
- You can store much larger files in distributed file system
- You can retrieve your file even after losing few nodes
Examples:
- GFS (Google File System) — Proprietary, Research Paper published
- HDFS (Hadoop Distributed File System) — Open Source
- GlusterFS — Open Source

HDFS

Open Source
From Apache Hadoop
Inspired by GFS research paper
Can be used in commodity computers keeping cost down

Big Data Processing

Tools include:
- MapReduce
- Apache Spark

MapReduce

Programming paradigm
Inspired by map and reduce in functional programming
Uses two main processing steps:
- Map
- Reduce
In Map, data is split between parallel processing tasks
Reduce aggregate the data after mapping is done
Used by Apache Hadoop, Apache Couch DB, Infinispan, Rick
Use cases:
- Batch Processing
MapReduce is being phased out and replaced by faster framework like Apache Spark

flowchart TD
  U[User Uploads Video] --> LB[Upload Load Balancer]
  LB --> Ingest[Video Ingestion Service]

  Ingest --> MetaDB["Metadata DB<br/>(Cassandra / DynamoDB)"]
  Ingest --> Blob["Blob Storage<br/>(S3 / GCS)"]
  Ingest --> DFS["Distributed FS<br/>(GFS / HDFS)"]

  Blob --> CDN[CDN Edge Servers]
  CDN --> Viewer[Viewer Streaming]

  DFS --> Transcode[Transcoding Service]
  Transcode --> Blob
  Transcode --> ML[ML Pipelines]

  MetaDB --> API[Metadata API]
  API --> Viewer

Experiments

Explorer

Distributed_File_System

Distributed File System

HDFS

Big Data Processing

MapReduce

Table of Contents