Distributed File System

  • Useful for Big Data
  • Advantages:
    • Scalable
    • Reliability
    • High Performance
    • Fault Tolerance
    • Consistency
  • Traditional File System includes NTFS, exFAT etc.
  • In Distributed FS, The file chunks are distributed and replicated across servers in a cluster
  • Uses
    • You can store much larger files in distributed file system
    • You can retrieve your file even after losing few nodes
  • Examples:
    • GFS (Google File System) — Proprietary, Research Paper published
    • HDFS (Hadoop Distributed File System) — Open Source
    • GlusterFS — Open Source

HDFS

  • Open Source
  • From Apache Hadoop
  • Inspired by GFS research paper
  • Can be used in commodity computers keeping cost down

Big Data Processing

  • Tools include:
    • MapReduce
    • Apache Spark

MapReduce

  • Programming paradigm
  • Inspired by map and reduce in functional programming
  • Uses two main processing steps:
    • Map
    • Reduce
  • In Map, data is split between parallel processing tasks
  • Reduce aggregate the data after mapping is done
  • Used by Apache Hadoop, Apache Couch DB, Infinispan, Rick
  • Use cases:
    • Batch Processing
  • MapReduce is being phased out and replaced by faster framework like Apache Spark
flowchart TD
  U[User Uploads Video] --> LB[Upload Load Balancer]
  LB --> Ingest[Video Ingestion Service]

  Ingest --> MetaDB["Metadata DB<br/>(Cassandra / DynamoDB)"]
  Ingest --> Blob["Blob Storage<br/>(S3 / GCS)"]
  Ingest --> DFS["Distributed FS<br/>(GFS / HDFS)"]

  Blob --> CDN[CDN Edge Servers]
  CDN --> Viewer[Viewer Streaming]

  DFS --> Transcode[Transcoding Service]
  Transcode --> Blob
  Transcode --> ML[ML Pipelines]

  MetaDB --> API[Metadata API]
  API --> Viewer