Skip to content

Why Spark

The MapReduce Disk Bottleneck#

MapReduce writes intermediate results to disk between every phase:

Map → disk → Shuffle → disk → Reduce → disk

Disk is 100x slower than RAM. For a 5-step pipeline that's 10 disk reads/writes before you get your answer. This is why MapReduce jobs take minutes to hours.


What Spark Does Differently#

Spark keeps all intermediate results in RAM. Disk is only touched at the start (read input) and end (write output).

MapReduce:   Map → disk → Shuffle → disk → Reduce → disk
Spark:       Map → RAM  → Shuffle → RAM  → Reduce → disk (final output only)

RAM access is ~100x faster than disk. Result: Spark is 10–100x faster than MapReduce for the same job.


When To Use Spark vs MapReduce#

MapReduce Spark
Speed Slow (disk I/O) Fast (in-memory)
Iterative workloads Bad — reads/writes disk every iteration Great — stays in RAM
Memory requirement Low — spills to disk High — needs enough RAM
Fault tolerance Re-read from disk Recompute from lineage
Use case Simple batch ETL, large data that doesn't fit in RAM ML training, iterative analytics, fast batch jobs

Key interview point: "For the nightly billing report I'd use Spark — process the raw event log from S3 in memory for fast exact counts. MapReduce would work too but Spark is significantly faster for iterative aggregations."