MapReduce and Hadoop

Why Parallelism

Increasing data size
Limitations of single node architecture(bandwidth/memory limitation..)
Emerging affordable cluster architecture(Clusters of Linux nodes, Gigabit Ethernet connection)

Challenge

  • Cheap nodes fail
    Solution: Build fault-tolenrance into storage infrastructure(DFS, replicate)
  • Low bandwidth of commodity network
    Solution: Bring computation close to the data
  • Programming distributed systems is hard
    Solution: Data-parallel programming model
    Users write “map” and “reduce” functions.
    System handles work distribution and fault tolerance
    1 GBps bandwidth in rack, 8 GBps out of rack
  • Other failure
    Loss of single node, i.e. disk crash
    Loss of entire rack, i.e. network failure

Hadoop Component

Distributed File System: files stored redundantly

Files are big
rarely updated
reads and append-only

  • data node
    file split into contiguous chunks(64 - 128MB)
    each chunk replicated around 3 times and kept in different racks
  • master node
    store meta data about where files are located, may be replicated
    CLient library talks to master to find chunk servers, connects directly to chunk servers to access data
  • reliability and availability
    data kept in chunks and chunks replicated on various machines(chunck server)
    chunk servers also serve as compute servers

MapReduce Programming System: divide computations into tasks

MapReduce is a data-parallel programming model for clusters of commodity machines

  1. Sequentially read a lot of data
  2. Map: extract sth you care about (define by user)
  3. Group by key: sort and shuffle (done by system)
  4. Reduce: aggregate, summarize, filter or transform (define by user)
  5. write the result

MapReduce environment

takes care of

  • partitioning input data
  • scheduling program’s execution across a set of machines
  • performing the Group by key step
  • handling machine failures
  • managing inter-machine communication

Grouping by key

  • User defines number of Reduce tasks R
  • System produces a bucket number from 0 to R-1
  • Hash each key output by Map task, put key-value pair in one of R local files
  • Master controller merge files from Map tasks destined for the same Reduce task as a sequence of pairs

MapReduce Diagram

MapReduce