Data Engineering with Avishkar: Implementing a Left Outer Join in Map Reduce

February 20, 2013

The Problem:

I have two datasets:

Given these data sets, I want to find the number of unique locations in which each car has been sold.

First off, the problem requires that we write a two stage map-reduce:

Join users onto transactions and emit a set of car-location pairs (one for each transaction)
For each car sold, count the # of distinct locations it was sold in

We’re basically building a left outer join with map reduce.

transaction map task outputs (K,V) with K = userId, and V = carId
user map tasks outputs (K,V) with K = userId, and V = location
reducer gets both user location and carid thus outputs (K,V) with K = carId, and V = location
STAGE 2
map task is an identity mapper, outputs (K,V) with K = carId and V = location
reducer counts the number of unique locations that it sees per carId, outputs (K,V), K = carId, andV = # distinct locations