The Problem:
I have two datasets:
- User information (id, mobile, location)
- Transaction information (transaction-id, car-id, user-id, CarBookingDate)
Given these data sets, I want to find the number of unique locations in which each car has been sold.
One Solution
- For each transaction, look up the user record for the transaction’s user-Id
- Join the user records to each transaction
- Create a mapping from car-id to a list of locations
- Count the number of distinct locations per car-id.
The Map Reduce Solution
First off, the problem requires that we write a two stage map-reduce:
- Join users onto transactions and emit a set of car-location pairs (one for each transaction)
- For each car sold, count the # of distinct locations it was sold in
STAGE 1
- transaction map task outputs (K,V) with
K = userId
, andV = carId
- user map tasks outputs (K,V) with
K = userId
, andV = location
- reducer gets both user location and carid thus outputs (K,V) with
K = carId
, andV = location
STAGE 2
- map task is an identity mapper, outputs (K,V) with
K = carId
andV = location
- reducer counts the number of unique locations that it sees per carId, outputs (K,V),
K = carId
, andV = # distinct locations
No comments:
Post a Comment