February 20, 2013

Implementing a Left Outer Join in Map Reduce

The Problem:

I have two datasets:
  1. User information (id, mobile,  location)
  2. Transaction information (transaction-id, car-id, user-id, CarBookingDate)
Given these data sets, I want to find the number of unique locations in which each car has been sold.

 One Solution

  1. For each transaction, look up the user record for the transaction’s user-Id
  2. Join the user records to each transaction
  3. Create a mapping from car-id to a list of locations
  4. Count the number of distinct locations per car-id.

The Map Reduce Solution

First off, the problem requires that we write a two stage map-reduce:
  1. Join users onto transactions and emit a set of car-location pairs (one for each transaction)
  2. For each car sold, count the # of distinct locations it was sold in

STAGE 1

We’re basically building a left outer join with map reduce.
  • transaction map task outputs (K,V) with K = userId, and V = carId
  • user map tasks outputs (K,V) with K = userId, and V = location
  • reducer gets both user location and carid thus outputs (K,V) with K = carId, and V = location
  • STAGE 2

  • map task is an identity mapper, outputs (K,V) with K = carId and V = location
  • reducer counts the number of unique locations that it sees per carId, outputs (K,V), K = carId, andV = # distinct locations
  •  

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...