July 29, 2018

Word Count - the "Hello World" of Big Data



A "Hello, World!" program is traditionally used to introduce novice programmers to a programming language. "Hello, world!" is also traditionally used in a sanity test to make sure that a computer language is correctly installed, and that the operator understands how to use it.

Similarly "Word Count" is the "Hello World" of Big Data.
The text from the input text file is tokenized into words to form a key value pair with all the words present in the input text file. The key is the word from the input file and value is ‘1’.
For instance if you consider the sentence “Hello World”. The pyspark in the WordCount example will split the string into individual tokens i.e. words. In this case, the entire sentence will be split into 2 tokens (one for each word) with a value 1.
(Hello,1)
(World,1)
file.txt contains “Hello World”


PySpark Code:

lines = sc.textFile("file.txt")


sorted(lines.flatMap(lambda line: line.split()).map(lambda w: (w,1)).reduceByKey(lambda v1, v2: v1+v2).collect())


Output:
[(u'Hello', 1), (u'World', 1)]

No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...