Data Engineering with Avishkar: Word Count - the "Hello World" of Big Data

July 29, 2018

Word Count - the "Hello World" of Big Data

A "Hello, World!" program is traditionally used to introduce novice programmers to a programming language. "Hello, world!" is also traditionally used in a sanity test to make sure that a computer language is correctly installed, and that the operator understands how to use it.

Similarly "Word Count" is the "Hello World" of Big Data.

The text from the input text file is tokenized into words to form a key value pair with all the words present in the input text file. The key is the word from the input file and value is ‘1’.

For instance if you consider the sentence “Hello World”. The pyspark in the WordCount example will split the string into individual tokens i.e. words. In this case, the entire sentence will be split into 2 tokens (one for each word) with a value 1.

(Hello,1)

(World,1)

file.txt contains “Hello World”

PySpark Code:

lines = sc.textFile("file.txt")

sorted(lines.flatMap(lambda line: line.split()).map(lambda w: (w,1)).reduceByKey(lambda v1, v2: v1+v2).collect())

Output:

[(u'Hello', 1), (u'World', 1)]

Data Engineering with Avishkar

July 29, 2018

Word Count - the "Hello World" of Big Data

No comments:

Fashion Catalog Similarity Search using Datastax AstraDB Vector Database

Search This Blog