A "Hello, World!" program is traditionally used to
introduce novice programmers to a programming language. "Hello,
world!" is also traditionally used in a sanity test to make sure that a computer
language is correctly installed, and that the operator understands how to use
it.
Similarly "Word Count" is the "Hello
World" of Big Data.
The text from the input text file is tokenized into words to form
a key value pair with all the words present in the input text file. The key is
the word from the input file and value is ‘1’.
For instance if you consider the sentence “Hello World”. The pyspark
in the WordCount example will split the string into individual tokens i.e.
words. In this case, the entire sentence will be split into 2 tokens (one for
each word) with a value 1.
(Hello,1)
(World,1)
file.txt contains “Hello World”
PySpark Code:
lines = sc.textFile("file.txt")
sorted(lines.flatMap(lambda line: line.split()).map(lambda w: (w,1)).reduceByKey(lambda v1, v2: v1+v2).collect())
Output:
[(u'Hello', 1), (u'World', 1)]