Data Engineering with Avishkar: 07/01/2018

July 29, 2018

Word Count - the "Hello World" of Big Data

A "Hello, World!" program is traditionally used to introduce novice programmers to a programming language. "Hello, world!" is also traditionally used in a sanity test to make sure that a computer language is correctly installed, and that the operator understands how to use it.

Similarly "Word Count" is the "Hello World" of Big Data.

The text from the input text file is tokenized into words to form a key value pair with all the words present in the input text file. The key is the word from the input file and value is ‘1’.

For instance if you consider the sentence “Hello World”. The pyspark in the WordCount example will split the string into individual tokens i.e. words. In this case, the entire sentence will be split into 2 tokens (one for each word) with a value 1.

(Hello,1)

(World,1)

file.txt contains “Hello World”

PySpark Code:

lines = sc.textFile("file.txt")

sorted(lines.flatMap(lambda line: line.split()).map(lambda w: (w,1)).reduceByKey(lambda v1, v2: v1+v2).collect())

Output:

[(u'Hello', 1), (u'World', 1)]

July 27, 2018

Prepare Your Apache Hadoop Cluster for PySpark Jobs

Since Spark itself runs in the JVM, Java have advantages like platform independence by running inside the JVM, self-contained packaging of code and its dependencies into JAR files, and higher performance.

If you chose to use Python a non-JVM languages for your Spark code, coders struggle in managing dependencies and making them available for PySpark jobs on a cluster.

In Apache Hadoop cluster, first need to identify the dependencies required and where different parts of Spark code get executed and how computation is distributed on the cluster. Spark orchestrates its operations via the driver program. The driver program initializes a SparkContext in which you define your data actions and transformations, e.g. map, flatMap, and filter. When the driver program is run, the Spark framework initializes executor processes on the worker nodes that then process your data across the cluster.

Python transformations you define use any third-party libraries, like NumPy or nltk, then the Spark executors will need access to those libraries when they execute your code on the remote worker nodes.

In this post, we shall try to resolve such error : ImportError: No module named numpy

This means that mllib functions do not work on the cluster.

We shall try to install numpy using following command:

$ sudo pip install numpy

But it gives system error and if numpy is not installed.

Please follow below steps to resolve this error.

Solution:

-->

$ wget https://bootstrap.pypa.io/get-pip.py

$ sudo python get-pip.py

$ sudo apt-get install python-devel

$ sudo pip install numpy

>>> import numpy

>>> a1 = numpy.array([1,2,3,4,5])

>>> a1sum = a1.sum()

>>> print(a1sum)

Output : 15

Data Engineering with Avishkar

July 29, 2018

Word Count - the "Hello World" of Big Data

July 27, 2018

Prepare Your Apache Hadoop Cluster for PySpark Jobs

Creating DataFrames from CSV in Apache Spark

Search This Blog