July 27, 2018

Prepare Your Apache Hadoop Cluster for PySpark Jobs

Since Spark itself runs in the JVM, Java have advantages like platform independence by running inside the JVM, self-contained packaging of code and its dependencies into JAR files, and higher performance. 

If you chose to use Python  a non-JVM languages for your Spark code, coders struggle in managing dependencies and making them available for PySpark jobs on a cluster.

In Apache Hadoop cluster, first need to identify the dependencies required and  where different parts of Spark code get executed and how computation is distributed on the cluster. Spark orchestrates its operations via the driver program. The driver program initializes a SparkContext  in which you define your data actions and transformations, e.g. 
mapflatMap, and filter. When the driver program is run, the Spark framework initializes executor processes on the worker nodes that then process your data across the cluster.

Python transformations you define use any third-party libraries, like NumPy or nltk, then the Spark executors will need access to those libraries when they execute your code on the remote worker nodes.

In this post, we shall try to resolve such error : 
 ImportError: No module named numpy

This means that mllib functions do not work on the cluster.

We shall try to install numpy using following command:

$ sudo pip install numpy

But it gives system error and if numpy is not installed.

Please follow below steps to resolve this error.

Solution:
-->

$ wget https://bootstrap.pypa.io/get-pip.py
$ sudo python get-pip.py
$ sudo apt-get install python-devel
$ sudo pip install numpy
>>> import numpy
>>> a1 = numpy.array([1,2,3,4,5])
>>> a1sum = a1.sum()
>>> print(a1sum)


Output : 15


No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...