Step 1:
Setup CentOS
AMI on Amazon AWS EC2 Instance
Choose
CentOS AMI on Amazon AWS EC2 Instance
Step 2: Connect to CentOS AWS
instance
ssh -i
~/.ssh/filename.pem centos@awsinstanceip
Step 3: Connecting with root user (
super admin)
sudo su -
Step 4: Update all the available
packages from repository
yum update
Step 5: Install JDK
yum
install java-1.6.0-openjdkx86_64
Check
java version
java
–showversion Test
java –version
whereis
java
sudo
alternatives --config javac
Step 6: Install Hadoop
Install wget to allow
download softwares
yum –y install wget
cd /usr/local
wget
http://apache.javapipe.com/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz
|
tar -zxvf hadoop-2.7.6.tar.g
set
Hadoop home : /usr/local/hadoop-2.7.6
set Java
Home:
cd
/usr/lib/jvm/jre-1.6.0-openjdk.x86_64/
Step 7: Configure Hadoop
set the JAVA_HOME and
HADOOP_HOME in the root/.bashrc file, by copying the following content
7.1 Open
file vi /root/.bashrc
7.2 Copy
the content
export
HADOOP_HOME= /usr/local/hadoop-2.7.6
export
JAVA_HOME= /usr/lib/jvm/jre-1.8.0-openjdk
unalias
fs &> /dev/null
alias
fs="hadoop fs"
unalias
hls &> /dev/null
alias
hls="fs -ls"
lzohead
() {
hadoop fs
-cat $1 | lzop -dc | head -1000 | less
}
export
PATH=$PATH:$HADOOP_HOME/bin
7.3
Restart instance and check Java & Hadoop locations
echo
$JAVA_HOME
echo
$HADOOP_HOME
7.4
Create temp directory for Hadoop Data storage
mkdir -p /tmp/hadoop/data
7.5 Set JAVA_HOME in
/usr/local/hadoop-2.7.6/etc/Hadoop/hadoop-env.sh
|
7.6
Configure the conf/core-site.xml
scheme
and authority determine the FileSystem implementation.
7.7
Configure
the conf/mapred-site.xml with following content. It is the configuration for
JobTracker.
-->
7.8 configure conf/hdfs-site.xml. Replication
factor configuration for the HDFS blocks
dfs.replication
Step 8:
Start Hadoop
8.1 Formatting the Hadoop filesystem, which is
implemented on top of the local filesystems of your cluster, you need to do
this the first time you set up a Hadoop installation.
./bin/hdfs
namenode –format
8.2 start your Hadoop Single Node Cluster
./sbin/start-dfs.sh
./sbin/start-yarn.sh
8.3 JPS (Java
Virtual Machine Process Status Tool )
JPS is a command
is used to check all the Hadoop daemons like NameNode, DataNode,
ResourceManager, NodeManager etc. which are running on the machine. If JPS
doesn’t run , install it via ant.
sudo yum
install ant
jps
output
1600
ResourceManager
1703
NodeManager
1288
DataNode
1449
SecondaryNameNode
2331 Jps
1164
NameNode
Apache Spark:
Apache Spark:
Download & Install Spark
wget
http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.7.tgz
tar xf
spark-2.0.0-bin-hadoop2.7.tgz
mkdir /usr/local/spark
cp -r spark-2.0.0-bin-hadoop2.7/*
/usr/local/spark
export SPARK_EXAMPLES_JAR=/usr/local/spark/examples/jars/spark-examples_2.11-2.0.0.jar
PATH=$PATH:$HOME/bin:/usr/local/spark/bin
source ~/.bash_profile
Start Pyspark session
./bin/pyspark
Python 2.7.5 (default, Jul 13 2018, 13:06:57)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on
linux2
Type "help", "copyright",
"credits" or "license" for more information.
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use
sc.setLogLevel(newLevel).
18/09/30 17:07:09 WARN NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java classes
where applicable
Welcome to
____ __
/
__/__ ___ _____/ /__
_\ \/
_ \/ _ `/ __/ '_/
/__ /
.__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Python version 2.7.5 (default, Jul 13 2018
13:06:57)
SparkSession available as 'spark'.
>>>
The text from the input text file is tokenized into words to form
a key value pair with all the words present in the input text file. The key is
the word from the input file and value is ‘1’.
For instance if you consider the sentence “Hello World”. The
pyspark in the WordCount example will split the string into individual tokens
i.e. words. In this case, the entire sentence will be split into 2 tokens (one
for each word) with a value 1.
(Hello,1)
(World,1)
file.txt contains “Hello World”
Test Pyspark code
PySpark Code:
lines =
sc.textFile("file.txt")
sorted(lines.flatMap(lambda
line: line.split()).map(lambda w: (w,1)).reduceByKey(lambda v1, v2:
v1+v2).collect())
Output:
[(u'Hello',
1), (u'World', 1)]