Data Engineering with Avishkar: Hadoop

Showing posts with label Hadoop. Show all posts

October 18, 2018

Safe Mode in Hadoop

Safe Mode in hadoop is a maintenance state of NameNode during which NameNode doesn’t allow any changes to the file system. HDFS cluster is in safemode state during start up because the cluster needs to validate all the blocks and their locations.

Common errors occurs when Hadoop is in safe mode :

Error when reading/writing HDFS data
Cannot create directory

e.g.
# ./bin/hdfs dfs -mkdir /usr/local/hdfsdata
mkdir: Cannot create directory /usr/local/hdfsdata. Name node is in safe mode.

If required, HDFS could be placed in safe mode explicitly and manually for administration activities with dfsadmin command utility, using bin/hadoop dfsadmin -safemode command.

# hdfs dfsadmin -safemode get
Safe mode is ON

Once validated, safemode is then disabled.
# hdfs dfsadmin -safemode leave
Safe mode is OFF

HDFS Cheat Sheet

Commonly used & most useful HDFS Commands

1. Create a directory in HDFS at given path(s).

# cd $HADOOP_HOME

# hdfs dfs -mkdir /usr/local/hadoopdata

2. Upload and download a file in HDFS

Upload:

hdfs dfs -put /usr/local/localdata/names.csv/ /usr/local/hadoopdata/names.csv

Download:

dfs -get /usr/local/hadoopdata/names.csv /usr/local/localdata/hnames.csv

3. List the contents of a directory.

hdfs dfs -ls /usr/local/hadoopdata

4. Copy a file from/To Local file system to HDFS

Works similarly to the get and put commands, except that the destination is restricted to a local file reference.

hdfs dfs -copyFromLocal /usr/local/localdata/names.csv /usr/local/hadoopdata/hnames.csv

hdfs dfs -copyToLocal /usr/local/hadoopdata/hnames.csv /usr/local/localdata/fromhdnames2.csv

5. See contents of a file

hdfs dfs -cat /usr/local/hadoopdata/hnames.csv/names.csv

6. Display last few lines of a file.

hdfs dfs -tail /usr/local/hadoopdata/hnames.csv/names.csv

7. Display the aggregate length of a file.

hdfs dfs -du /usr/local/hadoopdata/hnames.csv/names.csv

8. Run some of the examples provided:

hadoop jar hadoop-mapreduce-examples-2.7.6.jar wordcount /usr/local/hdfsdata/names/names.csv /usr/local/hdfsdata/names/namescount.csv

9. Examine the output files

hdfs dfs -ls /usr/local/hdfsdata/names/namescount.csv

hdfs dfs -cat /usr/local/hdfsdata/names/namescount.csv/part-r-00000

10. Example: Top 5 words in a text file

10.1 Copy local file to Hadoop
hdfs dfs -copyFromLocal /usr/local/localdata/data.txt /usr/local/hdfsdata/names/data.txt

10.2 Run wordcount to produce result
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /usr/local/hdfsdata/names/data.txt /usr/local/hdfsdata/names/datacount

10.3 Use sort and head for top 5
hdfs dfs -cat /usr/local/hdfsdata/names/datacount/part-r-00000 | sort -k 2 -r | head -5

10.4 Create file

hdfs dfs -cat /usr/local/hdfsdata/names/datacount/part* | sort -k 2 -r | head -3 > /usr/local/localdata/top3.txt

July 27, 2018

Prepare Your Apache Hadoop Cluster for PySpark Jobs

Since Spark itself runs in the JVM, Java have advantages like platform independence by running inside the JVM, self-contained packaging of code and its dependencies into JAR files, and higher performance.

If you chose to use Python a non-JVM languages for your Spark code, coders struggle in managing dependencies and making them available for PySpark jobs on a cluster.

In Apache Hadoop cluster, first need to identify the dependencies required and where different parts of Spark code get executed and how computation is distributed on the cluster. Spark orchestrates its operations via the driver program. The driver program initializes a SparkContext in which you define your data actions and transformations, e.g. map, flatMap, and filter. When the driver program is run, the Spark framework initializes executor processes on the worker nodes that then process your data across the cluster.

Python transformations you define use any third-party libraries, like NumPy or nltk, then the Spark executors will need access to those libraries when they execute your code on the remote worker nodes.

In this post, we shall try to resolve such error : ImportError: No module named numpy

This means that mllib functions do not work on the cluster.

We shall try to install numpy using following command:

$ sudo pip install numpy

But it gives system error and if numpy is not installed.

Please follow below steps to resolve this error.

Solution:

-->

$ wget https://bootstrap.pypa.io/get-pip.py

$ sudo python get-pip.py

$ sudo apt-get install python-devel

$ sudo pip install numpy

>>> import numpy

>>> a1 = numpy.array([1,2,3,4,5])

>>> a1sum = a1.sum()

>>> print(a1sum)

Output : 15

January 07, 2016

What does the term "Data Locality" mean in Hadoop?

What does the term "Data Locality" mean in Hadoop?

Hadoop believes in "Moving computation is cheaper than moving data", i.e. it works on local data as far as possible which means that when we invoke any map reduce job, the logic( map and reduce code) is sent to each data node in the cluster.

Hadoop puts mapReduce job's jar to the HDFS. The task trackers which needed it will take it from there. The node is going to process local data.

When a dataset is stored in HDFS, it is divided in to blocks and stored across the DataNodes in the Hadoop cluster. When a MapReduce job is executed against the dataset the individual Mappers will process the blocks (Input Splits). When the data is not available for the Mapper in the same node where it is being executed, the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task.

So suppose if you have a text file of size 1 GB and you have written a map reduce code to convert all text in that file to upper case then first the file will be broken into chunks and the logic to cover the text to Upper case would be available to each data node. Now the tasktracker on each node would only run the map reduce code the data block/s present on that local node. This is known as data locality.

Why is Data Locality important?

Imagine a MapReduce job with over 100 Mappers and each Mapper is trying to copy the data from another DataNode in the cluster at the same time, this would result in serious network congestion as all the Mappers would try to copy the data at the same time and it is not ideal. So it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation.

February 20, 2013

Implementing a Left Outer Join in Map Reduce

The Problem:

I have two datasets:

User information (id, mobile, location)
Transaction information (transaction-id, car-id, user-id, CarBookingDate)

Given these data sets, I want to find the number of unique locations in which each car has been sold.

One Solution

For each transaction, look up the user record for the transaction’s user-Id
Join the user records to each transaction
Create a mapping from car-id to a list of locations
Count the number of distinct locations per car-id.

The Map Reduce Solution

First off, the problem requires that we write a two stage map-reduce:

Join users onto transactions and emit a set of car-location pairs (one for each transaction)
For each car sold, count the # of distinct locations it was sold in

STAGE 1

We’re basically building a left outer join with map reduce.

transaction map task outputs (K,V) with K = userId, and V = carId
user map tasks outputs (K,V) with K = userId, and V = location
reducer gets both user location and carid thus outputs (K,V) with K = carId, and V = location
STAGE 2
map task is an identity mapper, outputs (K,V) with K = carId and V = location
reducer counts the number of unique locations that it sees per carId, outputs (K,V), K = carId, andV = # distinct locations

February 12, 2012

Hadoop

What is Hadoop?
Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
Apache Hadoop is an ideal platform for consolidating large-scale data from a variety of new and legacy sources. It complements existing data management solutions with new analyses and processing tools. It delivers immediate value to companies in a variety of vertical markets.
Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.

Where did Hadoop come from?
The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google's innovations were incorporated intoNutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What problems can Hadoop solve?
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built.

How is Hadoop architected?
Hadoop is designed to run on a large number of machines that don't share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.
In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.

Hadoop Project:

The project includes these subprojects:

Hadoop Common: The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:

Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™: A high-performance coordination service for distributed applications.

Data Engineering with Avishkar