November 07, 2011

Hive - Data Warehousing & Analytics on Hadoop

Hive is an open source, peta-byte scale date warehousing framework based on Hadoop that was developed by the Data Infrastructure Team at Facebook that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.


Hive architecture:
Hive organizes data in tables and partitions. A good partitioning scheme allows Hive to prune data while processing a query and that has a direct impact on how fast a result of the query can be produced. Behind the scenes, Hive stores partitions and tables into directories in Hadoop File System (HDFS).


Hive comprises of the following major components:

-Metastore: To store the meta data.
-Query compiler and execution engine: To convert SQL queries to a sequence of map/reduce jobs that are then executed on Hadoop.
- SerDe and ObjectInspectors: Programmable interfaces and implementations of common data formats and types.
-UDF and UDAF: Programmable interfaces and implementations for user defined functions (scalar and aggregate functions).
-Clients: Command line client similar to Mysql command line and a web UI.

Data Flow into Hadoop Cloud:




For more Information:
http://www.vldb.org/pvldb/2/vldb09-938.pdf

1 comment:

benslin kard said...

Companies commonly use Data warehousing to analyze trends over time.