February 28, 2013

NoSQL Key-Value Store

Basic terminology:
  • Key-Value Store – data is stored in unstructured records consisting of a key + the values associated with that record
  • NoSQL –Doesn’t use SQL commands
Let’s say you’ve got millions of data records — as you might have for example, if you’ve got millions of users who visit your website.
looks like a “row” in a database table).Note that not every user has the same information — some users will have a username, some will only have an email address, some users will have provided their name and others will not.  Each record has a different length and different values.
To store this kind of data, you create a key for each record and then store whatever fields are available as bins (what would be columns in a structured database) — where each bin consists of a name and a value.  Then you create a bin for each piece of data you have.  If you don’t have a particular piece of data, you don’t have a blank field (like in a relational table), you simply don’t store a bin for that data.
This type of database is called a Key-Value Store because each record has a primary key and a collection of values (bins).  It’s also called a Row Store because all of the data for a single record is stored together, in something that we can think of conceptually as a row.

Example of unstructured data for user records:

Key: 1 ID:av First Name: Avishkar



Key: 2 Email: avishkarm@gmail.com Location: Mumbai Age: 37



Key: 3  Facebook ID: avishkarmeshram  Password: xxx  Name: Avishkar




Data is organized into policy containers called ‘namespaces’, semantically similar to ‘databases’ in an RDBMS system. Namespaces are configured when the cluster is started, and are used to control retention and reliability requirements for a given set of data. 
Within a namespace, data is subdivided into ‘sets’ (similar to ‘tables’) and ‘records’ (similar to ‘rows’). Each record has an indexed ‘key’ that is unique in the set, and one or more named ‘bins’ (similar to columns) that hold values associated with the record.




Indexes (primary keys) are stored in DRAM for ultra-fast access and values can be stored either in DRAM or more cost-effectively on SSDs. Each namespace can be configured separately, so small namespaces can take advantage of DRAM and larger ones gain the cost

February 20, 2013

Hadoop Architecture and its Usage at Facebook

Lots of data is generated on Facebook
– 300+ million active users 
– 30 million users update their statuses at least once each day
– More than 1 billion photos uploaded each month 
– More than 10 million videos uploaded each month 
– More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week

Data Usage
Statistics per day:
– 4 TB of compressed new data added per day
– 135TB of compressed data scanned per day
– 7500+ Hive jobs on production cluster per day
– 80K compute hours per day
Barrier to entry is significantly reduced:
– New engineers go though a Hive training session
– ~200 people/month run jobs on Hadoop/Hive
– Analysts (non-engineers) use Hadoop through Hive

Where is this data stored?
Hadoop/Hive Warehouse
– 4800 cores, 5.5 PetaBytes
– 12 TB per node
– Two level network topology
1 Gbit/sec from node to rack switch
4 Gbit/sec to top level rack switch

Data Flow into Hadoop Cloud













Move old data to cheap storage

Implementing a Left Outer Join in Map Reduce

The Problem:

I have two datasets:
  1. User information (id, mobile,  location)
  2. Transaction information (transaction-id, car-id, user-id, CarBookingDate)
Given these data sets, I want to find the number of unique locations in which each car has been sold.

 One Solution

  1. For each transaction, look up the user record for the transaction’s user-Id
  2. Join the user records to each transaction
  3. Create a mapping from car-id to a list of locations
  4. Count the number of distinct locations per car-id.

The Map Reduce Solution

First off, the problem requires that we write a two stage map-reduce:
  1. Join users onto transactions and emit a set of car-location pairs (one for each transaction)
  2. For each car sold, count the # of distinct locations it was sold in

STAGE 1

We’re basically building a left outer join with map reduce.
  • transaction map task outputs (K,V) with K = userId, and V = carId
  • user map tasks outputs (K,V) with K = userId, and V = location
  • reducer gets both user location and carid thus outputs (K,V) with K = carId, and V = location
  • STAGE 2

  • map task is an identity mapper, outputs (K,V) with K = carId and V = location
  • reducer counts the number of unique locations that it sees per carId, outputs (K,V), K = carId, andV = # distinct locations
  •  

February 19, 2013

What's the big deal about Hadoop?

Hadoop has advantages over traditional database management systems, especially the ability to handle both structured data like that found in relational databases, say, as well as unstructured information such as video -- and lots of it. The system can also scale up with a minimum of fuss and bother.
 A growing number of firms are using Hadoop and related technologies such as Hive, Pig and Hbase to analysis analyze data in ways that cannot easily or affordably be done using traditional relational database technologies.
JPMorgan Chase, for instance, is using Hadoop to improve fraud detection, IT risk management, and self service applications. The financial services firm is also using the technology to enable a far more comprehensive view of its customers than was possible previously, executives said.
Meanwhile, Ebay is using Hadoop technology and the Hbase open source database to build a new search engine for its auction site. The auction site is revamping its core search engine technology using Hadoop and Hbase, a technology that enables real-time analysis of data in Hadoop environments.
The new eBay search engine, code-named Cassini, will replace the Voyager technology that's been used since the early 2000s. The update is needed in part due to surging volumes of data that needs to be managed.  Cassini will deliver more accurate and more context-based results to user search queries.

What is Hadoop used for? 
 Search 
– Yahoo, Amazon, Zvents 

 Log processing 
– Facebook, Yahoo, ContextWeb. Joost, Last.fm 
Recommendation Systems 
– Facebook   
Data Warehouse 
– Facebook, AOL 
Video and Image Analysis
– New York Times, Eyealike 


Goals of HDFS Very Large Distributed File System
– 10K nodes, 100 million files, 10 - 100 PB 

Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them 

Optimized for Batch Processing
– Data locations exposed so that computations can move to
where data resides
– Provides very high aggregate bandwidth 

User Space, runs on heterogeneous OS  

February 15, 2013

Scaling Out SQL Server

Scalability is the ability of an application to efficiently use more resources in order to do more useful work. For example, an application that can service four users on a single-processor system may be able to service 15 users on a four-processor system. In this case, the application is scalable. If adding more processors doesn't increase the number of users serviced (if the application is single threaded, for example), the application isn't scalable.

There are two kinds of scalability: scaleup and scaleout.
Scaleup means scaling to a bigger, more powerful server—going from a four-processor server to a 64-processor or 128-processor server, for example. This is the most common way for databases to scale. When your database runs out of resources on your current hardware, you go out and buy a bigger box with more processors and more memory. Scaleup has the advantage of not requiring significant changes to the database. In general, you just install your database on a bigger box and keep running the way you always have, with more database power to handle a heavier load.  

Scaleout means expanding to multiple servers rather than a single, bigger server. Scaleout usually has some initial hardware cost advantages—eight four-processor servers generally cost less than one 32-processor server. Scaleout  is separating or partitioning the database system in a manner so you can take those parts and place them on separate database servers. This allows you to spread processing power across as many servers as necessary to accommodate expanding growth. However, additional features and functionality require additional complexity. A scale out database scenario is not a particularly easy one to design or administer. You must answer many difficult business and technology-driven questions before you can successfully implement a scale out of a database system.

 There is no thumb rule for Scaleup  and Scaleout.. i.e. if hardware cost is less than licensing and maintenance costs then Scaleup is better than Scaleout.If one machine out of your N machine fails, it's less important. The system will still be up and running. And, it's not only failures but hardware/OS/software updates/upgrades, then Scaleout is better than Scaleup

February 07, 2013

What is Big Data

What is Big Data?

Big Data is a massive collection of data produced by multiple traffic sources which is constantly being updated – the very nature of Big Data means it’s complex and almost impossible to even get a handle on in the first place, let alone break down, assess and produce tangible results and recommendations that companies can learn from.
With Big Data, traditional web analytics is just the tip of the iceberg. we still need to know what traffic we’re getting, where it’s coming from and which journeys customers are taking when they arrive on the site, but in order to run a successful eCommerce store, we also need to take into account and learn from other data which is out of our control and not necessarily ours to “own”.

Big Data a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

The "structured" portion of Big Data refers to fixed fields within a database. For ecommerce merchants, this could be customer data — address, zip code — that's stored in a shopping cart.

The "unstructured" part encompasses email, video, tweets, and Facebook Likes. None of the unstructured data resides in a fixed database that's accessible to merchants.product reviews, social media data and images – things you know are out there and relate to your business but things you can’t necessarily get a hold of! But the feedback from, say, social media has become a very useful research tool for businesses.




Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...