March 16, 2014

Apache Cassandra

Introduction

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

Apache Cassandra is a highly scalable and high-performance distributed database
management system that can serve as both an operational datastore (the “system of record”) for
online/transactional applications, and as a read-intensive database for business intelligence
systems. Cassandra is able to manage the distribution of data across multiple data centers and
offers incremental scalability with no single points of failure.
Cassandra is a logical choice for enterprises that need high degrees of uptime, reliability, and
very fast performance.
Cassandra was originally incubated at Facebook and is based upon Google’s BigTable and
Amazon’s Dynamo software. The end result is an extremely scalable and fault-tolerant data
infrastructure that solves small to big data problems, handles write intensive user traffic, delivers
sub-millisecond caching layer reads, and supports demanding workloads involving petabytes of
data.

Cassandra Architecture

Cassandra is a peer-to-peer distributed data management system where every
node is essentially the same with respect to how it functions in the cluster. In Cassandra, there is
no concept of a “master node” or anything similar, with the benefit being derived that no single

point of failure exists for any key process or function.

The scale-out aspect of Cassandra allows node additions to occur with no disruption to
application uptime. Cassandra automatically partitions data across nodes once one or more
nodes have been added to a cluster and “seeds” the new nodes from existing machines in the
cluster.Data redundancy to protect against hardware failure and other data loss scenarios is also built
into and managed transparently by Cassandra.

An administrator, architect, or developer only has to specify a replication and data-partitioning
strategy. From there, Cassandra takes care of everything.
All nodes in the cluster communicate with each other through the gossip protocol. If a node goes
down, the cluster detects the failure and automatically routes user requests away from the failed
machine. Once the failed node is operational again, it rejoins the cluster, and its data is brought
back up to date via the other nodes.

Why Cassandra
  • MySQL drives too many random I/Os
  • File-based solutions require far too many locks
The new face of data
  • Scale out, not up
  • Online load balancing, cluster growth
  • Flexible schema
  • Key-oriented queries
  • CAP-aware

CQL Language

CQL provides a very similar syntax to that used in all RDBMSs, making it very easy for
developers and administrators coming from the relational world to begin working with Cassandra.

DDL, DML, and SELECT functionality all can be found in CQL.

cqlsh> CREATE TABLE monkeySpecies (
    species text PRIMARY KEY,
    common_name text,
    population varint,
    average_size int
) WITH comment='Important biological records'
   AND read_repair_chance = 1.0;

CREATE TABLE timeline (
    userid uuid,
    posted_month int,
    posted_time uuid,
    body text,
    posted_by text,
    PRIMARY KEY (userid, posted_month, posted_time)
) WITH compaction = { 'class' : 'LeveledCompactionStrategy' };

cqlsh> INSERT INTO users (KEY, password) VALUES ('jsmith', 'ch@ngem3a');

cqlsh> SELECT * FROM users
... WHERE gender='f' AND
... state='TX' AND
... birth_year='1968';

Batched Operations
Cassandra supports tunable consistency on a per-operation basis, meaning developers can
choose how strong or loose they want data consistency to be for a particular request. If a
developer wants to apply a certain consistency level for a number of different requests, he or she
can encase them in a BEGIN and APPLY BATCH statement.

BEGIN BATCH USING CONSISTENCY QUORUM
INSERT INTO users (KEY, password) VALUES (‘user1’, ‘mypass’)
UPDATE users SET password = ‘newpass’ WHERE KEY = ‘user1’
INSERT INTO users (KEY, password) VALUES (‘user2’, ‘user2pass’)
DELETE name FROM users WHERE key = ‘user5’
APPLY BATCH

Batched operations allow a developer to retry (if necessary) a group of changes in an idempotent
fashion.

Cassandra highlights


  • High availability
  • Incremental scalability
  • Eventually consistent
  • Tunable tradeoffs between consistency and latency
  • Minimal administration
  • No SPF (Single Point of Failure)

Applications suitable to use Cassandra 


  • Dispersed applications that need to serve numerous geographies with the same fast response times
  • Web online applications or 
  • other systems needing around-the-clock transactional input capabilities.
  • Applications needing extreme degrees of uptime and no single point of failure
  • Applications that need easy data elasticity, so capacity can be added to service peak workloads for various periods of time and then shrink back 
  • when user traffic reduction allows – all done in an online fashion
  • Write-intensive applications that must take in large volumes of data continuously e.g.credit card systems, music download purchases, device/sensor data, web clickstream,data, archiving systems, event logging.
  • Management of large data volumes (terabytes-petabytes) that must be kept online for query access and business intelligence processing.
  • Systems that need to store and directly deal with a combination of structured,unstructured, and semi-structured data, with a requirement for a flexible schema/data
    storage paradigm that allows for easy and online structure modifications
  • No comments:

    Creating DataFrames from CSV in Apache Spark

     from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...