February 12, 2012

Hadoop

What is Hadoop?
Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both MapReduce and the Hadoop Distributed File System are designed so that node failures are automatically handled by the framework.
Apache Hadoop is an ideal platform for consolidating large-scale data from a variety of new and legacy sources. It complements existing data management solutions with new analyses and processing tools. It delivers immediate value to companies in a variety of vertical markets.
Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.
Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.

Where did Hadoop come from?
The underlying technology was invented by Google back in their earlier days so they could usefully index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. There was nothing on the market that would let them do that, so they built their own platform. Google's innovations were incorporated intoNutch, an open source project, and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.

What problems can Hadoop solve?
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn't fit nicely into tables. It's for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting. That's exactly what Google was doing when it was indexing the web and examining user behavior to improve performance algorithms.
Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they're more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built.

How is Hadoop architected?
Hadoop is designed to run on a large number of machines that don't share any memory or disks. That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one. When you want to load all of your organization's data into Hadoop, what the software does is bust that data into pieces that it then spreads across your different servers. There's no one place where you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated from a known good copy.
In a centralized database system, you've got one big disk connected to four or eight or 16 big processors. But that is as much horsepower as you can bring to bear. In a Hadoop cluster, every one of those servers has two or four or eight CPUs. You can run your indexing job by sending your code to each of the dozens of servers in your cluster, and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole. That's MapReduce: you map the operation out to all of those servers and then you reduce the results back into a single result set.
Architecturally, the reason you're able to deal with lots of data is because Hadoop spreads it out. And the reason you're able to ask complicated computational questions is because you've got all of these processors, working in parallel, harnessed together.

Hadoop Project:

The project includes these subprojects:

Hadoop Common: The common utilities that support the other Hadoop subprojects.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:

Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
ZooKeeper™: A high-performance coordination service for distributed applications.

February 02, 2012

Oracle vs SQL Server : High Availability, Licensing, Performance and security

High Availability
SQL Server 2008:
Database-mirroring :
Use a rolling upgrade process to upgrade database instances in a database-mirroring session.
Take advantage of write-ahead functionality on the incoming log stream on the mirror server.
Use page read-ahead capability during the undo phase to further improve performance.
Provide reporting capabilities with a database snapshot as a source for reports on the mirror server.

Failover clustering:
Enable failover support by sharing access among nodes and restarting SQL Server on a working node.
Increase scalability with support of up to 16 nodes in a single failover cluster.
Support a rolling upgrade process for servers participating in a failover-clustering configuration.

Peer-to-peer replication:
Replicate changes in near real time, while all databases also handle their primary responsibilities.
Boost scalability, availability, and processing capacity by configuring applications to use peers and to fail over to another peer.
Protect against accidental conflicts with built-in conflict detection.
Increase availability by dynamically adding a new node to an existing topology.

Log shipping:
Provide database redundancy by using standby servers to automatically back up transaction logs.
Increase availability by providing multiple failover sites.
Reduce the load on the primary server by using a secondary server for read-only query processing.

Oracle 11g:

Oracle Provides following features for high availability.

Real Application Clusters
Clusterware
Data Guard
GoldenGate
Streams
Secure Backup
Recovery Manager (RMAN)
Flashback Technologies
VM
Cloud Computing
Cloud Storage
Cross-Platform Transportable Tablespace
Edition-Based Redefinition
Online Reorganization

License cost

Oracle 11g license cost

- Per Processor = $17,500
- Support (22%) = $3,850
- Total (Per Processor) = $21,350
- Total (4 Processors) = $85,400


license cost of SQL Server

- Per Processor = $5,999
- Total (4 Processors) = $23,996

Security:

SQL Server features role-based security for server, database and application profiles; integrated tools for security auditing, tracking 18 different security events and additional sub-events; plus support for sophisticated file and network encryption, including SSL, Kerberos and delegation.

Oracles provides powerful security features such as database activity monitoring and blocking, privileged user and multi-factor access control, data classification, transparent data encryption, consolidated auditing and reporting, secure configuration management, and data masking, customers can deploy reliable data security solutions that do not require any changes to existing applications, saving time and money.

ORACLE DATABASE SECURITY PRODUCTS:
Oracle Advanced Security
Oracle Audit Vault
Oracle Label Security
Oracle Configuration Management
Oracle Secure Backup
Oracle Database Firewall
Oracle Database Vault
Oracle Data Masking
Oracle Total Recall

Performance:

In SQL Server, the DBA has no "real" control over sorting and cache memory allocation. The memory allocation is decided only globally in the server properties memory folder, and that applies for ALL memory and not CACHING, SORTING, etc.


Following Oracle features do not exist in SQL Server.
There are no bitmap indexes
There are no reverse key indexes in SQL Server.
There are no function-based indexes in SQL Server.

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...