January 17, 2020

Data warehouse vs. Data mart vs. Operational Data Stores vs. Data lake


Data warehouse(DW)

A data warehouse is a central repository of subject-oriented, integrated, time-variant, and non-volatile collection of data from one or more disparate sources in support of management’s decision-making process. They store current and historical data in one single place that are used for creating analytical reports.
Subject-oriented implies that the data is organized around subjects such as customers, products, sales, etc.
The data warehouse is integrated in the sense that it integrates data from a variety of operational sources and a variety of formats such as relational database management systems, legacy database management systems, and flat files. It may require data cleansing for additional operations to ensure data quality before it is used in the DW for reporting.Time variant refers to the fact that the data warehouse essentially stores a time series of periodic snapshots. Operational data is always up-to-date and represents the most recent state of the data elements, whereas a data warehouse is not necessarily up to date but represents the state at some specific moment(s) in time.
Non-volatile implies that the data is primarily read-only and will thus not be frequently updated or deleted over time
.

Data mart(DM)



The data mart is a subset, condensed and more focused version of the data warehouse and is usually oriented to a specific business line or team and provide focused content.


Data marts improve end-user response time by offloading complex queries, and therefore workloads and allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.
An ODS is a database a designed to integrate data to from multiple sources for additional operations on the data, for reporting, controls and operational decision support . An ODS is still subject oriented, integrated and time variant, but volatile in nature.


An ODS is usually designed to contain real-time or near real- time data as it is received from the respective source systems so that analysis tools can query the ODS data as it is opposed to the data warehouses not containing up-to-date data and time-consuming transformation and loading operations.


The ODS provides access to the current, fine-grained and non-aggregated data, which can be queried in an integrated manner without burdening the transactional systems.

Data Lake

A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).


A key distinguishing property of a data lake compared to Data warehouse is that it stores raw data in its native format, which could be structured, unstructured, or semi-structured. This makes data lakes fit for more exotic and “bulk” data types that we generally do not find in data warehouses, such as social media feeds, clickstreams, server logs, and sensor data.


A data lake collects data emanating from operational sources “as is,” , either no or only very limited transformations (formatting, cleansing) are performed on the data before it enters the data lake and often without knowing upfront which analyses will be performed on it, or even whether the data will ever be involved in analysis at all.


Data Lake Vs Data warehouse:


The data schema definitions are only determined when the data is read (schema-on-read) instead of when the data is loaded (schema-on-write) as is the case for a data warehouse. Storage costs for data lakes are also relatively low because most of the implementations are open source solutions that can be easily installed on low-cost commodity hardware.


Since a data warehouse assumes a predefined structure, it is less agile compared to a data lake, which has no structure. Also, data warehouses have been around for quite some time already, which automatically implies that their security facilities are more mature.


Because the data within data lakes may be uncurated and can originate from sources outside of the company's operational systems, it isn't a good fit for the average business analytics user; rather, data lakes are the playground of data scientists and other data analytics experts. i.e. a data warehouse is targeted toward decision makers at the middle and top management level, whereas a data lake requires a data scientist, which is a more specialized profile in terms of data handling and analysis.



January 08, 2020

Amazon Athena


Overview of Athena
Amazon Athena is an interactive query service, which developers and data analysts use to analyze data stored in Amazon S3. Athena’s serverless architecture lowers operational costs and means users don’t need to scale, provision or manage any servers.

Amazon Athena users can use standard SQL when analysing data. Athena does not require a server, so there is no need to oversee infrastructure; users only pay for the queries they request. You don’t even need to load your data into Athena, just need to point to their data in Amazon S3, define the schema, and begin querying.

To get started, just log into the Athena Management Console, define your schema, and start querying. Amazon Athena uses Presto with full standard SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Apache Parquet and Avro. While Amazon Athena is ideal for quick, ad-hoc querying and integrates with Amazon QuickSight for easy visualization, it can also handle complex analysis, including large joins, window functions, and arrays.



Some Athena Facts
  • Athena supports only EXTERNAL tables, when you drop a table in Athena, only the table metadata is removed; the data remains in Amazon S3
  • Athena uses an approach known as schema-on-read
  • Athena does not modify your data in Amazon S3
  • Athena uses Apache Hive to define tables and create databases, which are essentially a logical namespace of tables
  • Athena can only query the latest version of data on a versioned Amazon S3 bucket, and cannot query previous versions of the data.
  • Athena does not support querying the data in the GLACIER storage class
  • Athena performs full table scans instead of using indexes
  • Athena supports ACID-compliant.
  • Athena is case-insensitive and turns table names and column names to lower case.
  • Athena table, view, database, and column names cannot contain special characters, other than underscore (_)


Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...