January 17, 2020

Data warehouse vs. Data mart vs. Operational Data Stores vs. Data lake


Data warehouse(DW)

A data warehouse is a central repository of subject-oriented, integrated, time-variant, and non-volatile collection of data from one or more disparate sources in support of management’s decision-making process. They store current and historical data in one single place that are used for creating analytical reports.
Subject-oriented implies that the data is organized around subjects such as customers, products, sales, etc.
The data warehouse is integrated in the sense that it integrates data from a variety of operational sources and a variety of formats such as relational database management systems, legacy database management systems, and flat files. It may require data cleansing for additional operations to ensure data quality before it is used in the DW for reporting.Time variant refers to the fact that the data warehouse essentially stores a time series of periodic snapshots. Operational data is always up-to-date and represents the most recent state of the data elements, whereas a data warehouse is not necessarily up to date but represents the state at some specific moment(s) in time.
Non-volatile implies that the data is primarily read-only and will thus not be frequently updated or deleted over time
.

Data mart(DM)



The data mart is a subset, condensed and more focused version of the data warehouse and is usually oriented to a specific business line or team and provide focused content.


Data marts improve end-user response time by offloading complex queries, and therefore workloads and allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users.
An ODS is a database a designed to integrate data to from multiple sources for additional operations on the data, for reporting, controls and operational decision support . An ODS is still subject oriented, integrated and time variant, but volatile in nature.


An ODS is usually designed to contain real-time or near real- time data as it is received from the respective source systems so that analysis tools can query the ODS data as it is opposed to the data warehouses not containing up-to-date data and time-consuming transformation and loading operations.


The ODS provides access to the current, fine-grained and non-aggregated data, which can be queried in an integrated manner without burdening the transactional systems.

Data Lake

A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).


A key distinguishing property of a data lake compared to Data warehouse is that it stores raw data in its native format, which could be structured, unstructured, or semi-structured. This makes data lakes fit for more exotic and “bulk” data types that we generally do not find in data warehouses, such as social media feeds, clickstreams, server logs, and sensor data.


A data lake collects data emanating from operational sources “as is,” , either no or only very limited transformations (formatting, cleansing) are performed on the data before it enters the data lake and often without knowing upfront which analyses will be performed on it, or even whether the data will ever be involved in analysis at all.


Data Lake Vs Data warehouse:


The data schema definitions are only determined when the data is read (schema-on-read) instead of when the data is loaded (schema-on-write) as is the case for a data warehouse. Storage costs for data lakes are also relatively low because most of the implementations are open source solutions that can be easily installed on low-cost commodity hardware.


Since a data warehouse assumes a predefined structure, it is less agile compared to a data lake, which has no structure. Also, data warehouses have been around for quite some time already, which automatically implies that their security facilities are more mature.


Because the data within data lakes may be uncurated and can originate from sources outside of the company's operational systems, it isn't a good fit for the average business analytics user; rather, data lakes are the playground of data scientists and other data analytics experts. i.e. a data warehouse is targeted toward decision makers at the middle and top management level, whereas a data lake requires a data scientist, which is a more specialized profile in terms of data handling and analysis.



No comments:

Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...