December 24, 2023

Medallion architecture: Data platform strategy and best practices for managing Bronze, Silver and Gold

 

Medallion architecture: Data platform strategy and best practices for managing Bronze, Silver and Gold

 

The medallion architecture describes a series of data layers that denote the quality of data stored in the lakehouse. Databricks recommends taking a multi-layered approach to building a single source of truth for enterprise data products. This architecture guarantees atomicity, consistency, isolation, and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics. The terms  Bronze(raw),  Silver(validated), and  Gold(enriched) describe the quality of the data in each of these layers.

 


Bronze layer:

The bronze layer is usually a reservoir that stores data in its natural and original state

 Bronze layer characteristics:

·         Maintains the raw state of the data source in the structure “as-is”.

·         Data is immutable (read-only).

·         Can be any combination of streaming and batch transactions.

Silver layer:

The Silver layer provides a refined structure over data that has been ingested. It represents a validated, enriched version of our data that can be trusted for downstream workloads, both operational and analytical. Silver layer characteristics:

·         Uses data quality rules for validating and processing data.

·         Typically contains only functional data. So, technical data or irrelevant data from Bronze is filtered out.

·         Historization is usually applied by merging all data. Data is processed using slowly changing dimensions (SCD)

·         Data is stored in an efficient storage format; preferably Delta, alternatively Parquet.

·         Handles missing data, standardizes clean or empty fields.

·         Data is often cluttered around certain subject areas.

·         Data is often still source-system aligned and organized. 

Gold layer:

In a Lakehouse architecture, the Gold layer houses data that is structured in “project-specific” databases, making it readily available for consumption. Uses denormalized and read-optimized data model with fewer joins, such as a Kimball-style star schema, depending on specific use cases. Gold layer characteristics:

·         Gold tables represent data that has been transformed for consumption or use cases.

·         Data is stored in an efficient storage format, preferably Delta.

·         Gold can be a selection or aggregation of data that’s found in Silver.

·         In Gold you apply complex business rules. So, it uses many post-processing activities, calculations, enrichments, use-case specific optimizations, etc.

·         Data is highly governed and well-documented.

 

 



September 21, 2023

Key causes of performance differences between SQL managed instance and SQL Server and recommended solutions.

 Migrating to a Microsoft Azure SQL Database managed instance provides performance benefits. In many cases, you might get better results on the on-premises SQL Server database because a SQL Database managed instance introduces some overhead for manageability and high availability. In other cases, you might get better results on a SQL Database managed instance because the latest version of the database engine has improved query processing and optimization features compared to older versions of SQL Server.


This article will help you understand the underlying factors that can cause performance differences and the recommended solutions to improve Azure SQL Database performance like changing default MAXDOP, improve performance of your workload by increasing the file size, backup schedule, etc.




January 12, 2023

Modernizing Data Architecture using Data Fabric


In my 20+ years of experience specializing in building, leading, and scaling teams Data Engineering platforms, I worked on lot of database projects and worked on strong data foundations that require a large volume of current, clean, and accurate data from different business silos to provide business value and insights.  But I found that DWH restricts analysis on Historical data and data lakes have often been nothing more than data swamps. 

I found Data Fabric architecture which is  promising solution to this problem. Data fabric is a design concept and architecture based on database veterans James Serra and Data Mesh Positioned by Zhamak Dehghani,addressing the complexity of data management and minimizing disruption to data consumers while ensuring that any data on any platform from any location can be successfully combined, accessed, shared, and governed efficiently and effectively. A data fabric architecture is enabled by AI/ML-driven augmentation and automation
It enables organizations to make better use of their existing data architectures without requiring an entire structural rebuild of every application or data store and gain new insights and create intelligent prescriptive services and applications.

Fortune Global Bank wanted to use  proprietary data  and build Analytics capabilities  based on Business priorities  for
  • Customer  acquisition and retention
  • Social media sentiment analysis of our customers and prospects
  • Credit Risk Analysis
  • Real-Time Fraud Detection and Risk Mitigation
  • Self-service analytics

DWH/Data Lake/Data lakehouse Possible Solutions : 

DWH 
  • Customer 360-degree view and predict the best products for the customer
  • Self-service analytics
DATA LAKE 
  • Social media sentiment analysis of customers and prospects
  • Gauge the possibility of a customer defaulting a loan or card
DATA LAKEHOUSE
  • Match the right customers with the right merchants to provide customized offers
  • Enhanced customer trust by tracking transactional activity across accounts & devices 24×7 to alert instantly in case of a potential threat and detect frauds at the very first instance.


But this solutions have following limitations:

  • Lack of customer-centricity                                                                 
  • Data is frequently held in silos
  • Built for specific purposes
  • Manual Data Governance limits speed and agility. 
  • Run the risk of becoming data swamps without Data Catalog
  • Lacks Governance and Data Security aka compliance
  • Data Democratization
  • Unable to address the data of a hybrid data landscape


What is Data fabric?
  • Data fabric refers to a machine-enabled data integration architecture that utilizes metadata assets to unify, integrate, and govern disparate data environments. By standardizing, connecting, and automating data management practices and processes, data fabrics improve data security and accessibility and provide end-to-end integration of data pipelines and on premises, cloud, hybrid multicloud, and edge device platforms.
  • Data fabric continuously identifies, connects, cleanses and enriches real-time data from different applications to discover relationships between data points. For example, a data fabric might monitor various data pipelines — the set of actions that ingest raw data from a source and move it to a destination — to suggest better alternatives before automating the most repeatable tasks. A data fabric might also “heal” failed data integration jobs, handle more complicated data management aspects like creating — and profiling — datasets and offer ways to govern and secure data by limiting who can access what data and infrastructure.
  • To uncover the relationships between data, a data fabric builds a graph that stores interlinked descriptions of data such as objects, events, situations and concepts. Algorithms can use this graph for different businesses analytics purposes, like making predictions and surfacing previously-hard-to-find dataset stores.

Technology Components of Data Fabric Architecture

A data fabric from a standard data integration ecosystem:
  • Augmented data catalog. Your data catalog will include and analyze all types of metadata (structural, descriptive, and administrative) in order to provide context to your information.
  • Knowledge graph. To help you and the AI/ML algorithms interpret the meaning of your data, you will build and manage a knowledge graph that formally illustrates the relationships between entities in your data (concepts, objects, events, etc.). And it should be enhanced with unified data semantics, which describes the meaning of data components themselves.
  • Metadata activation. You will switch from manual (passive) metadata to automatic (active) metadata. Active metadata management leverages machine learning to allow you to create and process metadata at massive scale.
  • Recommendation engine. Based on your active metadata, AI/ML algorithms will continuously analyze, learn, and make recommendations and predictions about your data integration and management ecosystem.
  • Data prep & ingestion. All common data preparation and delivery approaches will be supported, including the five key patterns of data integration: ETL, ELT, data streaming, application integration, and data virtualization.
  • DataOps. Bring your DevOps team together with your data engineers and data scientists to ensure that your fabric supports the needs of both IT and business users.

DATA FABRIC SOLUTION

Data fabric is a single environment consisting of a unified architecture with services and technologies running on it that architecture that helps a company manage their data
  • Data Fabric adds to a modern data warehouse: 
  • Augmented knowledge graph 
  • Intelligent integration
  • Unified data lifecycle 
  • Multimodal governance 
  • Data Democratization
  • Operationalize AI and mitigate risk/regulatory compliance 
  • Automated Data Pipeline & Data Discovery

BUSINESS BENEFITS
  • Improved Self-service data consumption capabilities
  • On demand real-time insights 
  • Automate governance
  • Data protection and security
  • Automate data engineering tasks and augment data integration

DATA FABRIC IMPLEMENTATION

There is not currently a single, stand-alone tool or platform you can use to fully establish a data fabric architecture. You’ll have to employ a mix of solutions, such as using a top data management tool for most of your needs and then finishing out your architecture with other tools and/or custom-coded solutions. Still, according to research firm Gartner, there are four pillars to consider when implementing:

  • Collect and analyze all types of metadata
  • Convert passive metadata to active metadata
  • Create and curate knowledge graphs that enrich data with semantics
  • Ensure a robust data integration foundation
In addition to these pillars, you’ll need to have in place the typical elements of a robust data integration solution. This includes the mechanisms for collecting, managing, storing, and accessing your data. Plus, having a proper data governance framework which includes metadata management, data lineage, and data integrity best practices.





Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...