January 12, 2023

Modernizing Data Architecture using Data Fabric


In my 20+ years of experience specializing in building, leading, and scaling teams Data Engineering platforms, I worked on lot of database projects and worked on strong data foundations that require a large volume of current, clean, and accurate data from different business silos to provide business value and insights.  But I found that DWH restricts analysis on Historical data and data lakes have often been nothing more than data swamps. 

I found Data Fabric architecture which is  promising solution to this problem. Data fabric is a design concept and architecture based on database veterans James Serra and Data Mesh Positioned by Zhamak Dehghani,addressing the complexity of data management and minimizing disruption to data consumers while ensuring that any data on any platform from any location can be successfully combined, accessed, shared, and governed efficiently and effectively. A data fabric architecture is enabled by AI/ML-driven augmentation and automation
It enables organizations to make better use of their existing data architectures without requiring an entire structural rebuild of every application or data store and gain new insights and create intelligent prescriptive services and applications.

Fortune Global Bank wanted to use  proprietary data  and build Analytics capabilities  based on Business priorities  for
  • Customer  acquisition and retention
  • Social media sentiment analysis of our customers and prospects
  • Credit Risk Analysis
  • Real-Time Fraud Detection and Risk Mitigation
  • Self-service analytics

DWH/Data Lake/Data lakehouse Possible Solutions : 

DWH 
  • Customer 360-degree view and predict the best products for the customer
  • Self-service analytics
DATA LAKE 
  • Social media sentiment analysis of customers and prospects
  • Gauge the possibility of a customer defaulting a loan or card
DATA LAKEHOUSE
  • Match the right customers with the right merchants to provide customized offers
  • Enhanced customer trust by tracking transactional activity across accounts & devices 24×7 to alert instantly in case of a potential threat and detect frauds at the very first instance.


But this solutions have following limitations:

  • Lack of customer-centricity                                                                 
  • Data is frequently held in silos
  • Built for specific purposes
  • Manual Data Governance limits speed and agility. 
  • Run the risk of becoming data swamps without Data Catalog
  • Lacks Governance and Data Security aka compliance
  • Data Democratization
  • Unable to address the data of a hybrid data landscape


What is Data fabric?
  • Data fabric refers to a machine-enabled data integration architecture that utilizes metadata assets to unify, integrate, and govern disparate data environments. By standardizing, connecting, and automating data management practices and processes, data fabrics improve data security and accessibility and provide end-to-end integration of data pipelines and on premises, cloud, hybrid multicloud, and edge device platforms.
  • Data fabric continuously identifies, connects, cleanses and enriches real-time data from different applications to discover relationships between data points. For example, a data fabric might monitor various data pipelines — the set of actions that ingest raw data from a source and move it to a destination — to suggest better alternatives before automating the most repeatable tasks. A data fabric might also “heal” failed data integration jobs, handle more complicated data management aspects like creating — and profiling — datasets and offer ways to govern and secure data by limiting who can access what data and infrastructure.
  • To uncover the relationships between data, a data fabric builds a graph that stores interlinked descriptions of data such as objects, events, situations and concepts. Algorithms can use this graph for different businesses analytics purposes, like making predictions and surfacing previously-hard-to-find dataset stores.

Technology Components of Data Fabric Architecture

A data fabric from a standard data integration ecosystem:
  • Augmented data catalog. Your data catalog will include and analyze all types of metadata (structural, descriptive, and administrative) in order to provide context to your information.
  • Knowledge graph. To help you and the AI/ML algorithms interpret the meaning of your data, you will build and manage a knowledge graph that formally illustrates the relationships between entities in your data (concepts, objects, events, etc.). And it should be enhanced with unified data semantics, which describes the meaning of data components themselves.
  • Metadata activation. You will switch from manual (passive) metadata to automatic (active) metadata. Active metadata management leverages machine learning to allow you to create and process metadata at massive scale.
  • Recommendation engine. Based on your active metadata, AI/ML algorithms will continuously analyze, learn, and make recommendations and predictions about your data integration and management ecosystem.
  • Data prep & ingestion. All common data preparation and delivery approaches will be supported, including the five key patterns of data integration: ETL, ELT, data streaming, application integration, and data virtualization.
  • DataOps. Bring your DevOps team together with your data engineers and data scientists to ensure that your fabric supports the needs of both IT and business users.

DATA FABRIC SOLUTION

Data fabric is a single environment consisting of a unified architecture with services and technologies running on it that architecture that helps a company manage their data
  • Data Fabric adds to a modern data warehouse: 
  • Augmented knowledge graph 
  • Intelligent integration
  • Unified data lifecycle 
  • Multimodal governance 
  • Data Democratization
  • Operationalize AI and mitigate risk/regulatory compliance 
  • Automated Data Pipeline & Data Discovery

BUSINESS BENEFITS
  • Improved Self-service data consumption capabilities
  • On demand real-time insights 
  • Automate governance
  • Data protection and security
  • Automate data engineering tasks and augment data integration

DATA FABRIC IMPLEMENTATION

There is not currently a single, stand-alone tool or platform you can use to fully establish a data fabric architecture. You’ll have to employ a mix of solutions, such as using a top data management tool for most of your needs and then finishing out your architecture with other tools and/or custom-coded solutions. Still, according to research firm Gartner, there are four pillars to consider when implementing:

  • Collect and analyze all types of metadata
  • Convert passive metadata to active metadata
  • Create and curate knowledge graphs that enrich data with semantics
  • Ensure a robust data integration foundation
In addition to these pillars, you’ll need to have in place the typical elements of a robust data integration solution. This includes the mechanisms for collecting, managing, storing, and accessing your data. Plus, having a proper data governance framework which includes metadata management, data lineage, and data integrity best practices.





Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...