Data Engineering with Avishkar: 01/01/2023

In my 20+ years of experience specializing in building, leading, and scaling teams Data Engineering platforms, I worked on lot of database projects and worked on strong data foundations that require a large volume of current, clean, and accurate data from different business silos to provide business value and insights. But I found that DWH restricts analysis on Historical data and data lakes have often been nothing more than data swamps.

I found Data Fabric architecture which is promising solution to this problem. Data fabric is a design concept and architecture based on database veterans James Serra and Data Mesh Positioned by Zhamak Dehghani,addressing the complexity of data management and minimizing disruption to data consumers while ensuring that any data on any platform from any location can be successfully combined, accessed, shared, and governed efficiently and effectively. A data fabric architecture is enabled by AI/ML-driven augmentation and automation

It enables organizations to make better use of their existing data architectures without requiring an entire structural rebuild of every application or data store and gain new insights and create intelligent prescriptive services and applications.

Fortune Global Bank wanted to use proprietary data and build Analytics capabilities based on Business priorities for

Customer acquisition and retention
Social media sentiment analysis of our customers and prospects
Credit Risk Analysis
Real-Time Fraud Detection and Risk Mitigation
Self-service analytics

DWH/Data Lake/Data lakehouse Possible Solutions :

DWH

Customer 360-degree view and predict the best products for the customer
Self-service analytics

DATA LAKE

Social media sentiment analysis of customers and prospects
Gauge the possibility of a customer defaulting a loan or card

DATA LAKEHOUSE

Match the right customers with the right merchants to provide customized offers
Enhanced customer trust by tracking transactional activity across accounts & devices 24×7 to alert instantly in case of a potential threat and detect frauds at the very first instance.

But this solutions have following limitations:

Lack of customer-centricity
Data is frequently held in silos
Built for specific purposes
Manual Data Governance limits speed and agility.
Run the risk of becoming data swamps without Data Catalog
Lacks Governance and Data Security aka compliance
Data Democratization
Unable to address the data of a hybrid data landscape

What is Data fabric?

Data fabric refers to a machine-enabled data integration architecture that utilizes metadata assets to unify, integrate, and govern disparate data environments. By standardizing, connecting, and automating data management practices and processes, data fabrics improve data security and accessibility and provide end-to-end integration of data pipelines and on premises, cloud, hybrid multicloud, and edge device platforms.
Data fabric continuously identifies, connects, cleanses and enriches real-time data from different applications to discover relationships between data points. For example, a data fabric might monitor various data pipelines — the set of actions that ingest raw data from a source and move it to a destination — to suggest better alternatives before automating the most repeatable tasks. A data fabric might also “heal” failed data integration jobs, handle more complicated data management aspects like creating — and profiling — datasets and offer ways to govern and secure data by limiting who can access what data and infrastructure.
To uncover the relationships between data, a data fabric builds a graph that stores interlinked descriptions of data such as objects, events, situations and concepts. Algorithms can use this graph for different businesses analytics purposes, like making predictions and surfacing previously-hard-to-find dataset stores.

Technology Components of Data Fabric Architecture

A data fabric from a standard data integration ecosystem:

Augmented data catalog. Your data catalog will include and analyze all types of metadata (structural, descriptive, and administrative) in order to provide context to your information.
Knowledge graph. To help you and the AI/ML algorithms interpret the meaning of your data, you will build and manage a knowledge graph that formally illustrates the relationships between entities in your data (concepts, objects, events, etc.). And it should be enhanced with unified data semantics, which describes the meaning of data components themselves.
Metadata activation. You will switch from manual (passive) metadata to automatic (active) metadata. Active metadata management leverages machine learning to allow you to create and process metadata at massive scale.
Recommendation engine. Based on your active metadata, AI/ML algorithms will continuously analyze, learn, and make recommendations and predictions about your data integration and management ecosystem.
Data prep & ingestion. All common data preparation and delivery approaches will be supported, including the five key patterns of data integration: ETL, ELT, data streaming, application integration, and data virtualization.
DataOps. Bring your DevOps team together with your data engineers and data scientists to ensure that your fabric supports the needs of both IT and business users.

DATA FABRIC SOLUTION

Data fabric is a single environment consisting of a unified architecture with services and technologies running on it that architecture that helps a company manage their data

Data Fabric adds to a modern data warehouse:
Augmented knowledge graph
Intelligent integration
Unified data lifecycle
Multimodal governance
Data Democratization
Operationalize AI and mitigate risk/regulatory compliance
Automated Data Pipeline & Data Discovery

BUSINESS BENEFITS

Improved Self-service data consumption capabilities
On demand real-time insights
Automate governance
Data protection and security
Automate data engineering tasks and augment data integration

DATA FABRIC IMPLEMENTATION

There is not currently a single, stand-alone tool or platform you can use to fully establish a data fabric architecture. You’ll have to employ a mix of solutions, such as using a top data management tool for most of your needs and then finishing out your architecture with other tools and/or custom-coded solutions. Still, according to research firm Gartner, there are four pillars to consider when implementing:

Collect and analyze all types of metadata
Convert passive metadata to active metadata
Create and curate knowledge graphs that enrich data with semantics
Ensure a robust data integration foundation

In addition to these pillars, you’ll need to have in place the typical elements of a robust data integration solution. This includes the mechanisms for collecting, managing, storing, and accessing your data. Plus, having a proper data governance framework which includes metadata management, data lineage, and data integrity best practices.

Data Engineering with Avishkar

January 12, 2023

Modernizing Data Architecture using Data Fabric

Creating DataFrames from CSV in Apache Spark

Search This Blog