December 24, 2023

Medallion architecture: Data platform strategy and best practices for managing Bronze, Silver and Gold

 

Medallion architecture: Data platform strategy and best practices for managing Bronze, Silver and Gold

 

The medallion architecture describes a series of data layers that denote the quality of data stored in the lakehouse. Databricks recommends taking a multi-layered approach to building a single source of truth for enterprise data products. This architecture guarantees atomicity, consistency, isolation, and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics. The terms  Bronze(raw),  Silver(validated), and  Gold(enriched) describe the quality of the data in each of these layers.

 


Bronze layer:

The bronze layer is usually a reservoir that stores data in its natural and original state

 Bronze layer characteristics:

·         Maintains the raw state of the data source in the structure “as-is”.

·         Data is immutable (read-only).

·         Can be any combination of streaming and batch transactions.

Silver layer:

The Silver layer provides a refined structure over data that has been ingested. It represents a validated, enriched version of our data that can be trusted for downstream workloads, both operational and analytical. Silver layer characteristics:

·         Uses data quality rules for validating and processing data.

·         Typically contains only functional data. So, technical data or irrelevant data from Bronze is filtered out.

·         Historization is usually applied by merging all data. Data is processed using slowly changing dimensions (SCD)

·         Data is stored in an efficient storage format; preferably Delta, alternatively Parquet.

·         Handles missing data, standardizes clean or empty fields.

·         Data is often cluttered around certain subject areas.

·         Data is often still source-system aligned and organized. 

Gold layer:

In a Lakehouse architecture, the Gold layer houses data that is structured in “project-specific” databases, making it readily available for consumption. Uses denormalized and read-optimized data model with fewer joins, such as a Kimball-style star schema, depending on specific use cases. Gold layer characteristics:

·         Gold tables represent data that has been transformed for consumption or use cases.

·         Data is stored in an efficient storage format, preferably Delta.

·         Gold can be a selection or aggregation of data that’s found in Silver.

·         In Gold you apply complex business rules. So, it uses many post-processing activities, calculations, enrichments, use-case specific optimizations, etc.

·         Data is highly governed and well-documented.

 

 



Creating DataFrames from CSV in Apache Spark

 from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CSV Example").getOrCreate() sc = spark.sparkContext Sp...