Data Engineering with Avishkar: 12/01/2023

December 24, 2023

Medallion architecture: Data platform strategy and best practices for managing Bronze, Silver and Gold

The medallion architecture describes a series of data layers that denote the quality of data stored in the lakehouse. Databricks recommends taking a multi-layered approach to building a single source of truth for enterprise data products. This architecture guarantees atomicity, consistency, isolation, and durability as data passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics. The terms Bronze(raw), Silver(validated), and Gold(enriched) describe the quality of the data in each of these layers.

Bronze layer:

The bronze layer is usually a reservoir that stores data in its natural and original state

Bronze layer characteristics:

· Maintains the raw state of the data source in the structure “as-is”.

· Data is immutable (read-only).

· Can be any combination of streaming and batch transactions.

Silver layer:

The Silver layer provides a refined structure over data that has been ingested. It represents a validated, enriched version of our data that can be trusted for downstream workloads, both operational and analytical. Silver layer characteristics:

· Uses data quality rules for validating and processing data.

· Typically contains only functional data. So, technical data or irrelevant data from Bronze is filtered out.

· Historization is usually applied by merging all data. Data is processed using slowly changing dimensions (SCD)

· Data is stored in an efficient storage format; preferably Delta, alternatively Parquet.

· Handles missing data, standardizes clean or empty fields.

· Data is often cluttered around certain subject areas.

· Data is often still source-system aligned and organized.

Gold layer:

In a Lakehouse architecture, the Gold layer houses data that is structured in “project-specific” databases, making it readily available for consumption. Uses denormalized and read-optimized data model with fewer joins, such as a Kimball-style star schema, depending on specific use cases. Gold layer characteristics:

· Gold tables represent data that has been transformed for consumption or use cases.

· Data is stored in an efficient storage format, preferably Delta.

· Gold can be a selection or aggregation of data that’s found in Silver.

· In Gold you apply complex business rules. So, it uses many post-processing activities, calculations, enrichments, use-case specific optimizations, etc.

· Data is highly governed and well-documented.

Data Engineering with Avishkar

December 24, 2023

Medallion architecture: Data platform strategy and best practices for managing Bronze, Silver and Gold

Medallion architecture: Data platform strategy and best practices for managing Bronze, Silver and Gold

Bronze layer:

Silver layer:

Gold layer:

Creating DataFrames from CSV in Apache Spark

Search This Blog