This project implements a declarative ETL pipeline using Delta Live Tables (DLT). It demonstrates the Medallion Architecture by processing raw JSON data from cloud storage into high-quality Materialized Views for analytics.
This project is part of the Databricks Data Engineer Learning Plan.
- Course: Build Data Pipelines with Lakeflow / Spark Declarative Pipelines
- Source: Databricks Academy - Learning Plan
The pipeline transforms data through three stages, utilizing both Streaming Tables (for incremental processing) and Materialized Views (for final aggregations).
Ingests raw JSON Files from Cloud Storage.
orders_bronze(Streaming Table)status_bronze(Streaming Table)customers_bronze(Streaming Table)
Cleans data and handles history.
orders_silver&status_silver: Cleaned streaming tables.- Customer CDC Logic:
customers_bronze_clean: Preliminary cleaning.type1_customers_silver: Applies Change Data Capture (CDC) logic to handle inserts, updates, and deletes, ensuring the table always reflects the current state of the customer (SCD Type 1).
Business-level aggregates and joins exposed as Materialized Views.
full_order_info_gold: Joins Orders and Status to provide a complete view.gold_orders_by_date: Aggregates order volume over time.- Filtered Views:
cancelled_orders: Subset of cancelled transactions.delivered_orders: Subset of successful deliveries.
- Platform: Databricks (Data Intelligence Platform)
- Orchestration: Delta Live Tables (Declarative Pipelines)
- Format: Delta Lake
- Languages: Python (PySpark) / SQL