Skip to content

andre-salvati/databricks-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

50 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

databricks-template

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, coverage tests, CI/CD automation, Declarative Automation Bundles, and DQX data quality framework.

Databricks PySpark CI/CD Stars

πŸš€ Overview

This project template is designed to boost productivity and promote maintainability when developing ETL pipelines on Databricks. It aims to bring software engineering best practicesβ€”such as modular architecture, automated unit and integration testing, and CI/CDβ€”into the world of data engineering. By combining a clean project structure with robust development and deployment jobs, this template helps teams move faster with confidence.

You’re encouraged to adapt the structure and tooling to suit your project’s specific needs and environment.

Interested in bringing these principles in your own project? Let’s connect on Linkedin.

πŸ§ͺ Technologies

  • Databricks Free Edition (Serverless)
  • Databricks Runtime 18.0 LTS
  • Databricks Unity Catalog
  • Databricks Declarative Automation Bundles (former Databricks Asset Bundles)
  • Databricks CLI
  • Databricks Python SDK
  • Databricks DQX
  • PySpark 4.1
  • Python 3.12+
  • GitHub Actions
  • Pytest

πŸ“¦ Features

This project template demonstrates how to:

  • structure PySpark code inside classes/packages, instead of notebooks.
  • package and deploy code to different environments (dev, staging, prod).
  • use a CI/CD pipeline with Github Actions.
  • run unit tests on transformations with pytest package. Set up VSCode to run unit tests on your local machine.
  • run integration tests setting the input data and validating the output data.
  • isolate "dev" environments / catalogs to avoid concurrency issues between developer tests.
  • show developer name and branch as job tags to track issues.
  • utilize coverage package to generate test coverage reports.
  • utilize uv as a project/package manager.
  • configure job to run tasks selectively.
  • use medallion architecture pattern.
  • lint and format code with ruff and pre-commit.
  • use a Make file to automate repetitive tasks.
  • utilize argparse package to build a flexible command line interface to start the jobs.

🧠 Resources

For a debate on the use of notebooks vs. Python packaging, please refer to:

Sessions on Databricks Declarative Automation Bundles, CI/CD, and Software Development Life Cycle at Data + AI Summit 2025:

Other:

πŸ“ Folder Structure

databricks-template/
β”‚
β”œβ”€β”€ .github/                       # CI/CD automation
β”‚   └── workflows/
β”‚       └── onpush.yml             # GitHub Actions pipeline
β”‚
β”œβ”€β”€ src/                           # Main source code
β”‚   └── template/                  # Python package
β”‚       β”œβ”€β”€ main.py                # Entry point with CLI (argparse)
β”‚       β”œβ”€β”€ config.py              # Configuration management
β”‚       β”œβ”€β”€ baseTask.py            # Base class for all tasks
β”‚       β”œβ”€β”€ commonSchemas.py       # Shared PySpark schemas
β”‚       β”œβ”€β”€ job1/                  # Job-specific tasks
β”‚       β”‚   β”œβ”€β”€ extract_source1.py
β”‚       β”‚   β”œβ”€β”€ extract_source2.py
β”‚       β”‚   β”œβ”€β”€ generate_orders.py
β”‚       β”‚   β”œβ”€β”€ generate_orders_agg.py
β”‚       β”‚   β”œβ”€β”€ integration_setup.py
β”‚       β”‚   └── integration_validate.py
β”‚       └── job2/                  # Additional job tasks
β”‚
β”œβ”€β”€ tests/                          # Unit tests
β”‚   β”œβ”€β”€ job1/
β”‚   β”‚   └── unit_test.py            # Pytest unit tests
β”‚   └── job2/
β”‚
β”œβ”€β”€ resources/                      # Databricks workflow templates
β”‚   └── jobs.yml                    # Generated job definition (auto-created)
β”‚
β”œβ”€β”€ scripts/                              # Helper scripts
β”‚   β”œβ”€β”€ sdk_generate_template_job.py      # Job definition generator (Databricks SDK)
β”‚   β”œβ”€β”€ sdk_init.py                       # Workspace initialization
β”‚   β”œβ”€β”€ sdk_analyze_job_costs.py          # Cost analysis script
β”‚   └── sdk_workspace_and_account.py      # Workspace and account management
β”‚
β”œβ”€β”€ docs/                           # Documentation assets
β”‚   β”œβ”€β”€ dag.png
β”‚   β”œβ”€β”€ task_output.png
β”‚   β”œβ”€β”€ data_lineage.png
β”‚   β”œβ”€β”€ data_quality.png
β”‚   └── ci_cd.png
β”‚
β”œβ”€β”€ dist/                        # Build artifacts (Python wheel)
β”œβ”€β”€ coverage_reports/            # Test coverage reports
β”‚
β”œβ”€β”€ databricks.yml               # Declarative Automation Bundle config
β”œβ”€β”€ pyproject.toml               # Python project configuration (uv)
β”œβ”€β”€ Makefile                     # Build automation
β”œβ”€β”€ .pre-commit-config.yaml      # Pre-commit hooks (ruff)
└── README.md                    # This file

CI/CD pipeline



Jobs



Task Output



Data Lineage



Data Quality (generated by Databricks DQX)



Instructions

  1. Create a workspace. Use a Databricks Free Edition workspace.

  2. Install and configure Databricks CLI on your local machine. Check the current version on databricks.yaml. Follow instructions here.

  3. Build Python env, execute unit tests on your local machine.

     make sync & make test
    
  4. Create an external location in Databricks and update the "storage-root" parameter in the Makefile. This step will create the catalogs, schemas, service principal, and the required grants. For more details, see Overview of external locations. Then run:

     make init
    
  5. Generate a secret for the service principal. In Databricks, go to: Workspace -> Settings -> Identity and access -> Service principals -> Secrets. Generate a new secret for your service principal and update the corresponding profiles in your .databrickscfg file. Your configuration should look similar to this:

     [dev]
     host             = https://xxxx.cloud.databricks.com/
     token            = bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
                     
     [staging]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
     [prod]
     host          = https://xxxx.cloud.databricks.com/
     client_id     = yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy
     client_secret = aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    
  6. Deploy and execute on the dev workspace.

     make deploy env=dev
    
  7. Configure CI/CD automation. Configure Github Actions repository secrets (DATABRICKS_HOST, DATABRICKS_PRINCIPAL_ID, DATABRICKS_SECRET).

  8. You can also execute unit tests from your preferred IDE. Here's a screenshot from VS Code with Microsoft's Python extension installed.

Task parameters


  • task (required) - determines the current task to be executed.
  • env (required) - determines the AWS account where the job is running. This parameter also defines the default catalog for the task.
  • user (required) - determines the name of the catalog when env is "dev".
  • schema (optional) - determines the default schema to read/store tables.
  • skip (optional) - determines if the current task should be skipped.
  • debug (optional) - determines if the current task should go through debug conditional.

About

A production-ready PySpark project template with medallion architecture, Python packaging, unit tests, integration tests, CI/CD automation, Databricks Asset Bundles, and DQX data quality framework.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors