Databricks Pipeline for Ceres

Medallion Architecture pipeline for open data analytics on Databricks

This repository contains the Databricks analytics pipeline for the Ceres project. It implements a Medallion Architecture (Bronze → Silver → Gold) that ingests the Ceres open data index from Hugging Face and produces analytics-ready tables plus a lightweight semantic search engine, all running on Databricks.

Architecture

Screenshots

Notebooks

#	Notebook	Layer	Description
01	`01_ingest_huggingface_to_bronze.py`	Bronze	Loads the dataset from Hugging Face, coerces types, and writes to a managed Delta table with audit columns
02	`02_process_bronze_to_silver.py`	Silver	Deduplicates, parses timestamps, splits tags into arrays, and standardizes text fields
03	`03_create_gold_analytics.py`	Gold	Produces three analytics tables: monthly ingestion trends, topic frequency analysis, and portal-level statistics
04	`04_semantic_search_engine.py`	Gold / ML	Builds a TF-IDF feature store using Spark ML and exposes a simple hashing-based search engine with an interactive widget

Quick Start

Prerequisites

A Databricks workspace (Community Edition works for testing)
Databricks CLI configured (databricks configure)

Option A — Databricks Asset Bundles (recommended)

# Clone and deploy
git clone https://github.com/AndreaBozzo/databricks-ceres-pipeline.git
cd databricks-ceres-pipeline

# Validate the bundle
databricks bundle validate

# Deploy to your workspace
databricks bundle deploy -t dev

# Run the pipeline
databricks bundle run ceres_pipeline -t dev

Option B — Manual import

Import the four .py notebooks into your Databricks workspace
Run them in order: 01 → 02 → 03 → 04
Notebook 01 installs HuggingFace dependencies automatically via %pip

Configuration

Databricks Asset Bundle

The pipeline is configured as a Databricks Asset Bundle in databricks.yml. Targets:

Target	Description
`dev`	Development — runs on your personal workspace folder
`prod`	Production — designed for a shared workspace with job scheduling

Environment variables

No secrets are required. The pipeline reads from a public Hugging Face dataset.

Variable	Default	Description
`dataset_name`	`AndreaBozzo/ceres-open-data-index`	HuggingFace dataset identifier
`table_name`	`bronze_ceres_metadata`	Bronze target table

Delta Tables Produced

Table	Layer	Description
`bronze_ceres_metadata`	Bronze	Raw dataset metadata + `ingestion_ts`, `source_system`
`silver_ceres_metadata`	Silver	Cleaned, deduplicated, with parsed timestamps and tag arrays
`gold_monthly_trend`	Gold	Monthly dataset ingestion counts by portal
`gold_topic_analysis`	Gold	Top 200 topics by frequency across portals
`gold_portal_stats`	Gold	Per-portal statistics (dataset count, orgs, date range)
`gold_ml_features`	Gold	TF-IDF feature vectors (1024-dim) for semantic search

Relationship to Ceres

Ceres is a Rust-based semantic search engine that harvests metadata from CKAN open data portals and indexes them with vector embeddings. This pipeline provides a complementary analytics layer on the same data:

Ceres (main repo) → Real-time harvesting, Gemini embeddings, PostgreSQL + pgvector, REST API
This pipeline → Batch analytics, Spark ML features, Delta Lake, Databricks dashboards

Both consume the same Hugging Face dataset as their source of truth.

Development

# Lint notebooks
pip install -r requirements-dev.txt
ruff check .

# Run tests (requires Databricks Connect or a cluster)
pytest tests/

License

Licensed under the Apache License, Version 2.0.

Acknowledgments

Ceres — the main semantic search engine project
Databricks — unified analytics platform
Hugging Face — dataset hosting
Delta Lake — open-source storage layer

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
docs/assets		docs/assets
tests		tests
.gitignore		.gitignore
01_ingest_huggingface_to_bronze.py		01_ingest_huggingface_to_bronze.py
02_process_bronze_to_silver.py		02_process_bronze_to_silver.py
03_create_gold_analytics.py		03_create_gold_analytics.py
04_semantic_search_engine.py		04_semantic_search_engine.py
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
databricks.yml		databricks.yml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databricks Pipeline for Ceres

Architecture

Screenshots

Notebooks

Quick Start

Prerequisites

Option A — Databricks Asset Bundles (recommended)

Option B — Manual import

Configuration

Databricks Asset Bundle

Environment variables

Delta Tables Produced

Relationship to Ceres

Development

License

Acknowledgments

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Databricks Pipeline for Ceres

Architecture

Screenshots

Notebooks

Quick Start

Prerequisites

Option A — Databricks Asset Bundles (recommended)

Option B — Manual import

Configuration

Databricks Asset Bundle

Environment variables

Delta Tables Produced

Relationship to Ceres

Development

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages