Skip to content

AndreaBozzo/databricks-ceres-pipeline

Repository files navigation

Ceres logo

Databricks Pipeline for Ceres

Medallion Architecture pipeline for open data analytics on Databricks

Ceres HuggingFace Dataset License


This repository contains the Databricks analytics pipeline for the Ceres project. It implements a Medallion Architecture (Bronze → Silver → Gold) that ingests the Ceres open data index from Hugging Face and produces analytics-ready tables plus a lightweight semantic search engine, all running on Databricks.

Architecture

Ceres Databricks Pipeline Architecture

Screenshots

Pipeline overview Silver to Gold transformations Example Viz

Notebooks

# Notebook Layer Description
01 01_ingest_huggingface_to_bronze.py Bronze Loads the dataset from Hugging Face, coerces types, and writes to a managed Delta table with audit columns
02 02_process_bronze_to_silver.py Silver Deduplicates, parses timestamps, splits tags into arrays, and standardizes text fields
03 03_create_gold_analytics.py Gold Produces three analytics tables: monthly ingestion trends, topic frequency analysis, and portal-level statistics
04 04_semantic_search_engine.py Gold / ML Builds a TF-IDF feature store using Spark ML and exposes a simple hashing-based search engine with an interactive widget

Quick Start

Prerequisites

  • A Databricks workspace (Community Edition works for testing)
  • Databricks CLI configured (databricks configure)

Option A — Databricks Asset Bundles (recommended)

# Clone and deploy
git clone https://github.com/AndreaBozzo/databricks-ceres-pipeline.git
cd databricks-ceres-pipeline

# Validate the bundle
databricks bundle validate

# Deploy to your workspace
databricks bundle deploy -t dev

# Run the pipeline
databricks bundle run ceres_pipeline -t dev

Option B — Manual import

  1. Import the four .py notebooks into your Databricks workspace
  2. Run them in order: 01020304
  3. Notebook 01 installs HuggingFace dependencies automatically via %pip

Configuration

Databricks Asset Bundle

The pipeline is configured as a Databricks Asset Bundle in databricks.yml. Targets:

Target Description
dev Development — runs on your personal workspace folder
prod Production — designed for a shared workspace with job scheduling

Environment variables

No secrets are required. The pipeline reads from a public Hugging Face dataset.

Variable Default Description
dataset_name AndreaBozzo/ceres-open-data-index HuggingFace dataset identifier
table_name bronze_ceres_metadata Bronze target table

Delta Tables Produced

Table Layer Description
bronze_ceres_metadata Bronze Raw dataset metadata + ingestion_ts, source_system
silver_ceres_metadata Silver Cleaned, deduplicated, with parsed timestamps and tag arrays
gold_monthly_trend Gold Monthly dataset ingestion counts by portal
gold_topic_analysis Gold Top 200 topics by frequency across portals
gold_portal_stats Gold Per-portal statistics (dataset count, orgs, date range)
gold_ml_features Gold TF-IDF feature vectors (1024-dim) for semantic search

Relationship to Ceres

Ceres is a Rust-based semantic search engine that harvests metadata from CKAN open data portals and indexes them with vector embeddings. This pipeline provides a complementary analytics layer on the same data:

  • Ceres (main repo) → Real-time harvesting, Gemini embeddings, PostgreSQL + pgvector, REST API
  • This pipeline → Batch analytics, Spark ML features, Delta Lake, Databricks dashboards

Both consume the same Hugging Face dataset as their source of truth.

Development

# Lint notebooks
pip install -r requirements-dev.txt
ruff check .

# Run tests (requires Databricks Connect or a cluster)
pytest tests/

License

Licensed under the Apache License, Version 2.0.

Acknowledgments