This repository contains the Databricks analytics pipeline for the Ceres project. It implements a Medallion Architecture (Bronze → Silver → Gold) that ingests the Ceres open data index from Hugging Face and produces analytics-ready tables plus a lightweight semantic search engine, all running on Databricks.
| # | Notebook | Layer | Description |
|---|---|---|---|
| 01 | 01_ingest_huggingface_to_bronze.py |
Bronze | Loads the dataset from Hugging Face, coerces types, and writes to a managed Delta table with audit columns |
| 02 | 02_process_bronze_to_silver.py |
Silver | Deduplicates, parses timestamps, splits tags into arrays, and standardizes text fields |
| 03 | 03_create_gold_analytics.py |
Gold | Produces three analytics tables: monthly ingestion trends, topic frequency analysis, and portal-level statistics |
| 04 | 04_semantic_search_engine.py |
Gold / ML | Builds a TF-IDF feature store using Spark ML and exposes a simple hashing-based search engine with an interactive widget |
- A Databricks workspace (Community Edition works for testing)
- Databricks CLI configured (
databricks configure)
# Clone and deploy
git clone https://github.com/AndreaBozzo/databricks-ceres-pipeline.git
cd databricks-ceres-pipeline
# Validate the bundle
databricks bundle validate
# Deploy to your workspace
databricks bundle deploy -t dev
# Run the pipeline
databricks bundle run ceres_pipeline -t dev- Import the four
.pynotebooks into your Databricks workspace - Run them in order:
01→02→03→04 - Notebook
01installs HuggingFace dependencies automatically via%pip
The pipeline is configured as a Databricks Asset Bundle in databricks.yml. Targets:
| Target | Description |
|---|---|
dev |
Development — runs on your personal workspace folder |
prod |
Production — designed for a shared workspace with job scheduling |
No secrets are required. The pipeline reads from a public Hugging Face dataset.
| Variable | Default | Description |
|---|---|---|
dataset_name |
AndreaBozzo/ceres-open-data-index |
HuggingFace dataset identifier |
table_name |
bronze_ceres_metadata |
Bronze target table |
| Table | Layer | Description |
|---|---|---|
bronze_ceres_metadata |
Bronze | Raw dataset metadata + ingestion_ts, source_system |
silver_ceres_metadata |
Silver | Cleaned, deduplicated, with parsed timestamps and tag arrays |
gold_monthly_trend |
Gold | Monthly dataset ingestion counts by portal |
gold_topic_analysis |
Gold | Top 200 topics by frequency across portals |
gold_portal_stats |
Gold | Per-portal statistics (dataset count, orgs, date range) |
gold_ml_features |
Gold | TF-IDF feature vectors (1024-dim) for semantic search |
Ceres is a Rust-based semantic search engine that harvests metadata from CKAN open data portals and indexes them with vector embeddings. This pipeline provides a complementary analytics layer on the same data:
- Ceres (main repo) → Real-time harvesting, Gemini embeddings, PostgreSQL + pgvector, REST API
- This pipeline → Batch analytics, Spark ML features, Delta Lake, Databricks dashboards
Both consume the same Hugging Face dataset as their source of truth.
# Lint notebooks
pip install -r requirements-dev.txt
ruff check .
# Run tests (requires Databricks Connect or a cluster)
pytest tests/Licensed under the Apache License, Version 2.0.
- Ceres — the main semantic search engine project
- Databricks — unified analytics platform
- Hugging Face — dataset hosting
- Delta Lake — open-source storage layer




