Crawls RSS/Atom feeds, scores articles against configured topics using local embeddings, de-duplicates by news event, and posts to Slack channels.
config.yaml → Crawler → Embedder → Scorer → Deduplicator → Slack Publisher
↕
SQLite DB
- Crawl — Fetches RSS/Atom feeds in parallel (including Google News search feeds for 30-day backfill)
- Embed — Encodes
"{title}. {summary}"with all-MiniLM-L6-v2 (384-dim, runs on MPS/CUDA/CPU) - Score — Mean-centered cosine similarity against topic embeddings. Centering subtracts the corpus mean so omnipresent terms (like "AI") are down-weighted and only topic-specific language drives relevance.
- Dedup — Two phases:
- Cross-tick: suppresses articles similar to ones posted in the last N hours
- Intra-tick: greedy clustering so only the highest-scored representative of each news event is posted
- Publish — Posts to Slack via Block Kit with rate-limit-aware retries
# Setup (Mac — gets MPS acceleration automatically)
make setup
# Set your Slack bot token
export SLACK_BOT_TOKEN=xoxb-...
# Test run — logs what would be posted without hitting Slack
make dry-run
# Single tick with real posting
make run-once
# Scheduled loop (default: every 15 minutes)
make run
# Run tests
make testEdit config.yaml:
global:
crawl_interval_minutes: 15
dedup_threshold: 0.82 # similarity to consider duplicate
dedup_window_hours: 72 # look-back window for cross-tick dedup
relevance_threshold: 0.35 # default, overridable per channel
max_articles_per_channel_per_tick: 10
database_path: "data/news_intel.db"
model_name: "sentence-transformers/all-MiniLM-L6-v2"
slack:
bot_token: "${SLACK_BOT_TOKEN}" # env var interpolation
post_delay_seconds: 1.5
channels:
- name: "ai-energy"
slack_channel: "#ai-energy"
relevance_threshold: 0.38 # channel override
topics:
- "Autonomous AI control systems for industrial energy optimization"
- "AI-driven meteorology for renewable grid stability"
feeds:
- url: "https://cleantechnica.com/feed/"
label: "CleanTechnica"
- url: "https://news.google.com/rss/search?q=AI+%22energy+optimization%22+when:30d&hl=en-US&gl=US&ceid=US:en"
label: "GNews: AI energy optimization"Google News RSS search feeds (news.google.com/rss/search?q=...&when:30d) are useful for niche topics that general feeds don't cover well.
# Build and run with GPU
make docker-build
SLACK_BOT_TOKEN=xoxb-... make docker-run
# One-off run
make docker-run-once
# Logs
make docker-logsRequires nvidia-container-toolkit for GPU access.
src/
├── main.py # Entrypoint: scheduler, signal handling, pipeline orchestration
├── config.py # YAML loading with ${ENV_VAR} interpolation
├── models.py # Dataclasses: RawArticle, EmbeddedArticle, ScoredArticle, etc.
├── crawler.py # Parallel RSS/Atom fetching via feedparser + ThreadPoolExecutor
├── embedder.py # SentenceTransformer singleton, batch encoding
├── scorer.py # Mean-centered cosine similarity scoring
├── dedup.py # Cross-tick sliding window + intra-tick greedy clustering
├── publisher.py # Slack Block Kit formatting, rate-limit-aware posting
├── db.py # Thread-safe SQLite: schema, queries, cleanup
└── utils.py # Logging, HTML stripping, retry decorator
tests/
├── conftest.py # Shared fixtures
├── test_config.py # Config loading, env vars, validation
├── test_crawler.py # Feed parsing, dedup, channel distribution
├── test_embedder.py # Text construction, mock model encoding
├── test_scorer.py # Mean centering, filtering, thresholds
├── test_dedup.py # Intra/cross-tick deduplication
└── test_publisher.py # Block Kit format, dry run, edge cases
- Python 3.9+
- PyTorch (auto-detects MPS on Apple Silicon, CUDA on Linux)
- Dependencies:
feedparser,sentence-transformers,slack-sdk,apscheduler,pyyaml