News Intelligence Pipeline

Crawls RSS/Atom feeds, scores articles against configured topics using local embeddings, de-duplicates by news event, and posts to Slack channels.

config.yaml → Crawler → Embedder → Scorer → Deduplicator → Slack Publisher
                                       ↕
                                   SQLite DB

How It Works

Crawl — Fetches RSS/Atom feeds in parallel (including Google News search feeds for 30-day backfill)
Embed — Encodes "{title}. {summary}" with all-MiniLM-L6-v2 (384-dim, runs on MPS/CUDA/CPU)
Score — Mean-centered cosine similarity against topic embeddings. Centering subtracts the corpus mean so omnipresent terms (like "AI") are down-weighted and only topic-specific language drives relevance.
Dedup — Two phases:
- Cross-tick: suppresses articles similar to ones posted in the last N hours
- Intra-tick: greedy clustering so only the highest-scored representative of each news event is posted
Publish — Posts to Slack via Block Kit with rate-limit-aware retries

Quick Start

# Setup (Mac — gets MPS acceleration automatically)
make setup

# Set your Slack bot token
export SLACK_BOT_TOKEN=xoxb-...

# Test run — logs what would be posted without hitting Slack
make dry-run

# Single tick with real posting
make run-once

# Scheduled loop (default: every 15 minutes)
make run

# Run tests
make test

Configuration

Edit config.yaml:

global:
  crawl_interval_minutes: 15
  dedup_threshold: 0.82          # similarity to consider duplicate
  dedup_window_hours: 72         # look-back window for cross-tick dedup
  relevance_threshold: 0.35      # default, overridable per channel
  max_articles_per_channel_per_tick: 10
  database_path: "data/news_intel.db"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"

slack:
  bot_token: "${SLACK_BOT_TOKEN}"  # env var interpolation
  post_delay_seconds: 1.5

channels:
  - name: "ai-energy"
    slack_channel: "#ai-energy"
    relevance_threshold: 0.38      # channel override
    topics:
      - "Autonomous AI control systems for industrial energy optimization"
      - "AI-driven meteorology for renewable grid stability"
    feeds:
      - url: "https://cleantechnica.com/feed/"
        label: "CleanTechnica"
      - url: "https://news.google.com/rss/search?q=AI+%22energy+optimization%22+when:30d&hl=en-US&gl=US&ceid=US:en"
        label: "GNews: AI energy optimization"

Google News RSS search feeds (news.google.com/rss/search?q=...&when:30d) are useful for niche topics that general feeds don't cover well.

Docker (Linux/CUDA)

# Build and run with GPU
make docker-build
SLACK_BOT_TOKEN=xoxb-... make docker-run

# One-off run
make docker-run-once

# Logs
make docker-logs

Requires nvidia-container-toolkit for GPU access.

Project Structure

src/
├── main.py          # Entrypoint: scheduler, signal handling, pipeline orchestration
├── config.py        # YAML loading with ${ENV_VAR} interpolation
├── models.py        # Dataclasses: RawArticle, EmbeddedArticle, ScoredArticle, etc.
├── crawler.py       # Parallel RSS/Atom fetching via feedparser + ThreadPoolExecutor
├── embedder.py      # SentenceTransformer singleton, batch encoding
├── scorer.py        # Mean-centered cosine similarity scoring
├── dedup.py         # Cross-tick sliding window + intra-tick greedy clustering
├── publisher.py     # Slack Block Kit formatting, rate-limit-aware posting
├── db.py            # Thread-safe SQLite: schema, queries, cleanup
└── utils.py         # Logging, HTML stripping, retry decorator
tests/
├── conftest.py      # Shared fixtures
├── test_config.py   # Config loading, env vars, validation
├── test_crawler.py  # Feed parsing, dedup, channel distribution
├── test_embedder.py # Text construction, mock model encoding
├── test_scorer.py   # Mean centering, filtering, thresholds
├── test_dedup.py    # Intra/cross-tick deduplication
└── test_publisher.py # Block Kit format, dry run, edge cases

Requirements

Python 3.9+
PyTorch (auto-detects MPS on Apple Silicon, CUDA on Linux)
Dependencies: feedparser, sentence-transformers, slack-sdk, apscheduler, pyyaml

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
PLAN.md		PLAN.md
README.md		README.md
config.yaml		config.yaml
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Intelligence Pipeline

How It Works

Quick Start

Configuration

Docker (Linux/CUDA)

Project Structure

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

News Intelligence Pipeline

How It Works

Quick Start

Configuration

Docker (Linux/CUDA)

Project Structure

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages