Skip to content

jasontitus/news-intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

News Intelligence Pipeline

Crawls RSS/Atom feeds, scores articles against configured topics using local embeddings, de-duplicates by news event, and posts to Slack channels.

config.yaml → Crawler → Embedder → Scorer → Deduplicator → Slack Publisher
                                       ↕
                                   SQLite DB

How It Works

  1. Crawl — Fetches RSS/Atom feeds in parallel (including Google News search feeds for 30-day backfill)
  2. Embed — Encodes "{title}. {summary}" with all-MiniLM-L6-v2 (384-dim, runs on MPS/CUDA/CPU)
  3. Score — Mean-centered cosine similarity against topic embeddings. Centering subtracts the corpus mean so omnipresent terms (like "AI") are down-weighted and only topic-specific language drives relevance.
  4. Dedup — Two phases:
    • Cross-tick: suppresses articles similar to ones posted in the last N hours
    • Intra-tick: greedy clustering so only the highest-scored representative of each news event is posted
  5. Publish — Posts to Slack via Block Kit with rate-limit-aware retries

Quick Start

# Setup (Mac — gets MPS acceleration automatically)
make setup

# Set your Slack bot token
export SLACK_BOT_TOKEN=xoxb-...

# Test run — logs what would be posted without hitting Slack
make dry-run

# Single tick with real posting
make run-once

# Scheduled loop (default: every 15 minutes)
make run

# Run tests
make test

Configuration

Edit config.yaml:

global:
  crawl_interval_minutes: 15
  dedup_threshold: 0.82          # similarity to consider duplicate
  dedup_window_hours: 72         # look-back window for cross-tick dedup
  relevance_threshold: 0.35      # default, overridable per channel
  max_articles_per_channel_per_tick: 10
  database_path: "data/news_intel.db"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"

slack:
  bot_token: "${SLACK_BOT_TOKEN}"  # env var interpolation
  post_delay_seconds: 1.5

channels:
  - name: "ai-energy"
    slack_channel: "#ai-energy"
    relevance_threshold: 0.38      # channel override
    topics:
      - "Autonomous AI control systems for industrial energy optimization"
      - "AI-driven meteorology for renewable grid stability"
    feeds:
      - url: "https://cleantechnica.com/feed/"
        label: "CleanTechnica"
      - url: "https://news.google.com/rss/search?q=AI+%22energy+optimization%22+when:30d&hl=en-US&gl=US&ceid=US:en"
        label: "GNews: AI energy optimization"

Google News RSS search feeds (news.google.com/rss/search?q=...&when:30d) are useful for niche topics that general feeds don't cover well.

Docker (Linux/CUDA)

# Build and run with GPU
make docker-build
SLACK_BOT_TOKEN=xoxb-... make docker-run

# One-off run
make docker-run-once

# Logs
make docker-logs

Requires nvidia-container-toolkit for GPU access.

Project Structure

src/
├── main.py          # Entrypoint: scheduler, signal handling, pipeline orchestration
├── config.py        # YAML loading with ${ENV_VAR} interpolation
├── models.py        # Dataclasses: RawArticle, EmbeddedArticle, ScoredArticle, etc.
├── crawler.py       # Parallel RSS/Atom fetching via feedparser + ThreadPoolExecutor
├── embedder.py      # SentenceTransformer singleton, batch encoding
├── scorer.py        # Mean-centered cosine similarity scoring
├── dedup.py         # Cross-tick sliding window + intra-tick greedy clustering
├── publisher.py     # Slack Block Kit formatting, rate-limit-aware posting
├── db.py            # Thread-safe SQLite: schema, queries, cleanup
└── utils.py         # Logging, HTML stripping, retry decorator
tests/
├── conftest.py      # Shared fixtures
├── test_config.py   # Config loading, env vars, validation
├── test_crawler.py  # Feed parsing, dedup, channel distribution
├── test_embedder.py # Text construction, mock model encoding
├── test_scorer.py   # Mean centering, filtering, thresholds
├── test_dedup.py    # Intra/cross-tick deduplication
└── test_publisher.py # Block Kit format, dry run, edge cases

Requirements

  • Python 3.9+
  • PyTorch (auto-detects MPS on Apple Silicon, CUDA on Linux)
  • Dependencies: feedparser, sentence-transformers, slack-sdk, apscheduler, pyyaml

About

RSS/Atom feed crawler with embedding-based relevance scoring, deduplication, and Slack publishing for niche topic monitoring

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors