Skip to content

muhammedafsalpm/Intelligent-Publisher-Website-Classifier

Repository files navigation

Intelligent Publisher Website Classifier

A FastAPI service that evaluates publisher websites for affiliate marketing campaigns using LLMs and RAG.


🚀 Setup Instructions

Prerequisites

  • Python 3.10+
  • OpenAI API key (or Ollama for local)

Installation

# Clone and enter directory
git clone <repo>
cd publisher-classifier

# Install dependencies
pip install -r requirements.txt

# Create environment file
cp .env.example .env
# Edit .env with your OpenAI API key

# Load policies into RAG (one-time setup)
python load_policies_manual.py

# Start the API
uvicorn app.main:app --reload

Test the API

curl -X POST http://localhost:8000/api/classify \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.amazon.com"}'

📊 Sample Outputs (5 URLs)

1. E-commerce (Flipkart)

{
  "url": "https://www.flipkart.com/",
  "is_cashback_site": false,
  "is_adult_content": false,
  "is_gambling": false,
  "is_agency_or_introductory": false,
  "is_scam_or_low_quality": false,
  "overall_score": 90,
  "summary": "The website is a legitimate e-commerce platform with original content, clear business model, and all positive indicators present.",
  "confidence": "high",
  "scraped_content_length": 10000,
  "classification_time_ms": 11868,
  "rag_chunks_used": 4,
  "llm_provider": "openai",
  "cache_hit": false
}

2. Cashback (Rakuten)

{
  "url": "https://www.rakuten.com/",
  "is_cashback_site": true,
  "is_adult_content": false,
  "is_gambling": false,
  "is_agency_or_introductory": false,
  "is_scam_or_low_quality": false,
  "overall_score": 90,
  "summary": "The site is a legitimate cashback platform with clear terms, user engagement, and positive community feedback.",
  "confidence": "high",
  "scraped_content_length": 1386,
  "classification_time_ms": 9853,
  "rag_chunks_used": 4,
  "llm_provider": "openai",
  "cache_hit": false
}

3. Gambling (Rummy Circle)

{
  "url": "https://www.rummycircle.com/",
  "is_cashback_site": false,
  "is_adult_content": false,
  "is_gambling": true,
  "is_agency_or_introductory": false,
  "is_scam_or_low_quality": false,
  "overall_score": 30,
  "summary": "The site is classified as gambling, which falls under unacceptable policy violations.",
  "confidence": "high",
  "scraped_content_length": 566,
  "classification_time_ms": 6765,
  "rag_chunks_used": 4,
  "llm_provider": "openai",
  "cache_hit": false
}

4. Tech Company (IBM)

{
  "url": "https://www.ibm.com/in-en",
  "is_cashback_site": false,
  "is_adult_content": false,
  "is_gambling": false,
  "is_agency_or_introductory": false,
  "is_scam_or_low_quality": false,
  "overall_score": 90,
  "summary": "The site is a legitimate technology company offering high-quality software and services with all necessary trust signals present.",
  "confidence": "high",
  "scraped_content_length": 2357,
  "classification_time_ms": 8222,
  "rag_chunks_used": 4,
  "llm_provider": "openai",
  "cache_hit": false
}
{
  "url": "http://netflix-support-billing.com/",
  "is_cashback_site": false,
  "is_adult_content": false,
  "is_gambling": false,
  "is_agency_or_introductory": true,
  "is_scam_or_low_quality": true,
  "overall_score": 0,
  "summary": "Classification stalled due to technical error: [Errno 11001] getaddrinfo failed",
  "confidence": "low",
  "scraped_content_length": 0,
  "classification_time_ms": 0,
  "rag_chunks_used": 0,
  "llm_provider": "openai",
  "cache_hit": false
}

🧠 RAG Design Decisions

Chunking Strategy

Strategy: Paragraph-based semantic chunking with section headers, 800 characters per chunk, 100 character overlap.

def _create_retrieval_units(self, text: str, title: str) -> List[str]:
    paras = [p.strip() for p in text.split('\n\n') if p.strip()]
    units, current, c_len = [], [], 0
    for p in paras:
        if c_len + len(p) > 800 and current:
            units.append(f"[{title}]\n" + "\n\n".join(current))
            current, c_len = [], 0
        current.append(p)
        c_len += len(p)
    if current:
        units.append(f"[{title}]\n" + "\n\n".join(current))
    return units
Decision Rationale
Paragraph boundaries Preserves semantic coherence; each chunk contains complete ideas
800-character chunks Balances granularity vs context; optimal for embedding models
100-character overlap Prevents boundary fragmentation; ensures concepts aren't lost
Section headers Maintains policy category context for better retrieval

Policy Store vs Direct Prompt

In Policy Store (RAG) In Direct Prompt
Detailed classification rules Task definition
Edge cases and examples Output format (JSON structure)
Category-specific guidelines Basic instructions
Score ranges per category System persona

Why This Separation:

  • Dynamic relevance - Only relevant policies are injected (gambling sites get gambling rules)
  • Token efficiency - 1000 tokens instead of 3000+ per request
  • Maintainability - Update policies by editing markdown file; no code changes
  • Scalability - Add new categories without modifying core logic

🤖 Agentic Approach: Is It Meaningfully Better?

Architecture:

  1. Agent 1 (Researcher) - Extracts signals: category, quality, trust signals, risk indicators
  2. Agent 2 (Classifier) - Uses signals + retrieved policies for final decision

Yes, It's Better Because:

Benefit Explanation
Explainability We can trace WHY a site was classified (signals detected, policies applied)
Better edge cases Researcher distinguishes "gambling affiliate" vs actual casino
Debugging Isolate errors: signals wrong? policies wrong? classification wrong?
Token efficiency Classifier gets compressed signals, not raw HTML
Auditability Complete log of signals and policies for each classification

Trade-off: Adds ~0.5-1s latency (2 LLM calls vs 1)

Verdict: Worth it for production where accuracy and explainability matter.


🏭 Part 4: Production Thinking

1. 500 URLs/Day - What Breaks First? How Do You Fix It?

What Breaks First: LLM API rate limits and cost

Problem Fix
Rate limits exceeded Redis cache (24h TTL) → 70% hit rate reduces calls by 70%
High costs Batch processing during off-peak hours; fallback to Ollama for low-confidence
Scraping timeouts Queue system (Celery) with exponential backoff
API downtime Circuit breakers; fallback to cached results

Architecture: Load balancer → API instances (3) → Redis cache → Celery queue → LLM providers

2. Improving 78% Accuracy - Next 3 Experiments

# Experiment Expected Gain
1 Few-shot examples - Add 5 examples per category to prompt +8-10%
2 Ensemble classification - Combine GPT-4o-mini + Claude + Llama, majority vote +5-7%
3 Active learning - Human review of low-confidence predictions; add to training +3-5%

Example few-shot:

Examples:
- Amazon.com → ecommerce, score 95
- Rakuten.com → cashback, score 78
- FanDuel.com → gambling, score 0

3. Handling Incorrect Flag Complaints (Adult Content Example)

Operational Process:

Step Timeline Action
1 0-15 min Invalidate cache; reclassify with higher temperature; auto-resolve if changed
2 24 hours Human review: visit site, check signals, review policies, determine root cause
3 24-48 hours Update policies if edge case; add to few-shot examples; notify publisher
4 Weekly Analyze all disputes; update training data; measure false positive rate

Root Cause Analysis:

  • LLM hallucination → Add corrected example to few-shot
  • Overly broad policy → Update policy with edge case
  • Scraper error → Fix extraction for site type
  • Missing edge case → Add new policy section

Communication Template (False Positive):

Subject: Your Classification Has Been Corrected ✓

Dear Publisher,

We've reviewed [URL] and confirmed the adult content flag was incorrect.

Root Cause: [e.g., "Mainstream dating app misidentified"]
Fix Applied: [e.g., "Policy updated to distinguish dating apps"]

Your site is now correctly classified with score [85]/100.

We apologize for the inconvenience.

📁 Project Structure

publisher-classifier/
├── app/
│   ├── main.py                 # FastAPI app
│   ├── config.py               # Configuration
│   ├── models.py               # Pydantic models
│   ├── api/
│   │   ├── classify.py         # POST /classify endpoint
│   │   └── admin.py            # Policy management endpoints
│   ├── services/
│   │   ├── classifier.py       # Main pipeline orchestrator
│   │   ├── scraper.py          # Website scraping
│   │   ├── rag.py              # ChromaDB policy store
│   │   ├── signal_extractor.py # Agent 1: Researcher
│   │   └── llm/
│   │       ├── client.py       # Unified LLM client
│   │       ├── factory.py      # Provider factory
│   │       ├── openai_client.py
│   │       └── ollama_client.py
│   ├── policies/
│   │   └── classification_policies.md
│   └── utils/
│       ├── cache.py            # Redis/memory cache
│       └── logger.py           # Structured logging
├── tests/
├── requirements.txt
├── .env.example
├── load_policies_manual.py
└── README.md

🚀 Quick Commands

# Load policies (first time only)
python load_policies_manual.py

# Start API
uvicorn app.main:app --reload

# Test classification
curl -X POST http://localhost:8000/api/classify \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.amazon.com"}'

# Health check
curl http://localhost:8000/health

# Upload new policy
curl -X POST http://localhost:8000/rag-policy-store/upload \
  -F "file=@new_policy.md"

About

FastAPI service that classifies publisher websites for affiliate campaigns using an LLM pipeline (scrape → signals → RAG → scoring). Detects cashback, adult, gambling, scams. Supports OpenAI/Ollama, Redis cache, Docker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors