A FastAPI service that evaluates publisher websites for affiliate marketing campaigns using LLMs and RAG.
- Python 3.10+
- OpenAI API key (or Ollama for local)
# Clone and enter directory
git clone <repo>
cd publisher-classifier
# Install dependencies
pip install -r requirements.txt
# Create environment file
cp .env.example .env
# Edit .env with your OpenAI API key
# Load policies into RAG (one-time setup)
python load_policies_manual.py
# Start the API
uvicorn app.main:app --reloadcurl -X POST http://localhost:8000/api/classify \
-H "Content-Type: application/json" \
-d '{"url": "https://www.amazon.com"}'{
"url": "https://www.flipkart.com/",
"is_cashback_site": false,
"is_adult_content": false,
"is_gambling": false,
"is_agency_or_introductory": false,
"is_scam_or_low_quality": false,
"overall_score": 90,
"summary": "The website is a legitimate e-commerce platform with original content, clear business model, and all positive indicators present.",
"confidence": "high",
"scraped_content_length": 10000,
"classification_time_ms": 11868,
"rag_chunks_used": 4,
"llm_provider": "openai",
"cache_hit": false
}{
"url": "https://www.rakuten.com/",
"is_cashback_site": true,
"is_adult_content": false,
"is_gambling": false,
"is_agency_or_introductory": false,
"is_scam_or_low_quality": false,
"overall_score": 90,
"summary": "The site is a legitimate cashback platform with clear terms, user engagement, and positive community feedback.",
"confidence": "high",
"scraped_content_length": 1386,
"classification_time_ms": 9853,
"rag_chunks_used": 4,
"llm_provider": "openai",
"cache_hit": false
}{
"url": "https://www.rummycircle.com/",
"is_cashback_site": false,
"is_adult_content": false,
"is_gambling": true,
"is_agency_or_introductory": false,
"is_scam_or_low_quality": false,
"overall_score": 30,
"summary": "The site is classified as gambling, which falls under unacceptable policy violations.",
"confidence": "high",
"scraped_content_length": 566,
"classification_time_ms": 6765,
"rag_chunks_used": 4,
"llm_provider": "openai",
"cache_hit": false
}{
"url": "https://www.ibm.com/in-en",
"is_cashback_site": false,
"is_adult_content": false,
"is_gambling": false,
"is_agency_or_introductory": false,
"is_scam_or_low_quality": false,
"overall_score": 90,
"summary": "The site is a legitimate technology company offering high-quality software and services with all necessary trust signals present.",
"confidence": "high",
"scraped_content_length": 2357,
"classification_time_ms": 8222,
"rag_chunks_used": 4,
"llm_provider": "openai",
"cache_hit": false
}5. Low Quality (http://netflix-support-billing.com)
{
"url": "http://netflix-support-billing.com/",
"is_cashback_site": false,
"is_adult_content": false,
"is_gambling": false,
"is_agency_or_introductory": true,
"is_scam_or_low_quality": true,
"overall_score": 0,
"summary": "Classification stalled due to technical error: [Errno 11001] getaddrinfo failed",
"confidence": "low",
"scraped_content_length": 0,
"classification_time_ms": 0,
"rag_chunks_used": 0,
"llm_provider": "openai",
"cache_hit": false
}Strategy: Paragraph-based semantic chunking with section headers, 800 characters per chunk, 100 character overlap.
def _create_retrieval_units(self, text: str, title: str) -> List[str]:
paras = [p.strip() for p in text.split('\n\n') if p.strip()]
units, current, c_len = [], [], 0
for p in paras:
if c_len + len(p) > 800 and current:
units.append(f"[{title}]\n" + "\n\n".join(current))
current, c_len = [], 0
current.append(p)
c_len += len(p)
if current:
units.append(f"[{title}]\n" + "\n\n".join(current))
return units| Decision | Rationale |
|---|---|
| Paragraph boundaries | Preserves semantic coherence; each chunk contains complete ideas |
| 800-character chunks | Balances granularity vs context; optimal for embedding models |
| 100-character overlap | Prevents boundary fragmentation; ensures concepts aren't lost |
| Section headers | Maintains policy category context for better retrieval |
| In Policy Store (RAG) | In Direct Prompt |
|---|---|
| Detailed classification rules | Task definition |
| Edge cases and examples | Output format (JSON structure) |
| Category-specific guidelines | Basic instructions |
| Score ranges per category | System persona |
Why This Separation:
- Dynamic relevance - Only relevant policies are injected (gambling sites get gambling rules)
- Token efficiency - 1000 tokens instead of 3000+ per request
- Maintainability - Update policies by editing markdown file; no code changes
- Scalability - Add new categories without modifying core logic
Architecture:
- Agent 1 (Researcher) - Extracts signals: category, quality, trust signals, risk indicators
- Agent 2 (Classifier) - Uses signals + retrieved policies for final decision
Yes, It's Better Because:
| Benefit | Explanation |
|---|---|
| Explainability | We can trace WHY a site was classified (signals detected, policies applied) |
| Better edge cases | Researcher distinguishes "gambling affiliate" vs actual casino |
| Debugging | Isolate errors: signals wrong? policies wrong? classification wrong? |
| Token efficiency | Classifier gets compressed signals, not raw HTML |
| Auditability | Complete log of signals and policies for each classification |
Trade-off: Adds ~0.5-1s latency (2 LLM calls vs 1)
Verdict: Worth it for production where accuracy and explainability matter.
What Breaks First: LLM API rate limits and cost
| Problem | Fix |
|---|---|
| Rate limits exceeded | Redis cache (24h TTL) → 70% hit rate reduces calls by 70% |
| High costs | Batch processing during off-peak hours; fallback to Ollama for low-confidence |
| Scraping timeouts | Queue system (Celery) with exponential backoff |
| API downtime | Circuit breakers; fallback to cached results |
Architecture: Load balancer → API instances (3) → Redis cache → Celery queue → LLM providers
| # | Experiment | Expected Gain |
|---|---|---|
| 1 | Few-shot examples - Add 5 examples per category to prompt | +8-10% |
| 2 | Ensemble classification - Combine GPT-4o-mini + Claude + Llama, majority vote | +5-7% |
| 3 | Active learning - Human review of low-confidence predictions; add to training | +3-5% |
Example few-shot:
Examples:
- Amazon.com → ecommerce, score 95
- Rakuten.com → cashback, score 78
- FanDuel.com → gambling, score 0
Operational Process:
| Step | Timeline | Action |
|---|---|---|
| 1 | 0-15 min | Invalidate cache; reclassify with higher temperature; auto-resolve if changed |
| 2 | 24 hours | Human review: visit site, check signals, review policies, determine root cause |
| 3 | 24-48 hours | Update policies if edge case; add to few-shot examples; notify publisher |
| 4 | Weekly | Analyze all disputes; update training data; measure false positive rate |
Root Cause Analysis:
- LLM hallucination → Add corrected example to few-shot
- Overly broad policy → Update policy with edge case
- Scraper error → Fix extraction for site type
- Missing edge case → Add new policy section
Communication Template (False Positive):
Subject: Your Classification Has Been Corrected ✓
Dear Publisher,
We've reviewed [URL] and confirmed the adult content flag was incorrect.
Root Cause: [e.g., "Mainstream dating app misidentified"]
Fix Applied: [e.g., "Policy updated to distinguish dating apps"]
Your site is now correctly classified with score [85]/100.
We apologize for the inconvenience.
publisher-classifier/
├── app/
│ ├── main.py # FastAPI app
│ ├── config.py # Configuration
│ ├── models.py # Pydantic models
│ ├── api/
│ │ ├── classify.py # POST /classify endpoint
│ │ └── admin.py # Policy management endpoints
│ ├── services/
│ │ ├── classifier.py # Main pipeline orchestrator
│ │ ├── scraper.py # Website scraping
│ │ ├── rag.py # ChromaDB policy store
│ │ ├── signal_extractor.py # Agent 1: Researcher
│ │ └── llm/
│ │ ├── client.py # Unified LLM client
│ │ ├── factory.py # Provider factory
│ │ ├── openai_client.py
│ │ └── ollama_client.py
│ ├── policies/
│ │ └── classification_policies.md
│ └── utils/
│ ├── cache.py # Redis/memory cache
│ └── logger.py # Structured logging
├── tests/
├── requirements.txt
├── .env.example
├── load_policies_manual.py
└── README.md
# Load policies (first time only)
python load_policies_manual.py
# Start API
uvicorn app.main:app --reload
# Test classification
curl -X POST http://localhost:8000/api/classify \
-H "Content-Type: application/json" \
-d '{"url": "https://www.amazon.com"}'
# Health check
curl http://localhost:8000/health
# Upload new policy
curl -X POST http://localhost:8000/rag-policy-store/upload \
-F "file=@new_policy.md"