MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
-
Updated
Mar 27, 2026 - Python
MinerU-HTML: An SLM-powered HTML main content extractor that outputs clean HTML bodies. Perfect for Deep Research Agents, RAG applications, and training data generation.
pebkac Chrome Nonautomation - A Local LLM-Driven Web Co-Browser using Smolagents, Zendriver, Trafilatura.
Fast, accurate web content extraction in Rust. ML page-type classification, per-type extraction, confidence scoring. F1=0.966 on ScrapingHub (#1), F1=0.859 across 2,008 annotated pages (1,497 development + 511 held-out test
web Scrapper In Python
Telegram Mini App that saves internet articles to read them later
Lean Python tool for extracting clean, LLM-optimized markdown from web pages. Handles dynamic content with Playwright + Trafilatura for maximum information extraction efficiency.
Tools for LLMs to anonymously search and browse the web
Selective web content extraction for AI agents — URL + query returns only the chunks that matter (Python library + MCP server)
ChatGPT AI Clone
A pipe-based news article scraping and metadata extraction library for Python
Real-time AI search and chat backend with WebSocket streaming, powered by Tavily web search and Google Gemini for Flutter apps.
Protocole de collecte et d'analyse d'archives de la Wayback Machine pour une analyse textuelle et statistique
Tools for LLMs to anonymously search and browse the web
🕷️ Clean, chunked documentation crawler optimized for RAG & AnythingLLM. Dockerized.
🕵️♂️ Enable anonymous web searches for your LLM with the first-ever Model Context Protocol server utilizing Tor for secure and private information retrieval.
Trafilatura API for html content info extract
A web scraper with an LLM-powered document suggestion system that combines web crawling, data extraction, and advanced AI capabilities to recommend relevant documents.
FastAPI service that classifies publisher websites for affiliate campaigns using an LLM pipeline (scrape → signals → RAG → scoring). Detects cashback, adult, gambling, scams. Supports OpenAI/Ollama, Redis cache, Docker.
Add a description, image, and links to the trafilatura topic page so that developers can more easily learn about it.
To associate your repository with the trafilatura topic, visit your repo's landing page and select "manage topics."