Convert any webpage to clean, LLM-friendly markdown. Powered by Crawl4AI.
LLMs work best with clean text, but the web is full of messy HTML, navigation menus, ads, and JavaScript-rendered content. This template gives you a simple API that crawls any URL—including pages that require JavaScript to load—and converts the content into clean markdown ready to feed into GPT-4, Claude, or any other model.
It includes a minimal web UI so you can try it out immediately—just paste a URL and see the markdown output. For production use, call the REST API from your code to integrate web content into your LLM pipelines.
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web UI for converting URLs to markdown |
/health |
GET | Health check |
/docs |
GET | Interactive API documentation (Swagger UI) |
/crawl |
POST | Crawl a single URL and return markdown |
/crawl/batch |
POST | Crawl multiple URLs concurrently (max 10) |
curl -X POST https://your-service.onrender.com/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Response:
{
"url": "https://example.com",
"title": "Example Domain",
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples...",
"word_count": 42,
"success": true
}curl -X POST https://your-service.onrender.com/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"include_raw": true,
"filter_threshold": 0.5
}'Request options:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | The URL to crawl |
include_raw |
bool | false |
Include unfiltered raw markdown |
filter_threshold |
float | 0.4 |
Content filter aggressiveness (0-1) |
wait_for_selector |
string | null |
CSS selector to wait for before extraction |
js_code |
string | null |
JavaScript to execute before extraction |
curl -X POST https://your-service.onrender.com/crawl/batch \
-H "Content-Type: application/json" \
-d '[
"https://example.com",
"https://httpbin.org/html"
]'For pages that load content via JavaScript:
curl -X POST https://your-service.onrender.com/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/dynamic-page",
"wait_for_selector": ".content-loaded",
"js_code": "document.querySelector(\"button.load-more\").click()"
}'- Click Use this template to create your own copy of this repository
- In your Render Dashboard, create a new Blueprint
- Connect your new repository
- Render automatically detects and deploys from
render.yaml
| Variable | Description | Default |
|---|---|---|
PORT |
Port for the web server | 8000 |
# Clone the repository
git clone https://github.com/render-examples/crawl4ai.git
cd crawl4ai
# Build and run with Docker
docker build -t url-to-markdown .
docker run -p 8000:8000 url-to-markdown
# Visit http://localhost:8000 for the web UIOnce you have this running, here are some ideas for what you could build:
- Add an LLM layer - Pipe the markdown into GPT-4 or Claude to summarize, extract entities, or answer questions about the content
- Build a RAG pipeline - Store crawled content in a vector database and build a chatbot that answers questions from your sources
- Create a monitoring service - Schedule regular crawls to track changes on competitor sites, documentation, or news sources
- Power an AI agent - Let your agent browse the web by calling this API to gather information for complex tasks
- Generate datasets - Collect domain-specific content for fine-tuning models or building evaluation sets
Crawl4AI also supports LLM-based extraction if you want structured data instead of markdown.