url-scraper

A lightweight, recursive web crawler that maps a static website, extracts all unique hyperlinks, categorizes them as internal or external, and outputs a redirect-ready CSV dataset.

Features

Recursive crawl starting from a root URL
Normalizes URLs: strips query parameters and fragments (treats example.com/page?a=1 and example.com/page?a=2 as the same page)
Categorizes links as Internal or External
Records HTTP status codes for every internal URL
MIME-type filtering: only parses text/html pages, skips PDFs, images, etc.
Graceful error handling: logs timeouts, connection errors, and HTTP errors without crashing
Rate limiting with configurable delay between requests
Parallel crawling with configurable worker threads

Output

Exports a results.csv file with columns:

Column	Description
`Original_URL`	The full normalized URL
`HTTP_Status_Code`	HTTP response code, or `TIMEOUT` / `CONNECTION_ERROR` on failure
`Link_Type`	`Internal` or `External`
`Found_On`	The page where this URL was discovered (empty for the start URL)

Page content (`<domain>.json`)

Saved alongside the CSV for all internal text/html pages. Keyed by URL:

{
  "https://example.com/about": {
    "title": "About Us",
    "meta_description": "Learn more about our team.",
    "h1": "Our Story",
    "headings": ["Our Story", "The Team", "Contact"],
    "body_text": "Full visible text content of the page..."
  }
}

Requirements

Python 3.11+
uv

Setup

uv sync

Usage

uv run main.py <url> [options]

Arguments

Crawl settings

Flag	Default	Description
`url`	(required)	Root URL to start crawling from
`-o`, `--output`	`<domain>.csv`	Output CSV file path
`-w`, `--workers`	`10`	Number of parallel worker threads
`-d`, `--delay`	`0.3`	Delay between requests per worker (seconds)
`--timeout`	`5.0`	HTTP request timeout (seconds)

Output filters

Flag	Description
`--filter [all\|internal\|external]`	Only include the specified link type in the CSV (default: `all`)
`--errors-only`	Only export URLs with error status codes (4xx, 5xx, timeouts)
`--no-found-on`	Omit the `Found_On` column from the CSV
`--no-text`	Skip page content extraction, do not write the JSON file

Examples

# Basic crawl
uv run main.py https://example.com

# Save only internal URLs
uv run main.py https://example.com --filter internal

# Find all broken links
uv run main.py https://example.com --errors-only

# Export external links without the Found_On column
uv run main.py https://example.com --filter external --no-found-on -o external.csv

# Gentler crawl (fewer workers, longer delay)
uv run main.py https://example.com -w 5 -d 1.0

# Faster crawl with longer timeout for slow servers
uv run main.py https://example.com -w 20 --timeout 10

# Crawl without saving page content
uv run main.py https://example.com --no-text

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

url-scraper

Features

Output

Page content (`<domain>.json`)

Requirements

Setup

Usage

Arguments

Examples

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

url-scraper

Features

Output

Page content (<domain>.json)

Requirements

Setup

Usage

Arguments

Examples

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Page content (`<domain>.json`)