Version: v1.2.3 | Status: Active | Last Updated: March 2026
The scrape module is the web data extraction engine of Codomyrmex. It provides a unified interface for scraping web content, crawling websites, mapping site structures, and extracting structured data. It abstracts the complexities of different scraping providers (e.g., Firecrawl) behind a consistent Pythonic interface.
- Provider Abstraction:
BaseScraperinterface allows multiple provider implementations - Adapter Pattern: Provider-specific adapters (e.g.,
FirecrawlAdapter) implement the core interface - Pluggable Architecture: Easy to add new scraping providers without changing core code
- Strong Typing: All methods use type hints
- Structured Results: Standard result types (ScrapeResult, CrawlResult, etc.)
- Input Validation: Validation at API boundaries
- No Reinvention: Wraps existing services (Firecrawl) rather than reimplementing scraping logic
- Minimal State: Configuration is externalized, operations are stateless where possible
- Clear Abstractions: Simple, focused interfaces
- Multiple Formats: Support for markdown, HTML, JSON, screenshots, metadata
- Batch Operations: Efficient processing of multiple URLs
- Dynamic Content: Support for JavaScript-rendered content
- LLM Integration: AI-powered structured data extraction
- Unit Tests: Test core abstractions and adapters
- Integration Tests: Test end-to-end workflows (with controlled real URLs)
- Error Cases: Comprehensive error handling tests
graph TD
subgraph sg_8cbf9f5f16 [User Code]
A[Scraper]
end
subgraph sg_6572b4b746 [Core Layer]
B[BaseScraper Interface]
C[ScrapeResult Types]
D[ScrapeOptions]
end
subgraph sg_b461fb937f [Provider Layer]
E[FirecrawlAdapter]
F[Future Adapters]
end
subgraph sg_6136584869 [Client Layer]
G[FirecrawlClient]
H[Future Clients]
end
subgraph sg_ec1bb9b527 [External Services]
I[Firecrawl API]
J[Future APIs]
end
A -->|implements| B
A -->|uses| E
E -->|implements| B
E -->|uses| G
G -->|calls| I
E -->|returns| C
A -->|takes| D
-
Single URL Scraping
- Scrape individual web pages
- Support multiple output formats (markdown, HTML, JSON, etc.)
- Handle dynamic content with actions (click, scroll, wait)
- Extract metadata (title, description, etc.)
-
Website Crawling
- Crawl entire websites starting from a URL
- Control crawl depth and page limits
- Respect robots.txt (configurable)
- Return structured results for all pages
-
Site Mapping
- Discover all links on a website
- Filter links by search term
- Return link metadata (title, description, URL)
-
Web Search
- Search the web with queries
- Optionally scrape search results
- Control number of results
- Return search results with content
-
LLM Extraction
- Extract structured data from URLs using AI/LLM
- Support JSON schema definitions
- Support prompt-based extraction
- Handle multiple URLs (including wildcards)
- Reliability: Robust error handling and retry logic
- Performance: Efficient batch operations and parallel processing where possible
- Security: API key management, rate limiting, robots.txt respect
- Documentation: Comprehensive API documentation and examples
# Main interface
scraper = Scraper(config: Optional[ScrapeConfig] = None)
# Scraping
result: ScrapeResult = scraper.scrape(url: str, options: Optional[ScrapeOptions] = None)
# Crawling
crawl_result: CrawlResult = scraper.crawl(url: str, options: Optional[ScrapeOptions] = None)
# Mapping
map_result: MapResult = scraper.map(url: str, search: Optional[str] = None)
# Searching
search_result: SearchResult = scraper.search(query: str, options: Optional[ScrapeOptions] = None)
# Extraction
extract_result: ExtractResult = scraper.extract(
urls: List[str],
schema: Optional[Dict[str, Any]] = None,
prompt: Optional[str] = None
)@dataclass
class ScrapeResult:
url: str
content: str
formats: Dict[str, Any]
metadata: Dict[str, Any]
status_code: Optional[int]
success: bool
error: Optional[str]
@dataclass
class ScrapeOptions:
formats: List[ScrapeFormat | str]
timeout: Optional[float]
headers: Dict[str, str]
actions: List[Dict[str, Any]]
max_depth: Optional[int]
limit: Optional[int]
follow_links: bool
respect_robots_txt: boolclass ScrapeError(CodomyrmexError): ...
class ScrapeConnectionError(ScrapeError): ...
class ScrapeTimeoutError(ScrapeError): ...
class ScrapeValidationError(ScrapeError): ...
class FirecrawlError(ScrapeError): ...codomyrmex.logging_monitoring- For loggingcodomyrmex.exceptions- Base exception classes
firecrawl-py- Firecrawl Python SDK (optional, for Firecrawl provider)
- Unit Tests: Test core abstractions, data structures, configuration with real implementations
- Adapter Tests: Test FirecrawlAdapter with real FirecrawlClient (skipped if firecrawl-py unavailable)
- Integration Tests: Test end-to-end with real URLs and API calls (when API key available)
- Error Tests: Test error handling and exception translation with real error propagation
- Additional provider adapters (e.g., Scrapy, BeautifulSoup)
- Caching layer for scraped content
- Rate limiting and throttling
- Proxy support
- Custom user agents and headers per request
- Content filtering and transformation pipelines
- Human Documentation: README.md
- Technical Documentation: AGENTS.md
- Functional Specification: SPEC.md
- Parent Directory: codomyrmex
- Repository Root: ../../../README.md
- Repository SPEC: ../../../SPEC.md