Improve robustness, simplify consolidator API, and align docs/examples by jgarciab · Pull Request #110 · sodascience/websweep

jgarciab · 2026-02-23T12:53:27Z

Summary

simplify consolidator usage with default input/output resolution: Consolidator(target_folder_path=out).consolidate()
keep backward compatibility for explicit/legacy consolidator call patterns
harden config initialization/restore path handling for clean instance setup
remove CLI backend override flag and clarify backend behavior during init
sync and update featured notebook/docs/readme for consistent CLI vs library guidance
add and update tests for consolidator defaults/compatibility and config restore/init behavior

Validation

local test suite: 37 passed, 5 skipped
CLI scratch flow verified: init -> crawl -> extract -> consolidate
library scratch flow verified: Crawler -> Extractor -> Consolidator

Notes

backend selection is now instance-oriented in CLI (database mode during init; DuckDB preferred, SQLite fallback)
consolidator now defaults to latest extracted_data/*.ndjson and standard consolidated output path

…onsolidator

vankesteren · 2026-03-02T12:00:29Z

This is the first time I'm properly looking at websweep, and my conclusion is that it's a really awesome piece of research infrastructure 🚀 It should be used widely.

I tried it out on some lists of URLS and I looked in detail at each component that was generated. I really like that the CLI is built using typer (love fastapi) which gives nice help / docs in the terminal. I also really like the thought that has gone into the internal data structures and intermediate results of the steps. It looks like an excellent resource especially for recurring sweeps of base domain names, which is kind of a unique selling point that may be promoted more. Almost like archive.org

Below are some comments. However, I suggest you turn some of these into issues to be solved later after merging this PR. I think you can consider almost all of the comments optional (except the first one).

Overall comments

This is a sick PR with almost 30k LOC. It's because the entire build directory of the sphinx build is included (docs/build). Please exclude that from the PR.
Stuff like this file as part of the docs build feels like maintenance hell waiting to happen.

Infrastructure comments

Why are the date / timestamp fields in the Overview table in the duckdb not stored as dates and times? Would be much easier to work with and more fault-tolerant
I also think it may be nice to separate http status codes (which can be converted to Int) and other errors in this database, to allow things like SELECT * FROM Overview WHERE status < 300;

Documentation comments

I immediately got blocked on several domains: warning in the docs! Or larger waittime by default? People will try this out; let them try it out safely and only increase the number of requests per second in production or so.
Docs are not so human-friendly, even though they look good. Small example: people don't read the docs from front to back, so in "Library Quickstart" and "Library Workflow (detailed)" add some nice comments in the code (for the steps). Also, don't separate these sections of quickstart and detailed, just merge them because they are almost the same.
Question I had for a long time: what does the extractor do by default and why do I get all these empty struct[0] fields when I do pl.read_ndjson("consolidated.ndjson")? I think I even got a "kvk" field by default which I think is not needed? Explain this early and also immediately point / link to the concept of extraction add-ons.
Similar comment for crawler: explain in a few sentences early what it does more specifically, and link to the URL filtering rules. On the landing page of the docs, you can spend a little more space for each step.
Add an image for the pipeline instead of the small monospaced text in the readthedocs landing page
In the documentation it's unclear how to set recurring crawls using the CLI. In some parts it seems like this is automatic? Or should we set up a cronjob? Like in the section "Recurring CLI pattern (every X months)", it states "Keep one configured instance and run the same sequence on each update cycle:". What does this even mean?

jgarciab · 2026-03-09T13:02:08Z

jgarciab added 25 commits April 11, 2023 16:47

init consolidator

2c1d6e5

temp dev

75ffb6d

update crawler

3c39a76

Merge branch 'consolidator' of github.com:sodascience/websweep into c…

c96522f

…onsolidator

convert to google-re2

7da71a6

clean up

4b2281f

doctring

5005768

remove duplicates

72b8171

add google-re2

d299371

clean up unused libraries

c21583b

remove unused dependencies

28e6ec6

add temporary folder to main

9d84e97

Merge branch 'consolidator' into increase_robustness

ed8a89c

feat: harden crawling and extraction pipeline

63d67ae

docs: overhaul README, Sphinx docs, and featured notebook

377189f

ci: add uv-based test, lint, docs, and publish workflows

386af70

docs: refresh generated HTML and doctree artifacts

a7bdc3c

docs: update generated Sphinx environment cache

3164b92

chore: finalize robustness updates and docs sync

a7faabb

ci(docs): install pandoc for nbsphinx notebook build

f3ef3f2

docs+cli: clarify library vs CLI workflows and simplify usage guidance

ef575c4

feat: streamline consolidator defaults and harden config restore/init

21a25ae

Configure extractor add-on at init and copy into instance

e8b4d9b

Stabilize storage-path pipeline and complement retries

d2d8232

Simplify temp/target storage model and align docs

aae4ed9

jgarciab self-assigned this Feb 26, 2026

jgarciab requested a review from vankesteren February 26, 2026 12:22

jgarciab added 2 commits March 7, 2026 19:40

Stop tracking Sphinx build artifacts

575aa8a

Improve docs clarity and crawler defaults

bca9e33

Clarify README research workflow

a168818

jgarciab merged commit 3d1eeb5 into main Mar 9, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve robustness, simplify consolidator API, and align docs/examples#110

Improve robustness, simplify consolidator API, and align docs/examples#110
jgarciab merged 28 commits intomainfrom
increase_robustness

jgarciab commented Feb 23, 2026

Uh oh!

vankesteren commented Mar 2, 2026

Uh oh!

jgarciab commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jgarciab commented Feb 23, 2026

Summary

Validation

Notes

Uh oh!

vankesteren commented Mar 2, 2026

Overall comments

Infrastructure comments

Documentation comments

Uh oh!

jgarciab commented Mar 9, 2026

Infrastructure comments

Documentation comments

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants