Skip to content

Improve robustness, simplify consolidator API, and align docs/examples#110

Merged
jgarciab merged 28 commits intomainfrom
increase_robustness
Mar 9, 2026
Merged

Improve robustness, simplify consolidator API, and align docs/examples#110
jgarciab merged 28 commits intomainfrom
increase_robustness

Conversation

@jgarciab
Copy link
Copy Markdown
Collaborator

Summary

  • simplify consolidator usage with default input/output resolution: Consolidator(target_folder_path=out).consolidate()
  • keep backward compatibility for explicit/legacy consolidator call patterns
  • harden config initialization/restore path handling for clean instance setup
  • remove CLI backend override flag and clarify backend behavior during init
  • sync and update featured notebook/docs/readme for consistent CLI vs library guidance
  • add and update tests for consolidator defaults/compatibility and config restore/init behavior

Validation

  • local test suite: 37 passed, 5 skipped
  • CLI scratch flow verified: init -> crawl -> extract -> consolidate
  • library scratch flow verified: Crawler -> Extractor -> Consolidator

Notes

  • backend selection is now instance-oriented in CLI (database mode during init; DuckDB preferred, SQLite fallback)
  • consolidator now defaults to latest extracted_data/*.ndjson and standard consolidated output path

@jgarciab jgarciab self-assigned this Feb 26, 2026
@jgarciab jgarciab requested a review from vankesteren February 26, 2026 12:22
@vankesteren
Copy link
Copy Markdown
Member

This is the first time I'm properly looking at websweep, and my conclusion is that it's a really awesome piece of research infrastructure 🚀 It should be used widely.

I tried it out on some lists of URLS and I looked in detail at each component that was generated. I really like that the CLI is built using typer (love fastapi) which gives nice help / docs in the terminal. I also really like the thought that has gone into the internal data structures and intermediate results of the steps. It looks like an excellent resource especially for recurring sweeps of base domain names, which is kind of a unique selling point that may be promoted more. Almost like archive.org

Below are some comments. However, I suggest you turn some of these into issues to be solved later after merging this PR. I think you can consider almost all of the comments optional (except the first one).

Overall comments

  • This is a sick PR with almost 30k LOC. It's because the entire build directory of the sphinx build is included (docs/build). Please exclude that from the PR.
  • Stuff like this file as part of the docs build feels like maintenance hell waiting to happen.

Infrastructure comments

  • Why are the date / timestamp fields in the Overview table in the duckdb not stored as dates and times? Would be much easier to work with and more fault-tolerant
  • I also think it may be nice to separate http status codes (which can be converted to Int) and other errors in this database, to allow things like SELECT * FROM Overview WHERE status < 300;

Documentation comments

  • I immediately got blocked on several domains: warning in the docs! Or larger waittime by default? People will try this out; let them try it out safely and only increase the number of requests per second in production or so.
  • Docs are not so human-friendly, even though they look good. Small example: people don't read the docs from front to back, so in "Library Quickstart" and "Library Workflow (detailed)" add some nice comments in the code (for the steps). Also, don't separate these sections of quickstart and detailed, just merge them because they are almost the same.
  • Question I had for a long time: what does the extractor do by default and why do I get all these empty struct[0] fields when I do pl.read_ndjson("consolidated.ndjson")? I think I even got a "kvk" field by default which I think is not needed? Explain this early and also immediately point / link to the concept of extraction add-ons.
  • Similar comment for crawler: explain in a few sentences early what it does more specifically, and link to the URL filtering rules. On the landing page of the docs, you can spend a little more space for each step.
  • Add an image for the pipeline instead of the small monospaced text in the readthedocs landing page
  • In the documentation it's unclear how to set recurring crawls using the CLI. In some parts it seems like this is automatic? Or should we set up a cronjob? Like in the section "Recurring CLI pattern (every X months)", it states "Keep one configured instance and run the same sequence on each update cycle:". What does this even mean?

@jgarciab
Copy link
Copy Markdown
Collaborator Author

jgarciab commented Mar 9, 2026

  • docs/build: fixed.
  • clean_apidoc.py: removed

Infrastructure comments

  • DuckDB date/timestamp fields: agreed. not implemented
  • Separating HTTP status codes from non-HTTP errors: agreed as well. Not implemented.

Documentation comments

  • README.md : added focus on research
  • Blocking / rate limiting: implemented, less concurrency, loner waits.
  • Library Quickstart vs detailed workflow: merged
  • Extractor defaults / add-ons: partly addressed.
  • Crawler explanation / filtering rules: implemented partially
  • Pipeline image on the landing page: implemented but not super pretty.
  • Recurring CLI runs: implemented. example provided

@jgarciab jgarciab merged commit 3d1eeb5 into main Mar 9, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants