Replace page classifier with dit, add -fpt flag#2404
Replace page classifier with dit, add -fpt flag#2404dogancanbakir wants to merge 6 commits intodevfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 WalkthroughReplaces the internal Naive Bayes page classifier with the external Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI / Flags
participant Runner as Runner
participant Dit as dit.Classifier
participant KB as KnowledgeBase
CLI->>Runner: start enumeration (flags include -fpt / outputs)
Runner->>Dit: ExtractPageType(headlessBody, body)
alt dit available
Dit-->>Runner: {PageType, Forms}
Runner->>KB: populate entry with PageType, Forms, pHash
else classifier nil or error
Dit-->>Runner: error / nil
Runner->>KB: populate entry with pHash (no PageType/Forms)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 6✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
runner/runner.go (1)
717-722:⚠️ Potential issue | 🟡 MinorFix unchecked error return from
finput.Close().The pipeline failure indicates that the error return value of
finput.Closeis not checked. This violates theerrchecklinter rule.🐛 Proposed fix
defer finput.Close() + defer func() { + if err := finput.Close(); err != nil { + gologger.Warning().Msgf("Could not close input file '%s': %s\n", r.options.InputFile, err) + } + }() - defer finput.Close()Alternatively, since
deferis already used, you can suppress the linter for this specific case if the error is intentionally ignored:- defer finput.Close() + defer finput.Close() //nolint:errcheck // best-effort close
🤖 Fix all issues with AI agents
In `@README.md`:
- Line 65: Update the Go version note for the httpx installation: replace or
make the `>=1.25.0` requirement in the README (the string "`>=1.25.0`" that
currently references Go for `httpx`) conditional or placeholder until the
correct Go version is finalized, and ensure consistency with the `go.mod` module
declaration (`go` directive) by updating both the README text and the `go.mod`
`go` version to the final approved Go version once determined.
| # Installation Instructions | ||
|
|
||
| `httpx` requires **go >=1.24.0** to install successfully. Run the following command to get the repo: | ||
| `httpx` requires **go >=1.25.0** to install successfully. Run the following command to get the repo: |
There was a problem hiding this comment.
Go version requirement should be updated once Go version is corrected.
The >=1.25.0 requirement aligns with go.mod, but both reference a non-existent Go version. Update this once the correct Go version is determined.
🤖 Prompt for AI Agents
In `@README.md` at line 65, Update the Go version note for the httpx installation:
replace or make the `>=1.25.0` requirement in the README (the string
"`>=1.25.0`" that currently references Go for `httpx`) conditional or
placeholder until the correct Go version is finalized, and ensure consistency
with the `go.mod` module declaration (`go` directive) by updating both the
README text and the `go.mod` `go` version to the final approved Go version once
determined.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@runner/runner.go`:
- Around line 655-669: The PageType value stored in Runner.classifyPage may be a
non-string type and later code does resp.KnowledgeBase["PageType"].(string),
causing a panic; change the insertion so PageType is coerced to a stable string
(e.g., use fmt.Sprint or fmt.Sprintf("%v", result.Type)) when setting
kb["PageType"] = ..., keep the existing nil/empty handling and ensure Forms
logic is unchanged so downstream filters can safely type-assert to string.
🧹 Nitpick comments (2)
runner/runner.go (2)
433-438: Initialization fallback is OK, but consider gating/noise + explicit nil on error.Proceeding with a nil classifier is fine, but:
- This warning will show even when page-type classification is irrelevant to the user’s chosen flags. Consider initializing
ditonly when-json/-csv/-fpt(or similar) is enabled to reduce surprise/noise.- Minor clarity: explicitly set
runner.ditClassifier = nilon error (so future refactors don’t accidentally use a partially-initialized value).
2639-2640: Consider preferringheadlessBody(when available) for classification accuracy.Right now KnowledgeBase classification uses
respDataeven whenscanopts.ScreenshotproducedheadlessBody. If the goal is better login/captcha/etc detection and form extraction on JS-rendered pages, you may want:
classifyPage(headlessBody, pHash)whenheadlessBody != ""- else fall back to
respDataThis is not strictly required for correctness, but it likely improves real-world detection rates.
f43056f to
9ba4754
Compare
dwisiswant0
left a comment
There was a problem hiding this comment.
Any deprecation process should be planned with a clear roadmap and timeline. Pulling support abruptly without giving users enough time to prepare/migrate is just going to cause UX frustration and/or unnecessary (downstream) breakage.
Proposed changes
Replace built-in Naive Bayes page classifier with dit (20 page types, 8 form types, 79 field types). Add
-fpt/-filter-page-typeflag for filtering by any page type(s). Deprecate-fepas alias for-fpt error.common/pagetypeclassifier/withdit.Classifier-fptflag (e.g.-fpt login,captcha,parked)-fepwith info messageFormswith form type and field classificationsCloses #2403
Proof
httpx -u https://github.com/login -json— KnowledgeBase showsPageType: login+ Forms-fpt loginfilters login pages,-fpt errorfilters error pages-fpt login,errorfilters multiple types, case-insensitive-fepbackward compat filters error pages + shows deprecation messagego build ./...andgo test ./...passChecklist
Summary by CodeRabbit
Chores
New Features
Deprecations
Removals
Documentation