Skip to content

Replace page classifier with dit, add -fpt flag#2404

Open
dogancanbakir wants to merge 6 commits intodevfrom
feat/dit-page-classifier
Open

Replace page classifier with dit, add -fpt flag#2404
dogancanbakir wants to merge 6 commits intodevfrom
feat/dit-page-classifier

Conversation

@dogancanbakir
Copy link
Member

@dogancanbakir dogancanbakir commented Feb 16, 2026

Proposed changes

Replace built-in Naive Bayes page classifier with dit (20 page types, 8 form types, 79 field types). Add -fpt/-filter-page-type flag for filtering by any page type(s). Deprecate -fep as alias for -fpt error.

  • Replace common/pagetypeclassifier/ with dit.Classifier
  • Add -fpt flag (e.g. -fpt login,captcha,parked)
  • Deprecate -fep with info message
  • KnowledgeBase now includes Forms with form type and field classifications
  • Bump Go to 1.25.7, update CI/CD workflows and Dockerfile

Closes #2403

Proof

  • httpx -u https://github.com/login -json — KnowledgeBase shows PageType: login + Forms
  • -fpt login filters login pages, -fpt error filters error pages
  • -fpt login,error filters multiple types, case-insensitive
  • -fep backward compat filters error pages + shows deprecation message
  • go build ./... and go test ./... pass

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Summary by CodeRabbit

  • Chores

    • Bumped Go toolchain to 1.25.7 and updated base builder image; refreshed dependency set and added a new classifier dependency.
  • New Features

    • Added -fpt / --filter-page-type flag to filter output by page type.
  • Deprecations

    • Deprecated -fep / --filter-error-page; kept for backward compatibility with a deprecation notice.
  • Removals

    • Removed the prior page-type classifier, its dataset, and related tests.
  • Documentation

    • README updated with new flag examples and Go requirement.

@auto-assign auto-assign bot requested a review from dwisiswant0 February 16, 2026 09:48
@coderabbitai
Copy link

coderabbitai bot commented Feb 16, 2026

No actionable comments were generated in the recent review. 🎉


Walkthrough

Replaces the internal Naive Bayes page classifier with the external dit classifier; removes common/pagetypeclassifier and its dataset/tests; adds --filter-page-type (-fpt) flag (keeps -fep as deprecated alias); updates runner integration to use dit, includes Forms in KnowledgeBase, and bumps Go to 1.25.

Changes

Cohort / File(s) Summary
Build & Toolchain
Dockerfile, go.mod
Bump Go to 1.25.7; add dit v0.0.9; update/remove several dependencies (remove html-to-markdown/dom/snowball).
Documentation
README.md
Raise Go requirement to >=1.25.0; add --filter-page-type/-fpt flag docs and example; reintroduce -fep with deprecation note pointing to -fpt.
Removed classifier package
common/pagetypeclassifier/...
common/pagetypeclassifier/dataset.txt, common/pagetypeclassifier/pagetypeclassifier.go, common/pagetypeclassifier/pagetypeclassifier_test.go
Delete old Naive Bayes classifier, dataset, and tests; remove PageTypeClassifier type, New(), Classify(), html-to-text helpers, and related tests.
CLI Flags / Options
runner/options.go
Add OutputFilterPageType goflags.StringSlice and --filter-page-type/-fpt; keep -fep as a deprecated alias that defaults to {"error"} when used.
Runner / Classifier integration
runner/runner.go
Replace pagetypeclassifier import/field with dit; rename field to ditClassifier; conditionally initialize dit when outputs/filters require it (warn on error, continue); add classifyPage(headlessBody, body, pHash) to call dit.ExtractPageType and include PageType and Forms in KnowledgeBase; update RunEnumeration to use new flow.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI / Flags
    participant Runner as Runner
    participant Dit as dit.Classifier
    participant KB as KnowledgeBase

    CLI->>Runner: start enumeration (flags include -fpt / outputs)
    Runner->>Dit: ExtractPageType(headlessBody, body)
    alt dit available
        Dit-->>Runner: {PageType, Forms}
        Runner->>KB: populate entry with PageType, Forms, pHash
    else classifier nil or error
        Dit-->>Runner: error / nil
        Runner->>KB: populate entry with pHash (no PageType/Forms)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped from Bayes to shiny new Dit,
More page types, more forms—what a fit!
Flags reworded, the old one says "bye",
Knowledge grows—hop, skip, and classify! 🥕

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: replacing the page classifier with dit and adding the -fpt flag.
Linked Issues check ✅ Passed The PR implements all primary objectives from #2403: replaced Naive Bayes classifier with dit, added -fpt flag, deprecated -fep, removed pagetypeclassifier package, and updated KnowledgeBase.
Out of Scope Changes check ✅ Passed All changes are in scope: Go version bumps (1.24→1.25), Dockerfile/dependency updates, and dit integration are directly related to the classifier replacement objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into dev

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dit-page-classifier

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
runner/runner.go (1)

717-722: ⚠️ Potential issue | 🟡 Minor

Fix unchecked error return from finput.Close().

The pipeline failure indicates that the error return value of finput.Close is not checked. This violates the errcheck linter rule.

🐛 Proposed fix
 				defer finput.Close()
+				defer func() {
+					if err := finput.Close(); err != nil {
+						gologger.Warning().Msgf("Could not close input file '%s': %s\n", r.options.InputFile, err)
+					}
+				}()
-				defer finput.Close()

Alternatively, since defer is already used, you can suppress the linter for this specific case if the error is intentionally ignored:

-				defer finput.Close()
+				defer finput.Close() //nolint:errcheck // best-effort close
🤖 Fix all issues with AI agents
In `@README.md`:
- Line 65: Update the Go version note for the httpx installation: replace or
make the `>=1.25.0` requirement in the README (the string "`>=1.25.0`" that
currently references Go for `httpx`) conditional or placeholder until the
correct Go version is finalized, and ensure consistency with the `go.mod` module
declaration (`go` directive) by updating both the README text and the `go.mod`
`go` version to the final approved Go version once determined.

# Installation Instructions

`httpx` requires **go >=1.24.0** to install successfully. Run the following command to get the repo:
`httpx` requires **go >=1.25.0** to install successfully. Run the following command to get the repo:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Go version requirement should be updated once Go version is corrected.

The >=1.25.0 requirement aligns with go.mod, but both reference a non-existent Go version. Update this once the correct Go version is determined.

🤖 Prompt for AI Agents
In `@README.md` at line 65, Update the Go version note for the httpx installation:
replace or make the `>=1.25.0` requirement in the README (the string
"`>=1.25.0`" that currently references Go for `httpx`) conditional or
placeholder until the correct Go version is finalized, and ensure consistency
with the `go.mod` module declaration (`go` directive) by updating both the
README text and the `go.mod` `go` version to the final approved Go version once
determined.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@runner/runner.go`:
- Around line 655-669: The PageType value stored in Runner.classifyPage may be a
non-string type and later code does resp.KnowledgeBase["PageType"].(string),
causing a panic; change the insertion so PageType is coerced to a stable string
(e.g., use fmt.Sprint or fmt.Sprintf("%v", result.Type)) when setting
kb["PageType"] = ..., keep the existing nil/empty handling and ensure Forms
logic is unchanged so downstream filters can safely type-assert to string.
🧹 Nitpick comments (2)
runner/runner.go (2)

433-438: Initialization fallback is OK, but consider gating/noise + explicit nil on error.

Proceeding with a nil classifier is fine, but:

  • This warning will show even when page-type classification is irrelevant to the user’s chosen flags. Consider initializing dit only when -json/-csv/-fpt (or similar) is enabled to reduce surprise/noise.
  • Minor clarity: explicitly set runner.ditClassifier = nil on error (so future refactors don’t accidentally use a partially-initialized value).

2639-2640: Consider preferring headlessBody (when available) for classification accuracy.

Right now KnowledgeBase classification uses respData even when scanopts.Screenshot produced headlessBody. If the goal is better login/captcha/etc detection and form extraction on JS-rendered pages, you may want:

  • classifyPage(headlessBody, pHash) when headlessBody != ""
  • else fall back to respData

This is not strictly required for correctness, but it likely improves real-world detection rates.

@dogancanbakir dogancanbakir force-pushed the feat/dit-page-classifier branch from f43056f to 9ba4754 Compare February 16, 2026 10:35
Copy link
Member

@dwisiswant0 dwisiswant0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any deprecation process should be planned with a clear roadmap and timeline. Pulling support abruptly without giving users enough time to prepare/migrate is just going to cause UX frustration and/or unnecessary (downstream) breakage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace page classifier with dit, add -fpt flag

2 participants