Skip to content

feat: add NER entity extraction and knowledge graph to docsaf#10

Merged
ajroetker merged 17 commits intomainfrom
feat/docsaf-ner-knowledge-graph
Mar 21, 2026
Merged

feat: add NER entity extraction and knowledge graph to docsaf#10
ajroetker merged 17 commits intomainfrom
feat/docsaf-ner-knowledge-graph

Conversation

@ajroetker
Copy link
Copy Markdown
Contributor

Summary

  • Adds GliNER2-based named entity recognition to the docsaf documentation sync tool via termite's /api/recognize endpoint
  • Extracted entities are stored as document metadata (faceted keyword search) and as nodes in a knowledge graph index (entity-based document traversal)
  • Graph index uses field-based mentions_entity edge type that automatically creates edges from doc sections to referenced entity nodes

New flags (prepare/sync)

  • --ner-model — Termite recognizer model (e.g., fastino/gliner2-base-v1)
  • --ner-label — Zero-shot entity labels (repeatable)
  • --ner-threshold — Confidence threshold (default 0.5)
  • --ner-batch-size — Texts per NER batch (default 32)
  • --termite-url — Termite API URL (default http://localhost:8088)

Usage

docsaf sync --dir ./docs --table docs --create-table \
  --ner-model fastino/gliner2-base-v1 \
  --ner-label technology --ner-label concept --ner-label api_endpoint

Test plan

  • Build docsaf: cd examples/docsaf && GOWORK=off go build ./...
  • Run without NER flags (backward compatible): docsaf sync --dir ./docs --table docs --create-table
  • Run with NER: docsaf sync --dir ./docs --table docs --create-table --ner-model fastino/gliner2-base-v1 --ner-label technology --ner-label concept
  • Verify entity records in prepared JSON: docsaf prepare --dir ./docs --ner-model fastino/gliner2-base-v1 --ner-label technology && jq 'to_entries[] | select(.value._type == "entity")' docs.json
  • Verify graph edges created: query the knowledge graph index for mentions_entity edges
  • Verify --ner-model without --ner-label gives a clear error
  • Verify load auto-detects entity records and creates graph index

Add GliNER2-based named entity recognition to the docsaf documentation
sync tool via termite's /api/recognize endpoint. Extracted entities are
stored as both document metadata (for faceted keyword search) and as
nodes in a knowledge graph index (for entity-based document traversal).

Key changes:
- New --ner-model, --ner-label, --ner-threshold, --ner-batch-size flags
  on prepare and sync commands
- Entity extraction via termite client with batched NER inference
- Normalized entity keys (entity:<label>:<name>) with unicode support
- Graph index with field-based "mentions_entity" edge type that
  automatically creates edges from documents to referenced entities
- Auto-detection of entity records in load command for graph index setup
- Entity schema added to schemas.yaml
The test only severed the original leader→follower link, so a leader
election after the follower restart could let the third node (now
leader) replicate to the follower through an uncut link, making the
require.Error convergence assertion flaky.

Cut links from both peers to the follower and drop the
hasSnapshotTransferEvent assertion that assumed the original leader
would be the snapshot sender.
The pkg/generating module was introduced in 076d39b but missing from
the e2e go.mod replace block and the Makefile GO_SUBMODULES list,
causing e2e builds to fail with "unknown revision" errors.
- Split monolithic `antfly` job into `unit` (build + tests) and `e2e`
  (postgres + e2e tests) jobs that run in parallel with `sim-validate`
- Remove ollama install, model pull, and server management — all tests
  that use ollama are already skipped in CI via env var guards
- Remove GONOPROXY/GOPRIVATE config (antfly is now public)
- Remove unnecessary `go clean -modcache` steps
- Remove zstd install (Go uses bundled DataDog/zstd via CGO)
- Combine sim-validate and unit into a single job to avoid redundant
  runner spin-up
- Fix EmbeddingsIndexConfig.Equal to compare all fields (DistanceMetric,
  Sparse, ChunkSize, MinWeight, TopK, Chunker) instead of only a subset
- Normalize empty DistanceMetric to l2_squared default so omitempty
  round-trips don't cause false mismatches
- Add tests for EmbeddingsIndexConfig.Equal and IndexConfig.Equal
- Remove unused ai import from openapi.go
…ng, and CI

Resolve merge conflicts and consolidate changes from origin/main:

- Resolve conflicts in CI workflow and retrieval agent generator chain logic
- Migrate resolveEffectiveGeneratorChain to use pkg/generating instead of lib/ai
  for ResolveGeneratorOrChain and GetDefaultChain (matching the broader refactor)
- Add resolveProvider/resolveProviderName helpers for consistent provider resolution
- Refactor GenerationError to pointer type with Unwrap support, preserving the
  original error chain through Cause field for proper errors.As/errors.Is usage
- Add asGenerationErrorResponse and emitStreamError helpers to reduce repetitive
  error classification and streaming boilerplate in retrieval agent endpoints
- Move git hook from pre-commit to pre-push, rewriting it to diff pushed commits
  rather than the staging area
- Update CONTRIBUTING.md to reflect pre-push hook and add dependency hygiene docs
- Fix retrieval_agent_test.go to use generating.Get/SetDefaultChain (not lib/ai)
- Clean up redundant import alias (generating "..." where package already matches)
- Fix import ordering in lib/scraping/scraping.go
kin-openapi v0.134.0 breaks oapi-codegen v2.5.1 (MappingRef type change)
and panics in InternalizeRefs on schemas with empty refs. oasdiff/yaml
v0.0.1 breaks kin-openapi v0.133.0 (OriginOpt API change). go-yit
20250909 pulls in go.yaml.in/yaml/v4 which breaks yaml-jsonpath.

Pin all three via replace directives in root go.mod and downgrade across
all sub-modules to restore a working dependency set.
@ajroetker ajroetker merged commit 3e66da0 into main Mar 21, 2026
7 checks passed
@ajroetker ajroetker deleted the feat/docsaf-ner-knowledge-graph branch March 21, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant