Skip to content

gafnts/agentic-kie-evals

Repository files navigation

Agentic KIE Evals

Benchmarking single-pass and agentic extraction strategies across LLM providers on the Kleister NDA dataset.

CI License


Extracting structured fields from legal documents is deceptively hard. This project measures how well modern LLMs handle that on real NDA documents from the SEC Edgar database. The benchmark covers three model families (Claude, Gemini, and GPT) and scores each run in LangSmith using exact and fuzzy F1 evaluators.

Contents


Dataset

This project uses the Kleister NDA dataset from Applica AI, which consists of NDA documents sourced from SEC Edgar, annotated with four entity types: effective_date, jurisdiction, party, and term.

Dataset preprocessing and delivery is handled by the Python package kleister-nda-preparation. The preparation pipeline reads the original TSV partitions, transforms raw labels into structured records validated against a Pydantic schema, relocates the corresponding PDF documents, and writes the results as partitioned Parquet files.

Note

This step runs automatically as part of make install.

Uploading the dataset to LangSmith

Before running the benchmark, the preprocessed Parquet files and their PDF attachments need to be uploaded to LangSmith. The upload_dataset.py module supports several behaviors:

  1. Dry run (validates parquet files and PDF paths, no API calls)
uv run python -m agentic_kie_evals.upload_dataset --dry-run
  1. Upload all partitions
uv run python -m agentic_kie_evals.upload_dataset
  1. Upload specific partitions
uv run python -m agentic_kie_evals.upload_dataset --partitions train dev-0
  1. Delete and recreate the dataset from scratch
uv run python -m agentic_kie_evals.upload_dataset --recreate

Tip

The upload script is idempotent: re-running it is safe. It reuses an existing dataset and deterministic example IDs prevent duplicates.


Running the benchmark

The benchmark runner evaluates the full experiment matrix (model × strategy × modality) against the LangSmith dataset. Each run is scored by the evaluators and logged back to LangSmith.

  1. Dry run (print the experiment matrix without making any API calls)
uv run python -m agentic_kie_evals.run_benchmark --dry-run
  1. Single quick test (one model, one strategy, 10 examples)
uv run python -m agentic_kie_evals.run_benchmark \
    --tier lite --model gemini --strategy single_pass --limit 10
  1. Full matrix, lite tier (cost-optimised models) on the dev split
uv run python -m agentic_kie_evals.run_benchmark
  1. Full matrix, standard tier (full-capability models) on the dev split
uv run python -m agentic_kie_evals.run_benchmark --tier standard

CLI reference

Flag Choices Default Description
--tier lite, standard, flagship lite Model tier: cost-optimised, full-capability, or top-capability
--model claude, gemini, gpt all Restrict to a single model
--strategy single_pass, agentic both Restrict to a single extraction strategy
--split train, dev, test dev Dataset split to evaluate against
--limit int none Cap the number of examples evaluated
--max-concurrency int 3 Max concurrent evaluations
--max-retries int 6 Max retries per extractor call
--dry-run false Print the experiment matrix and exit

Note

Modalities are configured via SINGLE_PASS_MODALITIES and AGENTIC_MODALITIES in run_benchmark.py.


Evaluators

Evaluators live in evaluators.py and follow the LangSmith custom evaluator signature (outputs, reference_outputs) -> {"key": str, "score": float}.

Evaluator Field Method Score
exact_effective_date_f1 effective_date Exact match 0 or 1
exact_jurisdiction_f1 jurisdiction Exact match 0 or 1
fuzzy_jurisdiction_f1 jurisdiction SequenceMatcher ≥ 0.85 0 or 1
exact_term_f1 term Exact match 0 or 1
fuzzy_term_f1 term SequenceMatcher ≥ 0.85 0 or 1
exact_party_f1 party Set F1, exact string 0–1 continuous
fuzzy_party_f1 party Set F1, SequenceMatcher ≥ 0.85 0–1 continuous
exact_f1 all fields Macro-average of exact F1 scores 0–1 continuous
fuzzy_f1 all fields Macro-average of fuzzy F1 scores 0–1 continuous

Normalization (lowercasing, whitespace trimming, trailing-period stripping) is applied to both sides before comparison.


Contributing

See CONTRIBUTING.md for the development workflow, available make targets, and the CI pipeline.

About

Benchmarking agentic and single-pass extraction strategies across LLM providers on the Kleister NDA dataset

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages