Agentic KIE Evals

Benchmarking single-pass and agentic extraction strategies across LLM providers on the Kleister NDA dataset.

Extracting structured fields from legal documents is deceptively hard. This project measures how well modern LLMs handle that on real NDA documents from the SEC Edgar database. The benchmark covers three model families (Claude, Gemini, and GPT) and scores each run in LangSmith using exact and fuzzy F1 evaluators.

Dataset

This project uses the Kleister NDA dataset from Applica AI, which consists of NDA documents sourced from SEC Edgar, annotated with four entity types: effective_date, jurisdiction, party, and term.

Dataset preprocessing and delivery is handled by the Python package kleister-nda-preparation. The preparation pipeline reads the original TSV partitions, transforms raw labels into structured records validated against a Pydantic schema, relocates the corresponding PDF documents, and writes the results as partitioned Parquet files.

Note

This step runs automatically as part of make install.

Uploading the dataset to LangSmith

Before running the benchmark, the preprocessed Parquet files and their PDF attachments need to be uploaded to LangSmith. The upload_dataset.py module supports several behaviors:

Dry run (validates parquet files and PDF paths, no API calls)

uv run python -m agentic_kie_evals.upload_dataset --dry-run

Upload all partitions

uv run python -m agentic_kie_evals.upload_dataset

Upload specific partitions

uv run python -m agentic_kie_evals.upload_dataset --partitions train dev-0

Delete and recreate the dataset from scratch

uv run python -m agentic_kie_evals.upload_dataset --recreate

Tip

The upload script is idempotent: re-running it is safe. It reuses an existing dataset and deterministic example IDs prevent duplicates.

Running the benchmark

The benchmark runner evaluates the full experiment matrix (model × strategy × modality) against the LangSmith dataset. Each run is scored by the evaluators and logged back to LangSmith.

Dry run (print the experiment matrix without making any API calls)

uv run python -m agentic_kie_evals.run_benchmark --dry-run

Single quick test (one model, one strategy, 10 examples)

uv run python -m agentic_kie_evals.run_benchmark \
    --tier lite --model gemini --strategy single_pass --limit 10

Full matrix, lite tier (cost-optimised models) on the dev split

uv run python -m agentic_kie_evals.run_benchmark

Full matrix, standard tier (full-capability models) on the dev split

uv run python -m agentic_kie_evals.run_benchmark --tier standard

CLI reference

Flag	Choices	Default	Description
`--tier`	`lite`, `standard`, `flagship`	`lite`	Model tier: cost-optimised, full-capability, or top-capability
`--model`	`claude`, `gemini`, `gpt`	all	Restrict to a single model
`--strategy`	`single_pass`, `agentic`	both	Restrict to a single extraction strategy
`--split`	`train`, `dev`, `test`	`dev`	Dataset split to evaluate against
`--limit`	int	none	Cap the number of examples evaluated
`--max-concurrency`	int	`3`	Max concurrent evaluations
`--max-retries`	int	`6`	Max retries per extractor call
`--dry-run`	—	false	Print the experiment matrix and exit

Note

Modalities are configured via SINGLE_PASS_MODALITIES and AGENTIC_MODALITIES in run_benchmark.py.

Evaluators

Evaluators live in evaluators.py and follow the LangSmith custom evaluator signature (outputs, reference_outputs) -> {"key": str, "score": float}.

Evaluator	Field	Method	Score
`exact_effective_date_f1`	`effective_date`	Exact match	0 or 1
`exact_jurisdiction_f1`	`jurisdiction`	Exact match	0 or 1
`fuzzy_jurisdiction_f1`	`jurisdiction`	SequenceMatcher ≥ 0.85	0 or 1
`exact_term_f1`	`term`	Exact match	0 or 1
`fuzzy_term_f1`	`term`	SequenceMatcher ≥ 0.85	0 or 1
`exact_party_f1`	`party`	Set F1, exact string	0–1 continuous
`fuzzy_party_f1`	`party`	Set F1, SequenceMatcher ≥ 0.85	0–1 continuous
`exact_f1`	all fields	Macro-average of exact F1 scores	0–1 continuous
`fuzzy_f1`	all fields	Macro-average of fuzzy F1 scores	0–1 continuous

Normalization (lowercasing, whitespace trimming, trailing-period stripping) is applied to both sides before comparison.

Contributing

See CONTRIBUTING.md for the development workflow, available make targets, and the CI pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github		.github
.vscode		.vscode
data/results		data/results
notebooks		notebooks
src/agentic_kie_evals		src/agentic_kie_evals
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic KIE Evals

Contents

Dataset

Uploading the dataset to LangSmith

Running the benchmark

CLI reference

Evaluators

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic KIE Evals

Contents

Dataset

Uploading the dataset to LangSmith

Running the benchmark

CLI reference

Evaluators

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages