Benchmarking single-pass and agentic extraction strategies across LLM providers on the Kleister NDA dataset.
Extracting structured fields from legal documents is deceptively hard. This project measures how well modern LLMs handle that on real NDA documents from the SEC Edgar database. The benchmark covers three model families (Claude, Gemini, and GPT) and scores each run in LangSmith using exact and fuzzy F1 evaluators.
This project uses the Kleister NDA dataset from Applica AI, which consists of NDA documents sourced from SEC Edgar, annotated with four entity types: effective_date, jurisdiction, party, and term.
Dataset preprocessing and delivery is handled by the Python package kleister-nda-preparation. The preparation pipeline reads the original TSV partitions, transforms raw labels into structured records validated against a Pydantic schema, relocates the corresponding PDF documents, and writes the results as partitioned Parquet files.
Note
This step runs automatically as part of make install.
Before running the benchmark, the preprocessed Parquet files and their PDF attachments need to be uploaded to LangSmith. The upload_dataset.py module supports several behaviors:
- Dry run (validates parquet files and PDF paths, no API calls)
uv run python -m agentic_kie_evals.upload_dataset --dry-run- Upload all partitions
uv run python -m agentic_kie_evals.upload_dataset- Upload specific partitions
uv run python -m agentic_kie_evals.upload_dataset --partitions train dev-0- Delete and recreate the dataset from scratch
uv run python -m agentic_kie_evals.upload_dataset --recreateTip
The upload script is idempotent: re-running it is safe. It reuses an existing dataset and deterministic example IDs prevent duplicates.
The benchmark runner evaluates the full experiment matrix (model × strategy × modality) against the LangSmith dataset. Each run is scored by the evaluators and logged back to LangSmith.
- Dry run (print the experiment matrix without making any API calls)
uv run python -m agentic_kie_evals.run_benchmark --dry-run- Single quick test (one model, one strategy, 10 examples)
uv run python -m agentic_kie_evals.run_benchmark \
--tier lite --model gemini --strategy single_pass --limit 10- Full matrix, lite tier (cost-optimised models) on the dev split
uv run python -m agentic_kie_evals.run_benchmark- Full matrix, standard tier (full-capability models) on the dev split
uv run python -m agentic_kie_evals.run_benchmark --tier standard| Flag | Choices | Default | Description |
|---|---|---|---|
--tier |
lite, standard, flagship |
lite |
Model tier: cost-optimised, full-capability, or top-capability |
--model |
claude, gemini, gpt |
all | Restrict to a single model |
--strategy |
single_pass, agentic |
both | Restrict to a single extraction strategy |
--split |
train, dev, test |
dev |
Dataset split to evaluate against |
--limit |
int | none | Cap the number of examples evaluated |
--max-concurrency |
int | 3 |
Max concurrent evaluations |
--max-retries |
int | 6 |
Max retries per extractor call |
--dry-run |
— | false | Print the experiment matrix and exit |
Note
Modalities are configured via SINGLE_PASS_MODALITIES and AGENTIC_MODALITIES in run_benchmark.py.
Evaluators live in evaluators.py and follow the LangSmith custom evaluator signature (outputs, reference_outputs) -> {"key": str, "score": float}.
| Evaluator | Field | Method | Score |
|---|---|---|---|
exact_effective_date_f1 |
effective_date |
Exact match | 0 or 1 |
exact_jurisdiction_f1 |
jurisdiction |
Exact match | 0 or 1 |
fuzzy_jurisdiction_f1 |
jurisdiction |
SequenceMatcher ≥ 0.85 | 0 or 1 |
exact_term_f1 |
term |
Exact match | 0 or 1 |
fuzzy_term_f1 |
term |
SequenceMatcher ≥ 0.85 | 0 or 1 |
exact_party_f1 |
party |
Set F1, exact string | 0–1 continuous |
fuzzy_party_f1 |
party |
Set F1, SequenceMatcher ≥ 0.85 | 0–1 continuous |
exact_f1 |
all fields | Macro-average of exact F1 scores | 0–1 continuous |
fuzzy_f1 |
all fields | Macro-average of fuzzy F1 scores | 0–1 continuous |
Normalization (lowercasing, whitespace trimming, trailing-period stripping) is applied to both sides before comparison.
See CONTRIBUTING.md for the development workflow, available make targets, and the CI pipeline.