feat: Add post-training baseline comparison, report and config examples by Ricardo-M-Shang · Pull Request #214 · aiwaves-cn/agents

Ricardo-M-Shang · 2025-12-10T16:22:15Z

Changelog

Overview

Adds an evaluation-and-baseline reporting workflow to the trainer:

Baseline (fixed prompts) evaluation before training.
Final evaluation after training.
Aggregated JSON report with score comparison, success rate, loss/score curve, and top reflection snippets.

What’s New

New ReportConfig and EvaluationReporter to drive end-of-training reporting.
Trainer can auto-run:
- Baseline evaluation using the initial (or provided) solution.
- Final evaluation using the trained solution.
- Training curve accumulation (per-step average score and loss).
- Reflection harvesting from loss outputs.
- A single report.json written under the run log directory.
README updated with config snippet and output layout.

Configuration

Add to trainer_config.json:
{
"report_config": {
"enable": true,
"sample_size": 8,
"eval_indices": null,
"success_threshold": 0.5,
"top_k_reflections": 5,
"report_name": "report.json",
"baseline_solution_path": null
}
}Key fields:

enable: turn reporting on/off.
sample_size: number of cases to evaluate when indices not specified.
eval_indices: optional explicit indices to evaluate.
success_threshold: if set, computes success rate based on score >= threshold; if scores are 0/1, success rate is auto-computed.
top_k_reflections: limit for reflection snippets from loss outputs.
report_name: filename for the report JSON.
baseline_solution_path: optional fixed baseline; defaults to the initial solution.

Outputs

Under logs/<timestamp>/ when enabled:

baseline/: baseline evaluation cases and results.
final/: final evaluation cases and results.
report.json: summary with:
- baseline and final aggregates (average score, success rate if applicable, average loss).
- delta_average_score (final minus baseline).
- training_curve (list of step-level avg score and avg loss).
- top_reflections (loss-based requirements/suggestions with associated scores).

How It Works (Trainer Flow)

Optional baseline run:
- Loads baseline solution (initial or provided).
- For selected indices/sample size, runs forward + dataset evaluate.
- Summarizes to baseline/.
Training loop:
- Records per-step average score and average loss into a training curve.
Final run:
- Evaluates the trained solution on the same sampled/selected indices.
- Saves to final/.
Report generation:
- Builds report.json from baseline summary, final summary, curve, and reflections.

Files Touched

src/agents/evaluation/report.py (new)
src/agents/evaluation/__init__.py
src/agents/optimization/trainer.py
README.md

Notes and Tips

Use a small sample_size first to validate paths and report generation.
If your dataset scores are not 0/1, set success_threshold to define success rate semantics.
report_name can be customized per run; default is report.json.
Baseline uses the initial solution unless baseline_solution_path is provided.

feat: Add post-training baseline comparison, report and config examples

fcf5d62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: Add post-training baseline comparison, report and config examples#214

feat: Add post-training baseline comparison, report and config examples#214
Ricardo-M-Shang wants to merge 1 commit intoaiwaves-cn:masterfrom
Ricardo-M-Shang:feat/train-baseline-report

Ricardo-M-Shang commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

Ricardo-M-Shang commented Dec 10, 2025

Changelog

Overview

What’s New

Configuration

Outputs

How It Works (Trainer Flow)

Files Touched

Notes and Tips

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant