Skip to content

Comments

feat: Add post-training baseline comparison, report and config examples#214

Open
Ricardo-M-Shang wants to merge 1 commit intoaiwaves-cn:masterfrom
Ricardo-M-Shang:feat/train-baseline-report
Open

feat: Add post-training baseline comparison, report and config examples#214
Ricardo-M-Shang wants to merge 1 commit intoaiwaves-cn:masterfrom
Ricardo-M-Shang:feat/train-baseline-report

Conversation

@Ricardo-M-Shang
Copy link

Changelog

Overview

Adds an evaluation-and-baseline reporting workflow to the trainer:

  • Baseline (fixed prompts) evaluation before training.
  • Final evaluation after training.
  • Aggregated JSON report with score comparison, success rate, loss/score curve, and top reflection snippets.

What’s New

  • New ReportConfig and EvaluationReporter to drive end-of-training reporting.
  • Trainer can auto-run:
    • Baseline evaluation using the initial (or provided) solution.
    • Final evaluation using the trained solution.
    • Training curve accumulation (per-step average score and loss).
    • Reflection harvesting from loss outputs.
    • A single report.json written under the run log directory.
  • README updated with config snippet and output layout.

Configuration

Add to trainer_config.json:
{
"report_config": {
"enable": true,
"sample_size": 8,
"eval_indices": null,
"success_threshold": 0.5,
"top_k_reflections": 5,
"report_name": "report.json",
"baseline_solution_path": null
}
}Key fields:

  • enable: turn reporting on/off.
  • sample_size: number of cases to evaluate when indices not specified.
  • eval_indices: optional explicit indices to evaluate.
  • success_threshold: if set, computes success rate based on score >= threshold; if scores are 0/1, success rate is auto-computed.
  • top_k_reflections: limit for reflection snippets from loss outputs.
  • report_name: filename for the report JSON.
  • baseline_solution_path: optional fixed baseline; defaults to the initial solution.

Outputs

Under logs/<timestamp>/ when enabled:

  • baseline/: baseline evaluation cases and results.
  • final/: final evaluation cases and results.
  • report.json: summary with:
    • baseline and final aggregates (average score, success rate if applicable, average loss).
    • delta_average_score (final minus baseline).
    • training_curve (list of step-level avg score and avg loss).
    • top_reflections (loss-based requirements/suggestions with associated scores).

How It Works (Trainer Flow)

  1. Optional baseline run:
    • Loads baseline solution (initial or provided).
    • For selected indices/sample size, runs forward + dataset evaluate.
    • Summarizes to baseline/.
  2. Training loop:
    • Records per-step average score and average loss into a training curve.
  3. Final run:
    • Evaluates the trained solution on the same sampled/selected indices.
    • Saves to final/.
  4. Report generation:
    • Builds report.json from baseline summary, final summary, curve, and reflections.

Files Touched

  • src/agents/evaluation/report.py (new)
  • src/agents/evaluation/__init__.py
  • src/agents/optimization/trainer.py
  • README.md

Notes and Tips

  • Use a small sample_size first to validate paths and report generation.
  • If your dataset scores are not 0/1, set success_threshold to define success rate semantics.
  • report_name can be customized per run; default is report.json.
  • Baseline uses the initial solution unless baseline_solution_path is provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant