feat: Add post-training baseline comparison, report and config examples#214
Open
Ricardo-M-Shang wants to merge 1 commit intoaiwaves-cn:masterfrom
Open
feat: Add post-training baseline comparison, report and config examples#214Ricardo-M-Shang wants to merge 1 commit intoaiwaves-cn:masterfrom
Ricardo-M-Shang wants to merge 1 commit intoaiwaves-cn:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog
Overview
Adds an evaluation-and-baseline reporting workflow to the trainer:
What’s New
ReportConfigandEvaluationReporterto drive end-of-training reporting.report.jsonwritten under the run log directory.Configuration
Add to
trainer_config.json:{
"report_config": {
"enable": true,
"sample_size": 8,
"eval_indices": null,
"success_threshold": 0.5,
"top_k_reflections": 5,
"report_name": "report.json",
"baseline_solution_path": null
}
}Key fields:
enable: turn reporting on/off.sample_size: number of cases to evaluate when indices not specified.eval_indices: optional explicit indices to evaluate.success_threshold: if set, computes success rate based on score >= threshold; if scores are 0/1, success rate is auto-computed.top_k_reflections: limit for reflection snippets from loss outputs.report_name: filename for the report JSON.baseline_solution_path: optional fixed baseline; defaults to the initial solution.Outputs
Under
logs/<timestamp>/when enabled:baseline/: baseline evaluation cases and results.final/: final evaluation cases and results.report.json: summary with:baselineandfinalaggregates (average score, success rate if applicable, average loss).delta_average_score(final minus baseline).training_curve(list of step-level avg score and avg loss).top_reflections(loss-based requirements/suggestions with associated scores).How It Works (Trainer Flow)
baseline/.final/.report.jsonfrom baseline summary, final summary, curve, and reflections.Files Touched
src/agents/evaluation/report.py(new)src/agents/evaluation/__init__.pysrc/agents/optimization/trainer.pyREADME.mdNotes and Tips
sample_sizefirst to validate paths and report generation.success_thresholdto define success rate semantics.report_namecan be customized per run; default isreport.json.baseline_solution_pathis provided.