You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* tmp
* initial conversion to reward in accumulate_inference_batches
* nearly working
* first test fixes
* running, just need to test reduced logging
* test scripts, tmp commit for integration test
* update tests
* intermediate commit
* fix accumulate_inference_batches inputs
* change model to actually solve
* refill filtered prompts
move weight sync directly after update
episode now refers to "training episode", not "generation episode" as
previously
* fix test reward fn
* cleanup and move episode to later
* allow for not having time/reward metric
* always calculate advantage
becomes the same as reward when num_responses_per_prompt is 1
just because cursor keeps complaining
* try to fix test
* fix ground truths and datasets
makes grpo and ppo reward functions the same
* fix test
we now return k repeats of a prompts, not just 1 in the batch
* active sampling in large tests
* Update open_instruct/grpo_fast.py
Co-authored-by: Finbarr Timbers <[email protected]>
* Update open_instruct/grpo_fast.py
Co-authored-by: Finbarr Timbers <[email protected]>
* cursor was right
* address comments
* nit
* 32b without active sampling
* repeat each fix
---------
Co-authored-by: root <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>
0 commit comments