diff --git a/README.md b/README.md index 15a0b90b0..378949aed 100644 --- a/README.md +++ b/README.md @@ -2,123 +2,171 @@ # rLLM -
-πŸš€ Reinforcement Learning for Language Agents🌟 -
- -
-
+**Train your AI agents with RL. Any framework. Minimal code changes.** -
- [![Documentation](https://img.shields.io/badge/Documentation-blue?style=for-the-badge&logo=googledocs&logoColor=white)](https://docs.rllm-project.com/) [![Slack](https://img.shields.io/badge/Slack-4A154B?style=for-the-badge&logo=slack&logoColor=white)](https://join.slack.com/t/rllmproject/shared_invite/zt-3pyblo6ef-m9kqAoInI8xSyUBkpuOyXA) [![Website](https://img.shields.io/badge/Site-%233f72af.svg?style=for-the-badge&logo=semanticweb&logoColor=white)](https://rllm-project.com) [![Blogs](https://img.shields.io/badge/Blogs-007AFF?style=for-the-badge)](https://rllm-project.com/blog) [![X](https://img.shields.io/badge/-black?logo=X&style=for-the-badge)](https://x.com/rllm_project) -
+
-rLLM is an open-source framework for post-training language agents via reinforcement learning. With rLLM, you can easily build your custom agents and environments, train them with reinforcement learning, and deploy them for real-world workloads. - -## Releases πŸ“° - -[2026/02/11] We release [`rLLM-FinQA-4B`](https://rllm-project.com/blog), a 4B financial analysis agent trained with RL that outperforms Qwen3-235B (**59.7% vs 51.4%**) and rivals Gemini 2.5 Pro on Snorkel Finance Benchmark. [[Blog]](https://rllm-project.com/blog) [[Model]](https://huggingface.co/rLLM/rLLM-FinQA-4B) [[Dataset]](https://huggingface.co/datasets/rLLM/finqa) +rLLM is an open-source framework for training AI agents with reinforcement learning. Swap in a tracked client, define a reward function, and let RL handle the rest β€” no matter what agent framework you use. -[2025/12/11] We release rLLM [v0.2.1](https://github.com/rllm-org/rllm/tree/v0.2.1) which comes with support for Tinker backend, LoRA and VLM training, and support for Eval Protocol. We also bumped our `verl` backend to `v0.6.1`. [[SDK Blogpost]](https://rllm-project.com/post.html?post=sdk.md) +## Core Features -[2025/10/16] rLLM [v0.2](https://github.com/rllm-org/rllm/tree/v0.2) is now officially released! We introduce `AgentWorkflowEngine` for training over arbitrary agentic programs. It also comes integrated with the official `verl-0.5.0`, featuring support for Megatron training. Check out this [blog post](https://rllm-project.com/post.html?post=rllm_v0.2.md) for more. +- **Works with any agent framework** β€” LangGraph, SmolAgent, Strands, OpenAI Agents SDK, Google ADK, or plain `openai.OpenAI`. Just swap the client. πŸ”Œ +- **Near-zero code changes** β€” Add `@rllm.rollout` to wrap your agent code, and rLLM traces every LLM call automatically. πŸͺ„ +- **CLI-first workflow** β€” Eval and train from the command line with 50+ built-in benchmarks. `rllm eval gsm8k` just works. ⚑ +- **Battle-tested results** β€” rLLM-trained agents beat models 50x their size (4B β†’ outperforms 235B on finance, 1.5B β†’ surpasses O1-Preview on math). πŸ“ˆ +- **Multiple RL algorithms** β€” GRPO, REINFORCE, RLOO, rejection sampling, and more. 🧠 +- **Two training backends** β€” `verl` for distributed multi-GPU training, `tinker` for single-machine / CPU setups. Same API either way. πŸ”§ -[2025/07/01] We release [`DeepSWE-Preview`](https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art[…]-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33?pvs=73), a 32B software engineering agent (SWE) trained with purely RL that achieves 59% on SWEBench-Verified with test-time scaling,(42.2% Pass@1), topping the SWEBench leaderboard for open-weight models. +Read more on our [documentation site](https://docs.rllm-project.com/). -[2025/04/08] We release [`DeepCoder-14B-Preview`](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), a 14B coding model that achieves an impressive **60.6%** Pass@1 accuracy on LiveCodeBench (+8% improvement), matching the performance of `o3-mini-2025-01-031 (Low)` and `o1-2024-12-17`. - -[2025/02/10] We release [`DeepScaleR-1.5B-Preview`](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2), a 1.5B model that surpasses O1-Preview and achieves 43.1% Pass@1 on AIME. We achieve this by iteratively scaling Deepseek's GRPO algorithm from 8Kβ†’16K->24K context length for thinking. - -## Getting Started 🎯 +## Installation rLLM requires `Python >= 3.10` (`3.11` is needed if using `tinker`). You can install it either directly via pip or build from source. -There are three ways that you can install rLLM: - -### Approach A: Direct Installation - ```bash -uv pip install "rllm[verl] @ git+https://github.com/rllm-org/rllm.git" +uv pip install "rllm @ git+https://github.com/rllm-org/rllm.git" ``` -_(or replace the `verl` above for `tinker` to install with tinker backend, see below for more details)_ - -### Approach B: Building from Source with `uv` +this installs dependencies for running rllm cli, which uses Tinker as the training backend. -**Step 1: Clone and Setup Environment** +To use `verl` as the training backend (GPU machine required), install via ```bash -# Clone the repository -git clone https://github.com/rllm-org/rllm.git -cd rllm - -# Create an uv environment -uv venv --python 3.11 -source .venv/bin/activate +# For distributed GPU training (verl + vLLM/SGLang) +uv pip install rllm[verl] @ git+https://github.com/rllm-org/rllm.git ``` -**Step 2: Install rLLM with Training Backend** +For building from source or Docker, see the [installation guide](https://docs.rllm-project.com/installation). -rLLM supports two training backends: `verl` and `tinker`. Choose one based on your needs. +## Quickstart -_**Option I:** Using `verl` as Training Backend_ +### Option A: CLI (no code needed) ```bash -uv pip install -e .[verl] -``` +# 1. Configure your model provider +rllm model setup -_**Option II:** Using `tinker` as Training Backend_ +# 2. Evaluate on a benchmark +rllm eval gsm8k -```bash -# can add --torch-backend=cpu to train on CPU-only machines -uv pip install -e .[tinker] +# 3. Train with RL +rllm train gsm8k ``` -### Approach C: Installation with Docker 🐳 +### Option B: Python API + +Define a rollout (your agent) and an evaluator (your reward function), then hand them to the trainer: + +```python +# my_flow.py +from openai import OpenAI +import rllm +from rllm.experimental.eval.types import AgentConfig, Task +from rllm.types import Episode, Trajectory + +@rllm.rollout +def solve(task: Task, config: AgentConfig) -> Episode: + client = OpenAI(base_url=config.base_url, api_key="EMPTY") + response = client.chat.completions.create( + model=config.model, + messages=[{"role": "user", "content": task.data["question"]}], + ) + answer = response.choices[0].message.content or "" + return Episode( + trajectories=[Trajectory(name="solver", steps=[])], + artifacts={"answer": answer}, + ) +``` -For a containerized setup, you can use Docker: +```python +# my_evaluator.py +import rllm +from rllm.experimental.eval.types import EvalOutput, Signal, _extract_agent_answer +from rllm.types import Episode + +@rllm.evaluator +def score(task: dict, episode: Episode) -> EvalOutput: + answer = _extract_agent_answer(episode) + is_correct = answer.strip() == task["ground_truth"].strip() + reward = 1.0 if is_correct else 0.0 + return EvalOutput(reward=reward, is_correct=is_correct, + signals=[Signal(name="accuracy", value=reward)]) +``` -```bash -# Build the Docker image -docker build -t rllm . +```python +# train.py +from rllm.experimental.unified_trainer import AgentTrainer + +trainer = AgentTrainer( + backend="tinker", + agent_flow=solve, + evaluator=score, + config=config, + train_dataset=dataset, +) +trainer.train() +``` -# Create and start the container -docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/rllm -v /tmp:/tmp --name rllm-container rllm sleep infinity -docker start rllm-container +During training, `config.base_url` points to a gateway that transparently captures token IDs and logprobs β€” your agent code stays the same for eval and training. -# Enter the container -docker exec -it rllm-container bash -``` +See the [cookbooks](./cookbooks) for complete working examples (single-turn VLM solver, multi-agent solver-judge, and more). + +## Architecture -For more detailed installation guide, including using `sglang` for `verl` backend, please refer to our [documentation](https://rllm-project.readthedocs.io/en/latest/getting-started/installation). +rLLM follows a simple pipeline: **run your agent β†’ collect traces β†’ compute rewards β†’ update the model**. -## Awesome Projects using rLLM πŸ”₯ +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ Your Agent │───▢│ Traces │───▢│ Rewards │───▢│ RL Update β”‚ +β”‚ (any code) β”‚ β”‚ (auto-logged)β”‚ β”‚ (your logic) β”‚ β”‚ (GRPO etc.) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` -* [DeepScaleR](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2): Surpassing O1-Preview with a 1.5B Model by Scaling RL -* [DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51): A Fully Open-Source 14B Coder at O3-mini Level -* [DeepSWE](https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art[%E2%80%A6]-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33): Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL -* [Tongyi DeepResearch](https://github.com/Alibaba-NLP/DeepResearch): A New Era of Open-Source AI Researchers [![GitHub Repo stars](https://img.shields.io/github/stars/Alibaba-NLP/DeepResearch)](https://github.com/Alibaba-NLP/DeepResearch) -* [Terminal-Bench-RL](https://github.com/Danau5tin/terminal-bench-rl): Training Long-Horizon Terminal Agents with Reinforcement Learning [![GitHub Repo stars](https://img.shields.io/github/stars/Danau5tin/terminal-bench-rl)](https://github.com/Danau5tin/terminal-bench-rl) -* [Cogito, Ergo Ludo](https://www.arxiv.org/abs/2509.25052): An Agent that Learns to Play by Reasoning and Planning -* [PettingLLMs](https://pettingllms-ai.github.io/): Using On-Policy Reinforcement Learning for Stronger Multi-Agent System [![GitHub Repo stars](https://img.shields.io/github/stars/pettingllms-ai/PettingLLMs)](https://github.com/pettingllms-ai/PettingLLMs) -* [Cut the Bill, Keep the Turns](https://agate-slipper-ef0.notion.site/Cut-the-Bill-Keep-the-Turns-Affordable-Multi-Turn-Search-RL-003f78214a4d451fb06f453d084e666c): Affordable Multi-Turn Search RL -* [SETA](https://eigent-ai.notion.site/SETA-Scaling-Environment-for-Terminal-Agent-2d2511c70ba280a9b7c0fe3e7f1b6ab8): Scaling Environments for Terminal Agents [![GitHub Repo stars](https://img.shields.io/github/stars/camel-ai/seta)](https://github.com/camel-ai/seta) -* [LLM-in-Sandbox](https://arxiv.org/abs/2601.16206): Building General Agents by running LLMs in a sandbox (virtual computer) [![GitHub Repo stars](https://img.shields.io/github/stars/llm-in-sandbox/llm-in-sandbox?style=social)](https://github.com/llm-in-sandbox/llm-in-sandbox) -* [Experiential Reinforcement Learning](https://arxiv.org/pdf/2602.13949v1): Reinforcement Learning with a Experience–Reflection–Consolidation Loop. -* [rLLM-FinQA-4B](https://rllm-project.com/blog): A 4B Financial Analysis Agent that Outperforms 235B and Rivals Gemini 2.5 Pro [[Model]](https://huggingface.co/rLLM/rLLM-FinQA-4B) [[Dataset]](https://huggingface.co/datasets/rLLM/finqa) +Your agent runs as-is β€” rLLM's SDK intercepts LLM calls and structures them into **Episodes** (one task) containing **Trajectories** (one agent run) made of **Steps** (one LLM call). A reward function scores the result, and the RL algorithm updates the model weights. The same agent code works for both eval and training. + +Under the hood: +- **Workflow Engine** runs N parallel agent instances to collect rollouts +- **LiteLLM Proxy** routes requests and captures token IDs + logprobs +- **Transform Pipeline** groups trajectories for advantage computation +- **Training Backend** (verl or tinker) handles the policy update + +## Community Projects + +- [Tongyi DeepResearch](https://github.com/Alibaba-NLP/DeepResearch) β€” Open-source AI researchers by Alibaba NLP [![Stars](https://img.shields.io/github/stars/Alibaba-NLP/DeepResearch)](https://github.com/Alibaba-NLP/DeepResearch) +- [Terminal-Bench-RL](https://github.com/Danau5tin/terminal-bench-rl) β€” Training long-horizon terminal agents with RL [![Stars](https://img.shields.io/github/stars/Danau5tin/terminal-bench-rl)](https://github.com/Danau5tin/terminal-bench-rl) +- [PettingLLMs](https://github.com/pettingllms-ai/PettingLLMs) β€” Multi-agent RL with on-policy training [![Stars](https://img.shields.io/github/stars/pettingllms-ai/PettingLLMs)](https://github.com/pettingllms-ai/PettingLLMs) +- [SETA](https://github.com/camel-ai/seta) β€” Scaling environments for terminal agents [![Stars](https://img.shields.io/github/stars/camel-ai/seta)](https://github.com/camel-ai/seta) +- [LLM-in-Sandbox](https://github.com/llm-in-sandbox/llm-in-sandbox) β€” Building general agents by running LLMs in a sandbox [![Stars](https://img.shields.io/github/stars/llm-in-sandbox/llm-in-sandbox)](https://github.com/llm-in-sandbox/llm-in-sandbox) +- [Cogito, Ergo Ludo](https://www.arxiv.org/abs/2509.25052) β€” An agent that learns to play by reasoning and planning +- [Cut the Bill, Keep the Turns](https://agate-slipper-ef0.notion.site/Cut-the-Bill-Keep-the-Turns-Affordable-Multi-Turn-Search-RL-003f78214a4d451fb06f453d084e666c) β€” Affordable multi-turn search RL +- [Experiential Reinforcement Learning](https://arxiv.org/abs/2602.13949) β€” Experience-reflection-consolidation loop for RL with sparse rewards +- [V1: Unifying Generation and Self-Verification](https://arxiv.org/abs/2603.04304) β€” Pairwise self-verification for parallel test-time scaling +## Articles & Blog Posts + +- [rLLM UI: Real-Time Observability Tool for Agent Training & Evaluation](https://rllm-project.com/post.html?post=rllm_ui.md) β€” Mar 2026 +- [rLLM On-Policy Distillation: Training Smaller Students from Stronger Teachers](https://rllm-project.com/post.html?post=opd.md) β€” Mar 2026 +- [Faster and Better: Open-Source Recipe for Deep Research Agents with Fully Async Training](https://rllm-project.com/post.html?post=async_rl.md) β€” Feb 2026 +- [rLLM-FinQA: How a 4B Model Outperforms 235B and Rivals Gemini 2.5 Pro on Financial Analysis](https://rllm-project.com/post.html?post=finqa.md) β€” Feb 2026 +- [rLLM SDK: Training Any Agentic Program without Code Changes](https://rllm-project.com/post.html?post=sdk.md) β€” Dec 2025 +- [rLLM v0.2: RL Training for General Agentic Programs](https://rllm-project.com/post.html?post=rllm_v0.2.md) β€” Oct 2025 +- [DeepSWE: Open-source SWE Agent via RL](https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art-Coding-Agent-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33) β€” Jul 2025 +- [DeepCoder: 14B Coder at O3-mini Level](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51) β€” Apr 2025 +- [DeepScaleR: 1.5B Surpasses O1-Preview](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) β€” Feb 2025 ## Acknowledgements -Our work is done as part of [Berkeley Sky Computing Lab](https://sky.cs.berkeley.edu/). The rLLM team is generously supported by grants from [Laude Institute](https://www.laude.org/), [AWS](https://aws.amazon.com/), [Hyperbolic](https://www.hyperbolic.ai/), [Fireworks AI](https://fireworks.ai/), and [Modal](https://modal.com/). We pay special thanks to [Together AI](https://www.together.ai/) for the research partnership and compute support. + +Our work is done as part of [Berkeley Sky Computing Lab](https://sky.cs.berkeley.edu/). The rLLM team is generously supported by grants from [Laude Institute](https://www.laude.org/), [AWS](https://aws.amazon.com/), [Hyperbolic](https://www.hyperbolic.ai/), [Fireworks AI](https://fireworks.ai/), and [Modal](https://modal.com/). We pay special thanks to [Together AI](https://www.together.ai/) for the research partnership and compute support. ## Citation + ```bibtex @misc{rllm2025, title={rLLM: A Framework for Post-Training Language Agents}, @@ -126,7 +174,6 @@ Our work is done as part of [Berkeley Sky Computing Lab](https://sky.cs.berkeley year={2025}, howpublished={\url{https://pretty-radio-b75.notion.site/rLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31}}, note={Notion Blog}, - year={2025} } ``` diff --git a/cookbooks/geo3k/README.md b/cookbooks/geo3k/README.md new file mode 100644 index 000000000..53b937ba4 --- /dev/null +++ b/cookbooks/geo3k/README.md @@ -0,0 +1,97 @@ +# Geo3K Flow + +A VLM geometry problem solver for rLLM that trains on the [Geometry3K](https://huggingface.co/datasets/hiyouga/geometry3k) dataset using the **AgentFlow protocol**. + +## Overview + +A single-turn VLM agent that receives a geometry problem with a diagram image and produces a step-by-step solution with a boxed final answer. Uses a plain `OpenAI` client with multimodal content blocks (base64-encoded images). + +During training, `config.base_url` points to the model gateway which transparently captures token IDs and logprobs. During eval, it points directly to the model provider. The agent code is identical in both cases. + +## Architecture + +``` +AgentFlow.run(task, config) + β”‚ + └── Solver + └── OpenAI(base_url=config.base_url).chat.completions.create( + messages=[system_prompt, {images + question}] + ) + β†’ Trajectory(name="solver", steps=[Step(action=response)]) + β”‚ + └── Episode(trajectories=[solver], artifacts={"answer": response}) +``` + +The evaluator extracts `\boxed{}` from the response and grades it against the ground truth using symbolic math grading. + +## Installation + +```bash +# From the rllm repo root +uv pip install -e ".[tinker]" # rllm + tinker backend +uv pip install -e cookbooks/geo3k # this cookbook +``` + +After installation, the agent and evaluator are discoverable by the CLI: + +```bash +rllm agent list # should show "geo3k" as a plugin +``` + +## Dataset + +Pull the Geometry3K dataset (one-time): + +```bash +rllm dataset pull geo3k +``` + +## Training + +### Option 1: rllm CLI + +```bash +rllm train geo3k \ + --agent geo3k \ + --evaluator geo3k_math \ + --model Qwen/Qwen3-VL-30B-A3B-Instruct \ + --lora-rank 32 \ + --group-size 8 \ + --epochs 3 +``` + +### Option 2: Python API + +```bash +python cookbooks/geo3k/train.py \ + rllm/backend=tinker \ + model.name=Qwen/Qwen3-VL-30B-A3B-Instruct \ + model.lora_rank=32 \ + training.group_size=8 +``` + +Or use the provided script (wraps train.py with defaults): + +```bash +bash cookbooks/geo3k/train.sh +``` + +## Eval + +```bash +rllm eval geo3k \ + --agent geo3k \ + --evaluator geo3k_math \ + --model Qwen/Qwen3-VL-30B-A3B-Instruct +``` + +## Files + +| File | Description | +|------|-------------| +| `geo3k_flow.py` | `Geo3KFlow` β€” AgentFlow implementation (VLM single-turn solver) | +| `evaluator.py` | `Geo3KEvaluator` β€” math answer grading with `\boxed{}` extraction | +| `train.py` | Python API training script (Hydra config) | +| `train.sh` | Shell wrapper β€” calls `train.py` with default overrides | +| `pyproject.toml` | Plugin metadata and entry points | +| `test.py` | Unit tests for image handling and evaluation | \ No newline at end of file diff --git a/cookbooks/geo3k/evaluator.py b/cookbooks/geo3k/evaluator.py new file mode 100644 index 000000000..a9ee33cc3 --- /dev/null +++ b/cookbooks/geo3k/evaluator.py @@ -0,0 +1,42 @@ +"""Geo3K evaluator: scores geometry answers using math grading.""" + +from __future__ import annotations + +import rllm +from rllm.experimental.eval.types import EvalOutput, Signal, _extract_agent_answer +from rllm.types import Episode + + +@rllm.evaluator +def geo3k_evaluator(task: dict, episode: Episode) -> EvalOutput: + """Grade geometry answers by extracting the boxed answer and comparing to ground truth.""" + from rllm.rewards.math_utils.utils import extract_answer, grade_answer_mathd, grade_answer_sympy + + answer_text = _extract_agent_answer(episode) + model_answer = extract_answer(answer_text) + + if model_answer is None: + return EvalOutput( + reward=0.0, + is_correct=False, + signals=[Signal(name="accuracy", value=0.0)], + ) + + ground_truth = task.get("ground_truth") + if ground_truth is None: + return EvalOutput( + reward=0.0, + is_correct=False, + signals=[Signal(name="accuracy", value=0.0)], + ) + + gt_str = str(ground_truth) + gt_extracted = extract_answer(gt_str) if "\\boxed" in gt_str else gt_str + + is_correct = grade_answer_mathd(model_answer, gt_extracted) or grade_answer_sympy(model_answer, gt_extracted) + reward = 1.0 if is_correct else 0.0 + return EvalOutput( + reward=reward, + is_correct=is_correct, + signals=[Signal(name="accuracy", value=reward)], + ) diff --git a/cookbooks/geo3k/geo3k_flow.py b/cookbooks/geo3k/geo3k_flow.py new file mode 100644 index 000000000..d85341b90 --- /dev/null +++ b/cookbooks/geo3k/geo3k_flow.py @@ -0,0 +1,96 @@ +"""Geo3K AgentFlow β€” VLM geometry problem solver. + +A single-turn VLM agent that solves geometry problems from the Geometry3K +dataset. Uses plain OpenAI client with multimodal content blocks β€” works +identically for eval and training (the gateway handles trace capture). +""" + +from __future__ import annotations + +import base64 +import logging + +from openai import OpenAI + +import rllm +from rllm.experimental.eval.types import AgentConfig, Task +from rllm.types import Episode, Trajectory + +logger = logging.getLogger(__name__) + +SYSTEM_PROMPT = """\ +You are a math problem solver with vision capabilities. You are given a \ +geometry problem that includes a diagram. +Solve the problem step by step, showing your reasoning clearly. +Put your final answer in \\boxed{} notation. + +For example: The answer is \\boxed{42}.""" + + +@rllm.rollout +def geo3k_flow(task: Task, config: AgentConfig) -> Episode: + """Single-turn VLM geometry solver.""" + data = task.data + client = OpenAI(base_url=config.base_url, api_key="EMPTY") + question = data.get("question", "") + images = data.get("images", []) + + user_content = _build_vlm_content(question, images) if images else question + + messages = [ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": user_content}, + ] + + response_text = "" + try: + response = client.chat.completions.create( + model=config.model, + messages=messages, + temperature=0.6, + max_tokens=2048, + ) + response_text = response.choices[0].message.content or "" + except Exception as e: + logger.warning("LLM call failed: %s", e) + + return Episode( + task=data, + trajectories=[Trajectory(name="solver", steps=[])], + artifacts={"answer": response_text}, + ) + + +def _detect_mime(data: bytes) -> str: + if data[:4] == b"\x89PNG": + return "image/png" + if data[:3] == b"\xff\xd8\xff": + return "image/jpeg" + if len(data) >= 12 and data[:4] == b"RIFF" and data[8:12] == b"WEBP": + return "image/webp" + return "image/png" + + +def _build_vlm_content(text: str, images: list) -> list[dict]: + """Build OpenAI multimodal content blocks from text + image data.""" + content: list[dict] = [] + for img in images: + if img is None: + continue + if isinstance(img, bytes): + mime = _detect_mime(img) + encoded = base64.b64encode(img).decode("utf-8") + data_uri = f"data:{mime};base64,{encoded}" + elif isinstance(img, str): + data_uri = img # assume already a URI or URL + else: + # PIL Image β€” convert to bytes + import io + + buf = io.BytesIO() + img.save(buf, format="PNG") + encoded = base64.b64encode(buf.getvalue()).decode("utf-8") + data_uri = f"data:image/png;base64,{encoded}" + content.append({"type": "image_url", "image_url": {"url": data_uri}}) + content.append({"type": "text", "text": text}) + return content diff --git a/cookbooks/geo3k/pyproject.toml b/cookbooks/geo3k/pyproject.toml new file mode 100644 index 000000000..907802175 --- /dev/null +++ b/cookbooks/geo3k/pyproject.toml @@ -0,0 +1,20 @@ +[build-system] +requires = ["setuptools>=64"] +build-backend = "setuptools.build_meta" + +[project] +name = "geo3k-flow" +version = "0.1.0" +description = "Geo3K VLM geometry solver plugin for rLLM (AgentFlow protocol)" +requires-python = ">=3.10" +dependencies = ["rllm", "openai"] + +[tool.setuptools] +py-modules = ["geo3k_flow", "evaluator"] + +# Plugin discovery: rllm CLI finds these by name +[project.entry-points."rllm.agents"] +geo3k = "geo3k_flow:geo3k_agent" + +[project.entry-points."rllm.evaluators"] +geo3k_math = "evaluator:Geo3KEvaluator" diff --git a/cookbooks/geo3k/test.py b/cookbooks/geo3k/test.py new file mode 100644 index 000000000..98a139475 --- /dev/null +++ b/cookbooks/geo3k/test.py @@ -0,0 +1,76 @@ +"""Tests for geo3k flow.""" + +from evaluator import geo3k_evaluator +from geo3k_flow import _build_vlm_content, _detect_mime + +from rllm.types import Episode, Step, Trajectory + + +def test_detect_mime_png(): + assert _detect_mime(b"\x89PNG\r\n\x1a\n") == "image/png" + + +def test_detect_mime_jpeg(): + assert _detect_mime(b"\xff\xd8\xff\xe0") == "image/jpeg" + + +def test_build_vlm_content_with_bytes(): + # Minimal PNG header + fake_png = b"\x89PNG" + b"\x00" * 20 + content = _build_vlm_content("What is x?", [fake_png]) + assert len(content) == 2 + assert content[0]["type"] == "image_url" + assert content[0]["image_url"]["url"].startswith("data:image/png;base64,") + assert content[1]["type"] == "text" + assert content[1]["text"] == "What is x?" + + +def test_build_vlm_content_no_images(): + content = _build_vlm_content("What is x?", []) + assert len(content) == 1 + assert content[0]["type"] == "text" + + +def test_evaluator_correct(): + task = {"question": "Find x", "ground_truth": "48"} + + episode = Episode( + trajectories=[ + Trajectory(name="solver", steps=[Step(action="The answer is \\boxed{48}")]), + ], + artifacts={"answer": "The answer is \\boxed{48}"}, + ) + + result = geo3k_evaluator.evaluate(task, episode) + assert result.is_correct is True + assert result.reward == 1.0 + + +def test_evaluator_wrong(): + task = {"question": "Find x", "ground_truth": "48"} + + episode = Episode( + trajectories=[ + Trajectory(name="solver", steps=[Step(action="The answer is \\boxed{24}")]), + ], + artifacts={"answer": "The answer is \\boxed{24}"}, + ) + + result = geo3k_evaluator.evaluate(task, episode) + assert result.is_correct is False + assert result.reward == 0.0 + + +def test_evaluator_no_boxed(): + task = {"question": "Find x", "ground_truth": "48"} + + episode = Episode( + trajectories=[ + Trajectory(name="solver", steps=[Step(action="I think 48")]), + ], + artifacts={"answer": "I think 48"}, + ) + + result = geo3k_evaluator.evaluate(task, episode) + assert result.is_correct is False + assert result.reward == 0.0 diff --git a/cookbooks/geo3k/train.py b/cookbooks/geo3k/train.py new file mode 100644 index 000000000..f3fff5693 --- /dev/null +++ b/cookbooks/geo3k/train.py @@ -0,0 +1,39 @@ +"""Train geo3k VLM geometry solver using the Python API. + +Usage (from rllm repo root): + python cookbooks/geo3k/train.py + +Or with Hydra overrides: + python cookbooks/geo3k/train.py model.name=Qwen/Qwen3-VL-30B-A3B-Instruct training.group_size=4 +""" + +import hydra +from evaluator import geo3k_evaluator +from geo3k_flow import geo3k_flow +from omegaconf import DictConfig + +from rllm.data.dataset import DatasetRegistry +from rllm.experimental.unified_trainer import AgentTrainer + + +@hydra.main(config_path="pkg://rllm.experimental.config", config_name="unified", version_base=None) +def main(config: DictConfig): + train_dataset = DatasetRegistry.load_dataset("geo3k", "train") + test_dataset = DatasetRegistry.load_dataset("geo3k", "test") + + if train_dataset is None: + raise RuntimeError("geo3k train split not found. Run: rllm dataset pull geo3k") + + trainer = AgentTrainer( + backend="tinker", + agent_flow=geo3k_flow, + evaluator=geo3k_evaluator, + config=config, + train_dataset=train_dataset, + val_dataset=test_dataset, + ) + trainer.train() + + +if __name__ == "__main__": + main() diff --git a/cookbooks/geo3k/train.sh b/cookbooks/geo3k/train.sh new file mode 100644 index 000000000..667bea785 --- /dev/null +++ b/cookbooks/geo3k/train.sh @@ -0,0 +1,20 @@ +#!/usr/bin/env bash +# Train geo3k VLM geometry solver via train.py with Hydra overrides. +# +# Prerequisites: +# 1. Install rllm with tinker extras: uv pip install -e ".[tinker]" +# 2. Install this cookbook: uv pip install -e cookbooks/geo3k +# 3. Pull the dataset: rllm dataset pull geo3k + +set -euo pipefail + +python -u cookbooks/geo3k/train.py \ + rllm/backend=tinker \ + model.name=Qwen/Qwen3-VL-30B-A3B-Instruct \ + model.lora_rank=32 \ + training.group_size=8 \ + rllm.trainer.total_epochs=3 \ + rllm.trainer.test_freq=10 \ + rllm.trainer.project_name=geo3k \ + rllm.trainer.experiment_name=qwen3-vl-30b-instruct \ + "$@" diff --git a/cookbooks/solver_judge_flow/README.md b/cookbooks/solver_judge_flow/README.md new file mode 100644 index 000000000..444c3bf09 --- /dev/null +++ b/cookbooks/solver_judge_flow/README.md @@ -0,0 +1,105 @@ +# Solver-Judge Flow + +A multi-agent flow for rLLM that trains a solver-judge system on the countdown task using the **AgentFlow protocol**. + +## Overview + +The solver-judge pattern uses two agents cooperatively: + +1. **Solver** β€” generates N candidate solutions to a countdown problem in parallel +2. **Judge** β€” evaluates the candidates and selects the best one + +Both agents use a plain `OpenAI` client pointed at `config.base_url`. During training, this URL points to the model gateway which transparently captures token IDs and logprobs for RL optimization. During eval, it points directly to the model provider. The agent code is identical in both cases. + +## Architecture + +``` +AgentFlow.run(task, config) + β”‚ + β”œβ”€β”€ Solver (N parallel threads) + β”‚ └── OpenAI(base_url=config.base_url).chat.completions.create(...) + β”‚ β†’ Trajectory(name="solver", steps=[Step(action=parsed_answer)]) + β”‚ + └── Judge + └── OpenAI(base_url=config.base_url).chat.completions.create(...) + β†’ Trajectory(name="judge", steps=[Step(action=selected_answer)]) + β”‚ + └── Episode(trajectories=[solver_0, solver_1, ..., judge]) +``` + +The evaluator scores each trajectory independently: +- Solver trajectories are scored by whether their answer is correct +- Judge trajectory is scored by whether the selected answer is correct +- GRPO computes advantages separately for the `solver` and `judge` trajectory groups + +## Installation + +```bash +# From the rllm repo root +uv pip install -e ".[tinker]" # rllm + tinker backend +uv pip install -e cookbooks/solver_judge_flow # this cookbook +``` + +After installation, the agent and evaluator are discoverable by the CLI: + +```bash +rllm agent list # should show "solver_judge" as a plugin +``` + +## Dataset + +Pull the countdown dataset (one-time): + +```bash +rllm dataset pull countdown +``` + +## Training + +### Option 1: rllm CLI + +```bash +rllm train countdown \ + --agent solver_judge \ + --evaluator solver_judge_countdown \ + --model Qwen/Qwen3-8B \ + --lora-rank 32 \ + --group-size 8 \ + --epochs 1 +``` + +### Option 2: Python API + +```bash +python cookbooks/solver_judge_flow/train.py \ + rllm/backend=tinker \ + model.name=Qwen/Qwen3-8B \ + model.lora_rank=32 \ + training.group_size=8 +``` +g +Or use the provided script (wraps train.py with defaults): + +```bash +bash cookbooks/solver_judge_flow/train.sh +``` + +## Eval + +```bash +rllm eval countdown \ + --agent solver_judge \ + --evaluator solver_judge_countdown \ + --model Qwen/Qwen3-8B +``` + +## Files + +| File | Description | +|------|-------------| +| `solver_judge_flow.py` | `SolverJudgeFlow` β€” AgentFlow implementation | +| `evaluator.py` | `SolverJudgeCountdownEvaluator` β€” per-trajectory reward scoring | +| `train.py` | Python API training script (Hydra config) | +| `train.sh` | Shell wrapper β€” calls `train.py` with default overrides | +| `pyproject.toml` | Plugin metadata and entry points | +| `test.py` | Unit tests for parsing and evaluation | diff --git a/cookbooks/solver_judge_flow/evaluator.py b/cookbooks/solver_judge_flow/evaluator.py new file mode 100644 index 000000000..d9f06a113 --- /dev/null +++ b/cookbooks/solver_judge_flow/evaluator.py @@ -0,0 +1,46 @@ +"""Solver-Judge evaluator: scores solver and judge trajectories independently.""" + +from __future__ import annotations + +import rllm +from rllm.experimental.eval.types import EvalOutput, Signal +from rllm.rewards.countdown_reward import compute_score +from rllm.types import Episode + + +@rllm.evaluator +def solver_judge_countdown_evaluator(task: dict, episode: Episode) -> EvalOutput: + """Score solver and judge trajectories independently. + + Sets per-trajectory rewards so GRPO can compute advantages separately + for solver vs judge trajectory groups. + """ + ground_truth = {"target": task["target"], "numbers": task["nums"]} + + solver_correct = 0 + solver_total = 0 + judge_reward = 0.0 + is_correct = False + + for traj in episode.trajectories: + answer = traj.steps[-1].action if traj.steps else "" + score = compute_score(str(answer), ground_truth) + reward = 1.0 if score >= 1.0 else 0.0 + traj.reward = reward + + if traj.name == "solver": + solver_total += 1 + solver_correct += int(reward >= 1.0) + elif traj.name == "judge": + judge_reward = reward + is_correct = reward >= 1.0 + + solver_acc = solver_correct / solver_total if solver_total > 0 else 0.0 + return EvalOutput( + reward=judge_reward, + is_correct=is_correct, + signals=[ + Signal(name="solver_acc", value=solver_acc), + Signal(name="judge_acc", value=float(is_correct)), + ], + ) diff --git a/cookbooks/solver_judge_flow/pyproject.toml b/cookbooks/solver_judge_flow/pyproject.toml new file mode 100644 index 000000000..6b46d9056 --- /dev/null +++ b/cookbooks/solver_judge_flow/pyproject.toml @@ -0,0 +1,20 @@ +[build-system] +requires = ["setuptools>=64"] +build-backend = "setuptools.build_meta" + +[project] +name = "solver-judge-flow" +version = "0.1.0" +description = "Solver-Judge multi-agent flow plugin for rLLM (countdown task)" +requires-python = ">=3.10" +dependencies = ["rllm", "openai"] + +[tool.setuptools] +py-modules = ["solver_judge_flow", "evaluator"] + +# Plugin discovery: rllm CLI finds these by name +[project.entry-points."rllm.agents"] +solver_judge = "solver_judge_flow:solver_judge_agent" + +[project.entry-points."rllm.evaluators"] +solver_judge_countdown = "evaluator:SolverJudgeCountdownEvaluator" diff --git a/cookbooks/solver_judge_flow/solver_judge_flow.py b/cookbooks/solver_judge_flow/solver_judge_flow.py new file mode 100644 index 000000000..38010f3b4 --- /dev/null +++ b/cookbooks/solver_judge_flow/solver_judge_flow.py @@ -0,0 +1,133 @@ +"""Solver-Judge AgentFlow β€” multi-agent countdown solver. + +A solver generates N candidate solutions in parallel, then a judge +selects the best one. Uses plain OpenAI client β€” works identically +for eval and training (the gateway handles trace capture). +""" + +from __future__ import annotations + +import re +from concurrent.futures import ThreadPoolExecutor + +from openai import OpenAI + +import rllm +from rllm.experimental.eval.types import AgentConfig, Task +from rllm.types import Episode, Step, Trajectory + +N_SOLUTIONS = 2 + + +@rllm.rollout(name="solver-judge") +def solver_judge_flow(task: Task, config: AgentConfig) -> Episode: + """AgentFlow: solver generates N solutions, judge picks the best.""" + data = task.data + client = OpenAI(base_url=config.base_url, api_key="EMPTY") + problem = data.get("question", "") + + # Step 1: Solver generates N solutions in parallel + solver_trajectories = _generate_solutions(client, config.model, problem) + + # Step 2: Judge selects the best solution + solutions = [t.steps[0].action for t in solver_trajectories] + judge_trajectory = _judge_solutions(client, config.model, problem, solutions) + + selected = judge_trajectory.steps[0].action + return Episode( + trajectories=[*solver_trajectories, judge_trajectory], + artifacts={"answer": selected}, + ) + + +def _generate_solutions(client: OpenAI, model: str, problem: str) -> list[Trajectory]: + """Generate N solutions in parallel using threads.""" + + def _solve() -> Trajectory: + messages = [{"role": "user", "content": f"{problem}. Output the final answer within ..."}] + response = client.chat.completions.create( + model=model, + messages=messages, + temperature=1, + max_tokens=1000, + ) + content = response.choices[0].message.content or "" + parsed = _parse_answer(content) + return Trajectory( + name="solver", + steps=[ + Step( + chat_completions=messages + [{"role": "assistant", "content": content}], + model_response=content, + action=parsed, + ) + ], + ) + + with ThreadPoolExecutor(max_workers=N_SOLUTIONS) as pool: + futures = [pool.submit(_solve) for _ in range(N_SOLUTIONS)] + return [f.result() for f in futures] + + +def _judge_solutions(client: OpenAI, model: str, problem: str, solutions: list[str]) -> Trajectory: + prompt = _create_judge_prompt(problem, solutions) + messages = [{"role": "user", "content": prompt}] + response = client.chat.completions.create( + model=model, + messages=messages, + temperature=1, + max_tokens=1000, + ) + content = response.choices[0].message.content or "" + selected = _parse_judge_response(content, solutions) + return Trajectory( + name="judge", + steps=[ + Step( + chat_completions=messages + [{"role": "assistant", "content": content}], + model_response=content, + action=selected, + ) + ], + ) + + +# -- Parsing helpers -------------------------------------------------------- + + +def _parse_answer(response: str) -> str: + match = re.search(r"(.*?)", response, re.IGNORECASE | re.DOTALL) + if match: + return f"{match.group(1).strip()}" + return "No solution found" + + +def _parse_judge_response(response: str, solutions: list[str]) -> str: + match = re.search(r"(.*?)", response, re.IGNORECASE | re.DOTALL) + if match: + try: + idx = int(match.group(1).strip()) + return solutions[idx - 1] + except (ValueError, IndexError): + return "" + return "" + + +def _create_judge_prompt(problem: str, solutions: list[str]) -> str: + prompt = f"""You are an expert verifier. Given a countdown problem and multiple solution attempts, select a correct solution. +Problem: +{problem} +Solutions to evaluate: +""" + for i, solution in enumerate(solutions, 1): + prompt += f"\nSolution {i}:\n{solution}\n" + + prompt += """ +A correct solution must satisfy the following criteria: +1. The solution uses only the given numbers. +2. Each number is used exactly once. +3. Only basic arithmetic operations (+, -, *, /) are used. +4. The calculation results in the target number. +5. The final answer is clearly marked within ... tags. +Output the index of your selected solution within ... tags, e.g., 1 for the first solution, 2 for the second solution, etc. If multiple solutions are correct, output the index of the first correct solution.""" + return prompt diff --git a/cookbooks/solver_judge_flow/test.py b/cookbooks/solver_judge_flow/test.py new file mode 100644 index 000000000..609a15080 --- /dev/null +++ b/cookbooks/solver_judge_flow/test.py @@ -0,0 +1,58 @@ +"""Tests for solver-judge flow.""" + +from evaluator import solver_judge_countdown_evaluator +from solver_judge_flow import _parse_answer, _parse_judge_response + +from rllm.types import Episode, Step, Trajectory + + +def test_parse_answer_extracts_boxed(): + assert _parse_answer("blah (1+2)*3 blah") == "(1+2)*3" + + +def test_parse_answer_no_match(): + assert _parse_answer("no answer here") == "No solution found" + + +def test_parse_judge_response(): + assert _parse_judge_response("2", ["sol1", "sol2"]) == "sol2" + + +def test_parse_judge_response_invalid(): + assert _parse_judge_response("bad", ["sol1"]) == "" + + +def test_evaluator_scores_trajectories(): + task = {"question": "reach 6", "target": 6, "nums": [1, 2, 3]} + + episode = Episode( + trajectories=[ + Trajectory(name="solver", steps=[Step(action="1 + 2 + 3")]), + Trajectory(name="solver", steps=[Step(action="wrong")]), + Trajectory(name="judge", steps=[Step(action="1 + 2 + 3")]), + ], + ) + + result = solver_judge_countdown_evaluator.evaluate(task, episode) + + assert result.is_correct is True + assert result.reward == 1.0 + assert episode.trajectories[0].reward == 1.0 + assert episode.trajectories[1].reward == 0.0 + assert episode.trajectories[2].reward == 1.0 + + +def test_evaluator_wrong_judge(): + task = {"question": "reach 6", "target": 6, "nums": [1, 2, 3]} + + episode = Episode( + trajectories=[ + Trajectory(name="solver", steps=[Step(action="1 + 2 + 3")]), + Trajectory(name="judge", steps=[Step(action="wrong")]), + ], + ) + + result = solver_judge_countdown_evaluator.evaluate(task, episode) + + assert result.is_correct is False + assert result.reward == 0.0 diff --git a/cookbooks/solver_judge_flow/train.py b/cookbooks/solver_judge_flow/train.py new file mode 100644 index 000000000..01f6f27b2 --- /dev/null +++ b/cookbooks/solver_judge_flow/train.py @@ -0,0 +1,39 @@ +"""Train solver-judge using the Python API. + +Usage (from rllm repo root): + python cookbooks/solver_judge_flow/train.py + +Or with Hydra overrides: + python cookbooks/solver_judge_flow/train.py model.name=Qwen/Qwen3-1.7B training.group_size=4 +""" + +import hydra +from evaluator import solver_judge_countdown_evaluator +from omegaconf import DictConfig +from solver_judge_flow import solver_judge_flow + +from rllm.data.dataset import DatasetRegistry +from rllm.experimental.unified_trainer import AgentTrainer + + +@hydra.main(config_path="pkg://rllm.experimental.config", config_name="unified", version_base=None) +def main(config: DictConfig): + train_dataset = DatasetRegistry.load_dataset("countdown", "train") + test_dataset = DatasetRegistry.load_dataset("countdown", "test") + + if train_dataset is None: + raise RuntimeError("countdown train split not found. Run: rllm dataset pull countdown") + + trainer = AgentTrainer( + backend="tinker", + agent_flow=solver_judge_flow, + evaluator=solver_judge_countdown_evaluator, + config=config, + train_dataset=train_dataset, + val_dataset=test_dataset, + ) + trainer.train() + + +if __name__ == "__main__": + main() diff --git a/cookbooks/solver_judge_flow/train.sh b/cookbooks/solver_judge_flow/train.sh new file mode 100755 index 000000000..e02fdbfb2 --- /dev/null +++ b/cookbooks/solver_judge_flow/train.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +# Train solver-judge via train.py with Hydra overrides. +# +# Prerequisites: +# 1. Install rllm with tinker extras: uv pip install -e ".[tinker]" +# 2. Install this cookbook: uv pip install -e cookbooks/solver_judge_flow +# 3. Pull the dataset: rllm dataset pull countdown +# +# To enable UI logging, append: rllm.trainer.logger=[console,ui] + +set -euo pipefail + +python -u train.py \ + rllm/backend=tinker \ + model.name=Qwen/Qwen3-8B \ + model.lora_rank=32 \ + training.group_size=8 \ + rllm.workflow.n_parallel_tasks=32 \ + rllm.trainer.total_epochs=1 \ + rllm.trainer.test_freq=5 \ + rllm.trainer.project_name=solver_judge \ + rllm.trainer.experiment_name=qwen3-8b \ + rllm.trainer.logger=[console,ui] \ + "$@" diff --git a/rllm/__init__.py b/rllm/__init__.py index 235620579..5170ee075 100644 --- a/rllm/__init__.py +++ b/rllm/__init__.py @@ -5,11 +5,20 @@ import sys -__all__ = ["BaseAgent", "Action", "Step", "Trajectory", "Episode"] +__all__ = ["BaseAgent", "Action", "Step", "Trajectory", "Episode", "rollout", "evaluator"] def __getattr__(name: str): - if name in __all__: + if name in ("rollout", "evaluator"): + from rllm.experimental.eval.rollout_decorator import evaluator, rollout + + _mod = sys.modules[__name__] + _mod.rollout = rollout + _mod.evaluator = evaluator + return rollout if name == "rollout" else evaluator + + _agent_exports = {"BaseAgent", "Action", "Step", "Trajectory", "Episode"} + if name in _agent_exports: from rllm.agents.agent import Action, BaseAgent, Episode, Step, Trajectory _exports = { diff --git a/rllm/experimental/eval/rollout_decorator.py b/rllm/experimental/eval/rollout_decorator.py new file mode 100644 index 000000000..b3150205c --- /dev/null +++ b/rllm/experimental/eval/rollout_decorator.py @@ -0,0 +1,261 @@ +"""Decorators that turn plain functions into AgentFlow / Evaluator objects. + +``@rollout`` wraps a function so it satisfies the :class:`AgentFlow` protocol, +and ``@evaluator`` wraps a function so it satisfies the :class:`Evaluator` +protocol. Both support bare (``@rollout``) and parameterized +(``@rollout(name="solver")``) syntax. +""" + +from __future__ import annotations + +import asyncio +import inspect +import logging +from collections.abc import Callable +from functools import update_wrapper +from typing import Any, overload + +from rllm.experimental.eval.types import AgentConfig, EvalOutput, Task +from rllm.types import Episode, Trajectory + +logger = logging.getLogger(__name__) + + +# --------------------------------------------------------------------------- +# Return-value coercion helpers +# --------------------------------------------------------------------------- + + +def _coerce_to_episode(result: Any, task: Task, traj_name: str) -> Episode: + """Convert a user function's return value into an Episode.""" + if isinstance(result, Episode): + if result.task is None: + result.task = task.data + return result + + if isinstance(result, list) and result and isinstance(result[0], Trajectory): + answer = "" + if result and result[-1].output: + answer = str(result[-1].output) + return Episode(task=task.data, trajectories=result, artifacts={"answer": answer}) + + if isinstance(result, str): + traj = Trajectory(name=traj_name, steps=[]) + return Episode(task=task.data, trajectories=[traj], artifacts={"answer": result}) + + if isinstance(result, dict): + traj = Trajectory(name=traj_name, steps=[]) + answer = result.get("answer", "") + return Episode(task=task.data, trajectories=[traj], artifacts={"answer": answer, **result}) + + # Fallback: stringify + traj = Trajectory(name=traj_name, steps=[]) + return Episode(task=task.data, trajectories=[traj], artifacts={"answer": str(result)}) + + +def _coerce_to_eval_output(result: Any) -> EvalOutput: + """Convert a user function's return value into an EvalOutput.""" + if isinstance(result, EvalOutput): + return result + + if isinstance(result, bool): + return EvalOutput(reward=1.0 if result else 0.0, is_correct=result) + + if isinstance(result, int | float): + return EvalOutput(reward=float(result), is_correct=float(result) > 0) + + if isinstance(result, tuple) and len(result) == 2: + reward, is_correct = result + return EvalOutput(reward=float(reward), is_correct=bool(is_correct)) + + raise TypeError(f"@evaluator function returned unsupported type {type(result).__name__}; expected EvalOutput, float, bool, or (float, bool)") + + +# --------------------------------------------------------------------------- +# AgentFlowFn β€” wrapper produced by @rollout +# --------------------------------------------------------------------------- + + +class AgentFlowFn: + """AgentFlow wrapper that delegates to a plain function. + + Satisfies the :class:`AgentFlow` protocol (``run(task, config) -> Episode``). + If the wrapped function is async, ``arun`` is also provided so that + :func:`run_agent_flow` can await it directly. + """ + + def __init__(self, fn: Callable, *, name: str = "solver") -> None: + self._fn = fn + self._name = name + self._is_async = inspect.iscoroutinefunction(fn) + update_wrapper(self, fn) + + def run(self, task: Task, config: AgentConfig) -> Episode: + if self._is_async: + result = asyncio.run(self._fn(task, config)) + else: + result = self._fn(task, config) + return _coerce_to_episode(result, task, self._name) + + async def arun(self, task: Task, config: AgentConfig) -> Episode: + if self._is_async: + result = await self._fn(task, config) + else: + result = self._fn(task, config) + return _coerce_to_episode(result, task, self._name) + + def __call__(self, task: Task, config: AgentConfig) -> Episode: + return self.run(task, config) + + def __repr__(self) -> str: + return f"AgentFlowFn({self._fn.__name__!r}, name={self._name!r})" + + +# --------------------------------------------------------------------------- +# EvaluatorFn β€” wrapper produced by @evaluator +# --------------------------------------------------------------------------- + + +class EvaluatorFn: + """Evaluator wrapper that delegates to a plain function. + + Satisfies the :class:`Evaluator` protocol + (``evaluate(task, episode) -> EvalOutput``). + """ + + def __init__(self, fn: Callable) -> None: + self._fn = fn + update_wrapper(self, fn) + + def evaluate(self, task: dict, episode: Episode) -> EvalOutput: + result = self._fn(task, episode) + return _coerce_to_eval_output(result) + + def __call__(self, task: dict, episode: Episode) -> EvalOutput: + return self.evaluate(task, episode) + + def __repr__(self) -> str: + return f"EvaluatorFn({self._fn.__name__!r})" + + +# --------------------------------------------------------------------------- +# @rollout decorator +# --------------------------------------------------------------------------- + + +@overload +def rollout(fn: Callable) -> AgentFlowFn: ... + + +@overload +def rollout(*, name: str = "solver", register: str | None = None) -> Callable[[Callable], AgentFlowFn]: ... + + +def rollout( + fn: Callable | None = None, + *, + name: str = "solver", + register: str | None = None, +) -> AgentFlowFn | Callable[[Callable], AgentFlowFn]: + """Decorator that turns a function into an :class:`AgentFlow`. + + The decorated function must accept ``(task, config)`` where *task* is a + :class:`Task` and *config* is an :class:`AgentConfig`. It may return: + + * a ``str`` β€” wrapped as the episode answer + * an :class:`Episode` β€” passed through + * a ``list[Trajectory]`` β€” wrapped in an Episode + * a ``dict`` β€” treated as episode artifacts + + Examples:: + + @rllm.rollout + def solver(task, config): + client = OpenAI(base_url=config.base_url, api_key="EMPTY") + resp = client.chat.completions.create( + model=config.model, + messages=[{"role": "user", "content": task.data["question"]}], + ) + return resp.choices[0].message.content + + @rllm.rollout(name="reasoning", register="my-agent") + def reasoning_agent(task, config): + ... + + Args: + fn: The function to decorate (when used without parentheses). + name: Trajectory name (default ``"solver"``). + register: If provided, register the agent under this name in + ``~/.rllm/agents.json`` for CLI discovery. + """ + + def _decorator(fn: Callable) -> AgentFlowFn: + agent = AgentFlowFn(fn, name=name) + if register is not None: + from rllm.experimental.eval.agent_loader import register_agent + + register_agent(register, agent) + return agent + + if fn is not None: + return _decorator(fn) + return _decorator + + +# --------------------------------------------------------------------------- +# @evaluator decorator +# --------------------------------------------------------------------------- + + +@overload +def evaluator(fn: Callable) -> EvaluatorFn: ... + + +@overload +def evaluator(*, register: str | None = None) -> Callable[[Callable], EvaluatorFn]: ... + + +def evaluator( + fn: Callable | None = None, + *, + register: str | None = None, +) -> EvaluatorFn | Callable[[Callable], EvaluatorFn]: + """Decorator that turns a function into an :class:`Evaluator`. + + The decorated function must accept ``(task, episode)`` where *task* is a + ``dict`` and *episode* is an :class:`Episode`. It may return: + + * an :class:`EvalOutput` β€” passed through + * a ``float`` β€” reward value (``is_correct = reward > 0``) + * a ``bool`` β€” correct/incorrect (reward 1.0 or 0.0) + * a ``(float, bool)`` tuple β€” ``(reward, is_correct)`` + + Examples:: + + @rllm.evaluator + def exact_match(task, episode): + answer = episode.artifacts.get("answer", "") + return 1.0 if answer.strip() == task["ground_truth"].strip() else 0.0 + + @rllm.evaluator(register="my-eval") + def custom_eval(task, episode): + ... + return EvalOutput(reward=score, is_correct=score > 0.5) + + Args: + fn: The function to decorate (when used without parentheses). + register: If provided, register the evaluator under this name in + ``~/.rllm/evaluators.json`` for CLI discovery. + """ + + def _decorator(fn: Callable) -> EvaluatorFn: + ev = EvaluatorFn(fn) + if register is not None: + from rllm.experimental.eval.evaluator_loader import register_evaluator + + register_evaluator(register, ev) + return ev + + if fn is not None: + return _decorator(fn) + return _decorator diff --git a/tests/eval/test_rollout_decorator.py b/tests/eval/test_rollout_decorator.py new file mode 100644 index 000000000..cdb575070 --- /dev/null +++ b/tests/eval/test_rollout_decorator.py @@ -0,0 +1,406 @@ +"""Tests for @rollout and @evaluator decorators.""" + +from __future__ import annotations + +import asyncio +from unittest.mock import patch + +import pytest + +from rllm.experimental.eval.rollout_decorator import ( + AgentFlowFn, + EvaluatorFn, + _coerce_to_episode, + _coerce_to_eval_output, + evaluator, + rollout, +) +from rllm.experimental.eval.types import AgentConfig, AgentFlow, EvalOutput, Evaluator, Task, run_agent_flow +from rllm.types import Episode, Trajectory + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + + +@pytest.fixture() +def task(): + return Task(data={"question": "What is 2+2?", "ground_truth": "4"}) + + +@pytest.fixture() +def config(): + return AgentConfig(base_url="http://localhost:4000/v1", model="test-model", session_uid="test-uid") + + +# --------------------------------------------------------------------------- +# @rollout tests +# --------------------------------------------------------------------------- + + +class TestRolloutBareDecorator: + def test_creates_agent_flow_fn(self): + @rollout + def my_agent(task, config): + return "hello" + + assert isinstance(my_agent, AgentFlowFn) + + def test_has_run_method(self): + @rollout + def my_agent(task, config): + return "hello" + + assert hasattr(my_agent, "run") + assert callable(my_agent.run) + + def test_has_arun_method(self): + @rollout + def my_agent(task, config): + return "hello" + + assert hasattr(my_agent, "arun") + + def test_satisfies_agent_flow_protocol(self): + @rollout + def my_agent(task, config): + return "hello" + + assert isinstance(my_agent, AgentFlow) + + def test_run_returns_episode(self, task, config): + @rollout + def my_agent(task, config): + return "the answer is 4" + + episode = my_agent.run(task, config) + assert isinstance(episode, Episode) + assert episode.artifacts["answer"] == "the answer is 4" + assert episode.task == task.data + + def test_default_trajectory_name(self, task, config): + @rollout + def my_agent(task, config): + return "answer" + + episode = my_agent.run(task, config) + assert len(episode.trajectories) == 1 + assert episode.trajectories[0].name == "solver" + + def test_callable(self, task, config): + @rollout + def my_agent(task, config): + return "answer" + + episode = my_agent(task, config) + assert isinstance(episode, Episode) + + +class TestRolloutParameterizedDecorator: + def test_custom_name(self, task, config): + @rollout(name="reasoning") + def my_agent(task, config): + return "answer" + + episode = my_agent.run(task, config) + assert episode.trajectories[0].name == "reasoning" + + def test_register_calls_register_agent(self): + with patch("rllm.experimental.eval.agent_loader.register_agent") as mock_reg: + + @rollout(register="my-agent") + def my_agent(task, config): + return "answer" + + mock_reg.assert_called_once_with("my-agent", my_agent) + + def test_repr(self): + @rollout(name="solver") + def my_agent(task, config): + return "answer" + + assert "AgentFlowFn" in repr(my_agent) + assert "my_agent" in repr(my_agent) + + +class TestRolloutReturnCoercion: + def test_str_return(self, task, config): + @rollout + def my_agent(task, config): + return "four" + + ep = my_agent.run(task, config) + assert ep.artifacts["answer"] == "four" + assert len(ep.trajectories) == 1 + assert ep.trajectories[0].steps == [] + + def test_episode_return(self, task, config): + @rollout + def my_agent(task, config): + return Episode(task=task.data, trajectories=[], artifacts={"answer": "direct"}) + + ep = my_agent.run(task, config) + assert ep.artifacts["answer"] == "direct" + + def test_episode_return_fills_task(self, task, config): + @rollout + def my_agent(task, config): + return Episode(trajectories=[], artifacts={"answer": "x"}) + + ep = my_agent.run(task, config) + assert ep.task == task.data + + def test_trajectory_list_return(self, task, config): + @rollout + def my_agent(task, config): + return [ + Trajectory(name="solver", steps=[], output="sol1"), + Trajectory(name="judge", steps=[], output="sol2"), + ] + + ep = my_agent.run(task, config) + assert len(ep.trajectories) == 2 + assert ep.trajectories[0].name == "solver" + assert ep.artifacts["answer"] == "sol2" + + def test_dict_return(self, task, config): + @rollout + def my_agent(task, config): + return {"answer": "four", "confidence": 0.9} + + ep = my_agent.run(task, config) + assert ep.artifacts["answer"] == "four" + assert ep.artifacts["confidence"] == 0.9 + + def test_fallback_to_str(self, task, config): + @rollout + def my_agent(task, config): + return 42 + + ep = my_agent.run(task, config) + assert ep.artifacts["answer"] == "42" + + +class TestRolloutAsync: + def test_async_function_via_run(self, task, config): + @rollout + async def my_agent(task, config): + return "async answer" + + ep = my_agent.run(task, config) + assert ep.artifacts["answer"] == "async answer" + + def test_async_function_via_arun(self, task, config): + @rollout + async def my_agent(task, config): + return "async answer" + + ep = asyncio.run(my_agent.arun(task, config)) + assert ep.artifacts["answer"] == "async answer" + + def test_sync_function_via_arun(self, task, config): + @rollout + def my_agent(task, config): + return "sync answer" + + ep = asyncio.run(my_agent.arun(task, config)) + assert ep.artifacts["answer"] == "sync answer" + + def test_works_with_run_agent_flow(self, task, config): + @rollout + def my_agent(task, config): + return "hello" + + ep = asyncio.run(run_agent_flow(my_agent, task, config)) + assert isinstance(ep, Episode) + assert ep.artifacts["answer"] == "hello" + + +# --------------------------------------------------------------------------- +# @evaluator tests +# --------------------------------------------------------------------------- + + +class TestEvaluatorBareDecorator: + def test_creates_evaluator_fn(self): + @evaluator + def my_eval(task, episode): + return 1.0 + + assert isinstance(my_eval, EvaluatorFn) + + def test_has_evaluate_method(self): + @evaluator + def my_eval(task, episode): + return 1.0 + + assert hasattr(my_eval, "evaluate") + assert callable(my_eval.evaluate) + + def test_satisfies_evaluator_protocol(self): + @evaluator + def my_eval(task, episode): + return 1.0 + + assert isinstance(my_eval, Evaluator) + + def test_evaluate_returns_eval_output(self): + @evaluator + def my_eval(task, episode): + return 1.0 + + result = my_eval.evaluate({"ground_truth": "4"}, Episode(trajectories=[])) + assert isinstance(result, EvalOutput) + + def test_callable(self): + @evaluator + def my_eval(task, episode): + return 1.0 + + result = my_eval({"ground_truth": "4"}, Episode(trajectories=[])) + assert isinstance(result, EvalOutput) + + +class TestEvaluatorParameterizedDecorator: + def test_register_calls_register_evaluator(self): + with patch("rllm.experimental.eval.evaluator_loader.register_evaluator") as mock_reg: + + @evaluator(register="my-eval") + def my_eval(task, episode): + return 1.0 + + mock_reg.assert_called_once_with("my-eval", my_eval) + + def test_repr(self): + @evaluator + def my_eval(task, episode): + return 1.0 + + assert "EvaluatorFn" in repr(my_eval) + assert "my_eval" in repr(my_eval) + + +class TestEvaluatorReturnCoercion: + def test_eval_output_passthrough(self): + @evaluator + def my_eval(task, episode): + return EvalOutput(reward=0.5, is_correct=False) + + result = my_eval.evaluate({}, Episode(trajectories=[])) + assert result.reward == 0.5 + assert result.is_correct is False + + def test_float_return(self): + @evaluator + def my_eval(task, episode): + return 0.75 + + result = my_eval.evaluate({}, Episode(trajectories=[])) + assert result.reward == 0.75 + assert result.is_correct is True + + def test_float_zero_is_not_correct(self): + @evaluator + def my_eval(task, episode): + return 0.0 + + result = my_eval.evaluate({}, Episode(trajectories=[])) + assert result.reward == 0.0 + assert result.is_correct is False + + def test_bool_true(self): + @evaluator + def my_eval(task, episode): + return True + + result = my_eval.evaluate({}, Episode(trajectories=[])) + assert result.reward == 1.0 + assert result.is_correct is True + + def test_bool_false(self): + @evaluator + def my_eval(task, episode): + return False + + result = my_eval.evaluate({}, Episode(trajectories=[])) + assert result.reward == 0.0 + assert result.is_correct is False + + def test_tuple_return(self): + @evaluator + def my_eval(task, episode): + return (0.5, True) + + result = my_eval.evaluate({}, Episode(trajectories=[])) + assert result.reward == 0.5 + assert result.is_correct is True + + def test_unsupported_type_raises(self): + @evaluator + def my_eval(task, episode): + return "not valid" + + with pytest.raises(TypeError, match="unsupported type"): + my_eval.evaluate({}, Episode(trajectories=[])) + + +# --------------------------------------------------------------------------- +# Coercion helper unit tests +# --------------------------------------------------------------------------- + + +class TestCoerceToEpisode: + def test_episode_passthrough(self): + task = Task(data={"q": "test"}) + ep = Episode(task={"q": "test"}, trajectories=[], artifacts={"answer": "x"}) + result = _coerce_to_episode(ep, task, "solver") + assert result is ep + + def test_str(self): + task = Task(data={"q": "test"}) + result = _coerce_to_episode("hello", task, "solver") + assert result.artifacts["answer"] == "hello" + + def test_dict(self): + task = Task(data={"q": "test"}) + result = _coerce_to_episode({"answer": "world", "extra": 1}, task, "solver") + assert result.artifacts["answer"] == "world" + assert result.artifacts["extra"] == 1 + + +class TestCoerceToEvalOutput: + def test_eval_output(self): + eo = EvalOutput(reward=1.0, is_correct=True) + assert _coerce_to_eval_output(eo) is eo + + def test_float(self): + result = _coerce_to_eval_output(0.5) + assert result.reward == 0.5 + assert result.is_correct is True + + def test_bool(self): + result = _coerce_to_eval_output(False) + assert result.reward == 0.0 + assert result.is_correct is False + + def test_tuple(self): + result = _coerce_to_eval_output((0.3, False)) + assert result.reward == 0.3 + assert result.is_correct is False + + +# --------------------------------------------------------------------------- +# Top-level import test +# --------------------------------------------------------------------------- + + +class TestTopLevelImport: + def test_rllm_rollout(self): + import rllm + + assert rllm.rollout is rollout + + def test_rllm_evaluator(self): + import rllm + + assert rllm.evaluator is evaluator