[Feature]: rLLM CLI, AgentFlow Framework, Model Gateway & Plugin System#438
Merged
jeffreysijuntan merged 77 commits intomainfrom Mar 13, 2026
Merged
[Feature]: rLLM CLI, AgentFlow Framework, Model Gateway & Plugin System#438jeffreysijuntan merged 77 commits intomainfrom
jeffreysijuntan merged 77 commits intomainfrom
Conversation
Add RLLMTrajectoryHookProvider, a Strands HookProvider that captures LLM calls during agent execution and builds TrajectoryView objects for RL training. Converts Bedrock-style messages to OpenAI Chat Completions format. Includes examples for simple, tool-using, and multi-agent setups. Also harden integration __init__.py imports with broad exception handling to prevent broken optional deps from blocking unrelated integrations. Made-with: Cursor
Resolve conflict in unified_trainer.py: - Adopt main's asyncio.run() pattern (drop background event loop thread) - Keep SDK integration (agent_run_func, SdkWorkflowFactory, post_execute_hook) - Adopt main's TrainerState enhancements (total_steps, reset_batch, etc.) - Keep SDK factory cleanup in shutdown() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SDK previously maintained separate StepView/TrajectoryView aliases for the canonical Step/Trajectory types from rllm.types. This removes the indirection and uses Step/Trajectory directly across the entire codebase, including integrations, engines, examples, tests, and docs. Also renames trace_to_step_view -> trace_to_step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the InferenceAPIServer-based Tinker path with a TinkerProxyManager that follows the same LiteLLM proxy pattern as VerlProxyManager. This ensures Tinker traces flow through TracingCallback (metadata routing, session context) instead of being stored directly by the inference server. Key changes: - New TinkerBackendServer: lightweight FastAPI wrapping TinkerEngine as an OpenAI-compatible /v1/chat/completions endpoint with token IDs and logprobs embedded as top-level choice fields (LiteLLM auto-collects these into provider_specific_fields) - New TinkerProxyManager: starts backend server, generates LiteLLM config with hosted_vllm/ prefix, manages lifecycle of both proxy and backend - Clean LiteLLM message dicts in build_llm_output to match Step's dict[str, str] schema (strip None values, promote reasoning from provider_specific_fields) - Remove InferenceAPIServer usage and inference_server parameter from SdkWorkflowFactory/SdkWorkflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…LMModel The root cause was that Strands' OpenAIModel always uses stream=True with the raw OpenAI SDK, but the LiteLLM proxy (hosted_vllm/ + fake_stream) sent SSE chunks lacking a proper finish_reason. Without finish_reason, Strands' process_stream never emitted contentBlockStop, so accumulated text was never finalized — resulting in empty message content. Key changes: - examples/sdk/strands_math: Switch from OpenAIModel to LiteLLMModel - strands.py: Remove `not self._traces` guard from _build_trajectories() so trajectories are always built even when no traces were recorded - strands.py: Add support for plain-string content (OpenAI format) and reasoningContent blocks in message converters - tinker_backend_server.py: Add streaming SSE support and content-block array flattening for multi-format message content - sdk_workflow.py: Add tracer flush retry and positional step matching fallback for Strands hook provider ID mismatches - proxy_manager.py: Enable fake_stream for TinkerProxyManager Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TinkerBackendServer has native SSE streaming support, so the LiteLLM proxy no longer needs to convert non-streaming responses to SSE. Strands' LiteLLMModel sends stream=True by default, which the backend now handles directly via _stream_sse(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New example at examples/sdk/openai_agents_math/ following the same agent_run_func pattern as the ADK and Strands examples. Uses OpenAIProvider with use_responses=False to route Chat Completions through the trainer's LiteLLM proxy. Also fix the same _ensure_trajectories guard bug in openai_agents.py that was fixed in strands.py — always build a trajectory even when _traces is empty so output/input are preserved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enables running agent code inside isolated environments (local subprocess, Docker, with stubs for Modal/AgentCore) instead of inside the trainer process. Agents expose a `rollout(task, config) -> list[Trajectory]` contract and communicate results back via a SQLite-backed result store through the proxy. Key components: - SandboxOrchestrator with persistent worker pool and per-task modes - ExecutionResultStore (SQLite + WAL) for cross-process result delivery - worker_server.py runner for inside sandboxes (fire-and-forget execution) - Local and Docker sandbox backends - Proxy routes for result submission/retrieval - SdkWorkflow integration with sandboxed execution path - Lazy __getattr__ imports in rllm/__init__.py to avoid torch dependency - Example sandbox_math agent with smoke test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement the agentcore sandbox backend that invokes pre-deployed AgentCore Runtimes via boto3 instead of managing local sandboxes. The ACR container runs agentcore_worker.py and POSTs results directly to the proxy's result store — same pattern as local/docker backends. - AgentCoreOrchestrator: bypasses SandboxOrchestrator protocol, invokes ACR via invoke_agent_runtime with rate limiting and adaptive retry - agentcore_worker.py: self-contained BedrockAgentCoreApp entrypoint with inlined metadata slug encoding and result POST (no rllm dep) - base_url override in config.extra for proxy reachability from ACR - Example Dockerfile, requirements, and step-by-step deployment docs using agentcore CLI (configure/deploy workflow) - IAM permissions documentation for InvokeAgentRuntime - Fallback rllm_types.py for agent environments without rllm installed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion reuse Add exponential-backoff retries to result submission across all worker variants, replace per-request aiohttp sessions with shared sessions, and introduce async event-based waiting in ExecutionResultStore to eliminate thread-blocking polling. Also parallelizes persistent pool worker creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add the rllm CLI (`rllm setup`, `rllm dataset`, `rllm agent`, `rllm eval`) with Click, including interactive provider configuration, dataset pull/list/info/inspect/remove, agent listing, and LiteLLM proxy-based eval. - Add CLI entry point and commands (setup, dataset, agent, eval) - Add v2 dataset registry with parquet storage and v1 migration - Add DatasetMetadata, DatasetConfig types - Add click and simple-term-menu dependencies - Include registry/*.json as package data - Add tests for CLI commands and dataset registry migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce a two-stage eval pipeline that separates agent execution from evaluation. AgentFlows produce Episodes (trajectories without rewards), and Evaluators score them independently — enabling swappable evaluation logic, multiple signals, and diverse agent programs (multi-agent, ADK, OpenAI SDK). - Add AgentFlow/Evaluator protocols and AgentConfig, Signal, EvalOutput types - Add built-in evaluators: MathEvaluator, CountdownEvaluator, CodeEvaluator, F1Evaluator, CompoundEvaluator - Rewrite built-in agents as AgentFlow classes (no reward fn imports) - Update EvalRunner for two-stage pipeline (agent.run → evaluator.evaluate) - Add evaluator_loader with registry and catalog auto-resolution - Add signals to EvalItem/EvalResult, artifacts to Episode, auto-gen Episode.id - Add --evaluator CLI option with auto-resolve from datasets.json - Update eval-framework.md documentation - Add comprehensive tests (test_eval_types, test_evaluator_loader, test_eval_runner) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ucture Expand rllm eval framework from 5 to 19 supported benchmarks by adding MCQ (mmlu_pro, mmlu_redux, gpqa_diamond, supergpqa, ceval, mmmlu), math (hmmt), code (humaneval, mbpp, livecodebench), instruction following (ifeval, ifbench), long context (longbench_v2), and agentic (bfcl, multichallenge) benchmarks. All datasets pull from upstream HuggingFace repos with row-level transforms for field normalization. Key additions: - 14 dataset transform functions in rllm/data/transforms.py - 4 new agents: MCQ, IFEval, BFCL function-calling, multi-turn - 4 new evaluators: MCQ, IFEval (vendored verification), BFCL, LLM judge - Extended _pull.py with hf_config, field_map, and transform support - 216 tests passing across all new and existing components Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… keys Replace `rllm setup` with `rllm model setup/swap/show`. Config now stores per-provider API keys so swapping providers doesn't re-prompt for known keys. Add GPT-5 family, o3-pro, Gemini 3 family, and gemini-2.5-flash-lite to supported models. Old config format and `rllm setup` alias are preserved for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ngBench configs New benchmarks: HMMT Nov 25, AA-LCR, HLE, MMLU-ProX, INCLUDE, Global PIQA, PolyMATH, WMT24++. Fix LongBench v2 (was QA+F1, now MCQ), fix HMMT split (test→train). Add aggregate_configs support in pull to merge all language configs into a single dataset with a language column. New infrastructure: reasoning agent (CoT), translation agent, LLM equality evaluator, ChrF translation evaluator. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
First vision-capable extension of the eval framework: - Image extraction in pull pipeline (PIL Images → disk PNGs, replaced with relative paths) - 3 VLM agent flows (vlm_mcq, vlm_math, vlm_open) with multimodal OpenAI API messages - 9 dataset transforms (MMMU, MMMU-Pro, MathVision, MathVista, DynaMath, ZEROBench, ZEROBench-Sub, VLMs Are Blind, BabyVision) - Registry entries for 10 datasets and 3 agents under new 'vlm' category - 54 new tests (agent flows, transforms, protocol conformance) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements `rllm train <benchmark>` that reuses the eval framework's AgentFlows and Evaluators to run RL training via the Tinker backend. Wraps AgentFlow + Evaluator into an agent_run_func and hands it to the experimental AgentTrainer. Key fixes beyond the initial implementation: - Inject session routing metadata into the base URL so plain OpenAI clients (used by AgentFlows) propagate session_uids to the LiteLLM proxy, enabling trace collection for training episodes. - Pass workflow timeout/gamma/reward_bonus_coeff through SdkWorkflowFactory.get_workflow_args() to prevent training hangs. - Set val_before_train=false in base.yaml default config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the two-hop architecture (LiteLLM proxy → TinkerBackendServer → TinkerEngine) with a single lightweight TinkerProxy that calls TinkerEngine directly in-process. This eliminates LiteLLM overhead and one HTTP round-trip per inference call during `rllm train`. - Add rllm/sdk/proxy/tinker_proxy.py: FastAPI server handling OpenAI- compatible chat completions, metadata-slug routing, and trace persistence via SqliteTracer - Rewrite TinkerProxyManager to start TinkerProxy instead of LiteLLM subprocess + TinkerBackendServer - Simplify SdkWorkflowFactory._setup_tinker_proxy() to match new API - Fix trace output format: nest token_ids/response_logprobs under provider_specific_fields so data_process.py extractors find them - Fix flush_tracer: use asyncio.to_thread for queue drain instead of loop.run_until_complete which fails inside a running event loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Defer torch, pandas, polars, litellm, and training module imports from top-level to point-of-use in CLI commands and Dataset class. Replace eager subcommand registration with a _LazyGroup that imports modules on first invocation. Update test mock patch paths to match new import sites. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-point discovery, and dataset CLI - Agent loader: persistent registry (~/.rllm/agents.json), auto-instantiation of classes, entry-point discovery (rllm.agents group), register/unregister/list_agents API - Evaluator loader: persistent registry (~/.rllm/evaluators.json), entry-point discovery (rllm.evaluators group), register/unregister/list_evaluators API - Dataset CLI: add `rllm dataset register` command (JSON, JSONL, CSV, Parquet) - Agent CLI: show Source column (built-in/registered/plugin) in `rllm agent list` - Examples: agent_plugin (pyproject.toml entry points) and agent_python_api (Python API registration) - Tests: 62 new tests for agent loader, evaluator loader, and dataset commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the PIL→PNG file pipeline with direct raw byte extraction from HuggingFace using Image(decode=False). Store binary image columns in Arrow IPC files instead of writing thousands of individual PNGs to disk. At eval time, base64-encode directly from in-memory bytes instead of reading files from disk. - Add _disable_image_decoding() for Image and Sequence(Image()) columns - Add _flatten_image_dicts() to extract bytes from HF image dicts - Add Arrow IPC save/load to DatasetRegistry with format-aware dispatch - Update VLM agents with _detect_mime_type() and _image_to_data_uri() to handle both bytes (new) and str paths (legacy) transparently - Update dataset inspect CLI to display <N bytes (image)> for binary data - Clean up legacy images/ directory on dataset removal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port upstream background worker (#419) and merge with our eval-specific additions (session_type, log_eval_result, error handling improvements).
feat(cli): port non-blocking UILogger, simplify --ui flag, and support eval UI logging
Replace LiteLLM proxy with rllm-model-gateway for AgentFlow-based training. Agents write standard OpenAI client code; the gateway transparently captures token IDs and logprobs for training via post-hoc enrichment. New files: - GatewayManager: manages gateway lifecycle (thread/process) and worker setup - AgentFlowEngine: runs AgentFlows in parallel with gateway trace capture - trace_converter: converts gateway TraceRecord → training Step Key changes: - CLI passes agent_flow + evaluator directly to AgentTrainer - UnifiedTrainer routes to AgentFlowEngine when agent_flow+evaluator provided - TinkerBackendServer outputs logprobs in vLLM standard format for gateway Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- New `rllm login` command that validates API key against UI backend - Persists key in ~/.rllm/config.json with 0o600 permissions - Shows login status if already logged in, --relogin to force - Handles pasted RLLM_API_KEY=... prefix gracefully - UILogger falls back to stored key (env var > config > None) - Fix raw string for banner to suppress SyntaxWarning
…training - Wrap create_session and get_traces in run_in_executor to prevent blocking the asyncio event loop in AgentFlowEngine - Preserve per-trajectory rewards from multi-agent evaluators - Widen Step.chat_completions type to dict[str, Any] for VLM content blocks - Reduce default train_batch_size from 64 to 32 to match CLI default - Downgrade per-task agent flow logs from INFO to DEBUG Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat(cli): add `rllm login` command for UI authentication
Add optional async `arun` method to AgentFlow. All three execution paths (eval runner, gateway training engine, tinker CLI) now prefer `arun` when available, falling back to sync `run` in a thread executor. Centralizes dispatch logic in `run_agent_flow()` helper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntegrations
Add SDK tracing integrations and plugin agent packages for three popular
agent frameworks, enabling `rllm eval <benchmark> --agent {smolagents,strands,langgraph}`.
SDK integrations (rllm/sdk/integrations/):
- smolagents.py: RLLMSmolAgentsTracer — model wrapper that intercepts __call__
- langgraph.py: RLLMTrajectoryCallbackHandler — LangChain BaseCallbackHandler
Plugin packages (plugins/):
- smolagents_agent: ToolCallingAgent + OpenAIServerModel with VLM image support
- strands_agent: Strands Agent + OpenAIModel with VLM ContentBlock support
- langgraph_agent: LangGraph StateGraph + ChatOpenAI with native multimodal support
All plugins follow the react_agent convention, adapt to any benchmark via
TaskSpec, and are discovered via rllm.agents entry points.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent_run_func training path is redundant now that the CLI uses AgentFlow + Evaluator natively. This removes Path 2 (SdkWorkflowFactory) from UnifiedTrainer, AgentTrainer, and all launcher classes, along with the make_agent_run_func() bridge function and its tests. The underlying rllm/sdk/ modules are kept intact for standalone SDK usage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove unused variables, chain ImportError exceptions with `from err`, use `X | Y` union syntax in isinstance calls, rename ambiguous variable, and remove unused imports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace plain ASCII banner with block-letter unicode art using a cyan-to-blue gradient, wrapped in a Rich Panel - Redesign `rllm dataset list` to show full catalog by default with datasets grouped by category, emoji icons, and color-coded status indicators (● pulled, ○ available, ◆ local) - Add `--local` flag to show only pulled datasets (replaces old default) - Use Rich Tables with rounded borders and consistent color theme Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hardcoded provider dicts with a ProviderInfo registry supporting 14 providers: openai, anthropic, gemini, openrouter, deepseek, together, fireworks, groq, cerebras, xai, zhipu, kimi, minimax, and custom OpenAI-compatible endpoints. Add base_url config field for custom endpoints, use display labels in CLI menus, and route through correct LiteLLM prefixes. Custom provider bypasses LiteLLM proxy entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change --ui from a boolean flag to --ui/--no-ui with auto-detection. When neither is passed, UI logging is automatically enabled if the user has a stored ui_api_key (via `rllm login`) or RLLM_API_KEY env var. Users can explicitly disable with --no-ui. Applied to both eval and train commands. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove minimal, langchain, openai-agents, crewai, and google-adk templates from `rllm init`. The react template is now the sole option and is selected automatically without prompting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move agent plugins from plugins/ to agenthub/ for clearer naming. Archive legacy examples (geo3k_tinker, ocr) and remove outdated CLI examples (agent_plugin, agent_python_api). Update imports, docs, and pyproject.toml references accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nudge - Add progressive episode uploads during eval via on_episode_complete callback with thread-safe buffer (batch size 50) instead of sending all after run - Add POST /api/episodes/batch support in UILogger with fallback to individual POSTs - Print clickable wandb-style session URL on UILogger init (rllm-ui: View run at ...) with local dev detection (localhost → frontend port 5173) - Add registration nudge in eval/train CLI when user is not logged in - Add nudge in Tracking.__init__ when 'ui' not in logger list
Same pattern as episodes — POST /api/trajectory-groups/batch with fallback to individual POSTs for backward compatibility.
Action and SWEAction are now Pydantic BaseModels which require keyword arguments. Updated all positional argument instantiations across agents and workflows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat(ui): progressive batched uploads, session URL, and registration nudge
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a major release that introduces a full-featured CLI (
rllm eval,rllm train,rllm init,rllm login), a comprehensive eval framework with 40+ benchmarks, a model gateway for RL agent training, an agent/evaluator plugin system, and sandboxed execution support. It also slims core dependencies, deprecates legacy APIs, and adds SDK integrations for popular agent frameworks.Key Changes
CLI (
rllm/experimental/cli/)rllm eval <benchmark> --model <name>— run evaluations against any supported benchmarkrllm train <benchmark> --model <name>— train with session-aware proxy tracing via tinker backendrllm init— scaffold new agent projects from templates (ReAct, OpenAI Agents, ADK, LangChain, CrewAI)rllm login— authenticate with the rLLM UIrllm dataset list/pull— browse and pull datasets from HuggingFace catalogrllm model setup— configure model providers with per-provider API keys--uiflag on eval/train for auto-enabling UI logging when logged inAgentFlow Abstraction for Eval (
rllm/experimental/eval/)AgentFlow/Evaluatorprotocol abstractions with async support (arun)EvalRunnerwith thread pool concurrency and async AgentFlow supportTaskSpec) and eval config managementAgentHub Plugins (
agenthub/)examples/cli/agent_plugin/)Training & RL Improvements
AgentFlow + Workflowtraining path via model gatewayChatTemplateParserin SFT trainer and OpenAI enginePackaging & Housekeeping
[verl],[tinker],[sdk],[dev])StepView/TrajectoryViewaliases removed — useStep/Trajectoryeverywhereexamples/archive/rllm/integrations/) deprecated and removedTest Plan
pytest— comprehensive test suite added for CLI commands, eval framework, data pipeline, model gateway, and SDK workflowsruff check .— linting passesrllm eval gsm8k --model gpt-4o-miniruns end-to-endrllm initscaffolds a project correctlyrllm dataset listdisplays cataloguv pip install -e .installs with slim core deps