Statistical analysis methods for comparing prompt and model performance in LLM evaluations.
-
Updated
Apr 17, 2026 - Python
Statistical analysis methods for comparing prompt and model performance in LLM evaluations.
Another day, another Awesome List repo. A comprehensive list of Chainforge-related content
Measure prompt and skill improvements with blind A/B comparison.
Official implementation for "GLaPE: Gold Label-agnostic Prompt Evaluation and Optimization for Large Language Models" (stay tuned & more will be updated)
The prompt engineering, prompt management, and prompt evaluation tool for Python
The prompt engineering, prompt management, and prompt evaluation tool for TypeScript, JavaScript, and NodeJS.
A project to take a suboptimal prompt from Langsmith, enhance it, submit it again, and then reevaluate the results. #LangSmith #PromptEngineer
pi extension for fixed-task-set eval runs and prompt/system comparisons with reproducible reports
A Simple Prompt Optimization Using 3 different algorithms for testing.
A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.
The prompt engineering, prompt management, and prompt evaluation tool for Ruby.
Building a framework to run prompt evaluation tasks.
Benchmark and continuously improve your Superwhisper custom modes against your own voice recording history.
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
A few prompts that I am storing in a repo for the purpose of running controlled experiments comparing and benchmarking different LLMs for defined use-cases
Vitest-for-prompts. File-based prompt eval runner with snapshots and assertions — no SaaS, no dashboard, CI-native. Built for agents.
Test prompt variants across LLM providers with LLM-as-judge evaluation
Local-first LLM evaluation for Ollama: benchmark, compare, judge, battle, and export results.
The prompt engineering, prompt management, and prompt evaluation tool for Java.
The prompt engineering, prompt management, and prompt evaluation tool for Kotlin.
Add a description, image, and links to the prompt-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the prompt-evaluation topic, visit your repo's landing page and select "manage topics."