Why LLM Evaluation Is Hard
Traditional software has clear pass/fail tests. LLM outputs are open-ended text. The same question can have many correct answers, and judging whether a response is "good enough" is subjective. But skipping evaluation is worse. Without it, you ship blind and find out about problems from angry users.
This guide covers the evaluation approaches that work in practice, from public benchmarks to custom test suites you build for your specific use case.
Public Benchmarks: Useful for Model Selection
Benchmarks like MMLU, HumanEval, GSM8K, and HellaSwag test models on standardized tasks. They are useful for comparing models against each other before you pick one for your project.
- MMLU: 57 subjects from elementary math to professional law. Tests general knowledge.
- HumanEval: Code generation. The model writes Python functions and they are tested against unit tests.
- GSM8K: Grade school math word problems. Tests reasoning ability.
- MT-Bench: Multi-turn conversation quality, scored by GPT-4 as a judge.
Benchmarks tell you which model is generally stronger. They do not tell you which model works best for your specific prompts and data. For that, you need custom evaluation.
Custom Evaluation: What Actually Matters
Build an eval set of 50 to 200 question-answer pairs drawn from your actual use case. Each pair should include the input (query + context), the expected answer, and scoring criteria.
Metrics to track:
- Correctness: Does the answer contain the right information? For factual tasks, compare against a ground truth answer.
- Hallucination rate: Does the model state things that are not in the provided context? For RAG systems, this is the most important metric.
- Format compliance: Does the output match the expected structure (JSON, bullet points, specific length)?
- Latency: How long does the response take? Latency varies by model, prompt length, and output length.
- Cost per query: Total token usage multiplied by per-token price.
LLM-as-Judge
For tasks where there is no single correct answer (creative writing, summarization, open-ended advice), you can use a stronger model to judge the output. GPT-4 or Claude 3.5 Sonnet evaluates the response on criteria you define: relevance, accuracy, helpfulness, and tone.
This approach is imperfect. The judge model has its own biases (it tends to prefer verbose answers and its own writing style). But it scales better than human evaluation and catches more issues than simple string matching.
To reduce bias, use pairwise comparison. Show the judge two responses to the same query (without labels) and ask which is better. This is more reliable than absolute scoring.
Testing Frameworks
Several frameworks automate LLM evaluation:
- Ragas: Built for RAG evaluation. Measures faithfulness (does the answer use the context?), answer relevance, and context recall.
- DeepEval: General-purpose LLM testing with pytest integration. Define test cases and metrics in Python.
- LangSmith: LangChain's platform for tracing, evaluating, and monitoring LLM applications. Good if you already use LangChain.
- Braintrust: Eval platform with support for custom metrics, LLM-as-judge, and A/B testing.
Building Your Eval Pipeline
- Collect 50 to 200 representative queries from production logs or expected use cases.
- Write expected answers or scoring rubrics for each query.
- Run your model against the eval set after every prompt or model change.
- Track metrics over time. A dashboard showing correctness, hallucination rate, and latency per prompt version is worth the setup time.
- Block deploys that degrade eval scores. Treat prompt changes like code changes with CI/CD gates.
Frequently Asked Questions
How many eval examples do I need?
Start with 50. That is enough to catch major regressions. Grow to 200+ as you discover edge cases in production. Quality matters more than count. One well-crafted edge case test is worth ten generic ones.
Can I automate evaluation completely?
Partly. Automated metrics catch format errors, hallucinations, and obvious regressions. But some quality dimensions (tone, helpfulness, user satisfaction) still benefit from periodic human review.