Evaluation

Online or Offline Evaluation

Offline Evaluation: This involves evaluating the agent in a controlled setting, typically using test datasets, not live user queries
- Automatic Benchmarks
- Custom Benchmarks
Online Evaluation: This refers to evaluating the agent in a live, real-world environment
- success rates
- user satisfaction scores
- LLM-as-a-Judge
  - set up a separate LLM call to gauge the output’s correctness, toxicity, style, or any other criteria you care about in real time

General Knowledge Benchmarks
- MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects
- HLE (Humanity’s Last Exam) consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences
  - It was created because modern LLMs have been able to score more than 90% on MMLU
Reasoning Benchmarks
- BBH (Big Bench Hard) for logical thinking and planning
- GSM8K for grade school math problem-solving
Language Understanding
- HELM for commonsense, world knowledge, and reasoning
Domain-Specific Benchmarks
- MATH benchmark for mathematical reasoning using 12,500 problems
- HumanEval Benchmark for python coding using 164 problems
- Alpaca Eval is automated evaluation framework to assess the quality of instruction-following models
Alternatives
- LLM-as-Judge: Using one LLM model to evaluate another’s outputs
- Evaluation Arenas: Users engage in anonymous “battles” between two LLMs, asking questions and voting on which model provides better responses
- Custom Benchmark