Evaluation

Online or Offline Evaluation

  • Offline Evaluation: This involves evaluating the agent in a controlled setting, typically using test datasets, not live user queries
    • Automatic Benchmarks
    • Custom Benchmarks
  • Online Evaluation: This refers to evaluating the agent in a live, real-world environment
    • success rates
    • user satisfaction scores
    • LLM-as-a-Judge
      • set up a separate LLM call to gauge the output’s correctness, toxicity, style, or any other criteria you care about in real time

Benchmarks

  • General Knowledge Benchmarks
    • MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects
    • HLE (Humanity’s Last Exam) consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences
      • It was created because modern LLMs have been able to score more than 90% on MMLU
  • Reasoning Benchmarks
    • BBH (Big Bench Hard) for logical thinking and planning
    • GSM8K for grade school math problem-solving
  • Language Understanding
    • HELM for commonsense, world knowledge, and reasoning
  • Domain-Specific Benchmarks
    • MATH benchmark for mathematical reasoning using 12,500 problems
    • HumanEval Benchmark for python coding using 164 problems
    • Alpaca Eval is automated evaluation framework to assess the quality of instruction-following models
  • Alternatives
    • LLM-as-Judge: Using one LLM model to evaluate another’s outputs
    • Evaluation Arenas: Users engage in anonymous “battles” between two LLMs, asking questions and voting on which model provides better responses
    • Custom Benchmark