Skip to main content

Eval

Agents eval:

  1. Start early.
  2. Source realistic tasks from failures.
  3. Define unambiguous, robust success criteria.
  4. Design graders thoughtfully and combine multiple types (code-based, model-based, human).
  5. Make sure the problems are hard enough for model.
  6. Iterate on evaluations to improve signal-to-noise ratio.
  7. Read transcripts (่ฎฐๅฝ•).
  8. Pick framework: prompt foo, harbor.
Method๐Ÿ‘ Strengths๐Ÿ‘Ž Weaknesses
Human EvaluationCaptures nuanced behaviorSubjective, time-consuming, expensive, difficult to scale
LLM-as-a-JudgeConsistent, scalable, efficientMay overlook intermediate steps, limited by LLM capabilities
Automated MetricsObjective, scalable, efficientMay not capture full capabilities

Traceโ€‹

When building agents, trace is the source of truth:

  • Debugging becomes trace analysis
  • Testing becomes eval-driven
  • Can't set breakpoints in reasoning
  • Performance optimization changes: task success rate, reasoning quality, tool usage efficiency

Trajectoryโ€‹

Trajectory is equally important as final response:

  • Exact match: produce trajectory that perfectly mirrors ideal solution.
  • In-order match: complete expected trajectory, while accommodating extra, un-penalized actions.
  • Any-order match: include all necessary actions.
  • Precision: relevant tool calls.
  • Recall: essential tool calls.

Trajectory

Benchmarksโ€‹

Benchmarks:

  • Aggregate: Donโ€™t obsess over a 1-2% lead on one benchmark, focus on specific and comprehensive domain.
  • Relative: Compare within the same model family or lab, how did the score change from v1 to v2?
  • Verify: The only benchmark that matters at the end of the day is your workload.