Observability

Traces and Spans

  • Traces: represent a complete agent task from start to finish
    • user query
  • Spans: individual steps within the trace
    • calling language model
    • retrieving data

Metrics

  • Latency: How quickly does the agent respond?
  • Costs: What’s the expense per agent run?
  • Request Errors: How many requests did the agent fail?
  • User Feedback: Implementing direct user evaluations provide valuable insights
    • explicit ratings (thumbs-up/down, 1-5 stars) or textual comments by user
  • Implicit User Feedback: User behaviors provide indirect feedback even without explicit ratings
    • immediate question rephrasing
    • repeated queries
    • clicking a retry button
  • Accuracy: How frequently does the agent produce correct or desirable outputs?
  • Automated Evaluation Metrics: Like LLM-as-a-Judge

Performance Metrics

  • Time to First Token (TTFT)
    • How quickly can you get the first response?
    • This is crucial for user experience and is primarily affected by the prefill phase
  • Time Per Output Token (TPOT)
    • How fast can you generate subsequent tokens?
    • This determines the overall generation speed
  • Throughput
    • How many requests can you handle simultaneously?
    • This affects scaling and cost efficiency
  • VRAM Usage
    • How much GPU memory do you need?
    • This often becomes the primary constraint in real-world applications

Inference Speed

  • https://mikeveerman.github.io/tokenspeed/?rate=60&mode=text
  • Measured in TPS (Token per second)
  • Humans reading speed
    • 1 Token in english ~ 0.75 word
    • 200 - 300 words per minute = 3.3 - 5 words per second
    • = 4 - 7 TPS
  • Human Typing speed
    • 40 - 60 words per minute = 0.67 - 1 words per second
    • = 0.89 - 1.3 TPS
  • For Chat bots
    • 10 - 20 TPS is good
  • For Agents or Coding tasks
    • 50+ TPS is good since you don’t need to read everything
  • MacBook M1 Pro, 16 GB
    • Qwen 2.5 (7b): 27 TPS
    • Gemma 3 (4b): 44 TPS
    • Gemma 4 (12b): 13 TPS