Observability
Traces and Spans
- Traces: represent a complete agent task from start to finish
- Spans: individual steps within the trace
- calling language model
- retrieving data
Metrics
- Latency: How quickly does the agent respond?
- Costs: What’s the expense per agent run?
- Request Errors: How many requests did the agent fail?
- User Feedback: Implementing direct user evaluations provide valuable insights
- explicit ratings (thumbs-up/down, 1-5 stars) or textual comments by user
- Implicit User Feedback: User behaviors provide indirect feedback even without explicit ratings
- immediate question rephrasing
- repeated queries
- clicking a retry button
- Accuracy: How frequently does the agent produce correct or desirable outputs?
- Automated Evaluation Metrics: Like LLM-as-a-Judge
- Time to First Token (TTFT)
- How quickly can you get the first response?
- This is crucial for user experience and is primarily affected by the prefill phase
- Time Per Output Token (TPOT)
- How fast can you generate subsequent tokens?
- This determines the overall generation speed
- Throughput
- How many requests can you handle simultaneously?
- This affects scaling and cost efficiency
- VRAM Usage
- How much GPU memory do you need?
- This often becomes the primary constraint in real-world applications
Inference Speed
- https://mikeveerman.github.io/tokenspeed/?rate=60&mode=text
- Measured in TPS (Token per second)
- Humans reading speed
- 1 Token in english ~ 0.75 word
- 200 - 300 words per minute = 3.3 - 5 words per second
- = 4 - 7 TPS
- Human Typing speed
- 40 - 60 words per minute = 0.67 - 1 words per second
- = 0.89 - 1.3 TPS
- For Chat bots
- For Agents or Coding tasks
- 50+ TPS is good since you don’t need to read everything
- MacBook M1 Pro, 16 GB
- Qwen 2.5 (7b): 27 TPS
- Gemma 3 (4b): 44 TPS
- Gemma 4 (12b): 13 TPS