Lessons:
Staying on task
Context management
Crystallizing good behavior
Debugging a real agent
# Current Plan
0. [x] Sort LLM spans by latency
1. [~] Identify bottlenecks ← CURRENT
2. [ ] Suggest improvements
Call todo_update(id=1, status="completed")
when you finish this task.You must complete or mark as blocked all todos before finishing
Gets a human in the loop
DO NOT TRY TO COMPARE MORE THAN 2 EXPERIMENTS AT A TIME
jq '.experiments[0].rows[:5]'
jq '[.rows[] | select(.eval_score < 0.5)]'
jq '[.rows[].latency_ms] | add / length'
grep_json pattern="error"Does not scale
contains_any=[
["2000ms", "2.0 seconds", "two seconds"],
["OpenAIChat.invoke", "LLM span"],
]LLM reads alyx-traces skill
→ pulls full session trace
→ identifies failing tool call, notes trace_id
LLM reads datadog-debug skill
→ searches backend spans for that trace
→ finds 500 error on GraphQL resolver
LLM reads gcloud-logs skill
→ finds OOMKill 2 minutes earlier
→ "here's your root cause"
not in polite suggestions
(again)
Text
The big lessons: #1
The big lessons: #2
The big lessons: #3
The big lessons: #4
Follow me on BlueSky: 🦋 @seldo.com
These slides: slides.com/seldo/alyx-lessons-learned