Ship Real Agents
Hands-on evals for agentic applications
AIE Europe 2026-04-08

What are we talking about?
- What is an eval, and why you need them
- Setting up tracing with Phoenix
- Building an AI agent with the Claude Agent SDK
From data to evals
- Looking at your data: error analysis
- Code evals and built-in LLM evals
- Writing custom eval rubrics
- Meta-evaluation: testing your tests
From evals to experiments
and beyond
- Datasets, experiments, and the improvement cycle
- A look at what comes next
Get set up
- Learner notebook: bit.ly/build-real-agents
- Claude API key: platform.claude.com/settings/keys
- Phoenix Cloud account: app.phoenix.arize.com/login/sign-up
- Phoenix hostname: app.phoenix.arize.com/s/yourusername
- Phoenix API key: app.phoenix.arize.com/s/yourusername/settings/general
These slides: slides.com/seldo/ship-real-agents

1. Notebook

3. Phoenix cloud

2. Claude key
What is an eval?
Traces are logs,
evals are tests
- Traces = logs, for AI
- Evals = tests, for AI
Spans: the building blocks
- Each span = one step in execution
- LLM call, tool call, agent turn
- Records input, output, timing, token counts
The vibes problem
What you can't do without evals
- Can't detect regressions when you change a prompt
- Can't compare prompt versions objectively
- Can't know if a new model is actually better
- Can't run quality gates in CI
You can't switch models without evals
- New models drop every few months
- Without evals, switching = weeks of manual testing
- With evals, you know within hours
This is not theoretical
Everybody vibes until they can't any more
Two types of evals
- Code evals — deterministic, free, fast
- LLM-as-a-judge — semantic, flexible, powerful
LLM-as-a-judge evals
- A second LLM grades outputs against a rubric
- Handles meaning, not just strings
- Non-deterministic — needs calibration
LLM judges: tradeoffs
When to use which
- Code evals → format, structure, constraints
- LLM judge → accuracy, relevance, tone, faithfulness
- Human review → novel failures, calibrating judges
Why agents make this harder
- Single LLM call: input → output. Done.
- Agent: input → tool call → result → reasoning → another tool call → output
- Errors cascade. Each step can go wrong.
Multi-agent complexity
- Handoffs between agents add another layer
- Triage routing, specialist handling
- Each layer = new ways things can go wrong
Cascading failures
- Bad retrieval → bad reasoning → confidently wrong output
- The user sees a polished response and trusts it.
- This is worse than an obvious failure.
Creatively correct vs. wrong
- Sometimes the agent finds a better solution
- Your eval says "fail" — but the agent was right
- Evals need to distinguish creative from wrong
Another way
to categorize evals
- Capability evals: can it do this new thing?
- Regression evals: can it do the stuff it used to do?
What an eval result looks like
Code eval: score: 1 · label: "valid"
LLM judge: score: 0 · label: "incorrect"
explanation: "The response fails to include..."
What a real explanation
looks like
label: "incorrect"
explanation: "The response fails to include a budget
breakdown, which is a core requirement. The agent
provides destination info and local recommendations
but omits all cost estimates, making the plan
incomplete for a user who asked specifically
about budget travel to Tokyo."
Explanations make evals actionable
- Concrete failure → you know what to fix
- Same explanation across 50 traces = systematic problem
- Evals become a debugging tool, not just a scoreboard
The full loop

Setting up Phoenix
Step 1: Tracing
What is Phoenix?
- Open-source AI observability platform
- Captures traces from any AI framework
- Free cloud tier at app.phoenix.arize.com
Open the notebook
Install dependencies
pip install claude-agent-sdk
openinference-instrumentation-claude-agent-sdk
arize-phoenix anthropic
The Claude Agent SDK
- Anthropic's framework for building agents
- Tool use, web search, conversation context
- Auto-instrumented by OpenInference
What are we building?
- A financial analysis chatbot
- Two-turn agent: research then write
- Web search tool for real financial data
- Traces everything to Phoenix automatically
How the two turns connect
- Two-turn architecture: research → write
- Claude SDK saves context between turns
- Phoenix traces every step automatically
Set your API keys
from google.colab import userdata
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-XXXX"
os.environ["PHOENIX_API_KEY"] = "YYYY"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"]
= "https://app.phoenix.arize.com/s/yourusername/"
Register the tracer
from phoenix.otel import register
register(
project_name="aie-claude-financial-agent",
auto_instrument=True
)
Build the agent
A financial analysis chatbot
The agent setup
from claude_agent_sdk import ClaudeSDKClient,
ClaudeAgentOptions, AssistantMessage, TextBlock
options = ClaudeAgentOptions(
model="claude-haiku-4-5-20251001",
allowed_tools=["WebSearch"],
permission_mode="acceptEdits",
)
The two-turn pattern
RESEARCH_PROMPT = "Research {tickers}. Focus on: {focus}.
Use web search to find current financial data."
WRITE_PROMPT = "Now write a concise financial report
based on your research above."
The financial_report function
async def financial_report(tickers, focus):
async with ClaudeSDKClient(options=options) as client:
await client.query(RESEARCH_PROMPT.format(...))
async for message in client.receive_response():
... # research completes
await client.query(WRITE_PROMPT)
async for message in client.receive_response():
... # collect the report
return report
Wrapping it in a span
with tracer.start_as_current_span(
"financial_report",
attributes={
"input.value": f"Research: {tickers}\nFocus: {focus}",
},
) as span:
Run it!
result = await financial_report(
"TSLA",
"financial performance and growth outlook"
)
print(result)
Non-deterministic by design
Why this needs evals
Look at the trace
Traces reveal every decision
What lives in a span
- kind: AGENT
- attributes.input.value: "Research: TSLA\nFocus: financial..."
- attributes.output.value: "# TESLA, INC. (TSLA)..."
- start_time, end_time, duration
Generate test data
Here's one I made earlier
Test queries
test_queries = [
{"tickers": "AAPL", "focus": "revenue growth"},
{"tickers": "NVDA", "focus": "AI chip demand"},
{"tickers": "AMZN", "focus": "AWS performance"},
{"tickers": "GOOGL", "focus": "advertising revenue"},
{"tickers": "MSFT", "focus": "cloud computing segment"},
{"tickers": "META", "focus": "metaverse investments"},
{"tickers": "TSLA", "focus": "vehicle deliveries"},
{"tickers": "RIVN", "focus": "financial health"},
{"tickers": "AAPL, MSFT", "focus": "comparative analysis"},
{"tickers": "NVDA", "focus": "competitive landscape"},
{"tickers": "KO", "focus": "dividend yield"},
{"tickers": "AMZN", "focus": "profitability trends"},
]
Covering the edge cases
Traces are loaded
Start with data, not metrics
Read your traces before you write evals
You need requirements first
- You can't say "it doesn't work" if you haven't defined what "works" looks like
- Write down explicit success criteria
Defining success is cross-functional work
Where to get test data
- Before production: synthetic data (LLM-generated queries)
- After production: real user queries from traces
- Diversity is critical — vary phrasing, intent, complexity
Don't forget the edge cases
Examine traces
When the data looks right
but isn't
The "confidently wrong" problem
Categorize by root cause
- "The response was wrong" — not actionable. Ask why.
- Retrieval failure → better search
- Reasoning error → better prompts
- Hallucination → grounding checks
- Scope violation → explicit boundaries
- Formatting / tone → output guardrails
Categorize what we found
trace_categories = {
"TSLA performance": "looks good",
"NVDA competitive": "possible hallucination",
"AAPL vs MSFT": "reasoning gap",
"RIVN financial health": "looks good",
...
}
What the frequency table
tells you
Category counts:
looks good 5 █████
possible hallucination 3 ███
reasoning gap 2 ██
unverifiable data 1 █
missing recommendation 1 █
Frequency x Severity = Priority
The Swiss Cheese Model

Stack your eval layers
Evaluations
Step 4: Code evals
The simplest useful eval
Get your spans
from phoenix.client import Client
px_client = Client()
spans_df = px_client.spans.get_spans_dataframe(
project_name="aie-claude-financial-agent"
)
parent_spans = spans_df[
spans_df["parent_id"].isna()
]
parent_spans.rename(columns={
"attributes.input.value": "input",
"attributes.output.value": "output"
}, inplace=True)
Ticker check eval
from phoenix.evals import create_evaluator
@create_evaluator(name="mentions_ticker", kind="code")
def mentions_ticker(input, output):
tickers = re.findall(r"\b([A-Z]{1,5})\b", input)
likely_tickers = [t for t in tickers
if len(t) >= 2 and t not in ("AI", "US", ...)]
missing = [t for t in likely_tickers
if t not in output.upper()]
if not missing:
return {"label": "pass", "score": 1}
return {"label": "fail", "score": 0,
"explanation": f"Missing: {', '.join(missing)}"}
How the ticker check works
Running the ticker check
Why this matters
Code evals
aren't just toy examples
- Did the output parse as JSON?
- Is the response under 500 tokens?
- Does it avoid forbidden phrases?
Grade what was produced,
not the path
- Don't check that the agent followed specific steps
- Agents find valid approaches you didn't anticipate
- Check the outcome, not the trajectory
Check the outcome,
not the trajectory
Step 5: Built-in LLM evals
What code can't check
Three components
- A judge model (the LLM that grades)
- A prompt template (the rubric)
- Data (the examples being evaluated)
Phoenix ships built-in evals
- Correctness, Faithfulness, Conciseness
- Tool Selection, Tool Invocation
- Document Relevance, Refusal
- No prompt engineering required
Tool selection evals
Set up the judge
from phoenix.evals import LLM
from phoenix.evals.metrics import CorrectnessEvaluator
llm = LLM(model="claude-sonnet-4-6", provider="anthropic")
correctness_eval = CorrectnessEvaluator(llm=llm)
Run the evaluation
from phoenix.evals import evaluate_dataframe
from phoenix.trace import suppress_tracing
with suppress_tracing():
results_df = evaluate_dataframe(
dataframe=parent_spans,
evaluators=[correctness_eval]
)
Log the results
from phoenix.evals.utils import to_annotation_dataframe
evaluations = to_annotation_dataframe(dataframe=results_df)
Client().spans.log_span_annotations_dataframe(
dataframe=evaluations
)
Every score is zero
Faithfulness:
Giving the judge context
- Correctness: "Is this factually accurate?" (no context)
- Faithfulness: "Does this stick to the source material?" (with context)
- The difference: faithfulness gets the research the agent found
- Not always better; more appropriate for this use case
How faithfulness works
- FaithfulnessEvaluator needs three columns:
- input: the user's query
- output: the agent's response
- context: the source material to check against
Extract research context
child_spans = spans_df[spans_df["parent_id"].notna()]
Run faithfulness
from phoenix.evals.metrics import FaithfulnessEvaluator
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
with suppress_tracing():
faith_results = evaluate_dataframe(
dataframe=parent_spans,
evaluators=[faithfulness_eval]
)
Faithfulness results
Two built-in evals, two different signals
- Correctness: 0/13 passed — eval doesn't fit the use case
- Faithfulness: 13/13 passed!
- Choosing the right eval matters more than tuning it
What you see
Built-in evals
are your starting point
Step 6: Custom eval rubrics
Five parts
of a good eval prompt
- Define the judge's role
- Explicit CORRECT / INCORRECT criteria
- Present the data clearly
- Add labeled examples
- Constrain the output labels
Part 1: Define the role
"You are an expert financial analyst evaluator.
Your task is to judge whether a financial report
provides actionable investment guidance,
not just raw data."
Part 2: Explicit criteria
- ACTIONABLE — The report:
- Contains specific recommendations (buy/sell/hold or equivalent)
- Identifies concrete risks with supporting data
- Includes forward-looking analysis, not just historical data
- Provides context for *why* recommendations are made
- NOT ACTIONABLE — The report:
- Only summarizes publicly available data without interpretation
- Lacks specific recommendations or next steps
- Presents risks without supporting evidence
- Contains only backward-looking analysis
What makes criteria specific
- ACTIONABLE — The report:
- Contains specific recommendations (buy/sell/hold or equivalent)
- Identifies concrete risks with supporting data
- Includes forward-looking analysis, not just historical data
- Provides context for *why* recommendations are made
Criteria come from
error analysis
Part 3: Present the data
- [BEGIN DATA]
- ************
- User query: {input}
- ************
- Financial Report: {output}
- ************
- [END DATA]
Part 4: Add examples
An actionable example
Example — ACTIONABLE:
"Based on NVDA's 122% YoY revenue growth driven by
data center demand, strong forward P/E of 35x relative
to sector median of 22x, and expanding margins, NVDA
presents a compelling growth position. Key risk:
concentration in AI training chips (~70% of revenue).
Recommendation: accumulate on pullbacks below $800."
A not-actionable example
Example — NOT ACTIONABLE:
"NVDA is a major player in the semiconductor industry.
The company has seen significant growth in recent years
driven by AI demand. NVDA's stock has performed well.
Investors should consider various factors when making
investment decisions."
Part 5: Constrain the output
- "Based on the criteria above,
- is this financial report ACTIONABLE or NOT ACTIONABLE?"
Chain-of-thought for judges
The full template
actionability_template = """
You are an expert financial analyst evaluator...
ACTIONABLE — [criteria]
NOT ACTIONABLE — [criteria]
[examples]
[BEGIN DATA]
User query: {input}
Financial Report: {output}
[END DATA]
Is this report ACTIONABLE or NOT ACTIONABLE?
"""
How the template flows
- Role definition
- Explicit criteria
- Labeled examples
- Data section with delimiters
- Constrained output question
Wire it up
actionability_eval = ClassificationEvaluator(
name="actionability",
prompt_template=actionability_template,
llm=llm,
choices={"actionable": 1.0, "not actionable": 0.0},
)
with suppress_tracing():
action_results_df = evaluate_dataframe(
dataframe=parent_spans,
evaluators=[actionability_eval]
)
Log and review
action_evaluations = to_annotation_dataframe(
dataframe=action_results_df
)
Client().spans.log_span_annotations_dataframe(
dataframe=action_evaluations
)
Read the explanations
Treat eval prompts like code
- Version them. Test them against known answers.
- Use Phoenix's prompt playground for fast iteration.
- An eval you haven't validated
- is just a fancy way of being wrong at scale.
The God Evaluator anti-pattern
- Don't build one eval that checks everything
- One evaluator per dimension
- Guardrails vs. north-star metrics
One evaluator per dimension
Guardrails vs.
north-star metrics
Can you trust your judges?
Meta-evaluation
Your judge is a classifier
- It makes predictions: pass or fail
- Predictions can be compared against ground truth
- Your job: check the judge's homework
Human judgement
is a lot of work
Building your golden dataset
Write unambiguous tasks
- If 0% pass rate on many trials → broken task, not broken agent
- Each task needs a reference solution
- Test when a behavior SHOULD occur AND when it shouldn't
The golden set grows over time
Dev/test splits for your labels
Run the judge on the same examples
- actionability results on the labeled subset
- Compare: judge label vs. human label
Where they agree and disagree
Fixing the rubric
- Disagreement → read the explanation → find the ambiguity → tighten the criteria
- "Forward-looking analysis" → "Forward-looking analysis WITH specific recommendations"
Rubric iteration
Precision and recall
| Positive | Negative | |
|---|---|---|
| Positive | True positive ✅ | Miss ❌ |
| Negative | False positive ❌ | True negative ✅ |
Actual
Predicted
Precision = True positives / (true positive + false positive)
Recall = True positives / (true positive + false negative)
- High precision means you minimize false positives
- High recall means you minimize misses
Precision and recall
Precision: when the judge says "fail," is it right?
Recall: of all real fails, how many does it catch?
Prioritize recall — catching defects matters more
than occasional false positives
Interpreting precision and recall
Prioritize recall
Judge pitfalls
- Position bias — judges favor the first or last option
- Length bias — longer responses score higher
- Confidence bias — fooled by confidently wrong answers
- Self-preference — same model rates its own output higher
Mitigating self-preference bias
Detecting judge bias
The benchmark is human performance
- Human inter-rater reliability: often 0.2–0.3 (Cohen's Kappa)
- If your judge is more consistent than humans, that's a win
Failures should seem fair
- When a task fails, is it clear what the agent got wrong?
- If scores don't climb, is the eval at fault?
- Reading transcripts is how you verify
The CORE-Bench story
The lesson: verify your evals
Datasets and experiments
Step 7: Iterate
The problem with one-off fixes
Save failures as a dataset
Save passing traces too
Improve the agent
IMPROVED_RESEARCH_PROMPT = """Research {tickers}.
Focus on: {focus}.
You MUST include:
- Specific financial ratios (P/E, P/B, debt-to-equity)
- News from the last 6 months
- Current stock price or recent performance data
- Competitive context and market positioning"""
IMPROVED_WRITE_PROMPT = """Write a concise financial report.
The report MUST be actionable. Specifically:
- Include explicit buy/sell/hold recommendations
- Identify concrete risks with supporting data
- Include forward-looking analysis
- Provide context for WHY each recommendation is made"""
Every change maps to a finding
Data-driven prompt engineering
Run an experiment
dataset = Client().datasets.get_dataset(
dataset="aie-financial-agent-fails"
)
async def my_task(example):
return await improved_financial_report(
tickers, focus
)
experiment = await async_client.experiments.run_experiment(
dataset=dataset,
task=my_task,
evaluators=evaluators
)
The task function
The task abstraction
What experiments show you
- Same inputs, same evaluators, different agent version
- The only variable is your change
- Side-by-side comparison, example by example
Dealing with non-determinism
Compare the results
The eval-iterate cycle
Find failures → Read explanations → Fix the prompt → Run experiment → Repeat
How many samples do you need?
- 200 samples, 3% defect rate → 95% CI: 0.6%–5.4%
- 400 samples → 95% CI: 1.3%–4.7%
- More samples = tighter confidence, but diminishing returns
Diminishing returns in sample size
- Double to 400 samples → 95% CI: 1.3%–4.7%
- Halving margin of error requires 4x the samples
Practical sample size guidance
- Workshop experiments: 12–20 examples for directional signal
- Shipping decisions: 200–400 samples
Making the cycle systematic
Where to invest your effort
The impact hierarchy
- Data quality fixes (highest impact)
- Prompting improvements
- Model selection
- Hyperparameter tuning (lowest impact)
Model selection and tuning
- Model upgrades: sometimes necessary, always costly
- Hyperparameter tuning: lowest impact, try last
Eval-driven development
- Write the eval first, then build the feature
- Like test-driven development, but for AI
- The eval defines what "done" means
Eval-driven development in practice
- Claude Code evolved this way — evals before capabilities
- Non-engineers can define evals too
Who can write evals?
- Non-engineers can define evals
- Product managers, customer success, even salespeople
The data flywheel
- Log → Sample → Review → Improve → Repeat
- Each iteration compounds
- Production failures become tomorrow's test cases
Model adoption advantage
- Teams with evals upgrade models in days
- Teams without face weeks of manual testing
Beyond the workshop
What to explore next
Production monitoring
- Run evals on sampled live traffic
- Alert on sustained score drops, not individual failures
- Route production failures back into your test suite
Cost-aware evaluation
- Not every query needs a frontier model
- Cost-Normalized Accuracy: accuracy ÷ cost
- An agent that's 92% accurate at $0.02/query may beat one that's 95% accurate at $0.15/query
Cost-Normalized Accuracy
- Accuracy ÷ cost = value per dollar
- 92% at $0.02/query can beat 95% at $0.15/query
Pairwise evaluation
- "Is A better than B?" is more reliable than "Rate A from 1-10"
- Run twice with order swapped to control for position bias
Reliability scoring

The frontier
- Multi-judge systems — consensus reduces bias
- Agent-as-a-judge — judges with tools
- Benchmark saturation — when evals need to get harder
What we built today

Start small
Evals are infrastructure
- Treat evals as a core part of your system, not an afterthought
- The value compounds — but only if you keep investing
Go try it
- app.phoenix.arize.com
- arize.com/docs/phoenix
- github.com/Arize-ai/phoenix
Arize AX
- Enterprise SAML, SSO
- Compliance: SOC2 etc.
- ADB: billions of rows
- Session-aware agent tracing
- Alyx: AI assistant
- Agent graphs
- Metrics
- Dashboards
- Monitoring
Thank you!
🦋 @seldo.com on BlueSky
arize.com/docs/phoenix
These slides:
Ship Real Agents (AIE Europe)
By Laurie Voss
Ship Real Agents (AIE Europe)
- 21