Ship Real Agents

Hands-on evals for agentic applications

AIE Europe 2026-04-08

What are we talking about?

  • What is an eval, and why you need them
  • Setting up tracing with Phoenix
  • Building an AI agent with the Claude Agent SDK

From data to evals

  • Looking at your data: error analysis
  • Code evals and built-in LLM evals
  • Writing custom eval rubrics
  • Meta-evaluation: testing your tests

From evals to experiments

and beyond

  • Datasets, experiments, and the improvement cycle
  • A look at what comes next

Get set up

  1. Learner notebook: bit.ly/build-real-agents
  2. Claude API key: platform.claude.com/settings/keys
  3. Phoenix Cloud account: app.phoenix.arize.com/login/sign-up

1. Notebook

3. Phoenix cloud

2. Claude key

What is an eval?

Traces are logs,

evals are tests

  • Traces = logs, for AI
  • Evals = tests, for AI

Spans: the building blocks

  • Each span = one step in execution
  • LLM call, tool call, agent turn
  • Records input, output, timing, token counts

The vibes problem

What you can't do without evals

  • Can't detect regressions when you change a prompt
  • Can't compare prompt versions objectively
  • Can't know if a new model is actually better
  • Can't run quality gates in CI

You can't switch models without evals

  • New models drop every few months
  • Without evals, switching = weeks of manual testing
  • With evals, you know within hours

This is not theoretical

Everybody vibes until they can't any more

Two types of evals

  • Code evals — deterministic, free, fast
  • LLM-as-a-judge — semantic, flexible, powerful

LLM-as-a-judge evals

  • A second LLM grades outputs against a rubric
  • Handles meaning, not just strings
  • Non-deterministic — needs calibration

LLM judges: tradeoffs

When to use which

  • Code evals → format, structure, constraints
  • LLM judge → accuracy, relevance, tone, faithfulness
  • Human review → novel failures, calibrating judges

Why agents make this harder

  • Single LLM call: input → output. Done.
  • Agent: input → tool call → result → reasoning → another tool call → output
  • Errors cascade. Each step can go wrong.

Multi-agent complexity

  • Handoffs between agents add another layer
  • Triage routing, specialist handling
  • Each layer = new ways things can go wrong

Cascading failures

  • Bad retrieval → bad reasoning → confidently wrong output
  • The user sees a polished response and trusts it.
  • This is worse than an obvious failure.

Creatively correct vs. wrong

  • Sometimes the agent finds a better solution
  • Your eval says "fail" — but the agent was right
  • Evals need to distinguish creative from wrong

Another way

to categorize evals

  • Capability evals: can it do this new thing?
  • Regression evals: can it do the stuff it used to do?

What an eval result looks like

Code eval:  score: 1 · label: "valid"
LLM judge:  score: 0 · label: "incorrect"
            explanation: "The response fails to include..."

What a real explanation

looks like

label: "incorrect"
explanation: "The response fails to include a budget
breakdown, which is a core requirement. The agent
provides destination info and local recommendations
but omits all cost estimates, making the plan
incomplete for a user who asked specifically
about budget travel to Tokyo."

Explanations make evals actionable

  • Concrete failure → you know what to fix
  • Same explanation across 50 traces = systematic problem
  • Evals become a debugging tool, not just a scoreboard

The full loop

Setting up Phoenix

Step 1: Tracing

What is Phoenix?

  • Open-source AI observability platform
  • Captures traces from any AI framework
  • Free cloud tier at app.phoenix.arize.com

Open the notebook

Install dependencies

pip install claude-agent-sdk
  openinference-instrumentation-claude-agent-sdk
  arize-phoenix anthropic

The Claude Agent SDK

  • Anthropic's framework for building agents
  • Tool use, web search, conversation context
  • Auto-instrumented by OpenInference

What are we building?

  • A financial analysis chatbot
  • Two-turn agent: research then write
  • Web search tool for real financial data
  • Traces everything to Phoenix automatically

How the two turns connect

  • Two-turn architecture: research → write
  • Claude SDK saves context between turns
  • Phoenix traces every step automatically

Set your API keys

from google.colab import userdata
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-XXXX"
os.environ["PHOENIX_API_KEY"] = "YYYY"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] 
  = "https://app.phoenix.arize.com/s/yourusername/"

Register the tracer

from phoenix.otel import register
register(
    project_name="aie-claude-financial-agent",
    auto_instrument=True
)

Build the agent

A financial analysis chatbot

The agent setup

from claude_agent_sdk import ClaudeSDKClient,
    ClaudeAgentOptions, AssistantMessage, TextBlock
options = ClaudeAgentOptions(
    model="claude-haiku-4-5-20251001",
    allowed_tools=["WebSearch"],
    permission_mode="acceptEdits",
)

The two-turn pattern

RESEARCH_PROMPT = "Research {tickers}. Focus on: {focus}.
    Use web search to find current financial data."
WRITE_PROMPT = "Now write a concise financial report
    based on your research above."

The financial_report function

async def financial_report(tickers, focus):
    async with ClaudeSDKClient(options=options) as client:
        await client.query(RESEARCH_PROMPT.format(...))
        async for message in client.receive_response():
            ...  # research completes
        await client.query(WRITE_PROMPT)
        async for message in client.receive_response():
            ...  # collect the report
        return report

Wrapping it in a span

with tracer.start_as_current_span(
        "financial_report",
        attributes={
            "input.value": f"Research: {tickers}\nFocus: {focus}",
        },
    ) as span:

Run it!

result = await financial_report(
    "TSLA",
    "financial performance and growth outlook"
)
print(result)

Non-deterministic by design

Why this needs evals

Look at the trace

Traces reveal every decision

What lives in a span

  • kind: AGENT
  • attributes.input.value: "Research: TSLA\nFocus: financial..."
  • attributes.output.value: "# TESLA, INC. (TSLA)..."
  • start_time, end_time, duration

Generate test data

Here's one I made earlier

Test queries

test_queries = [
    {"tickers": "AAPL", "focus": "revenue growth"},
    {"tickers": "NVDA", "focus": "AI chip demand"},
    {"tickers": "AMZN", "focus": "AWS performance"},
    {"tickers": "GOOGL", "focus": "advertising revenue"},
    {"tickers": "MSFT", "focus": "cloud computing segment"},
    {"tickers": "META", "focus": "metaverse investments"},
    {"tickers": "TSLA", "focus": "vehicle deliveries"},
    {"tickers": "RIVN", "focus": "financial health"},
    {"tickers": "AAPL, MSFT", "focus": "comparative analysis"},
    {"tickers": "NVDA", "focus": "competitive landscape"},
    {"tickers": "KO", "focus": "dividend yield"},
    {"tickers": "AMZN", "focus": "profitability trends"},
]

Covering the edge cases

Traces are loaded

Start with data, not metrics

Read your traces before you write evals

You need requirements first

  • You can't say "it doesn't work" if you haven't defined what "works" looks like
  • Write down explicit success criteria

Defining success is cross-functional work

Where to get test data

  • Before production: synthetic data (LLM-generated queries)
  • After production: real user queries from traces
  • Diversity is critical — vary phrasing, intent, complexity

Don't forget the edge cases

Examine traces

When the data looks right

but isn't

The "confidently wrong" problem

Categorize by root cause

  • "The response was wrong" — not actionable. Ask why.
  • Retrieval failure → better search
  • Reasoning error → better prompts
  • Hallucination → grounding checks
  • Scope violation → explicit boundaries
  • Formatting / tone → output guardrails

Categorize what we found

trace_categories = {
    "TSLA performance": "looks good",
    "NVDA competitive": "possible hallucination",
    "AAPL vs MSFT": "reasoning gap",
    "RIVN financial health": "looks good",
    ...
}

What the frequency table

tells you

Category counts:
looks good                5  █████
possible hallucination    3  ███
reasoning gap             2  ██
unverifiable data         1  █
missing recommendation    1  █

Frequency x Severity = Priority

The Swiss Cheese Model

Stack your eval layers

Evaluations

Step 4: Code evals

The simplest useful eval

Get your spans

from phoenix.client import Client
px_client = Client()
spans_df = px_client.spans.get_spans_dataframe(
    project_name="aie-claude-financial-agent"
)
parent_spans = spans_df[
    spans_df["parent_id"].isna()
]
parent_spans.rename(columns={
    "attributes.input.value": "input",
    "attributes.output.value": "output"
}, inplace=True)

Ticker check eval

from phoenix.evals import create_evaluator
@create_evaluator(name="mentions_ticker", kind="code")
def mentions_ticker(input, output):
    tickers = re.findall(r"\b([A-Z]{1,5})\b", input)
    likely_tickers = [t for t in tickers
        if len(t) >= 2 and t not in ("AI", "US", ...)]
    missing = [t for t in likely_tickers
               if t not in output.upper()]
    if not missing:
        return {"label": "pass", "score": 1}
    return {"label": "fail", "score": 0,
        "explanation": f"Missing: {', '.join(missing)}"}

How the ticker check works

Running the ticker check

Why this matters

Code evals

aren't just toy examples

  • Did the output parse as JSON?
  • Is the response under 500 tokens?
  • Does it avoid forbidden phrases?

Grade what was produced,

not the path

  • Don't check that the agent followed specific steps
  • Agents find valid approaches you didn't anticipate
  • Check the outcome, not the trajectory

Check the outcome,

not the trajectory

Step 5: Built-in LLM evals

What code can't check

Three components

  1. A judge model (the LLM that grades)
  2. A prompt template (the rubric)
  3. Data (the examples being evaluated)

Phoenix ships built-in evals

  • Correctness, Faithfulness, Conciseness
  • Tool Selection, Tool Invocation
  • Document Relevance, Refusal
  • No prompt engineering required

Tool selection evals

Set up the judge

from phoenix.evals import LLM
from phoenix.evals.metrics import CorrectnessEvaluator
llm = LLM(model="claude-sonnet-4-6", provider="anthropic")
correctness_eval = CorrectnessEvaluator(llm=llm)

Run the evaluation

from phoenix.evals import evaluate_dataframe
from phoenix.trace import suppress_tracing
with suppress_tracing():
    results_df = evaluate_dataframe(
        dataframe=parent_spans,
        evaluators=[correctness_eval]
    )

Log the results

from phoenix.evals.utils import to_annotation_dataframe
evaluations = to_annotation_dataframe(dataframe=results_df)
Client().spans.log_span_annotations_dataframe(
    dataframe=evaluations
)

Every score is zero

Faithfulness:

Giving the judge context

  • Correctness: "Is this factually accurate?" (no context)
  • Faithfulness: "Does this stick to the source material?" (with context)
  • The difference: faithfulness gets the research the agent found
  • Not always better; more appropriate for this use case

How faithfulness works

  • FaithfulnessEvaluator needs three columns:
    • input: the user's query
    • output: the agent's response
    • context: the source material to check against

Extract research context

child_spans = spans_df[spans_df["parent_id"].notna()]

Run faithfulness

from phoenix.evals.metrics import FaithfulnessEvaluator
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
with suppress_tracing():
    faith_results = evaluate_dataframe(
        dataframe=parent_spans,
        evaluators=[faithfulness_eval]
    )

Faithfulness results

Two built-in evals, two different signals

  • Correctness: 0/13 passed — eval doesn't fit the use case
  • Faithfulness: 13/13 passed!
  • Choosing the right eval matters more than tuning it

What you see

Built-in evals

are your starting point

Step 6: Custom eval rubrics

Five parts

of a good eval prompt

  1. Define the judge's role
  2. Explicit CORRECT / INCORRECT criteria
  3. Present the data clearly
  4. Add labeled examples
  5. Constrain the output labels

Part 1: Define the role

"You are an expert financial analyst evaluator.
Your task is to judge whether a financial report
provides actionable investment guidance,
not just raw data."

Part 2: Explicit criteria

  • ACTIONABLE — The report:
    • Contains specific recommendations (buy/sell/hold or equivalent)
    • Identifies concrete risks with supporting data
    • Includes forward-looking analysis, not just historical data
    • Provides context for *why* recommendations are made
  • NOT ACTIONABLE — The report:
    • Only summarizes publicly available data without interpretation
    • Lacks specific recommendations or next steps
    • Presents risks without supporting evidence
    • Contains only backward-looking analysis

What makes criteria specific

  • ACTIONABLE — The report:
    • Contains specific recommendations (buy/sell/hold or equivalent)
    • Identifies concrete risks with supporting data
    • Includes forward-looking analysis, not just historical data
    • Provides context for *why* recommendations are made

Criteria come from

error analysis

Part 3: Present the data

  • [BEGIN DATA]
  • ************
  • User query: {input}
  • ************
  • Financial Report: {output}
  • ************
  • [END DATA]

Part 4: Add examples

An actionable example

Example — ACTIONABLE:
"Based on NVDA's 122% YoY revenue growth driven by
data center demand, strong forward P/E of 35x relative
to sector median of 22x, and expanding margins, NVDA
presents a compelling growth position. Key risk:
concentration in AI training chips (~70% of revenue).
Recommendation: accumulate on pullbacks below $800."

A not-actionable example

Example — NOT ACTIONABLE:
"NVDA is a major player in the semiconductor industry.
The company has seen significant growth in recent years
driven by AI demand. NVDA's stock has performed well.
Investors should consider various factors when making
investment decisions."

Part 5: Constrain the output

  • "Based on the criteria above,
  • is this financial report ACTIONABLE or NOT ACTIONABLE?"

Chain-of-thought for judges

The full template

actionability_template = """
You are an expert financial analyst evaluator...
ACTIONABLE — [criteria]
NOT ACTIONABLE — [criteria]
[examples]
[BEGIN DATA]
User query: {input}
Financial Report: {output}
[END DATA]
Is this report ACTIONABLE or NOT ACTIONABLE?
"""

How the template flows

  1. Role definition
  2. Explicit criteria
  3. Labeled examples
  4. Data section with delimiters
  5. Constrained output question

Wire it up

actionability_eval = ClassificationEvaluator(
    name="actionability",
    prompt_template=actionability_template,
    llm=llm,
    choices={"actionable": 1.0, "not actionable": 0.0},
)
with suppress_tracing():
    action_results_df = evaluate_dataframe(
        dataframe=parent_spans,
        evaluators=[actionability_eval]
    )

Log and review

action_evaluations = to_annotation_dataframe(
    dataframe=action_results_df
)
Client().spans.log_span_annotations_dataframe(
    dataframe=action_evaluations
)

Read the explanations

Treat eval prompts like code

  • Version them. Test them against known answers.
  • Use Phoenix's prompt playground for fast iteration.
  • An eval you haven't validated
  • is just a fancy way of being wrong at scale.

The God Evaluator anti-pattern

  • Don't build one eval that checks everything
  • One evaluator per dimension
  • Guardrails vs. north-star metrics

One evaluator per dimension

Guardrails vs.

north-star metrics

Can you trust your judges?

Meta-evaluation

Your judge is a classifier

  • It makes predictions: pass or fail
  • Predictions can be compared against ground truth
  • Your job: check the judge's homework

Human judgement

is a lot of work

Building your golden dataset

Write unambiguous tasks

  • If 0% pass rate on many trials → broken task, not broken agent
  • Each task needs a reference solution
  • Test when a behavior SHOULD occur AND when it shouldn't

The golden set grows over time

Dev/test splits for your labels

Run the judge on the same examples

  • actionability results on the labeled subset
  • Compare: judge label vs. human label

Where they agree and disagree

Fixing the rubric

  • Disagreement → read the explanation → find the ambiguity → tighten the criteria
  • "Forward-looking analysis" → "Forward-looking analysis WITH specific recommendations"

Rubric iteration

Precision and recall

PositiveNegative
PositiveTrue positive ✅Miss ❌
NegativeFalse positive ❌True negative ✅

Actual

Predicted

Precision = True positives / (true positive + false positive)

Recall = True positives / (true positive + false negative)

  • High precision means you minimize false positives
  • High recall means you minimize misses

Precision and recall

Precision: when the judge says "fail," is it right?
Recall: of all real fails, how many does it catch?
Prioritize recall — catching defects matters more
  than occasional false positives

Interpreting precision and recall

Prioritize recall

Judge pitfalls

  • Position bias — judges favor the first or last option
  • Length bias — longer responses score higher
  • Confidence bias — fooled by confidently wrong answers
  • Self-preference — same model rates its own output higher

Mitigating self-preference bias

Detecting judge bias

The benchmark is human performance

  • Human inter-rater reliability: often 0.2–0.3 (Cohen's Kappa)
  • If your judge is more consistent than humans, that's a win

Failures should seem fair

  • When a task fails, is it clear what the agent got wrong?
  • If scores don't climb, is the eval at fault?
  • Reading transcripts is how you verify

The CORE-Bench story

The lesson: verify your evals

Datasets and experiments

Step 7: Iterate

The problem with one-off fixes

Save failures as a dataset

Save passing traces too

Improve the agent

IMPROVED_RESEARCH_PROMPT = """Research {tickers}.
    Focus on: {focus}.
    You MUST include:
    - Specific financial ratios (P/E, P/B, debt-to-equity)
    - News from the last 6 months
    - Current stock price or recent performance data
    - Competitive context and market positioning"""
IMPROVED_WRITE_PROMPT = """Write a concise financial report.
    The report MUST be actionable. Specifically:
    - Include explicit buy/sell/hold recommendations
    - Identify concrete risks with supporting data
    - Include forward-looking analysis
    - Provide context for WHY each recommendation is made"""

Every change maps to a finding

Data-driven prompt engineering

Run an experiment

dataset = Client().datasets.get_dataset(
    dataset="aie-financial-agent-fails"
)
async def my_task(example):
    return await improved_financial_report(
        tickers, focus
    )
experiment = await async_client.experiments.run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=evaluators
)

The task function

The task abstraction

What experiments show you

  • Same inputs, same evaluators, different agent version
  • The only variable is your change
  • Side-by-side comparison, example by example

Dealing with non-determinism

Compare the results

The eval-iterate cycle

Find failures → Read explanations → Fix the prompt → Run experiment → Repeat

How many samples do you need?

  • 200 samples, 3% defect rate → 95% CI: 0.6%–5.4%
  • 400 samples → 95% CI: 1.3%–4.7%
  • More samples = tighter confidence, but diminishing returns

Diminishing returns in sample size

  • Double to 400 samples → 95% CI: 1.3%–4.7%
  • Halving margin of error requires 4x the samples

Practical sample size guidance

  • Workshop experiments: 12–20 examples for directional signal
  • Shipping decisions: 200–400 samples

Making the cycle systematic

Where to invest your effort

The impact hierarchy

  1. Data quality fixes (highest impact)
  2. Prompting improvements
  3. Model selection
  4. Hyperparameter tuning (lowest impact)

Model selection and tuning

  • Model upgrades: sometimes necessary, always costly
  • Hyperparameter tuning: lowest impact, try last

Eval-driven development

  • Write the eval first, then build the feature
  • Like test-driven development, but for AI
  • The eval defines what "done" means

Eval-driven development in practice

  • Claude Code evolved this way — evals before capabilities
  • Non-engineers can define evals too

Who can write evals?

  • Non-engineers can define evals
  • Product managers, customer success, even salespeople

The data flywheel

  • Log → Sample → Review → Improve → Repeat
  • Each iteration compounds
  • Production failures become tomorrow's test cases

Model adoption advantage

  • Teams with evals upgrade models in days
  • Teams without face weeks of manual testing

Beyond the workshop

What to explore next

Production monitoring

  • Run evals on sampled live traffic
  • Alert on sustained score drops, not individual failures
  • Route production failures back into your test suite

Cost-aware evaluation

  • Not every query needs a frontier model
  • Cost-Normalized Accuracy: accuracy ÷ cost
  • An agent that's 92% accurate at $0.02/query may beat one that's 95% accurate at $0.15/query

Cost-Normalized Accuracy

  • Accuracy ÷ cost = value per dollar
  • 92% at $0.02/query can beat 95% at $0.15/query

Pairwise evaluation

  • "Is A better than B?" is more reliable than "Rate A from 1-10"
  • Run twice with order swapped to control for position bias

Reliability scoring

The frontier

  • Multi-judge systems — consensus reduces bias
  • Agent-as-a-judge — judges with tools
  • Benchmark saturation — when evals need to get harder

What we built today

Start small

Evals are infrastructure

  • Treat evals as a core part of your system, not an afterthought
  • The value compounds — but only if you keep investing

Go try it

  • app.phoenix.arize.com
  • arize.com/docs/phoenix
  • github.com/Arize-ai/phoenix

Arize AX

  • Enterprise SAML, SSO
  • Compliance: SOC2 etc.
  • ADB: billions of rows
  • Session-aware agent tracing
  • Alyx: AI assistant
  • Agent graphs
  • Metrics
  • Dashboards
  • Monitoring

Thank you!

🦋 @seldo.com on BlueSky

 

arize.com/docs/phoenix

Ship Real Agents (AIE Europe)

By Laurie Voss

Ship Real Agents (AIE Europe)

  • 21