How to Evaluate AI Agents: 3 Framework Comparison

How to Evaluate AI Agents: 3 Framework Comparison

How to Evaluate AI Agents, compare Strands Agents, PydanticAI, and DeepEval for AI agent evaluation. Same test cases, same rubrics, different frameworks. Code examples and results.

Find all the code here Evaluate AI Agents with Strands

Your AI agent produces answers. But how do you know if they’re good?

Three frameworks promise to solve this: Strands Agents, PydanticAI, and DeepEval. They all use LLM-as-Judge. They all detect hallucinations. But when you run the exact same test through each one, the scores diverge.

The problem: Framework comparisons usually test different things and call it “fair.” This post is different. We run identical test cases with the same judge model (gpt-4o-mini) through all three frameworks. The only variable? The framework API.

What you’ll learn:

  • Why GEval scores differ from direct rubric prompting (it’s by design, not a bug)
  • Which framework works best for your stack (AWS vs type-safety vs framework-agnostic)
  • When to use deterministic checks vs LLM-based evaluation
  • Why PydanticAI can’t evaluate pre-computed tool lists (OpenTelemetry requirement)

What’s actually being compared:

  • Strands Agents = Agent framework + evaluation library (strands-agents-evals)
  • PydanticAI = Agent framework + evaluation library (pydantic-evals)
  • DeepEval = Evaluation-only framework (works with any agent)

DeepEval doesn’t build agents—it only evaluates them. This makes it comparable to strands-agents-evals and pydantic-evals (the evaluation libraries), not to the full Strands/PydanticAI frameworks.

The evaluation landscape for AI agents saw 45+ new research papers in the past 6 months on arXiv (Cornell University’s open-access preprint repository), proposing new metrics for trajectory quality (TRACE), hallucination detection (LSC), and cost-performance tradeoffs (KAMI). But when it comes to implementing these evaluations, which framework should you use?


Why these 3 frameworks (and not CrewAI, LangGraph, or AutoGen)?

I compared 8 agent frameworks for evaluation capabilities. Most popular frameworks (CrewAI, LangGraph, AutoGen, OpenAI Agents SDK, Google ADK) focus on building agents, not evaluating them. They do not ship dedicated evaluation libraries.

These 3 were selected because they are the only ones with dedicated, open-source evaluation SDKs:

FrameworkEvaluation LibraryWhat It Provides
Strands Agentsstrands-agents-evalsOutputEvaluator, TrajectoryEvaluator, ToolCalled, ActorSimulator, Experiment runner
PydanticAIpydantic-evalsLLMJudge, typed Datasets with YAML, report diffing, HasMatchingSpan
DeepEvaldeepeval (standalone)30+ metrics: GEval, HallucinationMetric, FaithfulnessMetric, ToolCorrectnessMetric

What about the others?

FrameworkWhy Not Included
CrewAIcrewai test only supports OpenAI, provides basic 1-10 scoring. No rubrics, no trajectory eval, no hallucination detection.
LangGraphEvaluation lives in LangSmith (paid SaaS), not in the open-source framework.
AutoGenHas AutoGen Bench for benchmarking but no evaluation SDK with comparable metrics.
OpenAI Agents SDKProvides tracing hooks but no evaluation library. Pair it with DeepEval to evaluate.
Google ADKHas adk eval CLI but tightly coupled to the Gemini ecosystem.

If you use CrewAI, LangGraph, or AutoGen to build your agent, you still need one of these 3 frameworks to evaluate it. DeepEval in particular is framework-agnostic and works with any agent.

Diagram comparing Strands Agents, PydanticAI, and DeepEval evaluation flow showing the same test data flowing through each framework's unique API

What evaluation tasks are we running?

We evaluate the same travel assistant agent scenario across all three frameworks. The agent answers questions from travelers using tools (search flights, check hotel availability, get weather).

  1. Output Quality - Is the agent’s answer helpful and accurate? (LLM-as-Judge)
  2. Tool Correctness - Did the agent call the right tools with the right parameters?
  3. Hallucination Detection - Did the agent fabricate information not in the context?
  4. Faithfulness - Is the answer grounded in the retrieved information?

Same test cases. Same judge model (Claude on Amazon Bedrock). Same rubrics where possible.


Find all the code here Evaluate AI Agents with Strands

Round 1: Output Quality (LLM-as-Judge)

Quick answer: All three frameworks support LLM-as-Judge with custom rubrics, but Strands requires the fewest lines (7), PydanticAI offers the most configuration options (score + assertion modes), and DeepEval supports the widest range of custom criteria via GEval. Strands and PydanticAI support Bedrock natively; DeepEval requires a custom wrapper.

LLM-as-Judge is the most fundamental evaluation technique: use a large language model to score whether the agent’s output meets quality criteria. All three frameworks support this pattern, but the API differs significantly.

Strands Agents (7 lines)

Strands uses OutputEvaluator with a custom rubric, making it the most concise option for basic LLM-as-Judge:

from strands_evals import Experiment, Case
from strands_evals.evaluators import OutputEvaluator

cases = [
    Case(input="Find flights from NYC to London for next Friday",
         expected_output="Should include airline, price range, and departure times"),
]

evaluator = OutputEvaluator(
    rubric="Rate the response on helpfulness (0-1). A helpful response includes "
           "specific flight options with airlines, prices, and times. Penalize "
           "vague or generic responses.",
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
)

experiment = Experiment(cases=cases, evaluators=[evaluator])
reports = experiment.run_evaluations(lambda case: agent(case.input))
reports[0].display()

PydanticAI (10 lines)

PydanticAI wraps cases in a Dataset and provides separate score and assertion modes, giving you more control over pass/fail criteria:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[
        Case(
            name="flight_search",
            inputs="Find flights from NYC to London for next Friday",
            expected_output="Should include airline, price range, and departure times",
        ),
    ],
    evaluators=[
        LLMJudge(
            rubric="Rate the response on helpfulness. A helpful response includes "
                   "specific flight options with airlines, prices, and times. "
                   "Penalize vague or generic responses.",
            model="anthropic:claude-sonnet-4-6",
            include_input=True,
            include_expected_output=True,
            score={"include_reason": True},
        ),
    ],
)

report = dataset.evaluate_sync(lambda inputs: agent(inputs))
report.print(include_input=True)

DeepEval (12 lines)

DeepEval uses GEval with explicit evaluation parameters, allowing you to control which fields the judge sees:

from deepeval import evaluate
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

test_case = LLMTestCase(
    input="Find flights from NYC to London for next Friday",
    actual_output=agent("Find flights from NYC to London for next Friday"),
    expected_output="Should include airline, price range, and departure times",
)

metric = GEval(
    name="Helpfulness",
    criteria="Rate the response on helpfulness. A helpful response includes "
             "specific flight options with airlines, prices, and times. "
             "Penalize vague or generic responses.",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
    ],
    threshold=0.5,
)

result = evaluate(test_cases=[test_case], metrics=[metric])

Verdict: Which Framework Wins?

AspectStrandsPydanticAIDeepEval
Lines of code71012
Bedrock nativeYesYesCustom wrapper needed
Score format0.0-1.00.0-1.0 + pass/fail0.0-1.0
Reason includedYesYes (configurable)Yes
Batch evaluationExperiment.run_evaluations()Dataset.evaluate_sync()evaluate()
Prompting methodDirect rubric → LLMDirect rubric → LLMG-Eval (CoT + logprobs)

Strands is the most concise. PydanticAI offers the most configuration (separate score vs. assertion modes). DeepEval uses GEval, a research-backed technique from the paper “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment”.

⚠️ Why scores may differ: Even with the same model and rubric text, GEval uses a fundamentally different prompting strategy:

  1. Chain-of-thought decomposition - Breaks evaluation into explicit steps
  2. Logprobs weighting - Uses token probabilities to weight scores
  3. Structured template - Optimized prompt format for human alignment

This is by design. GEval optimizes for correlation with human judgments, not identical scoring to direct rubric prompting. Strands and PydanticAI optimize for transparency and customizability.


Round 2: Tool Correctness Evaluation

Quick answer: Strands provides built-in trajectory extraction and deterministic tool checks (zero cost). DeepEval has a dedicated ToolCorrectnessMetric with LLM-based comparison. PydanticAI’s HasMatchingSpan requires OpenTelemetry instrumentation and is not comparable to the other two for simple tool list validation.

Tool correctness measures whether the agent called the right tools with the right parameters. This is critical for agents that interact with APIs and databases, because a wrong tool call can cause real-world side effects.

⚠️ PydanticAI excluded from direct comparison: PydanticAI’s HasMatchingSpan evaluator requires full OpenTelemetry traces from live agent execution. It cannot evaluate pre-computed tool lists like ["search_flights", "check_availability"], making it fundamentally incomparable to Strands’ ToolCalled and DeepEval’s ToolCorrectnessMetric for basic tool validation.

Strands Agents (with trajectory extraction)

Strands automatically extracts tool usage from agent execution traces, making trajectory evaluation seamless:

from strands_evals import Experiment, Case
from strands_evals.evaluators import TrajectoryEvaluator
from strands_evals.extractors import tools_use_extractor

traj_eval = TrajectoryEvaluator(
    rubric="The agent should search for flights first, then check availability. "
           "Calling weather tools is optional but acceptable.",
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
)

cases = [
    Case(
        input="Find flights from NYC to London for next Friday",
        expected_trajectory=["search_flights", "check_availability"],
    ),
]

def task_with_trajectory(case):
    agent.messages = []
    response = agent(case.input)
    traj_eval.update_trajectory_description(
        tools_use_extractor.extract_tools_description(agent)
    )
    trajectory = tools_use_extractor.extract_agent_tools_used_from_messages(
        agent.messages
    )
    return {"output": str(response), "trajectory": trajectory}

experiment = Experiment(cases=cases, evaluators=[traj_eval])
reports = experiment.run_evaluations(task_with_trajectory)

Bonus: Deterministic tool check (no LLM needed, zero cost)

For simple “was this tool called?” checks, Strands provides instant verification with no API calls:

from strands_evals.evaluators import ToolCalled

# Check if a specific tool was called (instant, no API call)
experiment = Experiment(
    cases=cases,
    evaluators=[ToolCalled(tool_name="search_flights")],
)

PydanticAI (with span-based tool detection)

PydanticAI uses OpenTelemetry spans to detect tool usage, requiring custom evaluator code for trajectory validation:

from dataclasses import dataclass
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, HasMatchingSpan

dataset = Dataset(
    cases=[
        Case(
            name="flight_search",
            inputs="Find flights from NYC to London for next Friday",
            metadata={"expected_tools": ["search_flights", "check_availability"]},
        ),
    ],
    evaluators=[
        HasMatchingSpan(
            query={"name_contains": "search_flights"},
            evaluation_name="called_search_flights",
        ),
    ],
)

# Custom evaluator for full trajectory check
@dataclass
class ToolSequenceCheck(Evaluator):
    def evaluate(self, ctx: EvaluatorContext) -> dict[str, bool]:
        tool_spans = ctx.span_tree.find(lambda n: "tool" in n.name.lower())
        tool_names = [s.name for s in tool_spans]
        expected = ctx.metadata.get("expected_tools", [])
        return {
            "all_tools_called": all(t in tool_names for t in expected),
            "correct_order": self._check_order(tool_names, expected),
        }

    def _check_order(self, actual, expected):
        positions = []
        for tool in expected:
            if tool in actual:
                positions.append(actual.index(tool))
        return positions == sorted(positions)

DeepEval (with ToolCall objects)

DeepEval uses structured ToolCall objects with explicit parameter validation and ordering checks:

from deepeval import evaluate
from deepeval.metrics import ToolCorrectnessMetric
from deepeval.test_case import LLMTestCase, ToolCall

test_case = LLMTestCase(
    input="Find flights from NYC to London for next Friday",
    actual_output="I found 3 flights...",
    tools_called=[
        ToolCall(name="search_flights", input_parameters={"origin": "NYC", "dest": "LHR"}),
        ToolCall(name="check_availability", input_parameters={"flight_id": "BA117"}),
    ],
    expected_tools=[
        ToolCall(name="search_flights", input_parameters={"origin": "NYC", "dest": "LHR"}),
        ToolCall(name="check_availability"),
    ],
)

metric = ToolCorrectnessMetric(
    threshold=0.5,
    should_consider_ordering=True,
    should_exact_match=False,
)

result = evaluate(test_cases=[test_case], metrics=[metric])

Verdict: Which Framework Wins?

AspectStrandsPydanticAIDeepEval
Trajectory extractionBuilt-in extractorVia OpenTelemetry spansManual ToolCall objects
LLM-based trajectory evalTrajectoryEvaluatorNot comparable (OTEL-only)ToolCorrectnessMetric
Deterministic checkToolCalled (zero-cost)HasMatchingSpan (OTEL-only)N/A
Ordering validationin_order_match_scorerCustom codeshould_consider_ordering
Parameter validationVia rubricVia span attributesshould_exact_match
Works with pre-computed tool listsYesNo (requires live traces)Yes

Strands wins for simplicity with built-in trajectory extraction from agent messages. DeepEval has the most structured ToolCall API with dedicated LLM-based comparison. PydanticAI is the most flexible via span trees but requires OpenTelemetry instrumentation, making it suitable only for live agent evaluation, not pre-computed analysis.


Round 3: Hallucination Detection

Quick answer: DeepEval provides a purpose-built HallucinationMetric that decomposes claims and checks each against context. Strands and PydanticAI use general-purpose LLM-as-judge with custom rubrics, which is flexible but less specialized. DeepEval wins for hallucination detection with its dedicated metric and per-context contradiction counting.

Hallucination detection measures whether the agent fabricates information not present in the source context. This is one of the most critical evaluation dimensions, with recent research (LSC, Jan 2026) showing that zero-shot detection methods can identify fabricated content without any training data.

Strands Agents

Strands uses OutputEvaluator with a hallucination-focused rubric:

from strands_evals import Experiment, Case
from strands_evals.evaluators import OutputEvaluator

cases = [
    Case(
        input="What is the baggage policy for Delta flights to London?",
        expected_output="Based on the context: 2 checked bags, 23kg each, free for international",
    ),
]

hallucination_eval = OutputEvaluator(
    rubric="Score 1.0 if the response ONLY contains information present in the "
           "expected output (ground truth). Score 0.0 if the response includes "
           "any fabricated details such as specific prices, dates, or policies "
           "not mentioned in the ground truth. Partially correct responses "
           "should score between 0.3-0.7.",
    model="us.anthropic.claude-sonnet-4-20250514-v1:0",
)

experiment = Experiment(cases=cases, evaluators=[hallucination_eval])
reports = experiment.run_evaluations(lambda case: agent(case.input))

PydanticAI

PydanticAI uses LLMJudge with separate score and assertion modes for hallucination detection:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge

dataset = Dataset(
    cases=[
        Case(
            name="baggage_policy",
            inputs="What is the baggage policy for Delta flights to London?",
            expected_output="Based on the context: 2 checked bags, 23kg each, free for international",
        ),
    ],
    evaluators=[
        LLMJudge(
            rubric="Does the response ONLY contain information present in the "
                   "expected output? Score 0.0 for fabricated details, 1.0 for "
                   "fully grounded responses.",
            model="anthropic:claude-sonnet-4-6",
            include_expected_output=True,
            score={"include_reason": True, "evaluation_name": "hallucination"},
            assertion={"include_reason": True, "evaluation_name": "grounded"},
        ),
    ],
)

report = dataset.evaluate_sync(lambda inputs: agent(inputs))

DeepEval (dedicated HallucinationMetric)

DeepEval provides a specialized HallucinationMetric that decomposes responses into claims and verifies each against the source context:

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="What is the baggage policy for Delta flights to London?",
    actual_output=agent("What is the baggage policy for Delta flights to London?"),
    context=[
        "Delta international flights include 2 checked bags at 23kg each, free of charge.",
        "Carry-on must fit in overhead bin. One personal item allowed.",
    ],
)

metric = HallucinationMetric(threshold=0.5)
result = evaluate(test_cases=[test_case], metrics=[metric])

Verdict: Which Framework Wins?

AspectStrandsPydanticAIDeepEval
Dedicated metricNo (via OutputEvaluator rubric)No (via LLMJudge rubric)Yes (HallucinationMetric)
Context as inputVia expected_outputVia expected_outputDedicated context field
Scoring methodLLM judge with rubricLLM judge with rubricClaim-by-claim verification
GranularitySingle scoreScore + assertionPer-context contradiction count

DeepEval wins here with a purpose-built HallucinationMetric that decomposes claims and checks each against context. Strands and PydanticAI use general-purpose LLM-as-judge with custom rubrics. This approach is flexible but less specialized for hallucination detection.


Round 4: Batch Evaluation

Quick answer: PydanticAI has the best reporting with baseline diffing (compare v1 vs v2). DeepEval has the most metrics out-of-the-box (30+). Strands has the cleanest API for mixing LLM and deterministic evaluators in a single experiment.

Real-world evaluation runs multiple metrics on multiple test cases at the same time. This section compares how each framework handles parallel execution, mixed metric types, and reporting.

Strands Agents

Strands combines multiple evaluators in a single Experiment, automatically running all combinations:

from strands_evals import Experiment, Case
from strands_evals.evaluators import (
    OutputEvaluator, TrajectoryEvaluator, ToolCalled,
)

cases = [
    Case(input="Find flights NYC to London",
         expected_output="Flight options with prices",
         expected_trajectory=["search_flights"]),
    Case(input="What's the weather in Paris tomorrow?",
         expected_output="Temperature and conditions",
         expected_trajectory=["get_weather"]),
    Case(input="Book hotel in Tokyo for 3 nights",
         expected_output="Booking confirmation with dates and price",
         expected_trajectory=["search_hotels", "book_hotel"]),
]

experiment = Experiment(
    cases=cases,
    evaluators=[
        OutputEvaluator(rubric="Is the response helpful and specific?"),
        TrajectoryEvaluator(rubric="Did the agent use the right tools?"),
        ToolCalled(tool_name="search_flights"),
    ],
)

reports = experiment.run_evaluations(task_function)
for report in reports:
    report.display()

PydanticAI

PydanticAI uses Dataset.evaluate_sync() with a max_concurrency parameter for parallel execution:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import LLMJudge, EqualsExpected, HasMatchingSpan

dataset = Dataset(
    cases=[
        Case(name="flights", inputs="Find flights NYC to London",
             expected_output="Flight options with prices"),
        Case(name="weather", inputs="What's the weather in Paris tomorrow?",
             expected_output="Temperature and conditions"),
        Case(name="hotel", inputs="Book hotel in Tokyo for 3 nights",
             expected_output="Booking confirmation with dates and price"),
    ],
    evaluators=[
        LLMJudge(rubric="Is the response helpful and specific?",
                 score={"include_reason": True}),
    ],
)

report = dataset.evaluate_sync(task_function, max_concurrency=3)
report.print(include_input=True, include_averages=True)

DeepEval

DeepEval uses AsyncConfig to control parallel execution and supports the widest range of built-in metrics:

from deepeval import evaluate
from deepeval.metrics import (
    GEval, AnswerRelevancyMetric, HallucinationMetric, ToolCorrectnessMetric,
)
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.evaluate.configs import AsyncConfig

test_cases = [build_test_case(q) for q in questions]

metrics = [
    GEval(name="Helpfulness",
          criteria="Is the response helpful and specific?",
          evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT]),
    AnswerRelevancyMetric(threshold=0.7),
    HallucinationMetric(threshold=0.5),
]

result = evaluate(
    test_cases=test_cases,
    metrics=metrics,
    async_config=AsyncConfig(max_concurrent=5),
)

Verdict: Which Framework Wins?

AspectStrandsPydanticAIDeepEval
Parallel executionrun_evaluations_async()max_concurrency paramAsyncConfig(max_concurrent=N)
Mixed metric typesLLM + deterministicLLM + deterministic + spanLLM only (30+ metrics)
Report formatRich table via .display()Rich table via .print()Console + Confident AI dashboard
Report diffingNoYes (baseline= param)Via Confident AI
ExportJSON fileYAML/JSON fileJSON/CSV + cloud

PydanticAI has the best reporting with baseline diffing (compare v1 vs v2). DeepEval has the most metrics out-of-the-box. Strands has the cleanest API for mixing LLM and deterministic evaluators.


What is the complete feature comparison?

This table summarizes every evaluation capability across all three frameworks. Use it as a reference when choosing a framework for your specific evaluation needs.

FeatureStrands + evalsPydanticAI + evalsDeepEval
LLM-as-JudgeOutputEvaluatorLLMJudgeGEval
Trajectory evaluationTrajectoryEvaluator + extractorsSpanTree + customToolCorrectnessMetric
Hallucination detectionVia rubricVia rubricHallucinationMetric
FaithfulnessFaithfulnessEvaluator (trace)Via rubricFaithfulnessMetric
Deterministic checksEquals, Contains, ToolCalledEquals, Contains, IsInstanceN/A
Multi-agent evaluationInteractionsEvaluatorCustom evaluatorN/A
Multi-turn simulationActorSimulatorN/AConversationalTestCase
Test case generationExperimentGeneratorN/Adeepeval generate
Bedrock nativeYesYesCustom wrapper
OpenTelemetryBuilt-inVia LogfireN/A
Dataset serializationJSONYAML/JSONJSON/CSV
Report comparisonNoBaseline diffingConfident AI
pytest integrationVia Experimentdataset.evaluate_sync()assert_test() / deepeval test
Total built-in metrics12 evaluators6 evaluators + custom30+ metrics

Try it yourself

The companion notebook runs all comparisons with live code. You can reproduce every result from this post.

A companion Jupyter notebook with executable code examples is available in the GitHub repository. The notebook includes side-by-side comparisons of all three frameworks on the same evaluation tasks.

Setup

cd blog-framework-comparison
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Frequently asked questions

Which AI agent evaluation framework is easiest to learn? Strands Agents requires the fewest lines of code (7 lines for LLM-as-Judge). PydanticAI is close at 10 lines. DeepEval requires the most setup, especially for non-OpenAI models where you need a custom wrapper class.

Do Strands, PydanticAI, and DeepEval support Amazon Bedrock? Strands and PydanticAI support Bedrock natively (one-line configuration). DeepEval requires a custom DeepEvalBaseLLM wrapper that maps Bedrock’s API to DeepEval’s interface. The wrapper adds approximately 25 lines of code.

Do I need OpenTelemetry to evaluate AI agents? Only for trace-based evaluators in Strands (such as FaithfulnessEvaluator and ToolSelectionAccuracyEvaluator). Output-based evaluators in all three frameworks work without OpenTelemetry. PydanticAI uses OpenTelemetry via Logfire for span-based evaluation.

What is the cost of running AI agent evaluations? Every LLM-based evaluator makes API calls to the judge model, which incurs token costs. Strands provides deterministic evaluators (such as ToolCalled, Equals, Contains) that run instantly at zero cost. DeepEval and PydanticAI also have deterministic options (Equals, Contains, IsInstance).

Can I use multiple evaluation frameworks together? Yes. You can use DeepEval’s specialized metrics (such as HallucinationMetric) alongside Strands Agents for the agent runtime and trajectory capture. The frameworks evaluate outputs, not agents directly, so the agent framework and evaluation framework are independent choices.


Conclusion

There is no single “best” evaluation framework. The right choice depends on your stack, priorities, and what you’re comparing.

Key takeaway: Different methodologies produce different scores by design.

  • Strands and PydanticAI send rubrics directly to the LLM (transparent, customizable)
  • DeepEval uses research-backed techniques like G-Eval (optimized for human alignment)
  • PydanticAI requires OpenTelemetry for tool evaluation (live traces only)
  • Strands and DeepEval work with pre-computed data (simpler testing)

When to use each:

Strands Agents is the most cohesive option if you build on AWS. Agent creation, tool calling, trajectory capture, and evaluation live in the same ecosystem. The hooks system and built-in metrics mean evaluation is instrumented into the agent runtime, not bolted on after the fact. Best for AWS-first teams who want tight integration.

PydanticAI is the most elegant option if you value type safety and structured evaluation pipelines. YAML datasets, report diffing, and the Evaluator protocol make it ideal for teams that want evaluation-as-code with strong guarantees. Best for teams prioritizing type safety and reproducible pipelines.

DeepEval is the most comprehensive option if you want specialized metrics without building them yourself. Over 30 metrics, including purpose-built hallucination detection and faithfulness checking, let you evaluate immediately without writing custom rubrics. Best for framework-agnostic evaluation with research-validated techniques.

The evaluation concepts (LLM-as-judge, trajectory scoring, hallucination detection) are framework-independent. The research papers and techniques behind them work regardless of which framework you choose. For the full list of 45+ papers that informed this comparison, see the RESEARCH.md file.


Amazon Bedrock AgentCore: A Fourth Option

Amazon Bedrock AgentCore provides built-in evaluators and managed deployment for agents. If you’re committed to AWS and want a fully managed solution, AgentCore is worth considering alongside the open-source frameworks.

Built-In Evaluators

AgentCore includes 13 pre-built evaluators accessible via the AgentCore CLI and AWS SDK. These evaluators cover common evaluation dimensions without requiring custom code:

EvaluatorWhat It MeasuresWhen to Use
Builtin.HelpfulnessOutput quality and relevanceSame use case as Strands OutputEvaluator
Builtin.GoalSuccessRateTask completion accuracyBinary success metric (compare to trajectory scoring)
Builtin.ToolSelectionTool choice correctnessSame as Strands ToolCalled or DeepEval ToolCorrectnessMetric
Builtin.FaithfulnessGrounding in retrieved contextSame as DeepEval FaithfulnessMetric
Builtin.HarmfulnessSafety and policy complianceDetects unsafe outputs

How evaluations work: You invoke the agentcore run eval CLI command with your agent ID, the desired evaluator name (such as Builtin.Helpfulness), and a test cases file. AgentCore runs the agent on each test case and returns a JSON report with scores and reasoning for each query. See the AgentCore Evaluation Guide for examples.

Trace Capture for Observability

AgentCore captures full execution traces when you enable the enableTrace parameter in the invoke_agent API call. Traces include:

  • Rationale: The agent’s reasoning before each tool call
  • Tool invocations: Which tools were called with what parameters
  • Observations: Results returned from each tool
  • Orchestration steps: The full decision-making sequence

All traces are automatically logged to Amazon CloudWatch for analysis and monitoring. You can query traces using CloudWatch Logs Insights or export them to S3 for batch analysis. See the Bedrock Agent Tracing Documentation for trace schema details.

When to use AgentCore:

  • You’re already on AWS and want a managed service
  • You need CloudWatch-native observability and compliance logging
  • Your team prefers infrastructure-as-code (CDK/CloudFormation) over custom evaluation scripts
  • You don’t need to evaluate agents on other cloud providers

When to use open-source frameworks:

  • Multi-cloud deployment (Strands works with Bedrock, OpenAI, Anthropic, Ollama)
  • Need fine-grained control over evaluation logic
  • Want to iterate quickly on custom metrics without deploying Lambda functions
  • Research or prototyping where flexibility matters more than managed infrastructure

AgentCore Resources


Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube