AIToday

AWS releases Agent-EvalKit, an open-source toolkit that helps development teams systematically test AI agents by tracing their full execution—tool calls, data returned, and reasoning steps—rather than just checking final outputs.

Amazon AI Blog2h ago3 min read
AWS releases Agent-EvalKit, an open-source toolkit that helps development teams systematically test AI agents by tracing their full execution—tool calls, data returned, and reasoning steps—rather than just checking final outputs.

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  1. 1

    What happened: AWS published Agent-EvalKit, an Apache 2.0 licensed open-source toolkit that integrates with AI coding assistants (Claude Code, Kiro CLI, Kilo Code) to evaluate AI agents across six phases: planning evaluation metrics, generating test cases, adding execution tracing, running agents against tests, computing evaluation scores, and producing code-level improvement recommendations.

  2. 2

    Why it matters: Most teams evaluate AI agents by checking whether outputs look correct, but agents can produce plausible answers while hallucinating facts or skipping verification steps. The toolkit addresses this gap by making it practical to inspect the agent's full decision path—which tools it called, what data returned, and whether responses match that data—without teams having to build evaluation infrastructure from scratch.

  3. 3

    What to watch: The toolkit works through existing development environments (your AI coding assistant) rather than as a separate platform, and supports multiple agent frameworks including Strands, LangGraph, and CrewAI. It uses natural language guidance to let you focus evaluation on your agent's specific failure modes, such as hallucinations triggered by empty tool results.

Discussion

No comments yet. Be the first to share your thoughts!

Log in to join the discussion

Related Articles

Microsoft's Steve Ballmer received a $303 million(約480億円) dividend check this quarter—and will collect another next quarter—because his 4% stake in the company compounds at a rate few investors experience, even as the stock has pulled back year to date.

Yahoo Finance AI2h ago

Anthropic is committing $150 million(約240億円) to place 1,000 AI-trained fellows at nonprofits, while also pledging $200 million(約320億円) for AI workforce displacement research, as the $965 billion(約150兆円) company seeks to balance profit with social responsibility before a planned public offering.

Fortune AI2h ago

Booking.com's CEO envisions an AI travel assistant that intervenes before problems happen—revealing both the promise and the trust challenge of generative AI in high-stakes services.

Fortune AI2h ago

Anthropic released Fable 5, a new AI model that outperforms its predecessor Opus on benchmarks and enables longer, more complex multi-step tasks—but only through June 22 on the standard subscription plan.

Ben's Bites2h ago

X Square Robot open-sourced XRZero-G0, a wearable data-collection system that reduces real-robot training data needs by up to 20× and released a 2,000-hour dataset to accelerate robotics research.

The Robot Report5h ago

Google DeepMind and partner organizations are funding $10 million(約16億円) in research to understand risks when multiple AI agents interact with each other, concerned that unsafe scenarios could become real as agent deployment scales up.

MIT Technology Review AI5h ago

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →