
Summaries like this, in your inbox every morning.
Sign up free →LangSmith on AWS provides an evaluation framework to catch agent behavior issues early, track them in production, and continuously improve agent reliability. The post combines learnings from LangChain's work on evaluating deep agents and Anthropic's guide to demystifying evals.
Agent evaluation uses three types of graders: code-based graders (deterministic logic like string matching and tool call verification), model-based graders (LLM-as-judge using rubric-based scoring), and human graders (subject matter expert review for calibration). The practical recommendation is to use deterministic graders where possible, LLM graders for nuance, and human graders for calibration.
Amazon Nova 2 Lite, available in Amazon Bedrock, supports extended thinking with configurable budget levels (low, medium, high) and accepts text, image, video, and document inputs with a 1 million-token context window. The walkthrough uses a text-to-SQL agent with Nova 2 Lite for the full development to production lifecycle.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion



Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started Free5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack