AIToday

Amazon SageMaker AI launches multi-turn reinforcement learning for agent training

Amazon AI Blog2d ago4 min read
Amazon SageMaker AI launches multi-turn reinforcement learning for agent training

Key takeaway

Amazon SageMaker AI multi-turn reinforcement learning is a new service that simplifies training of multi-step agent AI systems by handling infrastructure and orchestration while giving teams full control over environment design, reward functions, and evaluation. The service runs at per-token pricing with serverless execution, covering the algorithmic choices most relevant to agentic tasks like customer support and content moderation.

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  • What happened

    Amazon announced SageMaker AI multi-turn reinforcement learning (MTRL), a service that provides the training loop, hardware, and orchestration for agents that perform multi-step tasks. The service supports multiple algorithms including Proximal Policy Optimization (PPO), Clipped Importance Sampling Policy Optimization (CISPO), and importance-sampling losses, paired with advantage estimators like GRPO and RLOO.

  • Why it matters

    Building reliable multi-turn agents requires handling sequences of dependent steps—reading instructions, making tool calls, reading results, and deciding next actions. The service abstracts infrastructure complexity through serverless execution at per-token pricing, allowing teams to focus on the choices that decide reliability: building trustworthy training environments, designing aligned rewards, and setting up external evaluation independent of the reward signal.

  • What to watch

    The service integrates with Amazon Bedrock AgentCore, Amazon EKS, Amazon EC2, AWS Fargate, or infrastructure of your choice through a small adapter. Evaluation jobs report reward, pass@k, trajectory metrics, and trajectory observability in MLflow before deployment to a SageMaker AI endpoint or Amazon Bedrock.

FAQ

What infrastructure does SageMaker AI MTRL require?
The agent can run on Amazon Bedrock AgentCore, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Compute Cloud (Amazon EC2), AWS Fargate, or infrastructure of your choice, connected through a small adapter that exposes your tool surface to the rollout server.
What are the three patterns for building a simulated training environment?
Read-only tools replay recorded responses keyed by inputs; stateful tools use seeded sandboxes that hold state for the length of an episode; verifiable outcomes run genuine execution in isolated simulation environments like Docker, SQLite, or managed code interpreters.
Why is external evaluation separate from the reward function?
Reinforcement learning optimizes the reward signal literally, so if the reward is the only metric you watch, you cannot separate progress on the actual task from progress on satisfying the reward criteria. External evaluation measures success independently to guide iteration on rewards, environment seeding, and hyperparameters.

Discussion

No discussion yet for this article

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →