
Summaries like this, in your inbox every morning.
Sign up free →Researchers call the software layer that enables agent behavior a 'harness,' which includes tools, interfaces, sandboxed execution environments, memory, testing, permission boundaries, execution loops, and feedback channels. Without it, a language model is stateless; with it, the model becomes a working agent.
The framework splits agent systems into three levels: code that bridges the model and environment (like Program-of-Thoughts or Chain of Code), infrastructure that keeps agents reliable across many steps (planning, memory, tool use, and plan-execute-verify cycles), and multiple agents coordinating through shared code workspaces and execution logs (as in systems like ChatDev and MetaGPT).
Production systems already follow this pattern: Anthropic's Claude Code ties together terminal, dev environment, and browser with permission rules; OpenAI's Codex and GitHub Copilot move similar workflows to managed cloud environments; Deepseek is building a dedicated 'Harness' team in Beijing to handle tool use, planning, and storage.
The authors flag open problems: more meaningful evaluations beyond raw success rates, checking result substance when tests alone don't suffice, harness self-improvement without regressions, shared state across multiple agents, human oversight, and extending to environments with image or sensor data like GUI agents and robots.
No discussion yet for this article





Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started Free5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack