
GitHub released benchmark data showing its AI agent harness, which powers GitHub Copilot tools, matches the performance of competing model-vendor harnesses on software engineering tasks while using fewer tokens. The harness supports multiple AI models from different providers—GPT, Claude, Gemini, and others—giving developers flexibility to choose the best model for each job without sacrificing quality or efficiency.
Summaries like this, in your inbox every morning.
Sign up free →What happened
GitHub published benchmark results showing its Copilot agentic harness—the shared engine powering Copilot CLI, the Copilot app, and code review—achieves task-completion rates on par with Claude Code and Codex CLI across SWE-bench Verified, SWE-bench Pro, SkillsBench, and TerminalBench, while consuming fewer tokens in most configurations.
Why it matters
The harness supports 20+ frontier models across GPT, Claude, Gemini, and MAI families, plus open-source and local models, letting developers pick the right model for each task's cost and capability needs without being locked into a single vendor's tool. Lower token use translates to reduced API costs for equivalent work.
What to watch
GitHub's multi-model architecture enables cross-model critique (e.g., Rubber Duck, where one model reviews another's output), a capability single-model vendor harnesses cannot offer. The benchmarks show GPT models deliver the best value with strong resolution at lowest cost, while Claude Opus reaches the highest resolution at higher cost.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion




Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack