Summaries like this, in your inbox every morning.
Sign up free →Researchers evaluated LLM agents on 80 greenfield generation tasks and 20 feature-implementation tasks across eight web frameworks, using end-to-end behavioral tests and static verifiers to measure performance under structural constraints.
Agent performance exhibits "constraint decay"—as structural requirements (such as architectural patterns, databases, and object-relational mappings) accumulate, capable configurations lose 30 points on average in assertion pass rates from baseline to fully specified tasks, while some weaker configurations approach zero.
Framework sensitivity analysis found significant performance disparities: agents succeed in minimal, explicit frameworks (e.g., Flask) but perform substantially worse on average in convention-heavy environments (e.g., FastAPI, Django); data-layer defects (incorrect query composition and ORM runtime violations) emerge as the leading root causes of failures.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion



Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started Free5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack