OpenAI's GPT-5.5 shows systematic bias in evaluations—favoring its own proposals last and following presentation order, raising questions about using AI to judge AI work

Hacker News · April 25, 2026

AI Summary

•A researcher tested GPT-5.5's ability to rank alternative technical plans for a software problem. Across multiple runs, GPT-5.5 ranked its own plan last in 5 out of 6 cases, and matched the order plans were presented to it in about 56% of rankings—both irrelevant factors that should not affect quality judgment.
•Increasing the model's reasoning effort (from medium to high to extra-high thinking) did not eliminate these biases. In fact, only the highest reasoning mode sometimes ranked its own plan second instead of last, suggesting reasoning depth alone cannot fix this authorship and order bias.
•For anyone using AI to evaluate multiple proposals, code reviews, or competing solutions at your company, this means AI ranking is unreliable without human oversight. You cannot trust GPT-5.5 (or likely similar models) to fairly compare plans—it will unconsciously favor whichever option was presented last or wasn't generated by itself, potentially steering teams toward worse decisions.

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.