AIToday

Fine-tuned open model beats GPT, Claude on finance tasks at 1/14th cost

THE DECODER7h ago5 min read
Fine-tuned open model beats GPT, Claude on finance tasks at 1/14th cost

Key takeaway

Hedge fund Bridgewater and AI startup Thinking Machines Lab say a fine-tuned open-weight model outperformed GPT and Claude on financial document evaluation tasks while costing nearly 14 times less to operate. The work highlights that proprietary corporate data and expert human judgment—which large AI labs cannot easily access—remain a significant source of improvement for specialized tasks, making open-model fine-tuning an attractive alternative for businesses that want to keep sensitive data private.

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  • What happened

    Bridgewater and Thinking Machines Lab tested AI models on six finance-focused tasks—like flagging relevant articles and spotting central bank signals. Frontier models (Gemini, Claude, GPT variants) hit only about 50% accuracy with basic prompts; expert instructions lifted them to the mid-70s, still below an 80% deployment threshold. A fine-tuned open model reached 84.7% accuracy versus 78.2% for the best frontier model tested, while costing nearly 14 times less to run.

  • Why it matters

    The result suggests large AI labs have not absorbed all valuable training data. Proprietary corporate data and human expertise locked inside companies remain untrained—a gap that open-model fine-tuning can close. For businesses worried about sending sensitive data to OpenAI or Anthropic, fine-tuning with tools like Thinking Machines' Tinker platform offers a way to keep weights, data, and compute infrastructure in-house while matching or exceeding frontier-model performance.

  • What to watch

    The evaluation comes from the two companies involved, so it is not independent; both have a commercial interest in the result. The finding is part of a broader pattern: newer frontier models show only marginal accuracy gains per dollar (GPT 5.4 costs 43% more than 5.2 but is only marginally more accurate), suggesting diminishing returns in throwing more compute at public benchmarks.

FAQ

Which frontier models were tested?
Variants of Gemini, Claude, and GPT were tested. The best frontier model achieved 78.2% accuracy in the authors' evaluation.
How did Bridgewater improve labeling accuracy without expensive expert review?
A first model trained on imperfect labels from cheap contractors re-evaluated the same documents. Wherever the model and the original label disagreed, an error was likely present; only those disputed cases were sent to investors for correction.
What open model was fine-tuned?
Training ran on Qwen3-235B through Thinking Machines Lab's Tinker platform.

Discussion

No comments yet. Be the first to share your thoughts!

Log in to join the discussion

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →