How much cheaper is Sail's inference than alternatives?

GLM-5.1 on Sail costs 6x less per token than Anthropic's Haiku. The same token cost is 6x less when waiting two minutes instead of two seconds for a code review.

What models does Sail support?

Sail distributes requests across open models including DeepSeek, Qwen, Kimi, and GLM, selecting the cheapest capable model for each task.

How does Sail keep costs low?

Sail uses spot capacity when available and fails over to reliable compute when it is not, using fleet-aware orchestration to keep utilization high and cost low. Customers pay only for active time in Sailboxes, not idle time.

Back to articles

Sail Research closes Series A funding to offer batch inference for AI agents at 6× lower cost than real-time alternatives.

Tomasz Tunguz (Theory Ventures)8h ago3 min read

Key takeaway

Sail Research, a startup offering cheaper batch inference for AI agents, has closed Series A funding. The company routes requests to open-source models at dramatically lower cost than real-time services—GLM-5.1 on Sail costs 6× less per token than Anthropic's Haiku—by using spare server capacity and queuing requests instead of reserving capacity per user. As AI agents move from chat assistants to background workers processing data overnight, this batch-focused approach may become the dominant inference model.

Summaries like this, in your inbox every morning.

3 Key Points

What happened
Sail Research, founded by Neil Movva and Samir Menon, announced a Series A investment alongside Kleiner Perkins, Redpoint, and Sequoia. The company routes asynchronous inference requests across open models like DeepSeek, Qwen, Kimi, and GLM, selecting the cheapest capable model for each task. GLM-5.1 on Sail costs 6x less per token than Anthropic's Haiku.
Why it matters
As AI agents shift from chat assistants into background workers running overnight tasks, most inference workloads will likely flow through batch queues rather than real-time systems. Batch inference costs far less because it uses spot capacity and idle server time instead of reserving capacity per request, making it economically viable for long-running tasks like code review, research, and document processing.
What to watch
Sailboxes—cloud computers that hold state across agent tasks, pause during inference waits, and resume in seconds—let customers pay only for active compute time. Sail has already served trillions of tokens to customers in code review, deep research, and cybersecurity.

FAQ

How much cheaper is Sail's inference than alternatives?: GLM-5.1 on Sail costs 6x less per token than Anthropic's Haiku. The same token cost is 6x less when waiting two minutes instead of two seconds for a code review.
What models does Sail support?: Sail distributes requests across open models including DeepSeek, Qwen, Kimi, and GLM, selecting the cheapest capable model for each task.
How does Sail keep costs low?: Sail uses spot capacity when available and fails over to reliable compute when it is not, using fleet-aware orchestration to keep utilization high and cost low. Customers pay only for active time in Sailboxes, not idle time.

Discussion

No discussion yet for this article

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →