How much does it cost to run a model server on Hugging Face Jobs?

An a10g-large GPU instance runs at $1.50/hour, and the service bills per second of hardware usage. Smaller or larger GPU flavors are available at different price points; you can check the full price list with the `hf jobs hardware` command.

What authentication is required to query the endpoint?

Every request must carry a Hugging Face token with read access to the job's namespace. The endpoint is gated and not publicly accessible; plain browser visits are rejected.

Can I run very large models on this service?

Yes. You can scale to much larger models by selecting beefier hardware flavors (such as h200x2 for a 122B parameter model) and using the `--tensor-parallel-size` flag to shard the model across multiple GPUs. Larger models require longer timeout values since they take longer to download and load.

Back to articlesLarge Language Models

Large Language Models

Hugging Face now lets you run a private AI model server on its infrastructure with a single command, billing by the minute instead of requiring upfront server provisioning.

Hugging Face Blog8h ago4 min read

Key takeaway

Hugging Face launched a simplified way to deploy private AI model servers using a single CLI command on its infrastructure, with OpenAI-compatible APIs and per-second billing. This removes the need to provision and manage servers or Kubernetes, making it practical for developers to spin up models for testing and batch work without fixed infrastructure commitments.

Summaries like this, in your inbox every morning.

3 Key Points

What happened
Hugging Face introduced a one-command way to launch a vLLM (a fast AI inference engine) server on its Jobs infrastructure, with OpenAI-compatible API access and per-second billing. Users authenticate with their Hugging Face token and can query the endpoint from anywhere.
Why it matters
Previously, setting up a private model server required provisioning and managing Kubernetes infrastructure. This removes that operational overhead, making it faster and cheaper to stand up models for testing, evaluation, or batch generation work without committing to fixed infrastructure costs.
What to watch
The service charges $1.50/hour for an a10g-large GPU instance, and users can scale to larger models by specifying bigger hardware flavors (like h200x2 for the 122B Qwen3.5 model). Jobs are billed per second, so explicit cancellation is cheaper than relying on the timeout safety net.

FAQ

How much does it cost to run a model server on Hugging Face Jobs?: An a10g-large GPU instance runs at $1.50/hour, and the service bills per second of hardware usage. Smaller or larger GPU flavors are available at different price points; you can check the full price list with the `hf jobs hardware` command.
What authentication is required to query the endpoint?: Every request must carry a Hugging Face token with read access to the job's namespace. The endpoint is gated and not publicly accessible; plain browser visits are rejected.
Can I run very large models on this service?: Yes. You can scale to much larger models by selecting beefier hardware flavors (such as h200x2 for a 122B parameter model) and using the `--tensor-parallel-size` flag to shard the model across multiple GPUs. Larger models require longer timeout values since they take longer to download and load.

Discussion

No comments yet. Be the first to share your thoughts!

Unable to generate summary — the article body provided contains no news content, only a newsletter registration prompt.

Nikkei AI Stocks1h ago

GitHub Copilot's shared AI harness matches rival model-vendor tools on task completion while using fewer tokens, letting developers mix-and-match models without sacrificing performance.

GitHub Copilot Blog4h ago

OpenAI will release GPT-5.6 in limited preview with Trump administration case-by-case approval, a less restrictive arrangement than the export controls imposed on rival Anthropic.

The Verge AI4h ago

The Trump administration is pressuring OpenAI to limit early access to its new GPT 5.6 model to select partners before a broader public release, shifting from its previous hands-off AI stance.

TechCrunch AI4h ago

Visa partners with AI and stablecoin firms to build payments infrastructure for autonomous agents and digital assets, diversifying revenue beyond traditional card fees.

Top Companies AI — US (1/2)7h ago

TrueFoundry acquires MLOps pioneer Seldon AI to combine infrastructure for deploying AI agents at scale, addressing a gap where only 14% of enterprises have moved AI pilots into production.

Top Companies AI — US (1/2)7h ago

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →