AIToday

Hugging Face now lets you run a private AI model server on its infrastructure with a single command, billing by the minute instead of requiring upfront server provisioning.

Hugging Face Blog8h ago4 min read
Hugging Face now lets you run a private AI model server on its infrastructure with a single command, billing by the minute instead of requiring upfront server provisioning.

Key takeaway

Hugging Face launched a simplified way to deploy private AI model servers using a single CLI command on its infrastructure, with OpenAI-compatible APIs and per-second billing. This removes the need to provision and manage servers or Kubernetes, making it practical for developers to spin up models for testing and batch work without fixed infrastructure commitments.

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  • What happened

    Hugging Face introduced a one-command way to launch a vLLM (a fast AI inference engine) server on its Jobs infrastructure, with OpenAI-compatible API access and per-second billing. Users authenticate with their Hugging Face token and can query the endpoint from anywhere.

  • Why it matters

    Previously, setting up a private model server required provisioning and managing Kubernetes infrastructure. This removes that operational overhead, making it faster and cheaper to stand up models for testing, evaluation, or batch generation work without committing to fixed infrastructure costs.

  • What to watch

    The service charges $1.50/hour for an a10g-large GPU instance, and users can scale to larger models by specifying bigger hardware flavors (like h200x2 for the 122B Qwen3.5 model). Jobs are billed per second, so explicit cancellation is cheaper than relying on the timeout safety net.

FAQ

How much does it cost to run a model server on Hugging Face Jobs?
An a10g-large GPU instance runs at $1.50/hour, and the service bills per second of hardware usage. Smaller or larger GPU flavors are available at different price points; you can check the full price list with the `hf jobs hardware` command.
What authentication is required to query the endpoint?
Every request must carry a Hugging Face token with read access to the job's namespace. The endpoint is gated and not publicly accessible; plain browser visits are rejected.
Can I run very large models on this service?
Yes. You can scale to much larger models by selecting beefier hardware flavors (such as h200x2 for a 122B parameter model) and using the `--tensor-parallel-size` flag to shard the model across multiple GPUs. Larger models require longer timeout values since they take longer to download and load.

Discussion

No comments yet. Be the first to share your thoughts!

Log in to join the discussion

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →