NVIDIA releases Cosmos 3, a unified omni-model combining world generation, physical reasoning, and action generation in a single architecture

Hugging Face BlogJun 1, 2026

Summaries like this, in your inbox every morning.

3 Key Points

NVIDIA released Cosmos 3 on Hugging Face with two model sizes: Cosmos 3 Nano (8B parameter model) optimized for efficient inference on workstation-grade compute like the RTX PRO 6000 GPU, and Cosmos 3 Super (32B parameter model) designed for large-scale synthetic data generation and research on NVIDIA Hopper and Blackwell GPUs.
Cosmos 3 is built on a Mixture-of-Transformers (MoT) architecture that processes text, image, video, audio, and action in a single unified model. It replaces the previous approach where developers had to work with separate models for world generation, controlled generation, scene understanding, and policy generation.
The model supports multiple input-output combinations: text/image/video-to-video generation, text/video-to-text output (for vision language tasks), action/image/text-to-video (forward dynamics), text/video-to-action (inverse dynamics), and image/text-to-video-and-action (policy model). The release includes Diffusers integration, post-training scripts on GitHub, and open synthetic data generation datasets for physical AI.

AI-summarized, only the topics you pick — one digest a day via Email, Slack, or Discord.

Free · takes 30 seconds · unsubscribe anytime

No discussion yet for this article

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack