RL.cu: From-scratch CUDA implementation of LLM reinforcement learning achieves 1.37× faster training than TRL with vLLM

Hacker NewsJun 7, 2026

Summaries like this, in your inbox every morning.

3 Key Points

RL.cu is a complete LLM RL pipeline written in pure CUDA with hand-written kernels (FlashAttention-2, RMSNorm, RoPE, SwiGLU, AdamW, GRPO loss), a vLLM-style inference engine with continuous batching and paged KV cache, and SFT + GRPO training. It requires CUDA Toolkit >= 12.0 and Ampere or newer GPUs (sm_80+).
On GRPO training with Qwen3-0.6B over DeepMath-103K (8 prompts × 8 generations), RL.cu achieved 1.37× wall-clock speedup versus TRL with vLLM backend: 33.7s per step vs. 46.3s, with matching final reward (0.307 vs. 0.312 over last 100 steps). Speed gains come from 15% faster generation throughput (2,992 tok/s vs. 2,602 tok/s) and elimination of weight transfer between inference and training processes.
Inference engine reaches 6,963 tok/s at batch size 256 on RTX PRO 6000, achieving 94% of nano-vLLM throughput (7,411 tok/s). The project is open source; code and instructions are available on GitHub.

AI-summarized, only the topics you pick — one digest a day via Email, Slack, or Discord.

Free · takes 30 seconds · unsubscribe anytime

No comments yet. Be the first to share your thoughts!

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime

1 minute a day. The AI essentials.

200+ sources · Email / LINE / Slack