AIToday

RL.cu: From-scratch CUDA implementation of LLM reinforcement learning achieves 1.37× faster training than TRL with vLLM

Hacker News23h ago2 min read
RL.cu: From-scratch CUDA implementation of LLM reinforcement learning achieves 1.37× faster training than TRL with vLLM

Summaries like this, in your inbox every morning.

Sign up free →

3 Key Points

  1. 1

    RL.cu is a complete LLM RL pipeline written in pure CUDA with hand-written kernels (FlashAttention-2, RMSNorm, RoPE, SwiGLU, AdamW, GRPO loss), a vLLM-style inference engine with continuous batching and paged KV cache, and SFT + GRPO training. It requires CUDA Toolkit >= 12.0 and Ampere or newer GPUs (sm_80+).

  2. 2

    On GRPO training with Qwen3-0.6B over DeepMath-103K (8 prompts × 8 generations), RL.cu achieved 1.37× wall-clock speedup versus TRL with vLLM backend: 33.7s per step vs. 46.3s, with matching final reward (0.307 vs. 0.312 over last 100 steps). Speed gains come from 15% faster generation throughput (2,992 tok/s vs. 2,602 tok/s) and elimination of weight transfer between inference and training processes.

  3. 3

    Inference engine reaches 6,963 tok/s at batch size 256 on RTX PRO 6000, achieving 94% of nano-vLLM throughput (7,411 tok/s). The project is open source; code and instructions are available on GitHub.

Discussion

No comments yet. Be the first to share your thoughts!

Log in to join the discussion

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free

Free · takes 30 seconds · unsubscribe anytime

5 minutes a day. The AI essentials.

200+ sources · Email / LINE / Slack

Get it free →