
Summaries like this, in your inbox every morning.
Sign up free →RL.cu is a complete LLM RL pipeline written in pure CUDA with hand-written kernels (FlashAttention-2, RMSNorm, RoPE, SwiGLU, AdamW, GRPO loss), a vLLM-style inference engine with continuous batching and paged KV cache, and SFT + GRPO training. It requires CUDA Toolkit >= 12.0 and Ampere or newer GPUs (sm_80+).
On GRPO training with Qwen3-0.6B over DeepMath-103K (8 prompts × 8 generations), RL.cu achieved 1.37× wall-clock speedup versus TRL with vLLM backend: 33.7s per step vs. 46.3s, with matching final reward (0.307 vs. 0.312 over last 100 steps). Speed gains come from 15% faster generation throughput (2,992 tok/s vs. 2,602 tok/s) and elimination of weight transfer between inference and training processes.
Inference engine reaches 6,963 tok/s at batch size 256 on RTX PRO 6000, achieving 94% of nano-vLLM throughput (7,411 tok/s). The project is open source; code and instructions are available on GitHub.
No comments yet. Be the first to share your thoughts!
Log in to join the discussion





Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started FreeFree · takes 30 seconds · unsubscribe anytime
5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack