
Summaries like this, in your inbox every morning.
Sign up free →RL trains AI by setting a success metric and letting the system try tasks millions of times, keeping moves that scored well. Examples include a drone trained in simulation that beat three FPV world champions on a real track (2023), a robot dog that learned to walk on a yoga ball with an LLM-written reward function (2024), and a robot hand that solved a Rubik's cube even when physically handicapped (2019).
RL systems optimize for multiple competing objectives simultaneously. Stable Diffusion models were tuned with different reward functions (aesthetic, compressible, incompressible, prompt-matching), and a ByteDance text-to-video model optimizes across five qualities—image aesthetics, text alignment, motion quality, overall visuals, and binary pass/fail constraints—using specialized judge models.
RL has deployed into consumer-facing systems: Meta's Advantage+ auto-generates ad variants and uses engagement signals to select which to show, with over a million advertisers running 15M+ AI-generated ads in a single month; YouTube's recommender uses a REINFORCE-trained policy to choose what to autoplay; and OpenAI Operator and Claude use RL-discovered strategies to click and navigate computers.
RL controls infrastructure and medical systems: DeepMind adjusted 19 magnetic coils 10,000 times per second to shape plasma in a real tokamak (2022), cut Google data center cooling costs by 40% (2016–2018), and an AI trained on 17,000+ ICU admissions recommended fluid and vasopressor doses with lowest mortality when human doctors matched its recommendations (2018).
No discussion yet for this article
Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.
Get Started Free5 minutes a day. The AI essentials.
200+ sources · Email / LINE / Slack