Back to articles

Developer ports 500M-parameter LLM training pipeline to ROCm on AMD Strix Halo APU

Hacker News · April 28, 2026

Developer ports 500M-parameter LLM training pipeline to ROCm on AMD Strix Halo APU

AI Summary

  • A fork of 1386.ai has been adapted to run on ROCm (a GPU compute platform for AMD hardware), targeting the AMD Strix Halo APU. The original author trained a 235M-parameter model; this port enables training a 500M-parameter model on the 128 GB Strix Halo APU in a GMKTec Evo X2 mini PC.
  • PyTorch's ROCm backend required virtually no model-specific code changes for training. The pipeline now uses torch.compile for performance, includes a Dockerfile to simplify ROCm installation, and changed training workers from 2 to 0 (running on the main thread) because training could not start with workers enabled.
  • Training a 500M-parameter model on this hardware takes roughly three weeks at ~4,750 tokens/s. The author notes there is likely not much low-hanging fruit left for optimization without writing custom CUDA kernels or deeper fused-operator optimizations.

Related Articles

Stay ahead with AI news

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Get Started Free