Developer ports 500M-parameter LLM training pipeline to ROCm on AMD Strix Halo APU
Hacker News · April 28, 2026
AI Summary
•A fork of 1386.ai has been adapted to run on ROCm (a GPU compute platform for AMD hardware), targeting the AMD Strix Halo APU. The original author trained a 235M-parameter model; this port enables training a 500M-parameter model on the 128 GB Strix Halo APU in a GMKTec Evo X2 mini PC.
•PyTorch's ROCm backend required virtually no model-specific code changes for training. The pipeline now uses torch.compile for performance, includes a Dockerfile to simplify ROCm installation, and changed training workers from 2 to 0 (running on the main thread) because training could not start with workers enabled.
•Training a 500M-parameter model on this hardware takes roughly three weeks at ~4,750 tokens/s. The author notes there is likely not much low-hanging fruit left for optimization without writing custom CUDA kernels or deeper fused-operator optimizations.