New StoSignSGD algorithm fixes SignSGD's convergence problems for training large language models on non-smooth objectives
arXiv cs.LG · April 20, 2026
AI Summary
•SignSGD has been popular for distributed learning and foundation model training but fails to converge on non-smooth objectives common in modern ML (ReLUs, max-pools, mixture-of-experts)
•StoSignSGD introduces structural stochasticity into the sign operator while keeping updates unbiased, solving SignSGD's fundamental convergence limitations
•Theoretical analysis proves StoSignSGD achieves sharp convergence rates matching lower bounds in convex optimization and improves performance in challenging non-convex non-smooth settings
•The algorithm maintains the computational efficiency benefits of sign-based methods while extending applicability to modern neural network architectures