A developer has demonstrated the first multi-chip pipelined language model inference on ESP32-class microcontrollers, splitting a 15-million-parameter model across two boards to overcome single-chip memory limits.

Hacker NewsJun 12, 2026Send on LINE

Summaries like this, in your inbox every morning.

3 Key Points

What happened
A Llama-architecture language model (AI that understands and generates text) runs with its layers split across two ESP32-S3 microcontroller boards connected by UART (a serial communication link), producing ~1.4 tokens per second (the individual words or pieces of text generated). The project is verified as the first published multi-chip pipelined inference on ESP32-class hardware, and shows a path to running a 42-million-parameter model at ~0.4–0.7 tokens per second with the same approach.
Why it matters
A single ESP32-S3 with 16MB flash can only fit a ~15-million-parameter model; the next size up (~24MB) does not fit on one board. By distributing layers across two chips and streaming weights from flash memory (using 0 bytes of RAM for weights), this approach lets developers run larger models on cheap, low-power hardware without buying bigger boards. The output remains bit-exact to the monolithic model, verified against NumPy reference tests.
What to watch
The roadmap targets measured hardware performance for the 42-million-parameter model, a 2–3× speed improvement via SIMD optimization in the matrix multiplication step, and an on-device touch UI to eliminate PC dependency. Code and setup instructions are publicly available under MIT License; the approach uses Karpathy's llama2.c architecture and Microsoft Research's TinyStories training dataset.

AI-summarized, only the topics you pick — one digest a day via Email, Slack, or Discord.

Free · takes 30 seconds · unsubscribe anytime

No comments yet. Be the first to share your thoughts!

Get curated AI news from 200+ sources delivered daily to your inbox. Free to use.

Free · takes 30 seconds · unsubscribe anytime