Train Transformers on Apple Neural Engine - ANE GitHub
Revolutionary: Training Transformers Directly on Apple Neural Engine
Apple's Neural Engine (ANE) delivers 15.8 TFLOPS of inference power in M4 chips, but training? Officially impossible. Until now.
The Breakthrough: Pure ANE Training
ANE Training is a from-scratch implementation running full transformer training loops - forward AND backward passes - directly on ANE hardware using reverse-engineered private APIs. No CoreML training APIs. No Metal shaders. No GPU fallback. Pure ANE compute.
Current Benchmarks (M4, dim=768, seq=512): - 9.3 ms/step - 11.2% ANE utilization (1.78 TFLOPS sustained) - 6 ANE kernel dispatches per training step
Architecture Deep Dive
The system orchestrates 6 specialized ANE kernels per training step:
| Kernel | Function | Key Innovation |
|---|---|---|
| kFwdAttn | RMSNorm + QKV + SDPA | Forward taps avoid CPU recompute |
| kFwdFFN | SwiGLU FFN | ANE RMSNorm fusion |
| kFFNBwd | FFN backward | Channel-first layout |
| kSdpaBwd1/2 | SDPA backward | Wo^T fusion reduces kernel count |
| kQKVb | QKV backward | GCD async cblas overlap |
CPU handles only: RMSNorm backward, residuals, loss, dW accumulation (Accelerate cblas), Adam updates.
Key Optimizations That Matter
- Channel-first layout - Matches ANE IOSurface [1,C,1,S] eliminating transpose overhead
- vDSP RMSNorm - 10x speedup (6.7ms β 0.7ms)
- ANE RMSNorm fusion - Baked into forward kernels
- Deferred cblas - Maximum ANE/CPU overlap
- exec() restart - Bypasses 119 compile limit
Performance evolution: 33.5ms β 9.3ms through systematic optimization.
Get Started in Minutes
# macOS 15+ Apple Silicon required
xcrun clang -O2 -framework Foundation -framework IOSurface \
-framework CoreML -framework Accelerate -ldl -lobjc \
-o train_large training/train_large.m
./train_large
Zero dependencies beyond system frameworks.
File Structure Highlights
api_exploration.m- Private API discoveryinmem_bench.m- ANE dispatch latencysram_probe.m- SRAM bandwidth explorationtraining/train_large.m- Production single-layer trainer
Limitations & Roadmap
β Causal masking via decomposition β Gradient accumulation/checkpointing β Adam optimizer
π Multi-layer pipeline π Real tokenized datasets π Full model training
Legal Note
Uses runtime introspection of undocumented APIs for research/educational purposes (DMCA Β§1201(f)). No Apple proprietary code included.
2.1k stars, 362 forks - Join the Apple Silicon ML revolution: https://github.com/maderix/ANE