Train Transformers on Apple Neural Engine - ANE GitHub

March 03, 2026

Category: Practical Open Source Projects

Tags:

Apple Silicon Apple Neural Engine Transformer Training ANE ML Optimization

Revolutionary: Training Transformers Directly on Apple Neural Engine

Apple's Neural Engine (ANE) delivers 15.8 TFLOPS of inference power in M4 chips, but training? Officially impossible. Until now.

The Breakthrough: Pure ANE Training

ANE Training is a from-scratch implementation running full transformer training loops - forward AND backward passes - directly on ANE hardware using reverse-engineered private APIs. No CoreML training APIs. No Metal shaders. No GPU fallback. Pure ANE compute.

Current Benchmarks (M4, dim=768, seq=512): - 9.3 ms/step - 11.2% ANE utilization (1.78 TFLOPS sustained) - 6 ANE kernel dispatches per training step

Architecture Deep Dive

The system orchestrates 6 specialized ANE kernels per training step:

Kernel	Function	Key Innovation
kFwdAttn	RMSNorm + QKV + SDPA	Forward taps avoid CPU recompute
kFwdFFN	SwiGLU FFN	ANE RMSNorm fusion
kFFNBwd	FFN backward	Channel-first layout
kSdpaBwd1/2	SDPA backward	Wo^T fusion reduces kernel count
kQKVb	QKV backward	GCD async cblas overlap

CPU handles only: RMSNorm backward, residuals, loss, dW accumulation (Accelerate cblas), Adam updates.

Key Optimizations That Matter

Channel-first layout - Matches ANE IOSurface [1,C,1,S] eliminating transpose overhead
vDSP RMSNorm - 10x speedup (6.7ms → 0.7ms)
ANE RMSNorm fusion - Baked into forward kernels
Deferred cblas - Maximum ANE/CPU overlap
exec() restart - Bypasses 119 compile limit

Performance evolution: 33.5ms → 9.3ms through systematic optimization.

Get Started in Minutes

# macOS 15+ Apple Silicon required
xcrun clang -O2 -framework Foundation -framework IOSurface \
-framework CoreML -framework Accelerate -ldl -lobjc \
-o train_large training/train_large.m

./train_large

Zero dependencies beyond system frameworks.

File Structure Highlights

api_exploration.m - Private API discovery
inmem_bench.m - ANE dispatch latency
sram_probe.m - SRAM bandwidth exploration
training/train_large.m - Production single-layer trainer

Limitations & Roadmap

✅ Causal masking via decomposition ✅ Gradient accumulation/checkpointing ✅ Adam optimizer

🔄 Multi-layer pipeline 🔄 Real tokenized datasets 🔄 Full model training

Legal Note

Uses runtime introspection of undocumented APIs for research/educational purposes (DMCA §1201(f)). No Apple proprietary code included.

2.1k stars, 362 forks - Join the Apple Silicon ML revolution: https://github.com/maderix/ANE

Original Article: View Original

Share this article