Train Transformers on Apple Neural Engine - ANE GitHub

Revolutionary: Training Transformers Directly on Apple Neural Engine

Apple's Neural Engine (ANE) delivers 15.8 TFLOPS of inference power in M4 chips, but training? Officially impossible. Until now.

The Breakthrough: Pure ANE Training

ANE Training is a from-scratch implementation running full transformer training loops - forward AND backward passes - directly on ANE hardware using reverse-engineered private APIs. No CoreML training APIs. No Metal shaders. No GPU fallback. Pure ANE compute.

Current Benchmarks (M4, dim=768, seq=512): - 9.3 ms/step - 11.2% ANE utilization (1.78 TFLOPS sustained) - 6 ANE kernel dispatches per training step

Architecture Deep Dive

The system orchestrates 6 specialized ANE kernels per training step:

Kernel Function Key Innovation
kFwdAttn RMSNorm + QKV + SDPA Forward taps avoid CPU recompute
kFwdFFN SwiGLU FFN ANE RMSNorm fusion
kFFNBwd FFN backward Channel-first layout
kSdpaBwd1/2 SDPA backward Wo^T fusion reduces kernel count
kQKVb QKV backward GCD async cblas overlap

CPU handles only: RMSNorm backward, residuals, loss, dW accumulation (Accelerate cblas), Adam updates.

Key Optimizations That Matter

  1. Channel-first layout - Matches ANE IOSurface [1,C,1,S] eliminating transpose overhead
  2. vDSP RMSNorm - 10x speedup (6.7ms β†’ 0.7ms)
  3. ANE RMSNorm fusion - Baked into forward kernels
  4. Deferred cblas - Maximum ANE/CPU overlap
  5. exec() restart - Bypasses 119 compile limit

Performance evolution: 33.5ms β†’ 9.3ms through systematic optimization.

Get Started in Minutes

# macOS 15+ Apple Silicon required
xcrun clang -O2 -framework Foundation -framework IOSurface \
-framework CoreML -framework Accelerate -ldl -lobjc \
-o train_large training/train_large.m

./train_large

Zero dependencies beyond system frameworks.

File Structure Highlights

  • api_exploration.m - Private API discovery
  • inmem_bench.m - ANE dispatch latency
  • sram_probe.m - SRAM bandwidth exploration
  • training/train_large.m - Production single-layer trainer

Limitations & Roadmap

βœ… Causal masking via decomposition βœ… Gradient accumulation/checkpointing βœ… Adam optimizer

πŸ”„ Multi-layer pipeline πŸ”„ Real tokenized datasets πŸ”„ Full model training

Uses runtime introspection of undocumented APIs for research/educational purposes (DMCA Β§1201(f)). No Apple proprietary code included.

2.1k stars, 362 forks - Join the Apple Silicon ML revolution: https://github.com/maderix/ANE

Original Article: View Original

Share this article