Run TinyLlama on a $10 Board with PicoLM – A Complete Tutorial
Run TinyLlama on a $10 Board with PicoLM – A Complete Tutorial
Embedded AI is no longer a luxury reserved for GPUs or cloud servers. PicoLM, a lightweight C‑only inference engine, lets you run an entire 1‑billion‑parameter model on a 256 MB device like the Raspberry Pi Zero 2W or the LicheeRV Nano. This tutorial walks through every step:
- Why PicoLM? | TinyLlama 1.1B in GGUF format, zero dependencies, <80 KB binary, ~45 MB runtime RAM.
- Hardware prerequisites – Raspberry Pi 5 (4‑core), Pi 4, Pi 3 B+ or Pi Zero 2W; or an RISC‑V board such as LicheeRV.
- Build‑and‑install – one‑liner for Pi/Linux, full‑source build for Windows.
- Download the model – a single 638 MB file; memory‑map the weights, no RAM copy.
- Run a quick test – see the prompt‑to‑response pipeline.
- Performance – token‑per‑second charts, RAM usage, and how to tweak threads.
- Integrate with PicoClaw – the ultra‑light Go assistant that pipes prompts and reads JSON output.
- Advanced options – JSON grammar constraints, KV cache persistence, mixed‑precision, GPU‑free offline use.
- FAQ & troubleshooting – common pitfalls and tips for debugging.
- Next steps – extending to Llama‑2, other LLaMA models, adding custom tools.
1. Why PicoLM?
| Feature | Benefit |
|---|---|
| Pure C implementation | No external libraries, compile‑time SIMD auto‑detection |
| GGUF native | Read Q4_K_M weights directly |
| Flash Attention | O(seq) memory footprint |
| FP16 KV cache | Halves cache size from 88 MB to 44 MB |
| Grammar‑constrained JSON | Reliable tool‑calling in small models |
| Cross‑compile for RISC‑V | Run on the LicheeRV Nano |
| Tiny binary (~80 KB) | Deploy anywhere |
Result – TinyLlama 1.1B runs on a ~$10 board with 256 MB RAM and 45 MB runtime memory.
2. Hardware prerequisites
| Board | RAM | Cost | Notes |
|---|---|---|---|
| Raspberry Pi 5 4‑core | 2 GB+ | $60 | Highest performance |
| Raspberry Pi 4 4‑core | 1 GB | $35 | Good trade‑off |
| Raspberry Pi 3 B+ | 512 MB | $25 | Still works |
| Raspberry Pi Zero 2W | 512 MB | $15 | Ultra‑cheap |
| LicheeRV Nano | 512 MB | $10 | RISC‑V, NEON‑like SIMD |
You'll need a SD card with at least 1 GB storage for the model and runtime.
3. Build & install
One‑liner installer (Pi/Linux)
curl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash
This script:
1. Detects your architecture (ARM64, ARMv7, x86‑64).
2. Installs gcc, make, and curl if missing.
3. Builds PicoLM with the optimal SIMD flags.
4. Downloads the TinyLlama 1.1B Q4_K_M model.
5. Adds picolm to your $PATH.
Full source build (cross‑platform)
git clone https://github.com/rightnow-ai/picolm.git
cd picolm/picolm
# Auto‑detect CPU for local build
make native
# Or cross‑compile for Raspberry Pi from an x86 host
make cross-pi
# Build for RISC‑V
make riscv
Windows (MSVC)
cd picolm
build.bat
picolm.exe model.gguf -p "Hello" -n 20
4. Download the model
The default make model target fetches TinyLlama 1.1B Chat (Q4_K_M) – 638 MB.
cd /opt/picolm
make model
If you prefer a different GGUF, place it under ~/.picolm/models/ and update the config later.
5. Quick test
./picolm /opt/picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "Explain gravity in one sentence" -n 30
You should see a short, coherent explanation. The binary size is about 80 KB and the program consumes <45 MB RAM.
6. Performance benchmarks
| Device | Cost | Tokens/s | RAM | Notes |
|---|---|---|---|---|
| Pi 5 4‑core | $60 | ~10 | 45 MB | 4‑core NEON |
| Pi 4 4‑core | $35 | ~8 | 45 MB | NEON |
| Pi 3 B+ | $25 | ~4 | 45 MB | NEON |
| Pi Zero 2W | $15 | ~2 | 45 MB | ARMv7 |
| LicheeRV Nano | $10 | ~1 | 45 MB | RISC‑V SIMD |
Tip: Use the
-jflag to increase thread count (max 8 on Pi 5). On RISC‑V you’re limited to single‑threaded performance.
7. Integrate with PicoClaw
PicoClaw is a lightweight Go assistant that spawns PicoLM as a subprocess. Configure it to use the built binary:
{
"agents": {
"defaults": {"provider": "picolm", "model": "picolm-local"}
},
"providers": {
"picolm": {
"binary": "~/.picolm/bin/picolm",
"model": "~/.picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
"max_tokens": 256,
"threads": 4,
"template": "chatml"
}
}
}
Run:
picoclaw agent -m "What is photosynthesis?"
PicoClaw handles the protocol, so your device remains offline, and the response appears instantly.
8. Advanced options
| Flag | Purpose |
|---|---|
--json |
Forces grammatically‑valid JSON output (essential for tool calling). |
--cache file.kvc |
Persists KV cache; skip prefill on repeated prompts. |
-t <float> |
Temperature; set to 0 for greedy output. |
-k <float> |
Top‑p nucleus sampling. |
-s <int> |
RNG seed. |
-c <int> |
Override context length (e.g., 512 for constrained devices). |
Example JSON test
./picolm /opt/picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --json -p "Return JSON with name and age" -n 40 -t 0.3
You’ll receive something like {"name":"Alice","age":30}.
9. FAQ & troubleshooting
Q: My Pi stalls after a few seconds.
A: The SD card might be slow; try a high‑speed UHS‑I card and enable write‑cache (ddrescue).
Q: I get an “Out‑of‑memory” error.
A: Reduce -j or context length. For Pi Zero, keep context at 512.
Q: How much storage does the model take? A: 638 MB for TinyLlama 1.1B Q4_K_M. Other models vary (Llama‑2 7B = 4.1 GB).
Q: Can this run Llama‑2 7B? A: Yes, if your device has ~1.4 GB of runtime RAM for KV cache. Pi 4 with 4 GB will work, just slower.
10. Next steps
- Add custom tools: PicoClaw can call external binaries (e.g., web requesters) and embed responses in JSON.
- Support more models: Any GGUF LLaMA‑architecture works; download from Hugging Face and update the config.
- Improve speed: AVX2 or AVX‑512 kernels for desktop CPUs, speculative decoding, or larger SIMD kernels.
- Edge AI packaging: Bundle PicoLM into an app image for Raspberry Pi OS or Alpine Linux.
Bottom line
PicoLM demonstrates that a 1‑billion‑parameter LLM can run on a $10, 256 MB board without a GPU, cloud, or internet access. With a 2‑minute install, an 80 KB binary, and a 45 MB RAM footprint, it’s the cheapest local AI you’ll find. Whether you’re an IoT hobbyist, a privacy advocate, or a developer pushing the limits of edge inference, PicoLM gives you a fully offline, open‑source, and highly optimized LLM solution.