Run prism-ml's 1-bit Bonsai models locally
14x smaller · 4-5x less energy · 56ms response time
The easiest way to run 1-bit quantized models locally
prism-ml's Bonsai models use real 1-bit quantization across all layers. No mixed-precision workarounds or escape hatches.
Built-in model registry maps shortnames to HuggingFace. Just bonsai pull bonsai-4b -- downloads directly, no URLs to remember.
Interactive multi-turn conversations with real-time token streaming. One-shot prompts too.
Everything runs on your machine via llama.cpp. No API keys, no telemetry, no cloud dependency.
Models from 248 MB to 1.2 GB. Run a capable LLM on a phone, laptop, or Raspberry Pi.
Auto-starts the server, manages the process lifecycle, and finds your models. Just bonsai run.
Direct inference means faster responses and full control
| Ollama | llama.cpp (bonsai v2) | |
|---|---|---|
| Response time | 4,585 ms | 56 ms (78x faster) |
| Forced thinking mode | Yes -- injects <think> tags | No -- clean responses |
| Wasted tokens | 160-265 per response | 0 |
| Dependencies | Ollama daemon + SDK + 8 deps | Single llama-server binary |
| Model storage | Opaque blob store | Plain GGUF files you control |
| Template control | Locked per model family | Full control |
Ollama's Qwen3 template forces chain-of-thought reasoning on every query, even trivial ones.
llama.cpp serves the model directly with no middleware overhead.
Want a different backend? Just set BONSAI_HOST to any OpenAI-compatible server.
True 1-bit quantized language models in GGUF format
8B parameters
1.2 GB
Best quality
bonsai pull bonsai-8b
4B parameters
572 MB
Best balance
bonsai pull bonsai-4b
1.7B parameters
248 MB
Ultra-portable
bonsai pull bonsai-1.7b
Models by prism-ml · Explore on HuggingFace
More intelligence per byte than any other quantization approach
Three commands, zero configuration
go install github.com/nareshnavinash/bonsai@latest
bonsai pull bonsai-4b && bonsai run