bonsai

Run prism-ml's 1-bit Bonsai models locally

14x smaller · 4-5x less energy · 56ms response time

Get Started View on GitHub
$ bonsai pull bonsai-4b
Downloading bonsai-4b (572 MB)...
pulling... [====================] 100% 572 MB/572 MB
Downloaded bonsai-4b

$ bonsai run
Interactive chat with bonsai-4b. Type /bye to exit.
>>> what is quantum computing?
Quantum computing uses qubits that can exist in
superposition, enabling parallel computation...
>>>

Why Bonsai?

The easiest way to run 1-bit quantized models locally

1b

True 1-Bit Models

prism-ml's Bonsai models use real 1-bit quantization across all layers. No mixed-precision workarounds or escape hatches.

One-Command Pull

Built-in model registry maps shortnames to HuggingFace. Just bonsai pull bonsai-4b -- downloads directly, no URLs to remember.

>>>

Streaming Chat

Interactive multi-turn conversations with real-time token streaming. One-shot prompts too.

~/

100% Local

Everything runs on your machine via llama.cpp. No API keys, no telemetry, no cloud dependency.

<1G

Tiny Footprint

Models from 248 MB to 1.2 GB. Run a capable LLM on a phone, laptop, or Raspberry Pi.

go

Zero Config

Auto-starts the server, manages the process lifecycle, and finds your models. Just bonsai run.

Why llama.cpp, Not Ollama?

Direct inference means faster responses and full control

Ollamallama.cpp (bonsai v2)
Response time 4,585 ms 56 ms (78x faster)
Forced thinking mode Yes -- injects <think> tags No -- clean responses
Wasted tokens 160-265 per response 0
Dependencies Ollama daemon + SDK + 8 deps Single llama-server binary
Model storage Opaque blob store Plain GGUF files you control
Template control Locked per model family Full control

Ollama's Qwen3 template forces chain-of-thought reasoning on every query, even trivial ones. llama.cpp serves the model directly with no middleware overhead. Want a different backend? Just set BONSAI_HOST to any OpenAI-compatible server.

Bonsai Models by prism-ml

True 1-bit quantized language models in GGUF format

bonsai-8b

8B parameters

1.2 GB

Best quality

bonsai pull bonsai-8b

bonsai-4b

4B parameters

572 MB

Best balance

bonsai pull bonsai-4b

bonsai-1.7b

1.7B parameters

248 MB

Ultra-portable

bonsai pull bonsai-1.7b
14x smaller than FP16
4-5x less energy
131 tok/s on M4 Pro
368 tok/s on RTX 4090

Models by prism-ml · Explore on HuggingFace

The 1-Bit Advantage

More intelligence per byte than any other quantization approach

14x
Smaller than FP16 equivalents
4-5x
Lower energy per token
1.06
Intelligence/GB (vs 0.10 for full precision)
40+
Tokens/sec even on an iPhone

Get Started in 60 Seconds

Three commands, zero configuration

1

Install llama.cpp

brew install llama.cpp

Or build from source

2

Install Bonsai

go install github.com/nareshnavinash/bonsai@latest
3

Pull a model and chat

bonsai pull bonsai-4b && bonsai run