bonsai

Run prism-ml's 1-bit Bonsai models locally

14x smaller · 4-5x less energy · 56ms response time

Get Started View on GitHub

$ bonsai pull bonsai-4b

Downloading bonsai-4b (572 MB)...

pulling... [====================] 100% 572 MB/572 MB

Downloaded bonsai-4b

$ bonsai run

Interactive chat with bonsai-4b. Type /bye to exit.

>>> what is quantum computing?

Quantum computing uses qubits that can exist in

superposition, enabling parallel computation...

>>>

Why Bonsai?

The easiest way to run 1-bit quantized models locally

True 1-Bit Models

prism-ml's Bonsai models use real 1-bit quantization across all layers. No mixed-precision workarounds or escape hatches.

↓

One-Command Pull

Built-in model registry maps shortnames to HuggingFace. Just bonsai pull bonsai-4b -- downloads directly, no URLs to remember.

>>>

Streaming Chat

Interactive multi-turn conversations with real-time token streaming. One-shot prompts too.

100% Local

Everything runs on your machine via llama.cpp. No API keys, no telemetry, no cloud dependency.

<1G

Tiny Footprint

Models from 248 MB to 1.2 GB. Run a capable LLM on a phone, laptop, or Raspberry Pi.

Zero Config

Auto-starts the server, manages the process lifecycle, and finds your models. Just bonsai run.

Why llama.cpp, Not Ollama?

Direct inference means faster responses and full control

	Ollama	llama.cpp (bonsai v2)
Response time	4,585 ms	56 ms (78x faster)
Forced thinking mode	Yes -- injects <think> tags	No -- clean responses
Wasted tokens	160-265 per response	0
Dependencies	Ollama daemon + SDK + 8 deps	Single llama-server binary
Model storage	Opaque blob store	Plain GGUF files you control
Template control	Locked per model family	Full control

Ollama's Qwen3 template forces chain-of-thought reasoning on every query, even trivial ones. llama.cpp serves the model directly with no middleware overhead. Want a different backend? Just set BONSAI_HOST to any OpenAI-compatible server.

Bonsai Models by prism-ml

True 1-bit quantized language models in GGUF format

bonsai-8b

8B parameters

1.2 GB

Best quality

bonsai pull bonsai-8b

bonsai-4b

4B parameters

572 MB

Best balance

bonsai pull bonsai-4b

bonsai-1.7b

1.7B parameters

248 MB

Ultra-portable

bonsai pull bonsai-1.7b

14x smaller than FP16

4-5x less energy

131 tok/s on M4 Pro

368 tok/s on RTX 4090

Models by prism-ml · Explore on HuggingFace

The 1-Bit Advantage

More intelligence per byte than any other quantization approach

14x

Smaller than FP16 equivalents

4-5x

Lower energy per token

1.06

Intelligence/GB (vs 0.10 for full precision)

40+

Tokens/sec even on an iPhone

Get Started in 60 Seconds

Three commands, zero configuration

Install llama.cpp

brew install llama.cpp

Or build from source

Install Bonsai

go install github.com/nareshnavinash/bonsai@latest

Pull a model and chat

bonsai pull bonsai-4b && bonsai run