Ollama benchmark

Continuous, automated speed tests for every Ollama Cloud model. One streaming request per model every ~10 minutes. No cherry-picked results — just raw measurements from outside Ollama's network.

Top 10 fastest Ollama Cloud models right now

Rank	Model	TPS now	TPS 24h avg	TTFT	Reliability
#1	GLM 4.7	265.8	122.4	418ms	100%
#2	Nemotron 3 Nano 30B	261.5	223.2	427ms	100%
#3	Ministral 3 3B	221.5	204.0	474ms	100%
#4	Ministral 3 3B	217.8	211.4	444ms	100%
#5	GLM 4.7	195.6	109.1	448ms	100%
#6	Ministral 3 8B	171.6	132.8	453ms	100%
#7	Qwen3 Coder 480B	162.7	108.6	784ms	100%
#8	Qwen3 Coder 480B	147.8	129.2	841ms	100%
#9	Gemini 3 Flash Preview	145.0	114.5	1.7s	100%
#10	Ministral 3 8B	140.9	129.5	591ms	100%

See all 56 models on the leaderboard →

What the Ollama benchmark measures

Each benchmark run sends a single streaming chat-completion request to the Ollama Cloud API endpoint. The model is prompted to write a 400-word prose explanation of HTTP request routing, with a max_tokens cap of 300.

TPS — tokens per second: Generation throughput: output tokens divided by the time between first and last token. Excludes TTFT so TPS reflects pure decode speed, not queue or prompt-processing delay.
TTFT — time to first token: Milliseconds from request dispatch to the first content chunk in the stream. Captures network round-trip plus the provider's prompt-processing latency.
Reliability: Percentage of benchmark runs that succeeded in the last 24 hours. Failures are classified as auth, rate_limit, server, timeout, network, or malformed.

Benchmark cadence and fairness

The worker uses a priority queue that always picks the most-overdue (provider, model) pair, targeting a ~10-minute interval per model. Benchmarks run sequentially — one request at a time — mirroring realistic single-client usage.

We benchmark on the Ollama Cloud premium plan. This gives full catalog access including models behind the paywall. Speed numbers reflect premium-tier infrastructure, not free-tier which may be slower under load.

Full methodology →

How to read the numbers

TPS is relative, not absolute. The same model can vary 20–30% across hours depending on provider load and time of day. Use the 24h average for a more stable comparison.
TTFT matters for interactive use. A model with high TPS but 3 s TTFT feels slow in a chat interface. The leaderboard sorts by latest TPS by default — sort by TTFT to optimise for responsiveness.
Reliability is often the deciding factor. A model that returns errors 30% of the time needs retry logic in production. Filter for ≥90% reliability for production workloads.