New Same benchmark, more providers: Ollama vs OpenCode Zen & Go. Compare on TokenDyno →

Ollama benchmark

Continuous, automated speed tests for every Ollama Cloud model. One streaming request per model every ~10 minutes. No cherry-picked results — just raw measurements from outside Ollama's network.

Top 10 fastest Ollama Cloud models right now

Rank Model TPS now TPS 24h avg TTFT Reliability
#1 GLM 4.7 265.8 122.4 418ms 100%
#2 Nemotron 3 Nano 30B 261.5 223.2 427ms 100%
#3 Ministral 3 3B 221.5 204.0 474ms 100%
#4 Ministral 3 3B 217.8 211.4 444ms 100%
#5 GLM 4.7 195.6 109.1 448ms 100%
#6 Ministral 3 8B 171.6 132.8 453ms 100%
#7 Qwen3 Coder 480B 162.7 108.6 784ms 100%
#8 Qwen3 Coder 480B 147.8 129.2 841ms 100%
#9 Gemini 3 Flash Preview 145.0 114.5 1.7s 100%
#10 Ministral 3 8B 140.9 129.5 591ms 100%

See all 56 models on the leaderboard →

What the Ollama benchmark measures

Each benchmark run sends a single streaming chat-completion request to the Ollama Cloud API endpoint. The model is prompted to write a 400-word prose explanation of HTTP request routing, with a max_tokens cap of 300.

TPS — tokens per second
Generation throughput: output tokens divided by the time between first and last token. Excludes TTFT so TPS reflects pure decode speed, not queue or prompt-processing delay.
TTFT — time to first token
Milliseconds from request dispatch to the first content chunk in the stream. Captures network round-trip plus the provider's prompt-processing latency.
Reliability
Percentage of benchmark runs that succeeded in the last 24 hours. Failures are classified as auth, rate_limit, server, timeout, network, or malformed.

Benchmark cadence and fairness

The worker uses a priority queue that always picks the most-overdue (provider, model) pair, targeting a ~10-minute interval per model. Benchmarks run sequentially — one request at a time — mirroring realistic single-client usage.

We benchmark on the Ollama Cloud premium plan. This gives full catalog access including models behind the paywall. Speed numbers reflect premium-tier infrastructure, not free-tier which may be slower under load.

Full methodology →

How to read the numbers

  • TPS is relative, not absolute. The same model can vary 20–30% across hours depending on provider load and time of day. Use the 24h average for a more stable comparison.
  • TTFT matters for interactive use. A model with high TPS but 3 s TTFT feels slow in a chat interface. The leaderboard sorts by latest TPS by default — sort by TTFT to optimise for responsiveness.
  • Reliability is often the deciding factor. A model that returns errors 30% of the time needs retry logic in production. Filter for ≥90% reliability for production workloads.