Ollama benchmark
Continuous, automated speed tests for every Ollama Cloud model. One streaming request per model every ~10 minutes. No cherry-picked results — just raw measurements from outside Ollama's network.
Top 10 fastest Ollama Cloud models right now
| Rank | Model | TPS now | TPS 24h avg | TTFT | Reliability |
|---|---|---|---|---|---|
| #1 | GLM 4.7 | 265.8 | 122.4 | 418ms | 100% |
| #2 | Nemotron 3 Nano 30B | 261.5 | 223.2 | 427ms | 100% |
| #3 | Ministral 3 3B | 221.5 | 204.0 | 474ms | 100% |
| #4 | Ministral 3 3B | 217.8 | 211.4 | 444ms | 100% |
| #5 | GLM 4.7 | 195.6 | 109.1 | 448ms | 100% |
| #6 | Ministral 3 8B | 171.6 | 132.8 | 453ms | 100% |
| #7 | Qwen3 Coder 480B | 162.7 | 108.6 | 784ms | 100% |
| #8 | Qwen3 Coder 480B | 147.8 | 129.2 | 841ms | 100% |
| #9 | Gemini 3 Flash Preview | 145.0 | 114.5 | 1.7s | 100% |
| #10 | Ministral 3 8B | 140.9 | 129.5 | 591ms | 100% |
What the Ollama benchmark measures
Each benchmark run sends a single streaming chat-completion request
to the Ollama Cloud API endpoint. The model is prompted to write a
400-word prose explanation of HTTP request routing, with a
max_tokens cap of 300.
- TPS — tokens per second
- Generation throughput: output tokens divided by the time between first and last token. Excludes TTFT so TPS reflects pure decode speed, not queue or prompt-processing delay.
- TTFT — time to first token
- Milliseconds from request dispatch to the first content chunk in the stream. Captures network round-trip plus the provider's prompt-processing latency.
- Reliability
- Percentage of benchmark runs that succeeded in the last 24 hours. Failures are classified as auth, rate_limit, server, timeout, network, or malformed.
Benchmark cadence and fairness
The worker uses a priority queue that always picks the most-overdue (provider, model) pair, targeting a ~10-minute interval per model. Benchmarks run sequentially — one request at a time — mirroring realistic single-client usage.
We benchmark on the Ollama Cloud premium plan. This gives full catalog access including models behind the paywall. Speed numbers reflect premium-tier infrastructure, not free-tier which may be slower under load.
How to read the numbers
- TPS is relative, not absolute. The same model can vary 20–30% across hours depending on provider load and time of day. Use the 24h average for a more stable comparison.
- TTFT matters for interactive use. A model with high TPS but 3 s TTFT feels slow in a chat interface. The leaderboard sorts by latest TPS by default — sort by TTFT to optimise for responsiveness.
- Reliability is often the deciding factor. A model that returns errors 30% of the time needs retry logic in production. Filter for ≥90% reliability for production workloads.