Ollama Cloud limits
What limits Ollama Cloud places on your usage — by plan tier, concurrency, and model weight. All figures from ollama.com/cloud (June 2026). For token-per-minute or request-per-minute limits, see the rate limits section below.
Concurrency limits
Each plan limits how many cloud model requests can be in-flight at once. Requests over the limit are queued; once the queue fills, additional requests are rejected until a slot opens.
| Plan | Concurrent cloud models | Price |
|---|---|---|
| Free | 1 | $0 |
| Pro | 3 | $20/mo or $200/yr |
| Max | 10 | $100/mo |
| Team | TBA | Contact Ollama |
Usage balance and resets
Ollama Cloud charges by GPU time, not token count. Usage balance resets on a rolling weekly cycle:
- Weekly limit — resets every 7 days. The overall cap for a rolling week.
An email alert fires at 90% of your plan limit. Pro and Max users can purchase additional usage balance when the plan balance is exhausted — Ollama does not publicly list the per-unit price, so check your account dashboard.
Model usage levels
Each cloud model has a usage difficulty level from 1 to 4. Heavier models consume more GPU time per request and therefore drain your plan balance faster.
| Level | Description | Example models |
|---|---|---|
| 1 | Light | gpt-oss:20b and similar small models |
| 2 | Moderate | Mid-size models |
| 3 | Heavy | Large models |
| 4 | Extra heavy | deepseek-v4-pro and similar flagship models |
Ollama does not publish the exact GPU-seconds each level consumes. Check your account usage dashboard or the model's detail page on ollama.com for the level of a specific model.
Rate limits (RPM / TPM / context)
Ollama Cloud does not publicly list requests-per-minute (RPM), tokens-per-minute (TPM), or maximum context window sizes in their documentation as of June 2026. The only usage constraints described publicly are the concurrency limits and GPU-time usage balance above.
If you need exact rate limit numbers for production planning:
- Check ollama.com/cloud — limits may have been added since this page was last updated.
- Check the Ollama Cloud documentation for any new constraints.
- Contact Ollama at [email protected] for Team plan details.
Model deprecation schedule
Ollama Cloud deprecates cloud models with advance notice via email and the website. The following models were announced for retirement on June 16, 2026:
| Retiring model | Replacement |
|---|---|
| kimi-k2-thinking | kimi-k2.6 |
| kimi-k2:1t | kimi-k2.6 |
| minimax-m2 | minimax-m3 |
| glm-4.6 | glm-5.1 |
| qwen3-next:80b | qwen3.5 |
| qwen3-vl:235b | qwen3.5 |
| qwen3-vl:235b-instruct | qwen3.5 |
| cogito-2.1:671b | deepseek-v4-flash |
Deprecations only affect cloud models. Local models are not affected. Source: docs.ollama.com/cloud.
How limits affect benchmark numbers
The benchmarks on this site run on the Ollama Cloud Pro plan. Benchmarks run sequentially (one request at a time) so concurrency limits do not affect our measurements. Usage balance limits can affect data freshness — if the balance is exhausted, benchmarks pause until the weekly reset.
See the full methodology for details, including how circuit-breaker logic and rate-limit backoff work.