Ollama Cloud limits

What limits Ollama Cloud places on your usage — by plan tier, concurrency, and model weight. All figures from ollama.com/cloud (June 2026). For token-per-minute or request-per-minute limits, see the rate limits section below.

Concurrency limits

Each plan limits how many cloud model requests can be in-flight at once. Requests over the limit are queued; once the queue fills, additional requests are rejected until a slot opens.

Plan	Concurrent cloud models	Price
Free	1	$0
Pro	3	$20/mo or $200/yr
Max	10	$100/mo
Team	TBA	Contact Ollama

Usage balance and resets

Ollama Cloud charges by GPU time, not token count. Usage balance resets on a rolling weekly cycle:

Weekly limit — resets every 7 days. The overall cap for a rolling week.

An email alert fires at 90% of your plan limit. Pro and Max users can purchase additional usage balance when the plan balance is exhausted — Ollama does not publicly list the per-unit price, so check your account dashboard.

Model usage levels

Each cloud model has a usage difficulty level from 1 to 4. Heavier models consume more GPU time per request and therefore drain your plan balance faster.

Level	Description	Example models
1	Light	gpt-oss:20b and similar small models
2	Moderate	Mid-size models
3	Heavy	Large models
4	Extra heavy	deepseek-v4-pro and similar flagship models

Ollama does not publish the exact GPU-seconds each level consumes. Check your account usage dashboard or the model's detail page on ollama.com for the level of a specific model.

Rate limits (RPM / TPM / context)

Ollama Cloud does not publicly list requests-per-minute (RPM), tokens-per-minute (TPM), or maximum context window sizes in their documentation as of June 2026. The only usage constraints described publicly are the concurrency limits and GPU-time usage balance above.

If you need exact rate limit numbers for production planning:

Check ollama.com/cloud — limits may have been added since this page was last updated.
Check the Ollama Cloud documentation for any new constraints.
Contact Ollama at [email protected] for Team plan details.

Model deprecation schedule

Ollama Cloud deprecates cloud models with advance notice via email and the website. The following models were announced for retirement on June 16, 2026:

Retiring model	Replacement
kimi-k2-thinking	kimi-k2.6
kimi-k2:1t	kimi-k2.6
minimax-m2	minimax-m3
glm-4.6	glm-5.1
qwen3-next:80b	qwen3.5
qwen3-vl:235b	qwen3.5
qwen3-vl:235b-instruct	qwen3.5
cogito-2.1:671b	deepseek-v4-flash

Deprecations only affect cloud models. Local models are not affected. Source: docs.ollama.com/cloud.

How limits affect benchmark numbers

The benchmarks on this site run on the Ollama Cloud Pro plan. Benchmarks run sequentially (one request at a time) so concurrency limits do not affect our measurements. Usage balance limits can affect data freshness — if the balance is exhausted, benchmarks pause until the weekly reset.

See the full methodology for details, including how circuit-breaker logic and rate-limit backoff work.

Source: ollama.com/cloud and docs.ollama.com/cloud, retrieved June 2026. Always verify at the source — limits and pricing can change without notice.