Methodology

Every number on this site comes from a reproducible automated benchmark. This page documents exactly what we measure, how we measure it, and what we do not claim.

What one benchmark run looks like

The worker sends a single streaming chat-completion request to the provider's OpenAI-compatible API endpoint. The model is asked to write a 400-word prose explanation of HTTP request routing. max_tokens is capped at 300 — enough to produce a full streaming response without waiting for a very long generation, and consistent across every model so comparisons are fair. The prompt and cap are fixed; they never change between runs.

TTFT — time to first token

TTFT (milliseconds) is measured from the moment the HTTP request is dispatched to the moment the first non-empty content chunk arrives in the stream. It captures network round-trip plus the provider's prompt-processing time. It does not include DNS or TLS handshake if a keep-alive connection is reused, but those costs are typical of real API usage.

TPS — tokens per second

TPS measures generation throughput — how fast the model emits output tokens, excluding the initial wait (TTFT).

For Ollama Cloud we use the server's own reported timing rather than measuring token arrival on our end. Each response includes eval_count (tokens generated) and total_duration (total server-side time). We compute:

TPS = eval_count ÷ (total_duration − time_to_first_token)

Reading the server's timing makes the number immune to network jitter and token buffering — some models stream their whole answer in a burst, which would wildly inflate a naïve client-side stopwatch. Subtracting time-to-first-token isolates generation from prompt processing and queueing. A run is discarded as malformed if the server returns no token count or generation time.

Timeout

Each request has a hard timeout of 120 seconds. If no complete response arrives in that window the run is recorded as a timeout error and counted against reliability.

Error taxonomy

Every failed run is classified into one of six error kinds:

auth — HTTP 401 or 403 (bad or expired API key)
rate_limit — HTTP 429 (provider throttle)
server — HTTP 5xx (provider-side error)
timeout — no complete response within 120 s
network — TCP/fetch-level failure (ECONNREFUSED, etc.)
malformed — response arrived but was unparseable or too short

Failed runs are stored with ok = false and excluded from TPS and TTFT statistics. They are counted in the reliability percentage (success rate = successful runs ÷ total runs in the window).

Benchmark cadence

The worker runs benchmarks continuously using a round-robin priority queue. Each (provider, model) pair has a target interval of 10 minutes. The scheduler always picks the most-overdue pair next, so the order naturally staggers across models without fixed cron slots.

Circuit breaker: if a model records 3 consecutive failures, it is dropped to 30-minute probe intervals until it recovers (a successful run resets the counter). This prevents a failing model from flooding the queue.

Rate-limit backoff: a 429 response pushes that provider's next benchmark slot back by 5 minutes, giving the provider time to recover without hammering a quota.

Data retention

Raw samples — kept for 60 days, then deleted.
Hourly rollups (avg/min/max/p50/p95 TPS, avg TTFT, success rate) — kept forever.
Daily rollups — kept forever.

Chart windows pick the right table automatically: 24 h and 7 d windows use raw samples; 30 d and 1 y windows use hourly or daily rollups.

Reasoning mode

Some Ollama Cloud models support extended thinking (reasoning) via the native think API parameter. When a model advertises the thinking capability (/api/show), we benchmark with reasoning enabled — think: true for most models, or the effort level that matches the linked Artificial Analysis entry (e.g. think: "high" for gpt-oss, which AA scores as the high-effort variant). Models without the thinking capability are benchmarked as a single endpoint with no think parameter.

On the leaderboard, model names are suffixed with (non-reasoning) when our benchmark and Intelligence Index refer to AA's non-reasoning variant. Reasoning models are shown without an extra label — the default case.

Plan tiers — what we actually benchmark

Provider pricing tiers can change the models available and the speed a given model runs at. We benchmark on the following plans:

Ollama Cloud — premium plan. Full model catalog coverage. Premium-gated models (such as deepseek-v4 and others behind the paywall) are included. Speed numbers reflect the premium-tier infrastructure.
OpenCode Zen — pay-per-use API (Zen endpoint). Benchmarked only when an active key is configured.
OpenCode Go — monthly subscription plan (Go endpoint). Different model set and endpoint from Zen; benchmarked separately where enabled.

If you are on a free or lower tier you may see different throughput. Our numbers are not a ceiling — they reflect our specific plan and the state of the provider's infrastructure at measurement time.

Sequential, not parallel

Benchmarks run sequentially — one request at a time, waiting for the full response before the next. This mirrors realistic single-client usage and avoids inflating TPS numbers by running requests in parallel (which would share provider capacity).

What we do not claim

Absolute throughput — numbers depend on network path, time of day, and provider load. Treat them as relative indicators, not hardware specs.
Batch or parallel throughput — if your workload sends many concurrent requests, throughput per request will differ.
Internal SLA compliance — we measure from outside the provider's network; TTFT includes our egress latency.

Open questions and feedback

If you notice a measurement that looks wrong, or want to suggest an improvement to the methodology, open an issue or start a discussion in the project repository. Accuracy and transparency are the point.