Spoold Logo
Home

Token generation speed

Guide

Separate prefill (prompt processing) and decode (autoregressive generation) throughput. Real systems batch, speculate, and overlap I/O — use this as a quick what-if, not a benchmark.

Prefill time
0.205 s
Decode time
2.327 s
Total time
2.532 s
Effective average (prompt + output) / total time
303.3 tok/s

Guide: Token generation speed

↑ Back to tool

What is this tool?

A free LLM inference speed calculator and token generation latency estimator. Enter prefill (prompt) and decode throughput in tokens per second, plus how many prompt tokens and how many new tokens to generate. The tool shows prefill time, decode time, total latency, and an effective average tok/s over the whole request. Use it for what-if planning — not as a benchmark. Runs in your browser.

Prefill vs decode

Prefill processes the full prompt (often parallel across the sequence) and is often limited by memory bandwidth or attention work on the prompt. Decode generates one token at a time (autoregressive) and is often more compute-bound per step. Real stacks report different tok/s for each phase — this tool keeps them separate so you can match numbers from your profiler or vendor docs.

  • High prefill, low decode — Long prompts feel slow to "start" even if streaming is steady afterward.
  • Low prefill, high decode — Short prompts start fast; long answers still take time.

How timing works

With prompt token count P, new tokens N, prefill tok/s Tp, and decode tok/s Td:

  • Prefill timeP / Tp
  • Decode timeN / Td
  • Total ≈ sum of the two (no overlap modeled)
  • Effective average tok/s (P + N) / total_seconds

Production systems pipeline, batch, and use speculative decoding — this model is intentionally simple.

Features

  • Presets — CPU-ish prefill; laptop / desktop / strong GPU (7B Q4, 13B Q4); server 70B multi-GPU (illustrative tok/s pairs).
  • Custom throughput — Edit prefill and decode tok/s directly.
  • Token counts — Prompt tokens and new tokens to generate.
  • Results — Prefill s, decode s, total s, effective average tok/s.

How to use

  1. Pick a preset or type your measured prefill and decode tok/s.
  2. Set prompt tokens — From a tokenizer or rough estimate.
  3. Set new tokens — Expected completion length (max output cap, typical reply, etc.).
  4. Read times — See which phase dominates for your scenario.

Use cases

ScenarioHow this helps
RAG / long system promptLarge P inflates prefill time — compare before trimming context.
Chat UI SLAEstimate time to first token vs time to finish a 256-token reply.
Hardware comparisonPlug in tok/s from two GPUs and compare total latency for the same P and N.
EducationSee why prefill and decode are not interchangeable metrics.

Limits & disclaimers

  • Presets are fictional ballparks — Not measured on your machine or model.
  • No batching, speculation, or overlap — Real servers hide latency with concurrency.
  • Quantization & context affect tok/s — Use numbers from your actual run.
  • Benchmark with your stack — vLLM, llama.cpp, TensorRT-LLM, etc. each report differently.

People often search for LLM tokens per second calculator, inference latency calculator, prefill vs decode time, time to first token estimate, generation speed calculator, LLM throughput estimator, decode tokens per second, and prompt processing speed. This page splits prefill and decode and sums simple seconds for planning.

FAQ

Is this token speed simulator free?

Yes. It runs client-side in your browser.

Why is my effective tok/s lower than decode tok/s?

Effective average divides total tokens (prompt + output) by total time. A long prefill pulls the average down even if decode is fast.

Can I use this as a benchmark?

No — use your framework’s benchmarks and hardware measurements. This is a calculator from numbers you supply.

What if prefill or decode is zero tokens?

Zero prompt tokens makes prefill time 0; zero new tokens makes decode time 0. Throughput inputs are clamped to at least 1 tok/s to avoid divide-by-zero.

Similar tools

Other Spoold utilities that pair with this simulator:

Conclusion

Use Token generation speed to reason about prefill vs decode latency from tok/s and token counts. For VRAM planning, use LLM RAM / VRAM; for token counts, use Token calculator or Token & context budget; for wall-clock conversions, see Unix time.