Guide: LLM RAM / VRAM estimator
↑ Back to toolWhat is this tool?
A free LLM VRAM calculator and GPU memory estimator for decoder-only transformers. It approximates model weights from parameter count and quantization (FP32, FP16, INT8, INT4, Q4-class), optionally adds KV cache from context length and batch size, then applies an overhead percentage. Use it for capacity planning — not as a substitute for profiling on your hardware, framework, or serving stack. Everything runs in your browser.
How weights & KV cache are estimated
- Weights — Bytes ≈ parameters × bytes-per-parameter for the selected dtype (e.g. FP16 ≈ 2 bytes/param; Q4-class uses a GGUF-style rule-of-thumb density).
- KV cache — Per layer, K and V tensors shaped by batch, sequence length, KV heads, and head dimension:
2 × layers × batch × seq × kv_heads × head_dim × elem_bytes. Optional; you can disable KV to see weights-only footprint. - Overhead — Extra percent on top of weights + KV for activations, allocator slack, CUDA graphs, etc. Tune to match how conservative you want the estimate.
MoE, CPU offload, multi-GPU sharding, and flash / sparse attention change real memory — this tool does not model those paths.
Features
- Presets — 7B, 8B, 13B, 70B (GQA), plus custom parameter count with derived layer/head heuristics.
- Quantization — FP32, FP16, INT8, INT4, Q4-class for weight bytes.
- Context & batch — Sequence length (with quick chips e.g. 4K–128K), batch up to 32.
- KV toggle — Include or exclude KV cache (FP16-style element size when on).
- Breakdown — Weights, KV, subtotal, and total after overhead in GB.
How to use
- Pick a model size — Choose a preset or enter custom billions of parameters.
- Set weight format — Match how you plan to load the model (FP16 vs quantized).
- Set context & batch — Align sequence length and batch with your inference scenario.
- Toggle KV & overhead — Turn KV on for long-context generation; adjust overhead for your comfort margin.
- Read totals — Compare weights vs KV; if KV dominates, context or batch is driving VRAM.
Use cases
| Scenario | How this helps |
|---|---|
| Buying a GPU | Ballpark whether a 7B Q4 vs 70B fits next to KV at your target context. |
| Batching APIs | See how batch > 1 scales KV and total memory. |
| Long context | Compare 32K vs 128K — KV often becomes the dominant term. |
| Quantization tradeoffs | Switch INT4 / Q4-class vs FP16 to estimate weight savings before conversion work. |
Limits & disclaimers
- Decoder-only — Encoder-decoder and non-transformer architectures are out of scope.
- Simplified KV — Real systems use paging, quantization of cache, or fused kernels; numbers are idealized.
- Presets are illustrative — Layer and head counts approximate “typical” shapes; your checkpoint may differ.
- Profile before production — Use vendor tools (
nvidia-smi, framework memory stats) for ground truth.
Related terms
People often search for LLM VRAM calculator, GPU memory calculator for LLM, how much VRAM for 7B model, KV cache memory calculator, LLM model size in GB, quantization VRAM, Llama memory estimate, inference RAM estimator, and transformer memory footprint. This page focuses on weights + optional KV + overhead for quick planning.
FAQ
Is this LLM VRAM calculator free?
Yes. It runs in your browser with no sign-up.
Will my GPU match these GB numbers exactly?
Unlikely. Framework overhead, kernels, and system reserved memory differ. Treat output as a ballpark and profile on device.
Why does KV cache explode at long context?
KV grows with sequence length (and batch). At long contexts it can exceed weights — that is expected in this simplified model.
Does this support MoE or pipeline parallel?
No. Expert routing and sharding change which weights live on each device; use this for single-device or rough totals only.
Similar tools
Other Spoold utilities that pair with this estimator:
Conclusion
Use this LLM RAM / VRAM estimator for quick weight and KV planning. For token counts and context limits, use Token & context budget or Token calculator; for latency what-ifs, use Token speed simulator.