Guide: LLM RAM / VRAM estimator
↑ Back to toolWhat is this tool?
A free LLM VRAM calculator and GPU memory estimator for decoder-only transformers. It approximates model weights from parameter count and quantization (FP32, FP16, INT8, INT4, Q4-class), optionally adds KV cache from context length and batch size, then applies an overhead percentage. Use it for capacity planning — not as a substitute for profiling on your hardware, framework, or serving stack. Everything runs in your browser.
How weights & KV cache are estimated
- Weights — Bytes ≈ parameters × bytes-per-parameter for the selected dtype (e.g. FP16 ≈ 2 bytes/param; Q4-class uses a GGUF-style rule-of-thumb density).
- KV cache — Per layer, K and V tensors shaped by batch, sequence length, KV heads, and head dimension:
2 × layers × batch × seq × kv_heads × head_dim × elem_bytes. Optional; you can disable KV to see weights-only footprint. - Overhead — Extra percent on top of weights + KV for activations, allocator slack, CUDA graphs, etc. Tune to match how conservative you want the estimate.
MoE, CPU offload, multi-GPU sharding, and flash / sparse attention change real memory — this tool does not model those paths.
Features
- Presets — 7B, 8B, 13B, 70B (GQA), plus custom parameter count with derived layer/head heuristics.
- Quantization — FP32, FP16, INT8, INT4, Q4-class for weight bytes.
- Context & batch — Sequence length (with quick chips e.g. 4K–128K), batch up to 32.
- KV toggle — Include or exclude KV cache (FP16-style element size when on).
- Breakdown — Weights, KV, subtotal, and total after overhead in GB.
How to use
- Pick a model size — Choose a preset or enter custom billions of parameters.
- Set weight format — Match how you plan to load the model (FP16 vs quantized).
- Set context & batch — Align sequence length and batch with your inference scenario.
- Toggle KV & overhead — Turn KV on for long-context generation; adjust overhead for your comfort margin.
- Read totals — Compare weights vs KV; if KV dominates, context or batch is driving VRAM.
Use cases
| Scenario | How this helps |
|---|---|
| Buying a GPU | Ballpark whether a 7B Q4 vs 70B fits next to KV at your target context. |
| Batching APIs | See how batch > 1 scales KV and total memory. |
| Long context | Compare 32K vs 128K — KV often becomes the dominant term. |
| Quantization tradeoffs | Switch INT4 / Q4-class vs FP16 to estimate weight savings before conversion work. |
Limits & disclaimers
- Decoder-only — Encoder-decoder and non-transformer architectures are out of scope.
- Simplified KV — Real systems use paging, quantization of cache, or fused kernels; numbers are idealized.
- Presets are illustrative — Layer and head counts approximate “typical” shapes; your checkpoint may differ.
- Profile before production — Use vendor tools (
nvidia-smi, framework memory stats) for ground truth.
Related terms
Planning: LLM VRAM calculator, GPU memory calculator for LLM, how much VRAM for 7B model, 13B 70B VRAM chart rough, LLM model size in GB, parameter count to disk size.
Quantization: quantization VRAM, fp16 vs bf16 vs int8 vs int4, GGUF Q4_K_M size (concept), AWQ GPTQ memory savings, bitsandbytes 8bit estimate.
Inference: KV cache memory calculator, context length VRAM scaling, batch size VRAM multiply, Llama memory estimate, Mistral Mixtral MoE RAM note, transformer memory footprint.
CPU / edge: inference RAM estimator, llama.cpp mmap footprint—this page focuses on weights + optional KV + overhead for quick planning, not vendor benchmarks.
FAQ
Is this LLM VRAM calculator free?
Yes. It runs in your browser with no sign-up.
Will my GPU match these GB numbers exactly?
Unlikely. Framework overhead, kernels, and system reserved memory differ. Treat output as a ballpark and profile on device.
Why does KV cache explode at long context?
KV grows with sequence length (and batch). At long contexts it can exceed weights — that is expected in this simplified model.
Does this support MoE or pipeline parallel?
No. Expert routing and sharding change which weights live on each device; use this for single-device or rough totals only.
Similar tools
Other Spoold utilities that pair with this estimator:
- Token calculator & cost
- Token speed simulator
- Token & context budget
- Vision token estimator
- Binary Decoder
Conclusion
Use this LLM RAM / VRAM estimator for quick weight and KV planning. For token counts and context limits, use Token & context budget or Token calculator; for latency what-ifs, use Token speed simulator.