Spoold Logo
Home

LLM RAM / VRAM estimator

Guide

Rough decoder-only inference footprint: model weights plus optional KV cache (GQA/MHA-style). Add overhead for activations, framework buffers, and CUDA graphs in production. Profile on your stack for real numbers.

Architecture
Weight format
KV cache

KV uses 2 bytes per element (FP16). Offloading or INT8 KV would change this—this is a common default for planning.

Context length (tokens)
Batch size1
132
Architecture details
Parameters
7.00B
Bytes / param (weights)
2
Layers
32
Attention / KV heads
32 / 32
Hidden size
4096
Head dim
128.00
Preset
7B (typical Llama-class)
Weights14.00 GB
KV cache (seq 4,096 × batch 1)2.15 GB
Subtotal (weights + KV)16.15 GB
Overhead (+8%)+1.29 GB
Total estimate17.44 GB
How these numbers are computed
  • Weights: params × 10⁹ × 2 B/param
  • KV cache: decoder stores K and V per layer: 2 × layers × batch × seq × kv_heads × head_dim × elem_bytes (leading 2 = K+V tensors). Here head_dim = hidden / attn_heads, elem_bytes = 2 for FP16 KV.
  • Overhead: applied to (weights + KV) to approximate activations and runtime buffers—not exact VRAM profiling.

Guide: LLM RAM / VRAM estimator

↑ Back to tool

What is this tool?

A free LLM VRAM calculator and GPU memory estimator for decoder-only transformers. It approximates model weights from parameter count and quantization (FP32, FP16, INT8, INT4, Q4-class), optionally adds KV cache from context length and batch size, then applies an overhead percentage. Use it for capacity planning — not as a substitute for profiling on your hardware, framework, or serving stack. Everything runs in your browser.

How weights & KV cache are estimated

  • Weights — Bytes ≈ parameters × bytes-per-parameter for the selected dtype (e.g. FP16 ≈ 2 bytes/param; Q4-class uses a GGUF-style rule-of-thumb density).
  • KV cache — Per layer, K and V tensors shaped by batch, sequence length, KV heads, and head dimension: 2 × layers × batch × seq × kv_heads × head_dim × elem_bytes. Optional; you can disable KV to see weights-only footprint.
  • Overhead — Extra percent on top of weights + KV for activations, allocator slack, CUDA graphs, etc. Tune to match how conservative you want the estimate.

MoE, CPU offload, multi-GPU sharding, and flash / sparse attention change real memory — this tool does not model those paths.

Features

  • Presets — 7B, 8B, 13B, 70B (GQA), plus custom parameter count with derived layer/head heuristics.
  • Quantization — FP32, FP16, INT8, INT4, Q4-class for weight bytes.
  • Context & batch — Sequence length (with quick chips e.g. 4K–128K), batch up to 32.
  • KV toggle — Include or exclude KV cache (FP16-style element size when on).
  • Breakdown — Weights, KV, subtotal, and total after overhead in GB.

How to use

  1. Pick a model size — Choose a preset or enter custom billions of parameters.
  2. Set weight format — Match how you plan to load the model (FP16 vs quantized).
  3. Set context & batch — Align sequence length and batch with your inference scenario.
  4. Toggle KV & overhead — Turn KV on for long-context generation; adjust overhead for your comfort margin.
  5. Read totals — Compare weights vs KV; if KV dominates, context or batch is driving VRAM.

Use cases

ScenarioHow this helps
Buying a GPUBallpark whether a 7B Q4 vs 70B fits next to KV at your target context.
Batching APIsSee how batch > 1 scales KV and total memory.
Long contextCompare 32K vs 128K — KV often becomes the dominant term.
Quantization tradeoffsSwitch INT4 / Q4-class vs FP16 to estimate weight savings before conversion work.

Limits & disclaimers

  • Decoder-only — Encoder-decoder and non-transformer architectures are out of scope.
  • Simplified KV — Real systems use paging, quantization of cache, or fused kernels; numbers are idealized.
  • Presets are illustrative — Layer and head counts approximate “typical” shapes; your checkpoint may differ.
  • Profile before production — Use vendor tools (nvidia-smi, framework memory stats) for ground truth.

Planning: LLM VRAM calculator, GPU memory calculator for LLM, how much VRAM for 7B model, 13B 70B VRAM chart rough, LLM model size in GB, parameter count to disk size.

Quantization: quantization VRAM, fp16 vs bf16 vs int8 vs int4, GGUF Q4_K_M size (concept), AWQ GPTQ memory savings, bitsandbytes 8bit estimate.

Inference: KV cache memory calculator, context length VRAM scaling, batch size VRAM multiply, Llama memory estimate, Mistral Mixtral MoE RAM note, transformer memory footprint.

CPU / edge: inference RAM estimator, llama.cpp mmap footprint—this page focuses on weights + optional KV + overhead for quick planning, not vendor benchmarks.

FAQ

Is this LLM VRAM calculator free?

Yes. It runs in your browser with no sign-up.

Will my GPU match these GB numbers exactly?

Unlikely. Framework overhead, kernels, and system reserved memory differ. Treat output as a ballpark and profile on device.

Why does KV cache explode at long context?

KV grows with sequence length (and batch). At long contexts it can exceed weights — that is expected in this simplified model.

Does this support MoE or pipeline parallel?

No. Expert routing and sharding change which weights live on each device; use this for single-device or rough totals only.

Similar tools

Other Spoold utilities that pair with this estimator:

Conclusion

Use this LLM RAM / VRAM estimator for quick weight and KV planning. For token counts and context limits, use Token & context budget or Token calculator; for latency what-ifs, use Token speed simulator.