Spoold Logo
Home

LLM RAM / VRAM estimator

Guide

Rough decoder-only inference footprint: model weights plus optional KV cache (GQA/MHA-style). Add overhead for activations, framework buffers, and CUDA graphs in production. Profile on your stack for real numbers.

Architecture
Weight format
KV cache

KV uses 2 bytes per element (FP16). Offloading or INT8 KV would change this—this is a common default for planning.

Context length (tokens)
Batch size1
132
Architecture details
Parameters
7.00B
Bytes / param (weights)
2
Layers
32
Attention / KV heads
32 / 32
Hidden size
4096
Head dim
128.00
Preset
7B (typical Llama-class)
Weights14.00 GB
KV cache (seq 4,096 × batch 1)2.15 GB
Subtotal (weights + KV)16.15 GB
Overhead (+8%)+1.29 GB
Total estimate17.44 GB
How these numbers are computed
  • Weights: params × 10⁹ × 2 B/param
  • KV cache: decoder stores K and V per layer: 2 × layers × batch × seq × kv_heads × head_dim × elem_bytes (leading 2 = K+V tensors). Here head_dim = hidden / attn_heads, elem_bytes = 2 for FP16 KV.
  • Overhead: applied to (weights + KV) to approximate activations and runtime buffers—not exact VRAM profiling.

Guide: LLM RAM / VRAM estimator

↑ Back to tool

What is this tool?

A free LLM VRAM calculator and GPU memory estimator for decoder-only transformers. It approximates model weights from parameter count and quantization (FP32, FP16, INT8, INT4, Q4-class), optionally adds KV cache from context length and batch size, then applies an overhead percentage. Use it for capacity planning — not as a substitute for profiling on your hardware, framework, or serving stack. Everything runs in your browser.

How weights & KV cache are estimated

  • Weights — Bytes ≈ parameters × bytes-per-parameter for the selected dtype (e.g. FP16 ≈ 2 bytes/param; Q4-class uses a GGUF-style rule-of-thumb density).
  • KV cache — Per layer, K and V tensors shaped by batch, sequence length, KV heads, and head dimension: 2 × layers × batch × seq × kv_heads × head_dim × elem_bytes. Optional; you can disable KV to see weights-only footprint.
  • Overhead — Extra percent on top of weights + KV for activations, allocator slack, CUDA graphs, etc. Tune to match how conservative you want the estimate.

MoE, CPU offload, multi-GPU sharding, and flash / sparse attention change real memory — this tool does not model those paths.

Features

  • Presets — 7B, 8B, 13B, 70B (GQA), plus custom parameter count with derived layer/head heuristics.
  • Quantization — FP32, FP16, INT8, INT4, Q4-class for weight bytes.
  • Context & batch — Sequence length (with quick chips e.g. 4K–128K), batch up to 32.
  • KV toggle — Include or exclude KV cache (FP16-style element size when on).
  • Breakdown — Weights, KV, subtotal, and total after overhead in GB.

How to use

  1. Pick a model size — Choose a preset or enter custom billions of parameters.
  2. Set weight format — Match how you plan to load the model (FP16 vs quantized).
  3. Set context & batch — Align sequence length and batch with your inference scenario.
  4. Toggle KV & overhead — Turn KV on for long-context generation; adjust overhead for your comfort margin.
  5. Read totals — Compare weights vs KV; if KV dominates, context or batch is driving VRAM.

Use cases

ScenarioHow this helps
Buying a GPUBallpark whether a 7B Q4 vs 70B fits next to KV at your target context.
Batching APIsSee how batch > 1 scales KV and total memory.
Long contextCompare 32K vs 128K — KV often becomes the dominant term.
Quantization tradeoffsSwitch INT4 / Q4-class vs FP16 to estimate weight savings before conversion work.

Limits & disclaimers

  • Decoder-only — Encoder-decoder and non-transformer architectures are out of scope.
  • Simplified KV — Real systems use paging, quantization of cache, or fused kernels; numbers are idealized.
  • Presets are illustrative — Layer and head counts approximate “typical” shapes; your checkpoint may differ.
  • Profile before production — Use vendor tools (nvidia-smi, framework memory stats) for ground truth.

People often search for LLM VRAM calculator, GPU memory calculator for LLM, how much VRAM for 7B model, KV cache memory calculator, LLM model size in GB, quantization VRAM, Llama memory estimate, inference RAM estimator, and transformer memory footprint. This page focuses on weights + optional KV + overhead for quick planning.

FAQ

Is this LLM VRAM calculator free?

Yes. It runs in your browser with no sign-up.

Will my GPU match these GB numbers exactly?

Unlikely. Framework overhead, kernels, and system reserved memory differ. Treat output as a ballpark and profile on device.

Why does KV cache explode at long context?

KV grows with sequence length (and batch). At long contexts it can exceed weights — that is expected in this simplified model.

Does this support MoE or pipeline parallel?

No. Expert routing and sharding change which weights live on each device; use this for single-device or rough totals only.

Similar tools

Other Spoold utilities that pair with this estimator:

Conclusion

Use this LLM RAM / VRAM estimator for quick weight and KV planning. For token counts and context limits, use Token & context budget or Token calculator; for latency what-ifs, use Token speed simulator.