Spoold is a free, privacy-first developer toolbox. Paste into the Magic Box and it detects JSON, HTML, JWT, curl, OpenAPI, CSV, timestamps, and more—then suggests the best tool. You can also browse the full catalog by category. No sign-up is required.

Is Spoold free to use?

Yes. Core tools are free. Heavy work runs in your browser so your payloads are not processed on Spoold servers for formatting, decoding, and similar utilities.

Is my data safe with Spoold?

Tool processing for supported client-side utilities happens in your browser. Encrypted share links are designed so only people with the link can read the payload. Review the Privacy Policy for details on cookies, optional ads, and third-party services.

What tools are available?

Spoold ships 61+ tools including JSON formatter and diff, JSON Schema validation, YAML and TOML converters, HTML and Markdown preview, JWT decode and sign, OpenAPI/Swagger viewer with curl, GraphQL formatter, curl to code and curl compare, certificate viewer, CSV preview, regex tester, Mermaid, code editor, LLM token utilities, QR codes, and many encoding and text utilities.

Does Spoold show ads?

The site may display advertisements or sponsored placements to help keep the service free. House promotions highlight other Spoold tools. Ads do not change the fact that supported tools process your pasted content locally in the browser. You can read how to manage ad cookies and opt-outs in the Privacy Policy.

Can I use keyboard shortcuts?

Yes. Press Cmd or Ctrl+K to open tool search, use / from the homepage, and use per-tool shortcuts (shown in the UI) such as jj for JSON and hh for HTML where configured.

LLM VRAM Calculator — GPU Memory & KV Cache Estimator (7B, 70B, Quantization)

Guide: LLM RAM / VRAM estimator

↑ Back to tool

What is this tool?

A free LLM VRAM calculator and GPU memory estimator for decoder-only transformers. It approximates model weights from parameter count and quantization (FP32, FP16, INT8, INT4, Q4-class), optionally adds KV cache from context length and batch size, then applies an overhead percentage. Use it for capacity planning — not as a substitute for profiling on your hardware, framework, or serving stack. Everything runs in your browser.

How weights & KV cache are estimated

Weights — Bytes ≈ parameters × bytes-per-parameter for the selected dtype (e.g. FP16 ≈ 2 bytes/param; Q4-class uses a GGUF-style rule-of-thumb density).
KV cache — Per layer, K and V tensors shaped by batch, sequence length, KV heads, and head dimension: 2 × layers × batch × seq × kv_heads × head_dim × elem_bytes. Optional; you can disable KV to see weights-only footprint.
Overhead — Extra percent on top of weights + KV for activations, allocator slack, CUDA graphs, etc. Tune to match how conservative you want the estimate.

MoE, CPU offload, multi-GPU sharding, and flash / sparse attention change real memory — this tool does not model those paths.

Features

Presets — 7B, 8B, 13B, 70B (GQA), plus custom parameter count with derived layer/head heuristics.
Quantization — FP32, FP16, INT8, INT4, Q4-class for weight bytes.
Context & batch — Sequence length (with quick chips e.g. 4K–128K), batch up to 32.
KV toggle — Include or exclude KV cache (FP16-style element size when on).
Breakdown — Weights, KV, subtotal, and total after overhead in GB.

How to use

Pick a model size — Choose a preset or enter custom billions of parameters.
Set weight format — Match how you plan to load the model (FP16 vs quantized).
Set context & batch — Align sequence length and batch with your inference scenario.
Toggle KV & overhead — Turn KV on for long-context generation; adjust overhead for your comfort margin.
Read totals — Compare weights vs KV; if KV dominates, context or batch is driving VRAM.

Use cases

Scenario	How this helps
Buying a GPU	Ballpark whether a 7B Q4 vs 70B fits next to KV at your target context.
Batching APIs	See how batch > 1 scales KV and total memory.
Long context	Compare 32K vs 128K — KV often becomes the dominant term.
Quantization tradeoffs	Switch INT4 / Q4-class vs FP16 to estimate weight savings before conversion work.

Limits & disclaimers

Decoder-only — Encoder-decoder and non-transformer architectures are out of scope.
Simplified KV — Real systems use paging, quantization of cache, or fused kernels; numbers are idealized.
Presets are illustrative — Layer and head counts approximate “typical” shapes; your checkpoint may differ.
Profile before production — Use vendor tools (nvidia-smi, framework memory stats) for ground truth.

Planning: LLM VRAM calculator, GPU memory calculator for LLM, how much VRAM for 7B model, 13B 70B VRAM chart rough, LLM model size in GB, parameter count to disk size.

Quantization: quantization VRAM, fp16 vs bf16 vs int8 vs int4, GGUF Q4_K_M size (concept), AWQ GPTQ memory savings, bitsandbytes 8bit estimate.

Inference: KV cache memory calculator, context length VRAM scaling, batch size VRAM multiply, Llama memory estimate, Mistral Mixtral MoE RAM note, transformer memory footprint.

CPU / edge: inference RAM estimator, llama.cpp mmap footprint—this page focuses on weights + optional KV + overhead for quick planning, not vendor benchmarks.

FAQ

Is this LLM VRAM calculator free?

Yes. It runs in your browser with no sign-up.

Will my GPU match these GB numbers exactly?

Unlikely. Framework overhead, kernels, and system reserved memory differ. Treat output as a ballpark and profile on device.

Why does KV cache explode at long context?

KV grows with sequence length (and batch). At long contexts it can exceed weights — that is expected in this simplified model.

Does this support MoE or pipeline parallel?

No. Expert routing and sharding change which weights live on each device; use this for single-device or rough totals only.

Similar tools

Other Spoold utilities that pair with this estimator:

Conclusion

Use this LLM RAM / VRAM estimator for quick weight and KV planning. For token counts and context limits, use Token & context budget or Token calculator; for latency what-ifs, use Token speed simulator.