Physical AI Inference · Batch-1 Decode
A robot serves one token at a time, not a crowd. Across four NVIDIA GPUs and 44 measured cells, the realised fraction of peak HBM bandwidth falls as peak bandwidth rises, and the cheapest GPU often serves the stream cheapest.
The workload
Cloud serving batches a crowd to hide the cost of memory. A robot can't: it serves one stream, and the metric is the latency between tokens. The textbook says step time should track peak HBM bandwidth, so a faster GPU should be a faster robot. We measured it. It is right on slow silicon and badly wrong on fast silicon.
Explore · interactive
Every number below is measured, not modelled. Rfloor = analytic memory floor ÷ observed step time; it is recomputed from first principles and matches the paper's tables to three decimals.
Mechanism · §1
Where the bytes go. One Qwen-2.5-7B decoder block, drawn with each kernel's HBM weight traffic to scale. The SwiGLU MLP moves roughly 407 MB per block, nearly seven times the attention projections, and every block fires the same short kernels back to back. Twenty-eight of them per token sets the memory floor.
Mechanism · §2
Every step pays two bills: the memory clock (streaming weights + KV, the analytic floor) and the overhead clock (everything above the floor). The slower clock sets the pace. Change GPU in the Intro tab and watch which one binds flip.
Mechanism · §3
One decoder block, ten kernels per layer. On slow silicon the kernels are a solid highway of compute; on fast silicon the same kernels are islands in a sea of CPU launch latency. CUDA Graphs removes the launch tax. Watch the gaps collapse.
Mechanism · §4
Pure-bandwidth scaling says decision rate should climb with HBM bandwidth. It doesn't. Drag the required control rate and feel how few GPUs clear it.
Methods · the map
Three 7-8B GQA models × four GPUs × four context lengths. Each cell is the median of 30 measured decode steps after 5 warmup, bf16, sdpa, batch 1, on Modal cloud hosts. Darker is closer to the memory floor. Four L4 cells OOM at long context.
Methods · the surface
The same R_floor values as a terrain. Height is the fraction of the memory floor a cell actually reaches. A bright high plateau on the cheap, short-context corner collapses into a dark trench where the silicon is fastest and the context longest. Follows the model selected in the Intro console.
Methods · the proof
The load-bearing claim is mechanistic: the H100 gap is per-kernel CPU launch overhead. It was tested with a CUDA Graphs A/B that touches the launch term and nothing else, with kill-conditions stated in advance.
Methods · §A
Qwen-2.5-7B, ctx 2048, L4. The 4× weight-traffic saving only lands when the int4 kernel is tuned for Ada SM89.
Methods · §B
Per-layer p50, Llama-3-8B decode shape, H100. Default SDPA beats every pinned backend; cuDNN rejects the shape.
Methods · §C
Each GPU runs its best measured lever (Qwen-2.5-7B, ctx 2048). Cost uses the editable rates from the Intro console.
Methods · reproduce
Each cell loads the model in bf16, runs a fixed prefill to populate the KV cache, then times 5 warmup decode steps and 30 measured single-token decode steps at batch 1. Step time is the median. The directly-measured ratio is the only number that matters:
Sandbox · the bill
At batch 1 every stream needs its own GPU. Per token, a $0.30/hr L4 with an Ada-tuned int4 runtime serves a 7B model 7.9× cheaper than a $3.50/hr H100. Slide your monthly token volume and see the bill.
Sandbox · the feel
A robot serves one token at a time, not a crowd. Stream the same sentence at each GPU's measured ms/token and watch two things: the same $0.30/hr L4 goes from slow to fast on a runtime swap alone (bf16 to ExLlamaV2 int4), and that cheap L4 then serves each token cheaper than the $3.50/hr H100 with 11x the bandwidth. (Qwen-2.5-7B, ctx 2048, batch 1.)