Physical AI Inference · Batch-1 Decode

The faster the silicon, the less of its bandwidth ever reaches the work.

A robot serves one token at a time, not a crowd. Across four NVIDIA GPUs and 44 measured cells, the realised fraction of peak HBM bandwidth falls as peak bandwidth rises, and the cheapest GPU often serves the stream cheapest.

0.27

H100 floor used

0.81

L4 floor used

7.9×

cheaper per token

arXiv paper ↗ PDF ↗ Open the sandbox →

ONE TOKEN AT A TIME · AUTOREGRESSIVE DECODE

The workload

One token at a time, one stream at a time

Cloud serving batches a crowd to hide the cost of memory. A robot can't: it serves one stream, and the metric is the latency between tokens. The textbook says step time should track peak HBM bandwidth, so a faster GPU should be a faster robot. We measured it. It is right on slow silicon and badly wrong on fast silicon.

Explore · interactive

Pick a cell, watch the bandwidth split

Every number below is measured, not modelled. R_floor = analytic memory floor ÷ observed step time; it is recomputed from first principles and matches the paper's tables to three decimals.

Model

GPU

Context length

GPU rate ($/hr)

Lever bf16 / eager best lever per GPU

Step time

Analytic HBM floor

(W+K)/B_peak

% peak HBM realised

R_floor

Serving cost

$ / Mtok

Most of the bandwidth never reaches the work.

Bright ribbon = useful bandwidth (R_floor × peak GB/s) for the selected cell; faded stream peeling below = bandwidth lost, dominated by CPU launch latency on fast silicon. The H100 pours in 11× the L4's bandwidth and delivers barely 3.7× the useful traffic.

Mechanism · §1

Anatomy of one decode step

Where the bytes go. One Qwen-2.5-7B decoder block, drawn with each kernel's HBM weight traffic to scale. The SwiGLU MLP moves roughly 407 MB per block, nearly seven times the attention projections, and every block fires the same short kernels back to back. Twenty-eight of them per token sets the memory floor.

Box width is HBM bytes moved by that kernel at bf16 (weights), the dominant traffic at batch 1. KV read at ctx 2048 is about 4.2 MB per block, small but growing linearly with context. Twenty-eight blocks total about 13.16 GB per step; over the H100's 3.35 TB/s that is a 3.93 ms floor, against a measured 14.83 ms (eager), so R_floor 0.27. The kernels are tiny and many, which is exactly why per-kernel launch cost, not bandwidth, dominates on fast silicon.

Mechanism · §2

Two clocks: memory cost vs launch cost

Every step pays two bills: the memory clock (streaming weights + KV, the analytic floor) and the overhead clock (everything above the floor). The slower clock sets the pace. Change GPU in the Intro tab and watch which one binds flip.

Follows the Intro selection ·

Mechanism · §3

The empty highway

One decoder block, ten kernels per layer. On slow silicon the kernels are a solid highway of compute; on fast silicon the same kernels are islands in a sea of CPU launch latency. CUDA Graphs removes the launch tax. Watch the gaps collapse.

Mechanism · §4

The dexterous barrier

Pure-bandwidth scaling says decision rate should climb with HBM bandwidth. It doesn't. Drag the required control rate and feel how few GPUs clear it.

Action tokens / chunk

Required control rate 30 Hz

Methods · the map

All 44 cells at once

Three 7-8B GQA models × four GPUs × four context lengths. Each cell is the median of 30 measured decode steps after 5 warmup, bf16, sdpa, batch 1, on Modal cloud hosts. Darker is closer to the memory floor. Four L4 cells OOM at long context.

R_floor is monotone in peak bandwidth at fixed context, and falls with context at fixed GPU (the KV term grows faster than launch overhead). The L4 column blazes; the H100 column is dim.

Methods · the surface

The efficiency surface

The same R_floor values as a terrain. Height is the fraction of the memory floor a cell actually reaches. A bright high plateau on the cheap, short-context corner collapses into a dark trench where the silicon is fastest and the context longest. Follows the model selected in the Intro console.

Drag to orbit, scroll to zoom. Peak bandwidth runs along the long axis (L4 300 to H100 3350 GB/s), context length along the depth axis (2048 to 16384), height is R_floor. The bright green plateau is the L4 near its floor; the deep red basin is the H100 sinking to 27% and below. Macro height follows the 44 measured cells; the fine contour relief and the basin are stylised for legibility, the ground truth is the cell values.

Methods · the proof

A pre-registered falsification, not a story

The load-bearing claim is mechanistic: the H100 gap is per-kernel CPU launch overhead. It was tested with a CUDA Graphs A/B that touches the launch term and nothing else, with kill-conditions stated in advance.

Pre-registered: an H100 speedup under ~1.15× would have killed the launch-tax claim; an L4 speedup over ~1.15× would also have killed it. Neither occurred: H100 measured 1.259× (95% bootstrap CI [1.253, 1.267], N=10, cross-session CV 0.9%); L4 measured a null 1.028×. An honest caveat: the A/B only proves the graph-removable slice is launch (~3.05 ms on H100). The remaining ~7.2 ms above the floor survives graphing and is not, by this evidence, attributable to launch.

Methods · §A

Quant: kernel, not bit-width

Qwen-2.5-7B, ctx 2048, L4. The 4× weight-traffic saving only lands when the int4 kernel is tuned for Ada SM89.

Methods · §B

Attention backends

Per-layer p50, Llama-3-8B decode shape, H100. Default SDPA beats every pinned backend; cuDNN rejects the shape.

Methods · §C

The cost-per-token inversion

Each GPU runs its best measured lever (Qwen-2.5-7B, ctx 2048). Cost uses the editable rates from the Intro console.

Methods · reproduce

The measurement protocol

Each cell loads the model in bf16, runs a fixed prefill to populate the KV cache, then times 5 warmup decode steps and 30 measured single-token decode steps at batch 1. Step time is the median. The directly-measured ratio is the only number that matters:

# R_floor = analytic memory floor / observed step time
t_floor = (W + K) / B_peak     # W = bf16 weight bytes; K = 2·n_layers·n_kv_heads·head_dim·ctx·2
R_floor = t_floor / t_obs      # t_obs = median of 30 measured decode steps, batch 1
# A purely HBM-bandwidth-bound decode would sit at R_floor = 1.
# H100 ctx2048 Qwen: t_floor 4.58 ms, t_obs 16.97 ms  ->  R_floor 0.27

Sandbox · the bill

What your fleet costs to serve

At batch 1 every stream needs its own GPU. Per token, a $0.30/hr L4 with an Ada-tuned int4 runtime serves a 7B model 7.9× cheaper than a $3.50/hr H100. Slide your monthly token volume and see the bill.

Tokens served per month, across your fleet

H100 · $3.50/hr

CUDA Graphs · 11.78 ms/tok

L4 · $0.30/hr

ExLlamaV2 · 17.36 ms/tok

You save

Cost = monthly tokens × measured ms/token × GPU rate. Per-token rates: H100 $11.45 / Mtok, L4 $1.45 / Mtok (Modal published rates, May 2026). Excludes idle, networking, storage and batching; the ratio is the point.

Sandbox · the feel

Runtime-poor, not compute-poor

A robot serves one token at a time, not a crowd. Stream the same sentence at each GPU's measured ms/token and watch two things: the same $0.30/hr L4 goes from slow to fast on a runtime swap alone (bf16 to ExLlamaV2 int4), and that cheap L4 then serves each token cheaper than the $3.50/hr H100 with 11x the bandwidth. (Qwen-2.5-7B, ctx 2048, batch 1.)

Prompt to stream

real time · 1000 / (tokens × step_ms)

Tokens are split on whitespace and punctuation as a proxy for model tokens; each lane emits one every measured ms/step. H100 + CUDA Graphs 11.78 ms, L4 + ExLlamaV2 17.36 ms, L4 bf16 63.15 ms. Same sentence, very different felt latency, and the cheap GPU with the right kernel nearly keeps up with the flagship.