Escaping the KV-Cache Bottleneck: vLLM vs HuggingFace Transformers on an H100
I needed roughly a hundred thousand LLM generations for a dataset project. On HuggingFace
.generate() that was going to take more than a day. On vLLM it took a few hours
on the same GPU. This is what changed, and the handful of non-obvious things that almost
stopped it from working at all.
Most "vLLM is faster" posts stop at the headline. The headline is true and boring: vLLM is faster, and everyone knows it. What I found useful, and what cost me an afternoon each to figure out, were the parts after you decide to use it: a CUDA version that silently breaks the sampler, a memory flag that means the opposite of what you'd guess on a shared machine, and a workload that turned out to be bound by the part of inference I wasn't even watching. Those are the parts worth writing down.
The setup
I was generating a synthetic question-answering dataset during a research internship: long spoken-video transcripts in, structured Q/A and multiple-choice items out. Concretely:
- Generator: Qwen2.5-14B-Instruct, bf16, on a single H100 (90 GB).
- A second pass with a 27B judge from a different model family for quality checking.
- Order of ~105 generations across tens of thousands of input chunks.
- The H100 was shared with several other researchers, so I never had all 90 GB to myself.
My first version was the obvious one: load the model with transformers, loop over
inputs, call model.generate() on small batches. It worked on ten examples. Scaled
to the full set, the ETA was north of 24 hours, and that's before the judging pass. That is
the wall this post is about.
Why .generate() hits a wall
The naive loop wastes the GPU in two compounding ways, and both come back to the KV cache, the per-token key/value tensors a transformer keeps so it doesn't recompute attention over the whole sequence at every step.
Static batching. A standard batched generate() call runs until
the last sequence in the batch finishes. If you batch 16 prompts and 15 produce a
10-token answer while one produces 500 tokens, those 15 slots sit idle, padded and burning
cycles, until the straggler is done. Real generation lengths are wildly uneven, so a lot of
the batch is dead weight a lot of the time.
Fragmented KV memory. Classic implementations reserve a contiguous KV block sized to the maximum possible sequence length for every request, then never use most of it. The vLLM authors measured 60–80% of that memory going to waste through fragmentation and over-reservation (Kwon et al., SOSP 2023). Wasted KV memory means you can't fit many sequences at once, which means low concurrency, which means the H100's compute sits underfed.
What vLLM actually changes
It's tempting to credit the speedup to a faster attention kernel. That's not the main story; HuggingFace can use FlashAttention-2 too. The real win is the serving engine, and two ideas in particular:
PagedAttention. Borrowing the operating-system trick of virtual memory, vLLM stores the KV cache in fixed-size pages instead of one contiguous block per request. A sequence's KV can live in scattered pages, allocated on demand as it grows. Fragmentation collapses to near zero, so you can pack far more sequences into the same 90 GB.
Continuous batching. Instead of waiting for a whole batch to finish, the scheduler works at the granularity of a single decode step. The moment one sequence emits its end token, its slot is freed and a queued request takes its place on the very next step. The GPU stays saturated. This is the lever that turns "more concurrency is possible" into "more concurrency actually happens."
Put together: pages let you hold many sequences; continuous batching keeps all of them busy. That is the order of magnitude, not the attention kernel.
The numbers
For the production run, on the same H100 and the same model, the practical throughput looked like this:
| Setup | Relative throughput | Full-run ETA |
|---|---|---|
HF .generate(), static batches | 1× | > 24 h |
| vLLM, continuous batching | ~10× | a few hours |
I'm deliberately reporting this as an order of magnitude rather than a polished benchmark table: these are real production numbers from one workload, one model, one GPU, not a controlled three-way shootout. The point isn't the exact multiplier; it's that the difference was the gap between "kick it off and check tomorrow" and "watch it finish over lunch."
The gotchas nobody warns you about
None of the above is why I'm writing this. This is. Each of these cost me real time, and none of them are in the quickstart.
1. A new CUDA can silently break the sampler
On a freshly set-up box with a very recent CUDA toolkit, vLLM tried to JIT-compile its FlashInfer sampling kernels and failed with opaque build errors, not a clean "unsupported" message. The fix was to disable the FlashInfer sampler so vLLM falls back to its default sampling path, set before vLLM is imported:
import os
os.environ.setdefault("VLLM_USE_FLASHINFER_SAMPLER", "0")
from vllm import LLM, SamplingParams # read at import time, so set it BEFORE importing vllm
The ordering matters: it is read at import time, so setting it after import
vllm does nothing. If you're on bleeding-edge CUDA and seeing kernel-compilation
failures, try this before you start bisecting driver versions.
2. gpu_memory_utilization is a fraction of total, not free
This one bit me specifically because the GPU was shared. The flag reads like "use up to 85% of what's available," and on an empty GPU that's effectively true. It is not what it means. vLLM treats it as a fraction of total device memory, and pre-allocates that much for weights plus KV pages up front.
gpu_memory_utilization=0.85, you are asking for ~77 GB out of the 64 GB
that's actually free. You get an OOM at load time, not at peak, which is confusing
until you realize the number was never about free memory.
On a shared card, compute your fraction against the total and leave the others
headroom: if they hold 26 GB, your ceiling is roughly (90 - 26 - slack) / 90 ≈
0.65, not 0.85. Check nvidia-smi first, every time.
3. Know whether you're prefill-bound or decode-bound
The generation pass and the judging pass had very different shapes, and it changed which knobs mattered.
- Generation: short prompts, longer outputs. This is decode-bound: the cost is dominated by producing many tokens, where continuous batching and high concurrency pay off most.
- Judging: I fed a whole transcript plus a batch of items as context and asked for a short verdict. That's prefill-bound: almost all the work is the single forward pass over a huge input prompt, and the output is tiny. Cranking concurrency barely helped, because each request was already a large chunk of compute on its own. What mattered there was prompt length and batching items per prompt, not stacking more concurrent requests.
If you turn the usual concurrency dials and throughput doesn't move, you're probably prefill-bound, and the fix is on the input side (shorter context, more items per prompt), not the scheduling side.
4. The max_model_len cliff
I capped transcripts at a character budget I'd eyeballed as "about 16k tokens" and set
max_model_len=16384. Then a run died with something like 16385 >
16384. Tokenizers don't care about your character estimate; one transcript tokenized
just over the line and the whole request was rejected. Two lessons: leave a real margin (I
moved to 32768), and remember that max_model_len is a hard wall:
crossing it by a single token fails the request, it doesn't truncate it.
5. Prefix caching only helps if the shared text comes first
vLLM can cache the KV for a shared prompt prefix across requests, which is great when every prompt starts with the same long system instructions. My first prompt layout put the variable content (the transcript) before the static instructions, so there was no shared prefix to cache and the feature did nothing. Reordering so the constant block leads makes the cache actually reusable. If you want prefix caching, design your template front-loaded with whatever is constant.
Takeaways
None of this means HuggingFace Transformers is the wrong tool. It's excellent where I started: experimenting, prototyping, poking at one model interactively, and the wide ecosystem of architectures and utilities around it. The moment the job turns into "run this same model over a large batch as fast as possible," though, you've crossed from prototyping into production, and that's vLLM's home turf. Use each where it's strong.
-
For any batch/offline generation past a few thousand items, reach for vLLM (or another
paged-KV serving engine) before you optimize a
.generate()loop. The win is continuous batching plus paged memory, not a faster kernel. -
On a shared GPU,
gpu_memory_utilizationis your single most dangerous flag. It's a fraction of total memory. Readnvidia-smifirst. - Figure out if your workload is prefill- or decode-bound before you tune anything. They reward opposite knobs.
- Bleeding-edge CUDA and vLLM's JIT kernels don't always agree. Knowing the one env var above saves an afternoon.
This work was done during my research internship at IIT Kharagpur under Prof. Koustav Rudra. The post is about the inference infrastructure only; the dataset and its research findings are separate work and not discussed here.