Blog · Hitesh Kumar Gupta

Running ~100k LLM generations on a shared H100: why HuggingFace .generate() was going to take a day, why vLLM cut it to hours, and the non-obvious gotchas (CUDA-13 FlashInfer, the gpu_memory_utilization trap, prefill- vs decode-bound) that nobody warns you about.