Engineering notes on LLM inference, NLP pipelines, and the machine-learning systems I build.
.generate() was
going to take a day, why vLLM cut it to hours, and the non-obvious gotchas (CUDA-13
FlashInfer, the gpu_memory_utilization trap, prefill- vs decode-bound) that
nobody warns you about.