Your 4090 Is Probably Memory-Bound
On a 24 GB card, single-GPU LLM inference is usually constrained by memory traffic and KV cache growth long before raw math throughput becomes the limit.
1 transmission tagged #gpu
On a 24 GB card, single-GPU LLM inference is usually constrained by memory traffic and KV cache growth long before raw math throughput becomes the limit.