The Memory Wall Sets the Pace
For years, GPU progress has been narrated as a FLOPS race. In large-model training, that story is incomplete.
The practical clock is usually memory movement.
Why memory dominates in modern training runs
Recent MLPerf rounds keep showing broad hardware participation and rising multi-node activity. But across systems, one pattern keeps repeating: teams that move tensors efficiently outperform teams that merely maximize peak arithmetic throughput.
Sequence length, activation footprint, and communication topology turn memory behavior into an architecture-level constraint.
Concrete example: Llama 2 70B LoRA pressure
Llama 2 70B LoRA at long sequence length is not a toy workload. Memory pressure is severe enough that placement and parallelism strategy become first-order decisions.
A representative vendor write-up from AMD reports peak memory usage around 210 GB for this class of benchmark. That changes what is feasible on 192 GB-class devices versus 256 GB-class devices.
Reported benchmark figures (as published)
- 22.04 minutes time-to-train on 8× MI325X (single node)
- 29.25 minutes on 8× MI300X (single node)
- 23.97 minutes cited H200 publication average in comparable v5.0 reporting (range 23.08–26.10)
- 10.91 minutes for a 32× MI300X multi-node result cited from MangoBoost
The individual numbers will move. The pattern does not: memory fit and movement strategy determine how much of theoretical compute is actually realized.
What this means for platform evaluation
When vendors emphasize HBM bandwidth, cache, and interconnect, they are responding to physics, not marketing fashion.
Arithmetic units idle when data arrives late. In practice, observed performance is the output of:
- memory bandwidth and latency,
- topology and synchronization overhead,
- kernel quality and scheduling,
- and communication/computation overlap.
Bottom line
If someone asks which GPU is “best” for LLM training, the responsible answer is workload-specific.
The winning platform is usually the one whose memory system, software stack, and topology best match your model shape, sequence length, and communication pattern. Peak FLOPS still matter, but only after memory stops sabotaging them.