Zero Python, Zero Frameworks, 397 Billion Parameters: Flash-MoE Is Kind of Absurd
Someone ran a 397B parameter model on a MacBook Pro using raw C and Metal shaders. Here's why that's actually impressive and not just a stunt.
2 transmissions tagged #inference
Someone ran a 397B parameter model on a MacBook Pro using raw C and Metal shaders. Here's why that's actually impressive and not just a stunt.
On a 24 GB card, single-GPU LLM inference is usually constrained by memory traffic and KV cache growth long before raw math throughput becomes the limit.