cross-posted from: https://lemmy.world/post/811496
Huge news for AMD fans and those who are hoping to see a real* open alternative to CUDA that isn’t OpenCL!
*: Intel doesn’t count, they still have to get their shit together in rendering things correctly with their GPUs.
We plan to expand ROCm support from the currently supported AMD RDNA 2 workstation GPUs: the Radeon Pro v620 and w6800 to select AMD RDNA 3 workstation and consumer GPUs. Formal support for RDNA 3-based GPUs on Linux is planned to begin rolling out this fall, starting with the 48GB Radeon PRO W7900 and the 24GB Radeon RX 7900 XTX, with additional cards and expanded capabilities to be released over time.
The comparisons of ROCm and CUDA are inevitable, and AMD’s software support just doesn’t hold up.
That being said: ROCm is… sufficient? The API largely works, though the main issue is the narrow set of GPUs that actually work with ROCm. Software wise: the bulk of “important” CUDA features re replicated, but AMD has inevitably locked itself into a losing game of catchup… always having to react to new features from CUDA rather than leading with features of their own.
That being said, I don’t think that the “bulk” of GPU-code has really changed much in the last 10 years. Sure, the TensorCore / AI stuff and Raytracing added some stuff, but those are rather specialized operations. A typical GPU programmer doesn’t necessarily work with AI or Raytracing.
AMD GPUs are hugely impressive when you do work with them. Absolutely tons of VRAM on the cheap and huge amounts of TFLOPS to get things working. The software is a bit rougher but it does in fact work.
And even when you do, you’re going to find it infinitely more productive (and also performant!) to use OpenAI’s Triton, or something like Tiramisu or Halide to implement custom fused matrix multiplies or convolutions. I honestly believe CUDA as a distinctive advantage of Nvidia GPUs has plateaued here.
NVidia has a few nifty tricks still. Their sparse matrix multiply allows for a 4x4 matrix multiplication with half the space (assuming that half the matrix has zeros in them, which is common in AI). I don’t think AMD has a sparse FP16 4x4 matrix multiplication instruction yet.
AMD is behind in AI, but not significantly. AMD is ahead in double-precision / 64-bit compute, by a wide measure. AMD is first-blood on chiplets with MI200 as well, which puts them in a strong boat for future innovation.