The way ROCm has turned out is such a shame. It really is effectively an advertisement for CUDA and served exactly that purpose for me. Even Intel's oneAPI stuff seems to be heading in a more viable direction.
I feel like AMD for some reason doesn't understand that support for GPGPU on the consumer stack is important to eventually getting adoption in the enterprise space. CUDA is so convenient in part because as long as you have an NVIDIA card it'll probably be usable for testing, even if not necessarily optimal or fast. Then, once confidence has been built about its usability, a more serious investment can be made into getting the more powerful cards.
It really is effectively an advertisement for CUDA
100%. At a previous employer, we had some budget left after buying a new NVIDIA GPU machine, but not enough to buy another bunch of workstation/server-class cards. So we decided to buy an additional machine with two Radeon VII, which were quite affordable for the compute it provided. So, we wanted to try if that is a viable path for the future.
Unfortunately, ROCm was death by a thousand paper cuts. You could barely train any model without ROCm crashing with some memory order, weird performance regressions, or garbage output. The only thing we ever really got working was transformer training, with various ROCm libraries and PyTorch pinned to specific revisions and a bunch of patches. Luckily I had the foresight to encode all the dependencies with Nix and put all the Nix files in git, so we could reproduce this exact working version. We attempted to update the stack at various occasions, but it would always lead to more breakage.
Two colleagues meticulously reported bugs for issues that were encountered. But AMD did not really seem interested to try to fix these things (or were just severely understaffed).
After this experience I vowed to never buy AMD again for ML for at least a decade.
I'm not sure I agree with this. In my view the good alternative would have been OpenCL based machine learning libraries, no?
Nvidia deliberately undermined that at every turn, and pushed a proprietary solution, hard. Almost all gpgpu based machine learning used it, and, what is AMD left with? HIP, the "CUDA wrapper to make it work on ROCm".
What's the alternative here? Say that Nvidia shouldn't do anticompetitive things? That's been an exceedingly successful market strategy for them for quite some time.
In my experience, nVidia had the best OpenCL stack of everyone. AMD should probably focus on CUDA support - everyone writes for it and it is a much nicer API. Nothing in it ties it to nVidia hardware. With enough push it will probably get opened like SGI's GL became OpenGL. But at this point I wouldn't trust AMD's graphics division to be able to write a competent driver for anything or to not just abandon their efforts a couple of years in.
As for OpenCL, Apple are the ones who killed it in the end to focus on Metal.
Apple gave OpenCL version 1.0 to Khronos, while they could have helped, it is not their fault that Google (on Android), Intel and AMD couldn't be bothered to actually make something out of it that could compete with tooling, libraries and polyglot infrastructure from CUDA.
That is a good question, but unfortunately OpenCL was also a total non-starter for my case. We were adding GPU support to an existing large scientific compute application and wanted to minimize code duplication between CPU and GPU (so single source solutions like CUDA were far more convenient).
Additionally, while most users are on Linux, where the OpenCL implementations are fairly reliable, we also needed to support Windows. Unfortunately on Windows AMD's OpenCL implementation is also terrible. It is extremely buggy and the tooling is outdated. I'd spent so many frustrated hours debugging OpenCL back when I was an AMD shill, only to eventually discover that the runtime was bugging out. Occasionally it would even turn out that the bug had been reported on AMD's dev forums years ago without so much as a response.
I think that to an extent NVIDIA's lead is well earned. The CUDA ecosystem has been very comfortable to work in, the support tools have been amazingly polished compared to what I had been used to when trying to stick to "open" things like OpenCL.
That said, I think that Intel is likely to finally reasonably compete with NVIDIA. While their hardware still has a ways to go, they seem to be putting a large amount of effort into catching up on the software front.
Don't blame NVidia for Intel and AMD's failure to do anything meaningful with OpenCL, or Google for completely ignoring it on Android and push Renderscript instead.
What on earth is your point? Let me make one thing clear, I could not care one bit about this team based mentality. Companies can be bad at each their thing.
NVIDIA delayed driver support for OpenCL deliberately, because they wanted the only viable alternative for their users to be CUDA. Pushing ML universities to invest resources into that.
Your point about AMD nok making something out of OpenCL isn't particularly fair... Maybe what you mean is that they should sponsor their own HPC labs at various universities? Is that "making something out of it"?
Nvidia delivered a good solution with cuda. You are whining about them not investing as hard in opencl. What if focusing on cuda was what they thought would let them deliver a good result for users? Lack of focus and sustained investment is part of issue w amd and others. If openCL is the hot thing that nvidia supposedly purposely delaying was a great opportunity for amd and intel to eat their lunch. Where were they?
I stopped believing in the well-intentions of NVIDIA a long time ago. Mind you, I didn't stop buying their products. But, they repeat the same shitty stuff all the time. I get why they do it, I get why it works, it's just annoying that people defend it as if it is altruistically motivated (not accusing you of this in particular, but in general).
And, as I said, AMD didn't, and still doesn't do a great job in the compute department. And, NVIDIA spent a lot of money developing good solutions with CUDA, their proprietary technology, cannot be anything but a good thing, right?
Invest money to "help" universities. Lock core software to your proprietary solutions. Jack up prices.
If they only did it in this instance, I'd perhaps give them the benefit of the doubt. But, off the top of my head:
- cuda
- phys-x
- gsync
- Gameworks in general
- rtx
- dlss
There is really nothing uniquely special about any of these technologies, other than being (mostly) software solutions tied to NVIDIa hardware, and being pushed heavily onto both developers and researchers.
It's the same playbook as "give MatLab/<Any AutoDesk Product> for free to students". These good deeds are not altruistic, they are investments in market capture.
My point is exactly that, Intel, AMD and Google are the ones to blame for OpenCL failure, not NVIDIA.
And lets not forget that OpenCL came from Apple, which they gave up after messing with Khronos politics. Which is basically the reason Metal came to be, before Khronos decided what OpenGL vNext was supposed to be. Had it not been for AMD's Mantle, they would probably still wondering about it.
They were the ones that failed to provide tooling and libraries that would create a valuable ecosystem around OpenCL.
It is more than fair, it is always easy to blame others for our failures.
Nvidia took a risk and invented the whole "AI on GPU" industry. AMD and Intel ignored it until they saw big $ there and their strategy was to create an "open system" that would cause external pressure (from folks like you) on Nvidia to share the industry with them. Now, Nvidia's OpenCL performance is better than AMD's but strictly inferior to CUDA, so what are you going to do when AMD fails even in that?
NVIDIA did absolutely not "invent AI on GPU", that's a ridiculous statement. When programmable shaders became commonplace, shaders were used to do general computation. That was the birth of GPGPU. NVIDIA jumped on this and developed a lot of very good tooling around that. known as CUDA. Then they invested a lot to make that tooling standard in research departments.
Once GPGPU got more commonplace, researchers also doing AI, finding suitable GPGPU tasks, used the more user friendly and advanced tooling, which was CUDA.
So, "NVIDIA inventing AI on GPU" suggests you do not know much about the history of AI work on GPUs. But, feel free to correct me.
"The GeForce 3, the first NV20 part, contained the first example of true programmability. Despite NVIDIA being a pioneer of highly configurable fragment processing, its programmability was in its vertex processing. The GeForce 3 was the first GPU that brought programmabilitiy to consumer hardware."
The AI on GPGPU was possible only because Nvidia provided a library with matrix and activation functions running on a GPU at speeds far surpassing CPUs and designed a fairly nice API that anybody could understand. There was nothing like that before and that's why they had such a foothold with academic institutions. Of course they didn't invent AI but made it possible to run on their GPUs and actively helped researchers to do that while Intel and AMD slept (well, Intel at least tried to do that on CPU with MKL).
What's the advantage of OpenCL if it still only works good on Nvidia? Because it has "Open" in the name? You are acting like it was Nvidia holding OpenCL back, but I'm not convinced that was the case.
The thing about the OpenCL support story is that if you look at version support, it's obvious NVIDIA wasn't at fault.
Even despite NVIDIA not being particularly interested in strongly supporting it, Intel, AMD and Qualcomm kept up support for it up to OpenCL 2.0. Get to 2.1 and the only vendor bothering to support it is Intel. OpenCL 2.2 is still only available in ROCm and even that came an entire 4 years after the spec was finalized. It's clear that none of the companies were particularly interested in pushing OpenCL with the effort required for its single source features in 2.0 onwards.
Then we get to OpenCL 3.0, where Khronos rolled back most of the big mandatory features from 2.2 and suddenly it's once again supported by NVIDIA, Intel and Samsung mere months after the spec is ratified (AMD noticeably still missing 2+ years later).
NVIDIA was not holding back OpenCL. Sure, it supported a lower version, but it actually worked. AMD delivered something that was nominally better and factually broken, time after time after time. Intels implementation was somewhat better, but the hardware was not interesting.
CUDA C++ is single-source easy to use, has good tools, has a lots of libraries you can build your new library on top (empowered by generic libraries, it can use all pre-existing C++ libraries), ...
OpenCL... is not single-source, not easy to use, does not have good tools, does not have good libraries that you can reuse (does not have generics), ...
AI is like a F1 race, you can either go race with a F1 car (CUDA C++), or don't go at all. Trying to go with the horsecar OpenCL is, is a waste of time and money. Every other AI startup is going to loop you a million times.
> Nvidia deliberately undermined that at every turn, and pushed a proprietary solution, hard.
Undermined who at what? Nobody was interested in OpenCL succeeding: not Intel, not AMD, not Apple. All these companies pushed and continue to push for their own proprietary incompatible ecosystems to try to create a platform like NVIDIA has with CUDA. Intel pushes for OneAPI which is Intel only, AMD pushes for their CUDA clone, and Apple pushes for Metal.
Its easy to create a new standard. You and me can get together, write something on a napkin, call it a standard, and we are done. We could go around telling people that it's going to be "The Future", like it happened with OpenCL, but that doesn't mean anyone will actually ship something that's usable. It is particularly easy to create a Khronos "standard", of which there are 200, and of which _none_ standardizes standard practice: create the standard first, try to see if it solves the problem later. Claiming that you are entitled to NVIDIA implementing them all, is... well... just that... pure entitlement.
From all vendors, ironically, it actually seems that the only vendor working on a portable way to program GPUs is NVIDIA itself, since they have sent a bunch of people regularly to the ISO Fortran and ISO C++ standard committees to extend these languages to allow portable code to run on GPUs without changes. In contrast to OpenCL, ISO Fortran and ISO C++ are actual international standards.
I think Fortran supports this since 2018 standard, and C++ since 2017. Ironically, NVIDIA is the only vendor that actually ships this and has for years. The NVIDIA fortran and c++ compilers can run these languages on their GPUs. Intel and AMD will talk about how "portable and cross-vendor" OneAPI and ROCm are, yet their Fortran and C++ implementations still, 5 years later, can't actually use their GPUs. The reason is simple, AMD and Intel don't care / believe / want a portable programming model for GPUs. They want their own closed platform. Unfortunately, since they are at a disadvantage, they need their closed platforms to be able to run on NVIDIA GPUs because otherwise nobody would use them, but that doesn't mean that their platforms are portable or open, and they don't care about code using their platform being able to run well on NVIDIA GPUs since that would be counter productive. So in practice, people still need to use CUDA C++ to target NVIDIA GPUs.
Anyways, OpenCL didn't failed "because of NVIDIA". It failed because it is a bad standard, it was a bad standard when it was created (it was significantly worse than the standard practice back then, hence why nobody wanted to use it in practice), and it turns out no vendor is actually interested in it, so it is a worthless standard.
> NVIDIA itself, since they have sent a bunch of people regularly to the ISO Fortran and ISO C++ standard committees to extend these languages to allow portable code to run on GPUs without changes.
Could you probide more details? I could only find C++ AMP by Microsoft
> Claiming that you are entitled to NVIDIA implementing them all, is... well... just that... pure entitlement.
It is? What happened to 'Customer is always right?'
Once on a Khronos promoted session someone asked about Fortran support roadmap for OpenCL, everyone was clueless that was really a thing the scientific community would cared about, and ended up with "if you care about this kind of stuff please come talk to us".
Meanwhile CUDA is having Fortran support for ages.
> I feel like AMD for some reason doesn't understand that support for GPGPU on the consumer stack is important to eventually getting adoption in the enterprise space.
I made an attempt a few years ago at getting in to the machine learning world and discovered very quickly that I was facing a choice - either I could go AMD and open source drivers or Nvidia and cutting edge software. To learn the algorithms on my own PC just didn't seem feasible with an AMD GPU - running them on the CPU was too slow and I never found a way to make things work on the GPU with the resources I located. I gave up on AI since open source software is more important to me.
Maybe I just lacked the right mindset, but the difficulties were so extreme it seems plausible that most people with an AMD GPU never learned to use ROCm at the amateur level and there wasn't a sufficiently good strategy to get it adopted at the professional level without an amateur community.
Low level GPU programming is really hard. OpenGL, CUDA, OpenCL, whatever. I'm still on the lookout for an introductory resource on how I'm meant to use an AMD gpu to multiple 2 matrices together. I literally can't find one with a 5 minute google search, I get a blank article [0]. Maybe it is just blank for me? For using OpenGL I can just go play around [1]. There needs to be a strong community of advanced people explaining what to do which means big amateur support.
So, in short, I strongly agree. I think AMDs (poor) support of consumer stack for GPGPU hurt them a lot more than they realise. OpenCL doesn't seem to cut it. The lack of libraries and community indicates that the feeder pipeline of amateur -> professional growth is broken and nobody is learning to use their platform even for fun.
Moreover they screwed over a big chunk of the people willing to use their hardware and their OpenCL stack (who might have migrated to ROCm if there was a not too painful path) when they EoLed their embedded GPU offer out of the blue. Now your options are TigerLake (and who knows whether Intel is going to keep the sycl/oneAPI support up, and I don't feel Intel really serious about GPUs in the long term... Or even in the 'open standard' shtick they're selling on oneAPI/SYCL) or NVIDIA and the CUDA collar.
ROCm is a mess and I am similarly disappointed in AMD for the same reasons. Using AMD GPUs for accelerated compute on Linux is buggy and disappointing even if you get close to making it work.
PyTorch would probably be the easiest way to get started there, it should give you a NumPy style handle into high-level objects.
As far as low-level, OpenCL has always had a horrendous amount of boilerplate just to get started, CUDA is a little better. One thing that can help is Thrust framework on CUDA, which attempts to provide a C++ STL-style interface to GPU operations. I am not saying it's super performant but especially it can give you a place to start and handle some of the boilerplate automatically.
In classic AMD fashion there is a me-too copy that hasn't been updated in like 5+ years, probably was some intern's summer project or some employee's resume-driven-development thing and AMD has just always left this trail of abandoned projects in their wake.
But basically what this is going to look like is a main function that sets everything up, and then a __global__ kernel launch that does the actual computation. You probably will have a 1D or 2D grid of threads, where each thread gets a threadIdx.x and threadIdx.y and that determines the row/column of elements they compute.
For Thrust/Bolt you can do a cheat implementation by zipping the thread position into a (int x, int y) pair driven by count_iterators, and then having a functor which takes the (x,y) and does the computation. Or the "true" implementation would be an accumulate-reduce operation I guess, but, frankly the STL-style stuff starts to fall apart pretty quickly and it just ends up being a convenient way to hook up the lower-level code without having to worry about too much boilerplate.
Less obvious than you might believe. I suspect PyTorch doesn't support any AMD GPU that are more than ~5 years old, because PyTorch is currently ROCm 5.0 and that only supports Vega 2.0 and up [0].
This is that thing about AMD not doing a great job of supporting consumer GPU. There was a window where PyTorch allegedly supported ROCm 4.x and therefore my GPU, but I expect that was a misdirection since AMD never seemed to make this stuff work.
Yeah, I haven't done PyTorch on AMD hardware and I guess that's not surprising with the support story.
I've commented elsewhere about AMD building the whole ecosystem around everyone working/distributing source and compiling only at runtime, and how I think that's kind of a nonstarter for commercial operators or any sort of hobbyists/etc.
But generally it just seems like AMD is utterly uninterested in targeting anyone but HPC. You see the guy in this thread posting the "but look at all 2 of the supercomputers using AMD GPUs!" (and I've seen the same talking point brought up in other discussions too) and that's the thing, is, ROCm is exactly the minimum effort they need to get those HPC wins and not a developer-hour further. ML is maybe a secondary target but they can't really get first-class support without solving all those other problems. So, they've built what they can on ROCm and if it doesn't run on your hardware welp, not doing that, sucks to be you.
They really really need some PTX equivalent (or just outright PTX support ala GPU Ocelot) for any sort of serious adoption. PTX gives NVIDIA an incredible forwards-and-backwards support story and AMD just is like "lol compile it for each individual chip in each individual family". No way.
Exactly. I wish I could consider a 7900 XTX or two for "enthusiast" level ML work at home (thinking open sourced GPT models, Stable Diffusion, etc), but the state of library support has me pricing out 3090s instead.
Sure, I could probably get ROCm going after a lot of tinkering and headaches but...that's not where I want to spend that effort.
Yeah, I faced the same conundrum back in the Radeon VII/5700XT days. I had gone with a Radeon VII and a 5700XT, believing AMD to eventually support the latter in ML.
Using the Radeon VII for ML was still pain, but I had more free time back then and was willing to tolerate the tinkering. The support for the 5700XT never came, was just constantly led on by AMD devs on their GitHub.
I didn't really have that kind of free time when I wanted to upgrade, so buying 3090s ended up being the only realistic option. Even on Linux, as long as I'm not trying to stick to the latest bleeding edge kernel, the closed source drivers work very well. Thus far I feel the 3090s have been well worth the extra cost simply in terms of tinkering time saved.
I'm in the middle point of a similar journey. I got a 5700xt for a steal, and ran GPT2 on it to see if I could. It worked, but just barely.
It now currently drives my primary monitors, and I keep a second GPU, a 1080ti, plugged in and passed through to a VM. I can then operate the 1080ti as a "program" from my desktop, which is fun, but somewhat limited given the space constraints in my case. Extra benefit is it can also be used for gaming. Next steps might be to toss 3090s in a dedicated chassis under Proxmox, but then I lose the part time gaming use case.
The path to >40GB VRAM as a hobbyist is tricky, but I have to remind myself that even having it as an option is wild.
>support for GPGPU on the consumer stack is important to eventually getting adoption in the enterprise space.
I've noticed as time goes on that more and more of the enterprise/mainstream tech that eventually become household names have their roots in ye olde vidja gaemz.
All this technology powering all the algorithms today? Yeah so, it was originally so we could have better and better Unreal Tournaments and Dooms and Call of Dutys and Runescapes.
Sort of. It's true a lot of this stuff was paid for by us gamers so that we could play Quake and stuff. UE is a lot more than a game engine these days. I've always been an id guy but I do consider UE a vital technology at this point. It really is because it can, and does, power so much more than games.
That said, these GPU and AI things started as US military project which trickled down to us
> That said, these GPU and AI things started as US military project which trickled down to us
no, GPGPU started with opengl compute shaders, then CUDA packaged this into something not built around the graphics framework starting with the Tesla microarchitecture (6800/8800/200 series) and fully-programmable shaders. AI started really taking off during the maxwell era due to those gaming GPUs offering a fairly incredible amount of FP32 performance which was enough for ML (in contrast to prior HPC GPGPU which often focused on FP64).
really at no point has GPU, GPGPU, or AI been a military thing that flowed to gaming, but completely the opposite. Or otherwise please explain what you mean because that's a very contrary perspective to the common view.
GPUs came from the enterprise, and to an extend, military space. Just three letters, SGI. They built mighty expensive 3D accelerators that were used in prohibitively expensive setups for training pilots for example. They never saw the potential of a consumer grade version of their tech. It should be well known that 3dfx was founded by ex SGI employees.
ackshually babbage’s difference engine was a commercial and scientific computer so I think you’ll find that’s where this all actually began, if you really understand the roots of the technology
but oh no those trig tables were used for military ballistics, I am le owned!!!
like sorry dude yes everyone here knows computing has always been intertwined with quasi-military applications like codebreaking ARPANET but this specific branch of tech has nothing more than broad general links that everything in computing shares. This particular tech is almost purely an outgrowth of gaming tech and that’s actually sort of unique and interesting in itself, don’t stifle substantive discussion just to throw in a “well ackshuyally” that’s not even very relevant, that’s very disruptive and antisocial. Please foster good discussion here, that’s the underlying mandate of this site.
You mean “3dfx”. Ironically, the original accelerators of both companies you've named were not a success in any way, and hardly contributed to the 3D revolution.
Haha I think part of that is because most teenage developers go through a phase of wanting to be a game developer. In the process they pick up on things from there and as they move onto their careers they eventually realize how they can extend or apply some technology from games to their current task.
The cycle hasn't really changed since 2007 and ATi Stream. Release some tool to program the GPU, expect someone else to do the rest, watch it fade into obscurity, rename it or release a new tool. Motivated users with simple tasks (hashing) would adapt to anything anyway, and so would industry grunts paid to deal with existing hardware and software. Still, each time they have announced the same magical “offloading of tasks to the GPU”, and the same magical “unified memory” to hold those tasks. Based on those sources, by now we are expected to have smart compilers that can take any Notepad and turn it into some CPU+GPU hybrid.
Which were dropped when the open source driver came to be, hence why my Asus 1215B netbook can only do OpenGL 3.3, when it originally was capable of doing OpenGL 4.1 with fglrx.
On top of that, GCN was a beautiful design for compute. Truly ahead of its time and a beast of a chip, as evident by how when running something they actually supported as well as NVIDIA chips, GCN cards would punch well above their gaming weight.
It's just a huge shame that they never really got the software sorted out well enough to take advantage of the hardware.
On top of what you said, I hoped that stable diffusion and others would have been a wake up call for AMD. It doesn't seem to be the case.
I can imagine a not too distant feature where even "normal" users would run a plethora of ML models on their GPUs and if things stay this way third parties developers would have a tough time supporting AMD cards in a reliable way.
AMD and compute is one of these cases where no support is better than half assed support. You only get to waste so much of my time with a half working solution until I jump ship and blacklist your products forever.
All AMD has to do is offer lots of petaflops on the cheap, and software that technically works. Once the hardware is there, and research groups have won grants to work on it, it's up to grad students (who actually do the research work) to figure out how to get anything done on ROCm.
hw cool,it does not mention rocm the software though. I read somewhere that says ROCm is only like 5% of what CUDA stack offers, it's basically just cuDNN in CUDA solutions.
Yes, yes. I've heard all the rants and agree to some extent because they seem to target different markes. I am not even sure is even trying to compete with Nvidia.
MI300 is specifically targeting AI and ML markets, which is in direct competition to Nvidia. They're competitors in the GPU market for a long time, MI300 is the first offer from AMD to get into ML high end market, the point here is that, AMD needs to invest in ROCm 20x more to be serious in the new segment, on the other hand, its EPYC is doing extremely well already used with Nvidia ML chips.
From H100 Nvidia tries to use its own host cpu, of course AMD now wants to have its own accelerator the MI300, soon they will each have their turnkey systems, it's just that AMD's software stack is still way behind.
Again, I mostly agree. I alluded to AMD not really trying to compete with Nvidia. I know these are public companies but they are run uncle and a niece. Sometimes it does feel like AMD is trying really hard not to compete
Indeed the CEO of Nvidia and AMD are both from Taiwan and about the same age, I was told they are actually good friends, so yeah they might not want to fight that hard. I will call them the cousins, maybe they can first crash Intel together.
I built this and the other rocm packages from AUR (some removed now) along with pytorch build-flagged for rocm. It did eventually work, but It was a real pain in the ass and cost a whole lot of time. After that though, something broke (who knows) and I ended up rebuilding everything again just to get it back to a working state. I'm currently too afraid to update to these community packages, partially because I don't know if the build flags I used are the duct tape keeping my builds functional (I have an rx590-8gb which is maybe at the edge of supported gpus). Instead, I just added every rocm package to my IgnorePkgs, which isn't a great sign.
Unfortunately, ROCm is broken for gfx803 (and possibly others?). I don't think they perform regression testing with their older cards.
I think this is due to custom assembly kernels that are now outdated, I recall seeing somewhere that removing the optimized kernels and allowing to fall back to the generic implementation allowed it to work.
ROCm is in such a sad state. If I can't make it work at home, why would I ever want to investigate it at work?
ROCm is doing excellently at pursuing the opportunity AMD is interested in. It's running on seriously fast supercomputers doing useful stuff for customers. I think it's also running production machine learning workloads though I don't pay much attention to that.
AMDGPU in general is doing brilliantly in games consoles and, as far as I can tell from the outside, roughly holding its own in PC gaming.
ROCm as a Linux first, open source lets-do-maths-on-GPUs is not being pursued to the same extent. That's evident in the supported hardware list (which is essentially the cards in supercomputers) and that the ROCm stack that comes in binary form hasn't been compiled for various gaming cards. Arch and I think Debian have packaging efforts in place for it, and with determination it's possible to assemble working toolchains from source code.
It's not easy to build a working system out of exclusively open source code and amdgpu gaming hardware but it can be done. The challenge is that the linux kernel, ROCm libraries and LLVM toolchain all need to broadly agree on ABI at a given point in time, and that interface changes more often than one would like, and not simultaneously across the projects. That's the sort of thing which is easy to manage on HPC and a mess on miscellaneous user systems, and it's expensive to fix properly which brings us back to which opportunities are of highest commercial value.
The positive side to this is that the foundational layers - the driver in the linux kernel and the llvm toolchain - are both developed primarily upstream. Building either from source is therefore likely to give you binaries that work. If they don't work, you can raise bugs against those projects directly. So it's quite DIY but at least you've got the pieces to work from. When cuda falls over, you've got no hope.
Ah yes, the old it-is-FOSS-so-you-can-fix-it-yourself argument. Well, I am not going to. I will be taking my chances with CUDA, because in my experience, it does not fall over all the time, unlike AMDs stack.
It's really shocking that AMD fails to extend support natively.
Workarounds such as DirectML claim to be the answer in unifying people with NVIDIA or AMD GPUs, but thus far it hasn't, with issues such as [this](https://github.com/microsoft/DirectML/issues/58) constantly popping up.
Lately however, after beginning to work on DGX V100s and A100s, and using my older laptop with a GTX 1650, it was apparent how simple setting up CUDA was, and how easily I could experiment with it on my consumer card. Many have spoken about similar stories, and here's mine. Really hope AMD does a whole lot more, and doesn't exclusively keep their powerful GPUs for gaming.
AMD often makes good and even great hardware, but they're just inept at software and always have been. That's obviously a problem, because software is what makes the hardware useful.
I have no ideas about the complexity of the underlying CUDA/ROCm GPGPU, I am just using my card with CUDA a bit stupidly.
But why AMD do not setup a team of 5 good coders to have a good support of the well deployed/used frameworks in the happy path case? Later they can cover more cases but at least it would drive the adoption of their cards.
There's a lot of talk about ML in general, and training in particular, when talking about GPGPU. But it's a very performant resource for various other productivity work as well. Meshroom (photogrammetry) needs Cuda for high quality models. Photoshop uses OpenCL for some filters. Lots of consumers and amateurs wants to use the new ML models for super resolution, generative art, transcriptions, etc.
Continuing to sell basically super computers in a PCI express card form factor that's half the value of the entire computer that's only usable for games is a huge waste, and it's a strange strategy for an underdog.
Have been through a couple of generations of Ryzen laptops and for seemingly arbitrary reasons all the good stuff is locked in ROCM, and now they are not even supported.
Given how shonky some aspects are - (no proper wake up from sleep 5 years in), there's no way I'd risk a whole other graphics stack that seems locked to a particular distro
Wasted opportunity sums up the whole AMD experience of the last few years (same laptops on windows havent been the best either the few times I've used them - just as likely to shut down possibly due to overheating during games there as much as on Linux).
Question for those with experience in this area: is vulkan compute or webgpu going to rescue AMD here? Are ML frameworks going to support either of those as a target, making it easier for AMD to catch up again?
I was waiting for things to stabilize a bit on the software and kernel side, for the last 3 years.
First was to ensure the Kernel driver would work, then ROCM would work, then something like Tensorflow would work. If ROCM wasn't going to be there, it was PlaidML for me. A lot of that died in time.
Quite recently I decided to say f--- it and moved to Pytorch, Dockerized ROCM and things seemed to work great on most distros today.
I guess Pytorch did a better job of it than Google did in getting it to work on AMD cards. Ditto with Blender who used this stack on Windows a long time before the rest of us could get to.
Very recently I tried to HIPify most CUDA code and I was able to get it to work. AMD seem to be open to patches too. So that may be the best things that could happen to this platform moving forward.
What sucks with ROCM is that AMD are purposely removing older cards from ROCM and that is such a wasted opportunity for AMD. Kudos to that one guy who still ports the Polaris code to work with ROCM 5.4.x.
If they extended their attitude of graphics driver improvements for older cards to ROCM, it would Rock em.
Finally, I want to add this, if AMD would rather the community do a better job - it should let the community ship the next version. We'd all benefit - like Mesa. A lot of game consoles would highly benefit from this work too.
The strategy should be to break the chicken and egg situation we are in, fixing the impression that NVIDIA is the go too.
First make the packages easy to install and get started, plus show-off things like SD performance being competitive or models on huggingface.co
Secondly if AMD can't allocate the development budget, they should open it up to the smart people in the community, give them all the hardware documentation required to build out support.
> Secondly if AMD can't allocate the development budget, they should open it up to the smart people in the community, give them all the hardware documentation required to build out support.
The Linux driver is open source and a rather comprehensive description of how the hardware is believed to work. I'd like source code for the firmware but the HSA programming model is reasonably straightforward and does appear to be implemented by the firmware.
The compiler stack is LLVM and should work out of the box.
As a minor existence proof, I'm one of the people responsible for having openmp on amdgpu work, and I haven't read any internal documentation whatsoever. Openmp has bugs but also has some happy users. The public ISA pdfs have been sufficient so far.
In principle if you really wanted to stick with AMD GPUs, it is probably better than ROCm. Rusticl is slightly faster (maybe 3%) than ROCm and supports SPIR-V but the problem boils down to the fact that if OpenCL doesn't work well on NVIDIA GPUs then nobody is going to invest the time and effort into porting pytorch to take full advantage of OpenCL.
There is a pytorch dlprim fork but so far it is just a random GitHub repository. That is better than nothing but you would expect AMD to just cut some checks so these people can work on the software full time.
I spent some time trying to get them to do this for a previous startup back around 2017 but their view at the time was essentially that a couple interns and/or "the community" would solve the problem. At a higher level I think they just didn't see it as an important market to serve. We were getting decent performance on their cards even without ROCm but it just wasn't something they cared about.
It’s absolutely wild to me that AMD and Intel doesn’t collaborate to port PyTorch and Tensorflow to OpenCL/Vulkan. I just don’t get it. Seems like the obvious thing to do.
The legacy OpenCL implementation doesn't support modern AMD (at least since RDNA) cards. RustICL seems to be the path forward but it's still a work-in-progress for AMD.
SYCL uses SPIR-V. Hopefully AMD doesn't introduce a proprietary format but uses an open standard. It would be amazing to be able to run the same code on AMD, Nvidia, and Intel, both current and future hardware, without recompiling.
It's still with-compiling. Shipping SPIR-V just means it gets compiled on the end user machine, which thus needs to have a compiler toolchain running. So it's more convenient, and popular with closed source libraries, but not magic.
Intel compiles LLVM IR to SPIR-V to LLVM IR. There was a thing on HSA/AMDGPU called HSAIL which I think was meant to work broadly like PTX but didn't work out.
I think we should adapt LLVM IR for use as a serialisation format which gets specialised to the hardware at the last moment, instead of bothering with the SPIR-V indirection, but that's somewhat in tension with LLVM changing their IR representation.
I'd be pretty happy with running the same code on those architectures _with recompiling_ as a first step, there's way too much #ifdef noise needed to make that hold together today.
> And worse, different binary slices are used between different dies of the same product line. As an example, for RDNA2, ROCm math libraries are compiled only for Navi21. This means that on a (smaller) Navi22 die (notably present in the 6700 XT), those components aren’t functional. The workaround is manually recompiling ROCm with support for more targets. Such a roadblock is very discouraging for adopters – and does complicate application distribution too.
holy shit I didn't realize the portability story was that bad, it can't even target a whole uarch/family at once (eg target all RDNA2) and it has to know about every specific die it will ever run on?
yeah I mean that's the deep deep problem with ROCm is there's no equivalent to PTX, AMD just wants you to distribute source and recompile everything at runtime, everything about ROCm pushes you towards source distribution rather than any kind of bytecode/IL or gosh even an executable file you could actually just run.
On NVIDIA the support story is simple: the driver does the final translation from PTX to assembly at runtime, so, as long as there is at least one overlap in the PTX versions packaged in the app and the PTX versions supported by the driver, it runs.
In practice this means you can take a program that was compiled for a CUDA 1.0 GPU (tesla uarch) and it'll run today on an ada, no questions asked. You might be leaving performance on the table by not fully exploiting the newer uarchs, but it'll run, just like x86. And you can take a program today and compile it against CUDA 1.0 capability targets (as long as it doesn't use any features that don't exist in those older versions) and the full 2023-era software toolchain will all just magically work on your 8800GTX even though driver support for that hardware has been dead for 10+ years. Because the driver knows how to run PTX 1.0 and the toolchain knows how to build PTX 1.0, and that's all that matters.
ROCm has always really struck me as being aimed at the HPC market - only supporting pro-tier GPUs, focusing on users who are already running custom software, openly disdaining amateur support, etc. And frankly the distribute-as-source model makes ROCm just a complete nonstarter for any sort of commercial or other model where the vendor would absolutely be distributing libraries or a compiled application, let alone trying to wrangle dumb customers through getting a working ROCm environment. Which right now is the authentic "linux in 1995" level experience/ordeal.
Getting serious would be restarting the GPU Ocelot project and going after PTX compatibility. But a corporation is never going to accept being a client on someone else's platform, just like they won't support the open-source Streamline framework because something something 'pluggable frameworks and library code are anti-user-freedom'. Same thing there too, AMD will only support FSR in statically-compiled code and they expect everyone to recompile all their shit and validate and push updates for every single game, every time AMD releases an update, because it's inconvenient for AMD's corporate strategy.
(oh and now that they finally have the ML hardware, the rumor is they're working on their own ML-based upscaler, quelle surprise that the "gosh we would never do anything that legacy users couldn't run" was just a bit too.)
AMD doesn't want user freedom, they want to be the one with the leash. That's why they do the source-distribution-only model... you'll have to work with your code in the ROCm ecosystem and not NVIDIA's. You'll be compiling against HIP and not NVIDIA's stuff. Etc etc. It's not a zero-effort thing to leap the gap, and they want to keep you once you do it, you'll have the same leap to get back out.
>ROCm has always really struck me as being aimed at the HPC market - only supporting pro-tier GPUs, focusing on users who are already running custom software, openly disdaining amateur support, etc. And frankly the distribute-as-source model makes ROCm just a complete nonstarter for any sort of commercial or other model where the vendor would absolutely be distributing libraries or a compiled application, let alone trying to wrangle dumb customers through getting a working ROCm environment. Which right now is the authentic "linux in 1995" level experience/ordeal.
While I somewhat agree, I'd say they also fail at properly targeting the HPC market. The specific slice they're targeting is people who are already forced to work with them, such as developers for software intended ahead of time to be run on their supercomputers. This leads to them still missing out on the portion of the HPC market which isn't bound to a specific supercomputer.
For example, with the software we work on, we currently benefit from our choice of CUDA due to the supercomputer we have access to being A100 based. But if working with ROCm were better and we could test it on more consumer hardware first, it could make a convincing case for gaining access to supercomputers with AMD GPUs, which would then influence the selection of their hardware in other machines. But since we can't, once our CUDA support is more mature it's essentially a given that our lab will be buying many more NVIDIA cards to allow other researchers to take advantage of it.
It's more subtle than that. AMDGPU doesn't have an equivalent to PTX to abstract over differences between hardware generations. So while CUDA caused problems with Volta changing the intrinsics it was able to mostly paper over it in the toolchain. For AMDGPU, changing from a gfx1030 to a gfx1031 probably means recompiling all the machine code.
Because there are _lots_ of different cards and they all have their own machine code, distributing libraries is a pain. You either distribute N copies or do some packaging effort to hide that you've distributed N copies, or you ship raw LLVM IR and cross your fingers that patching it up on the fly works out, in defiance of LLVM not really supporting that.
AMD as a company is in favour of open source. How they feel about user freedom is less obvious, but as long as I can rebuild their stack from source when it falls over, I'm going to stick with it over nvidia. I especially like that the driver ships in the linux kernel.
I don't know man, did you watch the video I linked from AMD's FSR lead?
"we won't support [an open source API/framework] because it might be used to plug things we don't like" is pretty explicitly hostile to both user freedom and open source as a whole. The freedom to do things that you might not want me to or that you won't do is really the only user freedom in this sense, right?
What happens when some game won't update to FSR 2.2 and you're stuck on FSR 2.1 forever? That's the thing open-source user freedom is supposed to fix, right? AMD doesn't get to determine that "our software is the best and alternatives offer no benefits" either, that's the end-user's choice, and when that's followed by "therefore we will work against adoption of pluggability and interoperability standards" that crosses into openly hostile.
The lead is being very diplomatic and careful but that's a very coached way to say that "this interoperability standard isn't good for us and we will work to kill it and prevent adoption, regardless of its open-source nature. Our product is better than theirs and we are going to work to deny you the freedom to choose otherwise."
That's some microsoft level embrace-extend-extinguish shit right there: this is them embracing upscaling tech, extending it with their own proprietary implementation and deliberately kneecapping interoperability, with the goal of eventually extinguishing DLSS. They're just saying the quiet bit out loud and generally being delusional to think they could ever make this happen.
--
Look, big-picture here: AMD did the open-source driver on linux thing because it was a way to get a force-multiplier on their dev time - the community does the work instead of employees that AMD has to pay for. It's a niche community with a DIY mentality and AMD (and Intel before them) leaned on that to get more work done.
AMD's not open-sourcing drivers on Windows. They slap down open APIs like Streamline when it doesn't support their business strategy.
The same thing is happening with ROCm - everyone else does software this way, but, AMD wants you to do it that way, even though it's more work and more cumbersome for end-users, because it results in more lock-in for their software ecosystem. Is that not fundamentally the same thing as people accuse NVIDIA of doing? And that's how they've handled FSR too - user freedom doesn't really matter that much as a principle, actually they've stated they're explicitly against users having these freedoms. They want people statically compiling it because if they embraced user freedom it would help users move more freely between these ecosystems and they don't want that, they are fine with lock-when you're locked into their ecosystem. What they mean by "user freedom" is they want to prevent users from plugging a library that interfaces with their competitor's hardware accelerators. That's actually the opposite of user freedom.
And where does PSP fit into user freedoms - a whole closed processor running underneath your user experience doing god knows what? Wasn't that something people flipped an absolute fucking shit over with Intel ME? Yeah it's got closed-source elements that make it a problem to open, but isn't that also true of the NVIDIA drivers people constantly whinge about? What's the reason for giving AMD a pass on external IP but not NVIDIA? Blobs don't matter anymore if it's AMD?
Or the platform lock - literally preventing secondhand resale of server cpus (and now desktop CPUs too) if they're ever used in a branded system. Note that it's not locked to a motherboard - it's not about preventing parts swapouts. It's locked to a brand, so you can swap any other HP-locked cpu into a HP-locked system. Clearly 100% targeted at killing the secondhand market, and that's pretty damn anti-user-freedom as well, why shouldn't I have the freedom to buy a used CPU if I want? Because it would impact AMD's bottom line I guess?
Like, at the end of the day AMD is rolling in the anti-user-freedoms shit same as everyone else. It's a bit, when you're the underdog you need an angle to get people to buy you. When the incentives align and you and AMD are both seeing a benefit from the open-source strategy it's great, but it's not something they "are in favor of as a company", they're happy to say no to open-source when it gives them a strategic advantage.
I didn't watch the youtube link. I have now and looked up who the speaker is. He's not aligned with my philosophy. In a past life I made proprietary tools for games studios releasing proprietary products. Most games dev is decidedly hostile to open source, and most windows development is likewise, so I'm not hugely surprised to see that position stated. There may be similar factors at play with the drivers on windows, not my sandbox.
I remember the epyc processor lock story breaking. I don't know how that played out in practice. I do know I'm going to be _really_ angry if it turns out my chip only works in asrock motherboards since that was the first one I put it in. I'd forgotten about that when buying it :(
It seems plausible that whoever is presently the underdog makes nice with open source and whoever is presently on top is not. See e.g. Microsoft over time.
I don't think ROCm will go proprietary because the commercial pressure is in the direction of supercomputers. Specifically, customers of these computers have their own engineers working directly on the open source upstream of the ROCm stack. Both writing optimisations targeting the applications they care about and fixing bugs that trouble them. I mostly get my code reviewed by people outside of AMD. That sort of dynamic doesn't work if you tell customers they need to connect through your VPN and deal with the internal bug tracking systems in order to request changes, as opposed to patching the toolchain themselves.
I am absolutely sure that the overall ROCm architecture was not designed to achieve user lock-in. To the extent I can see an overarching design, it looks like doing the simplest thing we can think of that works OK on clusters. I do see what looks like scar tissue from an initial bring up during the period where AMD was flirting with bankruptcy. OpenCL was implemented first and then apparently ignored by industry. The current compute model (HSA) was designed in collaboration with various other companies, of which I think Qualcomm are still using it but noone else is. I believe there are commercial reasons why we can't implement CUDA (and thus created HIP, which looks pretty similar and I'm told runs on amdgpu or on nvptx). Clang's OpenMP is converging on identical implementations for amdgpu and nvptx and will do the same for intel if they ever show up. That'll already compile a program that runs on either GPU arch if you ask it to.
Obviously I can't totally rule out a change in policy - some corporate mandate may come down that GPU compute is done with this open source model and customers will just have to accept what they're given. There are evidently factions within the company who would claim it is the right thing to do and there's a lag on consequences to changes like that. Hopefully maximising HPC sales will imply open source software for a long time.
> Most games dev is decidedly hostile to open source, and most windows development is likewise,
Yes but there's no problem with incorporating BSD/MIT license tools into proprietary games - that's the point of BSD/MIT. I know in general open-source isn't how games roll but games can use MIT tools without issue.
Windows drivers being open isn't something I realistically expect AMD to do, there's not really a demand for it (and maybe not even a delivery pipeline for it in this age of anticheat) but, still sucks. y'all also still have plenty of proprietary lock-ins, like Infinity Fabric Coherent Link vs CXL. They just are in areas where "everyone does that". And yup, so do you. Building a hardware+software stack is a large expense that nobody wants to give away. Gasp, proprietary. Is Gsync different from Infinity Fabric Coherent Link? Someone paid a lot to build them both.
The lack of third-party chipsets nowadays is a bummer. Nobody's gonna allow Nforce chipsets anymore. Although to be fair on AMD you can boot it as X300 (and epyc is SOC) and then do whatever you want as a chipset, I guess. It's just an I/O expander, not tied to the bringup like Intel.
> I remember the epyc processor lock story breaking. I don't know how that played out in practice. I do know I'm going to be _really_ angry if it turns out my chip only works in asrock motherboards since that was the first one I put it in. I'd forgotten about that when buying it :(
Asrock isn't the problem, it's putting Dell/HPE/etc into your Asrock. Essentially AMD destroyed the threat of any competition from secondhand corporate server sales, or at least significantly complicated the issue. There will never be the kind of flood (like is currently happening with 2011-3) where you can drop a server chip into your whitebox build at 1/100th the price it was originally sold for.
Or at least you will have to be very very careful about what you buy - and ebay sellers won't deal in that level of detail most of the time. Is that CPU that doesn't say it's brand-locked really unlocked, or just the seller doesn't know the history? Tune in next week to find out!
Yes, zey do zat on purpose. Absolutely. If it was about swapping CPUs inside datacenters to prevent attackers/hostile hardware you couldn't swap CPUs between motherboards at all, if I need to find another HP to swap with this HP umm ok it's $20 on ebay? It's only between brands because that's the MVP to kill secondhand sales, and it's permanent.
> I am absolutely sure that the overall ROCm architecture was not designed to achieve user lock-in. To the extent I can see an overarching design, it looks like doing the simplest thing we can think of that works OK on clusters.
Yeah that's fair. I can buy that ROCm is the minimum viable product for getting to National Labs HPC sales. I also just think you've got a huge problem with ROCm in general, it's not mature and it's not even a path that will lead you to something viable.
People don't want to distribute source, ever. Windows user application stories starting with "install wsl2" are no bueno let alone if you make it tough because they bought the wrong one. Compiling sucks, that's why Docker is a thing now, distribute a whole userland because compiling sucks and dependencies suck.
Static compilation on FSR may come back to bite AMD sooner or later - games are already dropping off the treadmill and because AMD didn't distribute it as libraries... welp sux, can't even swap DLLs. And they're going to wish they had, when FSR 3.x and FSR 4.x come out. AMD has their own improvements they will need to make and they've ruled out modularity or microdeploys, everyone has to compile and revalidate and submit the whole damn game again.
> I believe there are commercial reasons why we can't implement CUDA (and thus created HIP, which looks pretty similar and I'm told runs on amdgpu or on nvptx).
Microsoft funded GPU ocelot lol, but maybe that's a less direct conflict of interest. Maybe join Intel in OneAPI? I've heard good things about the concept (SyCL) although I haven't looked at specifics.
HIP is dependent on ROCm. Tied at the HIP, if you will. Don't have a $5k commercial datacenter card? wow sucks. Actually you will also need to compile that as a different slice so that will be even more complicated.
You gotta fix ROCm before you can lean on the "but we have HIP" thing. You aren't going to win adoption without widespread prosumer support, which ROCm doesn't really do. You kinda need to fix the support story with binary slices, why can't it be a family-wide ("RDNA2") instead of die-by-die? that sucks. CUDA does that. There is a well-defined CUDA Compute Capability table too. That makes feature-targeting code easy to write too. Umm this compiles on Turing and up.
It's the college kids and the universities who win you the next 15 years. Know where I learned CUDA? University. What did we have? K40s. What did I do all my dev work on? My grad workstation with a GT 640 I bought for $60 (2 whole GB), and my Thinkpad. Is that the kind of thing that is supported by ROCm? No.
> There are evidently factions within the company who would claim it is the right thing to do and there's a lag on consequences to changes like that. Hopefully maximising HPC sales will imply open source software for a long time.
I know and I'm not ragging on you personally. I didn't know who you were either but it's been an interesting chat.
At one point I set up rocm with an rx 580, but.. it only supported older versions of tensorflow, I didn't manage to get pytorch working, and the installation/set up was not fun.
I feel like AMD for some reason doesn't understand that support for GPGPU on the consumer stack is important to eventually getting adoption in the enterprise space. CUDA is so convenient in part because as long as you have an NVIDIA card it'll probably be usable for testing, even if not necessarily optimal or fast. Then, once confidence has been built about its usability, a more serious investment can be made into getting the more powerful cards.