> "Basically nobody writes CUDA," wrote Keller in a follow-up post. "If you do write CUDA, it is probably not fast. […] There is a good reason there is Triton, Tensor RT, Neon, and Mojo."
> Even Nvidia itself has tools that do not exclusively rely on CUDA. For example, Triton Inference Server is an open-source tool by Nvidia that simplifies deploying AI models at scale, supporting frameworks like TensorFlow, PyTorch, and ONNX. Triton also provides features like model versioning, multi-model serving, and concurrent model execution to optimize the utilization of GPU and CPU resources.
> Nvidia's TensorRT is a high-performance deep learning inference optimizer and runtime library that accelerates deep learning inference on Nvidia GPUs. [...]
Keller was speaking of OpenAI's Triton (https://openai.com/research/triton), a Python-like language that is compiled to code for Nvidia GPUs, but Tom's Hardware mixed this up with Nvidia's Triton Inference Server, a higher level tool that's really not a replacement for CUDA and not directly related to the Triton language. Easy to confuse these if you are a writer in a hurry.
Wow, that's some omission in the article. Mentioned in the very bottom, but with no disclaimer that it might influence his opinion as they're a competitor:
> His statements also imply that even though he has worked stints at some of the largest chipmakers in the world, including the likes of Apple, Intel, AMD, Broadcom (and now Tenstorrent), we might not see his name on the Nvidia roster any time soon.
I'd tend to lean this way. Keller is someone who could generally call any chip company in the world, and have a top job tomorrow. Not some scrappy dreamer who's just trying to steal customers from a big player.
The field itself is already pretty niche. The amount of people on his level probably fit on an index card. That helps probably the most.
He's obviously a smart guy, but to me his wisdom comes from being able to actually understand the system and explain it. Whether he actually codes it or is even that smart becomes irrelevant, he has the ability to big picture it and get aces in their places.
But mostly the first point. The niche is so small you'd probably be hard pressed to find someone in comments who actually has worked alongside him to even attest to any of this.
It's immensely clear that when he goes somewhere, some big advancement usually happens, and he can explain why. Even if that's -all- he knew, it'd be enough to make him super valuable. So if we have to keep going back and pinning traits, I'd say it's some amount of intelligence mixed with a whole lot of passion.
I meant more in terms of the article content. The entire article is about his point-of-view, but it's surely useful to the readers to know if he's just someone who happens to have an opinion, or if he also directly competes with the project/company he is talking about. But they never say that.
Somehow, I don't see him going to Nvidia in any case. Their designs are very different from his philosophy, and he can be quite opinionated. I'd characterise Tenstorrent as his attempt to build the opposite of an Nvidia style accelerator.
If only there was a service that could watch a 2 hour video and give you a detailed answer to this question in a few seconds just by asking it… soon enough I guess.
Feeding the output of a transcription service (e.g. https://youtubetranscript.com/?v=G4hL5Om4IJ4 ) into an llm for specific requests should be possible, likely in multiple steps to handle context limitations
Gemini does that already. Here's the output when asked what is Keller's philosophy given that youtube interview:
According to the video, Keller’s philosophy emphasizes the importance of both theory and engineering in the field of computer science. He believes that theory provides a foundation for understanding how things work, while engineering is the practical application of that knowledge. He argues that both are essential for making progress in the field.
Keller also emphasizes the importance of craftsmanship and attention to detail. He believes that the best engineers are those who take pride in their work and are constantly striving to improve it. He believes that this is essential for building high-quality, reliable computer systems.
Finally, Keller believes that it is important to be open to new ideas and to be willing to experiment. He believes that this is the best way to make progress in the field of computer science.
Here are some specific examples from the podcast that support Keller's philosophy:
- Keller discusses the importance of theory in the development of branch prediction, a key technique for improving computer performance. He explains that while the basic idea behind branch prediction was known for many years, it was only through theoretical advances that it was possible to develop a practical implementation.
- Keller also discusses the importance of engineering in the development of the Alpha 21264 microprocessor. He explains that while the chip was a groundbreaking design, it also had some flaws that were only discovered after it was released. He says that these flaws could have been avoided if the engineers had paid more attention to detail.
- Finally, Keller discusses the importance of being open to new ideas. He talks about his work on the TenstorFlow chip, which is a new type of chip designed for machine learning applications. He says that he was initially skeptical of the idea, but that he eventually came to believe that it had the potential to be a major breakthrough.
Overall, Keller's philosophy is one of pragmatism and open-mindedness. He believes that the best way to make progress in computer science is to be willing to experiment and to learn from both successes and failures.
...
Here are some timestamps related to Keller's philosophy in the YouTube video:
1:18:02 - Keller discusses the importance of both theory and engineering in good design.
1:23:22 - Keller gives the example of branch prediction as a breakthrough in engineering that was based on theory.
1:34:12 - Keller talks about the importance of craftsmanship in engineering.
1:42:15 - Keller discusses the limitations of human thinking and the importance of being open to new ideas.
2:12:22 - Keller talks about the responsibility of engineers to society.
Yeah, this review is pretty superficial. It seems like it was written by a college freshman that barely engaged with the interview. (Thanks for sharing it though.)
No, he showed up way after founding, he might have been an advisor before though. I'm not sure what the founding CEO is up to now but he had worked in AMD's GPU group and might have met Keller during that work.
Isn't low-level hardware really at the heart of a lot of this? Hasn't a lot of the criticism of Cuda been that it's incredibly difficult for others to implement on other hardware bc of the low-level interactions and Nvidia's usage of dark-APIs (can't recall the term i've heard used).
Wasn't this one of the reasons AMD abandoned/deprioritized their efforts on such a project?
I write fast CUDA so I guess I'm one of the lucky few - which I already knew. I spent a lot of time benchmarking, instruction counting the PTX, function approximations, interleaving etc. I don't think it would cost that much money to write new ops for a new hardware for existing frameworks but for some reason very few do it - I figure it must be cheaper than making the hardware. I assume it's "penny wise, pound foolish" or maybe that it's very hard for hardware companies to do software as they try to pay their software guys like hardware guys.
More and more I'm doing more branchy algorithms so the CPU is now my new bottleneck - thank you fast GPUs - but I don't think the market will move to this way doing things before the current AI wave has crested. So it is probably not good business sense to target my use cases. I have quite a lot of concurrency so I think my ideal hardware is a whole lot of little CPU cores with decent cache and Half Matrix Multiply Accumulate (HMMA) instructions. Once my CPU bottleneck becomes too painful I'll look at what options are on the market. I think, but am not sure, the early Tesla AI chips focused too much on optimizing ResNet and then later they went with CPUs with the HMMAs. Instead of using a variety of instructions I can instead approximate most functions with more HMMAs.
EDIT: changed my reference of matmul intrinsics to the more precise HMMA instruction
> I have quite a lot of concurrency so I think my ideal hardware is a whole lot of little CPU cores with decent cache and matmul intrinsics
Back in 2015 I thought this would be the dominant model in 2022. I thought that the AI startups challenging Nvidia would be about that. Instead, they all targetted inference instead of programmability. I thought that a Tenstorrent hardware would be about what you are talking about - lots of tiny cores, local memory, message passing between them, AI/matmult intrinsics.
I've been hyped about Tenstorrent for a long time, but now that it is finally coming out with something, I can see that the Grayskulls are very overpriced. And if you look at the docs for their low-level kernel programming, you will see that Tensix cores can only have four registers, have no register spilling, and also don't support function calls. What would one be able to program with that?
It would have been interesting had the Grayskull cards been released in 2018. But in 2024 I have no idea what the company wants to do with them. It's over five years behind what I was expecting.
My expectations for how the AI hardware wave would unfold were fit for another world entirely. If this is the best the challengers can do, the most we can hope for is that they depress Nvidia's margins somewhat so we can buy its cards cheaper in the future. As we go towards the Singularity, I've gone from expecting revolutionary new hardware from AI startups to hoping Nvidia can keep making GPUs faster and more programmable.
Ironically, that latter thing is one trend that I missed, and going from Maxwell cards to the last generation, the GPUs have gained a lot in terms of how general purpose they are. The range of domains they can be used for is definitely going up as time goes on. I thought that AI chips would be necessary for this, and that GPUs would remain as toys, but it has been the other way around.
I wasn't as optimistic that there would be a broad adoption of some of the more advanced techniques I was working on so I did figure back in 2013 that most people would stick to the GEMMs and Convs with rather simple loss functions - I had a hard enough time explaining BPR triplet loss to people. Now with LLMs people will be doubling down on GEMMs for the foreseeable future.
My customers won't touch non-commodity hardware as they see it as a potential vector for vendors to screw them over, and they're not wrong about that. In a post apocalyptic they could just pull a graphics card out of a gaming computer to get things working again which gives them a strong feeling of security. Having very capable GPU cards as a commodity means I can re-use the same ops for my training and inference which roughly halves my workload.
My approach to hardware companies is that I'll believe it when I see it, I'll wait until something is publically available that I can buy off the shelf before looking too closely at it's architecture. NVidia with their Tensor Cores got so good so quickly that I never really looked too closely at alternatives. I'm kind of hopeful that AMD SoC would provide a good edge compute option so I might give that a go.
I had a look at tenstorrent given this article and the Grendel architecture seems interesting.
Grayskull shipped in 2020 and each tensix cores has five RISC-V cores. Get your basic facts right before you complain. The dev kit is just that, a dev kit. Groq sells their dev kit for $20k even though a single LPU is useless.
> "Basically nobody writes CUDA," wrote Keller in a follow-up post. "If you do write CUDA, it is probably not fast.
I am a big fan of Jim Keller, but this is semantics that argues CUDA the language is not a moat, when most people refer to CUDA the libraries.
CUDA has first party support in most of the libraries in use today, if you were to stray from that happy path you must have deep pocket to work through the hurdles of working with something like XLA.
There's a reason why Microsoft, Google, Amazon and Meta are still buying Nvidia accelerators even when they have their own in-house accelerators.
A good number of years ago I wrote my own Torch-like C++ NN framework using CUDA (cuDNN, cuBLAS) for the GPU, and one of the annoyances is that cuDNN has incomplete coverage of even the basic operators needed. Add/Sub/Min/Max/Sqrt/Negate are all provided as part of cuDNN, but if you want other common NN building blocks like Div/Exp/Log/Pow/Inv/InvSqrt then you have to write them yourself in CUDA and either forgo cuDNNs tensor-descriptor layout flexibility or re-implement that yourself too.
Of course frameworks like PyTorch support all the operators you'd expect, since they've written their own kernels in CUDA where the functionality is missing from cuDNN.
It's not clear exactly what Keller is referring to there. When people say CUDA they might be referring to the entire ecosystem (nvcc CUDA C/C++ compiler, CUDA API's, higher level cuBLAS, cuDNN, etc), or maybe just the base compiler (which lets you write your own kernels) and API for allocating memory, queueing kernels, etc.
The cuDNN kernels (convolution, etc) are highly optimized, as is cuBLAS (e.g. matmul), and I doubt anyone is going to do better writing these themselves. Does Keller consider using cuDNN as "writing CUDA" ?
As far as I'm aware the higher level, performant, CUDA libraries, as well as specialized components like TensorRT are written in CUDA, although that could mean a combination of C/C++ & PTX pseudo-assembler (ptxas is really a compiler, not an assembler). The alternative would be they they were written in hand optimized SASS assembler which afaik is only available outside of NVDIA via the Open Source CuAssembler.
I believe Mojo support for NVIDA is based on PTX. I'm not sure if that would be really be considered as "CUDA" or not if they are not using nvcc at all.
Most people, outside of framework vendors, would have no reason to use CUDA anyway, since it's just too low level. The only sane use case would be where writing a custom kernel in (e.g.) PyTorch or Mojo doesn't get the performance you want and you write than one kernel in CUDA. The hope would be that the Mojo compiler is good enough that this would not be necessary.
Well he is obviously just talking his book. He is selling a Nvidia competitor. If CUDA isn’t beautiful and the future deserves to be beautiful, why don’t you make it and show us, Jim? All we are getting is more hot air about CUDA. That is all AMD and Intel have as well, bold statements about the future and no code.
I think you’re assuming these big companies are serious about AI itself and not just joining in on a trend. They would much rather spend huge amounts of money on chips than hire a competent researcher who might tell them that they didn’t really need to waste so much money.
The computing shops on my city shopping malls have enough laptops with NVIDIA GPUs, while they aren't at the same level as desktop or server GPUs, there are available and usable for people to learn CUDA, quite the contrary with most AMD offerings.
Not sure what you mean, both NVIDIA and AMD desktop GPUs are available, there is no particular shortage. Prices can be high for top ones, like NVIDIA 4090, but you can buy it.
Server GPUs are the ones difficult to buy.
People don't really learn CUDA, they use it mostly through another library like PyTorch.
I think it's more of a subtle publicity stunt for Tenstorrent's Grayskull [0] cards, which are highly optimized for inference.
With CUDA you get a lot of flexibility which you don't get with Grayskull, at the expense of having a more layered/patched API which has been evolving for many years while R&D came up with new solutions to new problems.
I wonder what Nvidia's capabilities are in creating new, highly optimized cards, similar to Grayskull, but I wouldn't be surprised if they don't have any interest in creating them since their current products are already consuming all their resources.
It's great to see that companies like Tenstorrent are offering more optimized and pricewise more accessible products. It would also ease the situation with how hard it is to get Nvidia cards, because even companies doing only inferencing are buying chips which are developed to be able to do much more than that.
My understanding is that grayskull is a devboard for inference, Wormhole is supposed to do both, because you can scale it out with multiple cards.
> With CUDA you get a lot of flexibility which you don't get with Grayskull
Do you? From what I can tel tt-metalium gives you quite low level access.
Grayskull/Wormhole/... basically have a bunch of tensix cores, which each have 5 small rv32 cores, where some are used for data movement and some for computation. IIRC two (could be one) drive a SIMT compute unit.
From what I can tell there is no "matrix in matrix out" style fixed accelerator, the SIMT unit looks decently flexible, the grayskull one is a bit limited isa wise, wormhole is supposed to improve on that.
tt-buda is supposed to be the high level API with pytorch, tensorflow, ... support, but idk how much compatibility there is at the moment. I assume there is a lot of software work left.
Wormhole is an accelerated NIC, like what Nvidia got access to when it bought Mellanox. It's meant for high-speed routing.
Yet you got me to reading the following:
> t-Series Workstations
> Our t-series workstations are turnkey solutions for running training and inference on our processors, from a single-user desktop workstation in the t1000 up to the t7000 rack-mounted system designed specifically to function as a host with our Galaxy 4U Server
Neural networks have two different compute costs: training and inference.
These are roughly analogous to compile time vs runtime for compiled programming languages.
Training is in general a more intensive task. However, in an ideal scenario training is run once and inference is run millions of times, so the lifetime cost of inference is bigger - this is why it might make sense to optimize for intense.
Inference consists of a single forward pass, which is easy to compute. Meanwhile training has both a forward pass and backward pass (back propagation). The minimum required precision for training is higher since you want to be able to have both fast and slow rates of change. Bfloat16 is preferred for training, while int8 tends to be good enough for inference.
You’re trying to get an output from giving an input to a bunch of matrices (tensors if you will). You do not touch anything in the matrices themselves which happens during training. Essentially, they’re great from deploying models but not making one.
Swamps keep users trapped in, by making it difficult for a competitor to make a compatible alternative. Had Nvidia (or Intel or Microsoft) kept their product lean by aggressively removing backwards compatibility, they would have lost users AND made it easier for another company to compete.
I think that's a misunderstanding of the analogy. The moat isn't about trapping users.
Moat as its typically used is to say that a business is protected by the moat, the prize being the castle/business. The residents/business are free to build and prosper, be productive in other words.
When Keller calls it a swamp, he's trying to say there's no castle to be productive in. The swamp makes it a mess, and thus hard for the business to actually maintain productivity. It is still difficult for the invaders to break in (?) but they don't need to...because it's a swamp. They should be busy making a better castle -- which is what Keller is trying to do.
Disagree with him or not, but I think that's the analogy.
Or maybe just humorously alluding to swamp castle in Monty Python and the Holy Grail. Not sure where you are getting your analogy from, but my guess is Jim's not really making an analogy. Mixing a batch of metaphors is not cooking chips.
PDF hasn't been a vendor-locked swamp for many years now. It got successful because it was better than TIFF for scanned images and better than raw Word/Postscript for print products.
There's a ton of PDF forms that I deal with at work that only work with Adobe Acrobat. When I open them in other PDF viewers, it'll be a single page telling me that if the PDF doesn't load in a few seconds then my PDF viewer is not compatible with the document.
I'm not an expert, but I imagine that they were created with Acrobat and it embeds some (proprietary) stuff in the PDF that only works with Acrobat. Acrobat is also the only PDF client that I'm aware of that supports signing PDF forms with certificates. There's other features in some PDFs that I've encountered that'll only work in Acrobat as well, like buttons embedded in the PDF to email the form or dropdowns in the form that don't work in other viewers.
It seems pretty swampy to me when 90% of the PDF forms I deal with can only be (fully) completed in Acrobat.
Using JavaScript, it's easy to create a standards-compliant PDF with inconsistent behavior in popular viewers. For example,
https://jasomill.at/HelloGoodbye.pdf
contains JavaScript that, per the spec, should display a pop-up alert ("Hello Acrobat <version>") when it is opened and another ("Goodbye Acrobat <version>") when it is closed.
And of course there were multiple unlicensed implementations. Famously, Apple switched from Display Postscript in Rhapsody and Mac OS X Server to Quartz and PDF in Mac OS X to remove the DPS licensing costs.
Color me skeptical. PDF viewing is reasonable, but where are the decent PDF editing tools? I’ve looked, and it seems like nobody else but Adobe is allowed to do real PDF editing. Many many third part apps claim to, but only do superficial editing and don’t remotely match Acrobat tools. I don’t know how the API licensing works, but I imagine this makes sense because the real market for PDF is business documents that involve signatures, watermarks, password security, versioning, maybe some tracking, etc..
I know you're talking about GUI editing, but I've found libqpdf[1] incredibly useful for making programmatic PDF edits with minimal (typically no) structural disturbance.
PDF manipulation is IMHO a niche application. If one really has to do that at scale, it would be probably better to fix the problem upstream to generate it correctly in the first place, or to layer the changes on top of the existing PDF and sign it.
That’s like saying image, audio, or video editing is niche. The number of PDF editing apps online suggests you might have formed an opinion before doing any research? Apparently lots and lots of people want to edit PDF files, enough that there’s a ton of competition. Why do you assume editing a PDF means the file is somehow incorrect, or that there exists any viable upstream solution? It’s more likely the edits are intended to make a new version, and don’t even belong upstream, and that editors have no access to upstream anyway. In addition to editing, there’s a whole industry of scraping PDF publications for data.
Editing the PDF just seems like the wrong place to place the edit since it is a format intended for interchange and archivation.
Layering changes on top makes total sense. Changing existing objects seems fragile. But maybe I feel that way because the format is known for being so complicated and bespoke that only Adobe's tools can edit it reliably. Just about any document format which can be converted to PDF is easier to edit, therefore it would make more sense to apply changes before conversion to PDF, if at all possible.
I agree that reading and scraping PDFs on scale is a very common operation.
Quite a thing to say as head of an AI hardware startup.
Most of the machine learning libraries we use/test in production (and that I use personally on my desktop) have hand written, extensively optimized CUDA kernels. It would be problem #1 switching to Tenstorrent hardware.
Not that I like that one bit, or that I really understand the optimization of these libraries. I do see triton code as well, but it still seems to be the suboptimal path.
But why do I need to write CUDA kernels? The CUDA kernels I need already exist, and I don't particularly care if writing them was so horrible that it required scratching your eyes out and sacrificing kittens to demons - I would care about the (re)writing effort only if I was considering other hardware that doesn't yet have the support that CUDA has as then I'd actually have to do that myself.
CUDA seems to have a lot of mind-share as a strategic asset, but all the evidence I've seen is that AMD defeated themselves. I've never seen them fail due to lacking features or capability, and I haven't heard stories of it either. The complaint seems to be, consistently, that the drivers crash.
If CUDA is so good, it'd be great to know what it does that AMD cards can't. I've never gotten that far because I hit what seems to be some sort of kernel panic or driver lockup.
> I've never seen them fail due to lacking features or capability, and I haven't heard stories of it either.
I've seen a mix of both for years. For the last 6 years the story has usually been some combination of "X program technically supports it, but you'll get 20% of the performance you'd expect given the hardware and/or you'll need to go through a bunch of hoops, and even then you'll have to troubleshoot tons of random errors and/or it'll crash the program or system"
That's on top of the "nobody supports ROCm, or if they do it's because of a single person who amd the PR - you're on your own though because none of the core contributers have amd hardware" which I'll admit is a chick-egg problem.
But given these two factors, it's always meant "if you want buy hardware to do data science, you have to go nvidia if you don't want to write the support yourself, and/or an insane headache that often resulting in switching to nvidia hardware anyway"
Supporting anecdote: I built a PC a few years ago, and first tried multiple AMD cards. All of them worked fine while using the default windows drivers, but as soon as I installed AMD's drivers, they stopped working. Spent weeks troubleshooting, no luck.
Most developers would rather write code for a “swamp” that has long term backwards compatibility, than for a perfect ivory tower that will get replaced in 2 years by another more perfect ivory tower.
Interesting that the web ecosystem basically works on the ivory tower model. Roughly every 2 years there's some new ivory tower and a lot of greenfield projects move to that. Brownfield projects become legacy and go into maintenance mode or get rewritten after some number of years.
I've seen something like this:
- Delphi desktop app (some time in the 2000s)
- Adobe Flash web app (in late 2000s to mid 2010s)
- 2015-era JS framework web app (mid 2010s to present, considered legacy codebase now, no feature development)
- 2020-era JS framework web app (late 2010s to present)
I expect another major rewrite or move to a new project in the next ~5 years
Depends where you look into, I have been pretty much Java and .NET since 2005 in what concerns Web development and backend infrastructure in distributed systems.
The web frontend works that way because little depends on the frontend. But the infrastructure one layer down is obsessive about backwards compatibility, both in the browser and in HTTP servers. JavaScript frameworks change all the time, but nginx and apache and friends have lasted for many decades, for good reason.
Given the rise in ARM in many areas of computing, I'm not sure how much of a moat x86 is / will be long-term.
Is there anything 'inherit' to the instruction set that cannot be done anywhere else? I know that many Intel/AMD systems now have a bazillion cores, which is very handy for some things. Also, AMD seems to have a large numbers PCIe lanes, which is great for I/O between (e.g.) the network and on-system stuff.
Before Apple's M1, ARM was slow and near unusable for x86-class machines. There were some cheap ARM Chromebooks with a much shorter lifespan than their x86-based brethren, and an ARM tablet that Microsoft abandoned, and that's about it.
After the M1, the ARM companies needed to up their game, and Qualcom is rumoured to finally release something competitive to a real x64 chip in the near future. This isn't the first time they've made claims like these, though, and I very much doubt they'll live up to their promise.
Meanwhile, x64 has caught up to Apple in terms of compute power (especially per dollar, which is the reason x64 is so popular), is getting closer and closer to Apple's power consumption levels, and unless the M4 will have dramatically more performance, the ARM overtake will just have been an outlier.
I'm not so sure how long non-Apple ARM will be able to stay competitive given the turnaround x64 has managed to make in just a few short years. It looks like x64 still has plenty of room for improvement, and this proves that the reason compute performance has plateaued was that there was no real competition.
> currently there are actually no modern x86 CPUs on the market. Both Intel and AMD don't actually use x86 cores, but instead proprietary RISC cores, with microcode that translates the x86 code to RISC code on the fly at execution time.
> In the P6 and later microarchitectures, x86 instructions are internally converted into simpler RISC-style micro-operations that are specific to a particular processor and stepping level
I would love to know what the person who wrote that stackexchange answer would say in response to the question, "so which x86 processors used 'real' x86 cores?" Because from the very beginning of x86, there's been translation of the front-door opcodes into internal opcodes via microcode.
So it isn't like there is a literal RISC processor inside the x86 processor with a tiny little compiler sitting in the middle. It's just that the out-of-order execution model requires instructions to be broken up into subtasks which can separately queue at the core's various execution units. Even pipelining instructions still wastes a lot of silicon (while you're doing an integer add the floating point ALU is just sitting there, bored) so breaking things up this way greatly improves parallelism. As I understand it, modern μop-based processor cores can actually have dozens of ALUs, multiple load/store units, virtual->physical address translation units, etc all working together asynchronously to chug through the incoming instructions.
> currently there are actually no modern x86 CPUs on the market. Both Intel and AMD don't actually use x86 cores, but instead proprietary RISC cores, with microcode that translates the x86 code to RISC code on the fly at execution time.
This kind of factoid does more to obscure the truth than it does to illuminate it.
The truth of the matter is that all high-end CPUs do a µop translation, whether or not their frontend is a CISC or RISC ISA. Indeed, the very notion of CISC versus RISC is way overwrought in architecture textbooks, and this probably produces the garbled thinking: since everyone "knows" that CISC can't be superscalar, this means that the Pentium (in making superscalar x86) has to somehow be RISC.
Another thing to note is that there's not really anything called CISC. RISC is the overall term for a family of computer architecture design methodologies arising the 80's that argued for compiler-centric rather than assembler-centric design and simpler instructions, sometimes to the point that you omit hardware and call it a feature (e.g., delay slots). CISC is... everything else; it's a strawman constructed for RISC to compete against rather than a coherent design methodology.
In actual practice, though, RISC v CISC hasn't been relevant for decades. Some of the RISC design ideas have won out: there's generally a high emphasis on instructions that can be selected by the compiler over hand-tuned assembly, for example. But things like delay slots have been generally considered a failure. The architectures that are the most successful--x86 and ARM--are the ones that blur the line between RISC and CISC the most.
Actually, if you scrubbed the x86 assembly away and came up with some new assembly syntax (including new mnemonics of course), you could probably sell the x86 ISA as a "compressed RISC" ISA and get many people to believe you that it was designed as a RISC. x86 doesn't have many instructions that have crazy interrupt rules or multiple memory references (the string instructions are the main exceptions here), and it's this property which turns out to be really key to making something high-performance or not. It would be better for us to be honest about what enables or doesn't enable superscalar architectures rather than trying to argue that somehow x86 cheated its way to success.
Thank you for saying this. The amount of hype RISC gets in unbelievable. It is especially irritating because almost every major RISC design which came out of the 1980s and early 1990s failed in the market. PA RISC is dead. PowerPC, MIPS and Spark are either on life support or only used in cheap low cost or maybe low power products. I believe IBM's Power architecture is still going but I suspect it's slower than AMD's and Intel's x86 chips (IBM does not publish SPEC CPU Benchmark results which is not a good sign).
Basically, I have noticed that people have been claiming RISC is the best for decades and yet real RISC architectures failed to beat x86's performance. Even worse, most of the RISC workstation and server makers exited the business, went bankrupt, or only sell systems to legacy customers who have not migrated to x86.
It seems I make this comment every 6 months or so, but... x86 uops are not at all RISC-style. They are very large (classic RISC uses 32bit encoding, uops are over 100 bits), variable size (classic RISC is fixed style, uops can use multiple slots to encode constants) and can encode load-op (classic RISC separates load and stores from operations).
There is also the fact that RISC-ness is a property of the ISA, so applying it to describe a microarchitecture doesn't make much sense.
Also, the more transistors we stuff on to a chip, a process that is still continuing, the less are proportionally dedicated to these decoding pipelines. My impression is that they are already relatively insignificant parts of the chips, though corrections welcome.
I wonder if there's any ISA that could be written against the x86 cores that would be more efficient. In theory the chip could use that as a mode, so the chip itself wouldn't require an entire OS to shift before things could use it. I don't know anywhere near enough about the inside of an x86 core to have even a clue if such a thing would be possible. But it would be an interesting escape hatch from x86. Anyone who can flesh this idea out with knowledge of the x86 core internals is welcome to explain to me why my idea is bad and I should feel bad.
Note that you couldn't directly simplify the architecture by just writing an assembler for the RISC microcode. The reason is that the RISC microcode is much larger than the x86 it is derived from. This on the fly decompression does wonders for cache hit percentage.
RISC is about the ISA, which is the interface between software and hardware. It is irrelevant whether the hardware is implemented with microops or hamsters running on wheels.
Yet if the existing microops used in popular CPUs made today were actually exposed as the interface between software and hardware, they would be VLIW ISAs, not RISC ISAs.
The whole "CISC have a RISC inside" idea is thus wrong at many levels, and just left over cultural damage from an Intel PR campaign ("risc vs cisc doesn't matter"), which was very effective at misinforming the tech world.
Note that Intel chips did no-doubt win in the market, but that was despite CISC, rather than thanks to CISC: Intel had a MASSIVE fab advantage, alongside the software moat around Microsoft OSs.
There are two ways to interpret this statement. First is that ISA is completely irrelevant since you can just convert to whatever is best. This is bad for RISC since there is literally no advantage gained by losing backwards compatibility.
The other interpretation is that micro ops are actually a microarchitectural optimization and that RISC processors should use it too. The compressed instruction set of RISC-V is leaning towards that direction. RISC-V processors internally convert RISC-V instructions to RISC-V.
It is kind of hollow to talk about it, since it is hardly clear to say that this is a unique disadvantage or advantage.
OTOH I have been hearing about ARM's rise since a long time now. It's not just performance per watt and even that's just Apple beating everyone. See Windows on ARM.
x64 is here to stay with its incredible ecosystem and OSS friendliness - Coreboot, Intel and AMD GPU drivers, great choice of operating systems and apps with a fairly long usable lifespan even with Windows updates.
There is nothing inherent in the ISA. The moat of x86 is purely backward compatibility and decades of experience of ARM and Intel in building large systems.
ARM has abysmal SIMD support. Not even supporting AVX 256. for reference that is about a 10 year lag behind x86. Neon is not an adequate substitute. Additionally when AVX is used, the power draw of a ARM skyrockets to x86 level defeating the advantage ARM has over x86 while offering worse performance. ARM is good for things that require many threads and are not heavily dependent on incredibly high integer performance per thread. x86 is the dominate in high power, when individual thread power is more important due to application requirements or size limits or if they need to do anything with SIMD. ARM has a lot of push behind it. But it contrary to techbro hype., it is not a drop in replacement for x86 and I dont think it ever will without shooting themselves in the foot making them less efficient.
Focussing on vector length is an error. You should care about throughput and number of different instructions you can retire.
If I have a core with four 128-bit neon vector units, I have the same throughput as an x86 with two AVX2 units or one AVX512 unit. However that 4x128-bit core is actually more flexible than the other two as I can do 4 different things at once, or 4 scalar operations per cycle. (Of course the downside is you spend more frontend resources on decode).
Given that most code isn't vector code, the multiple short vector length approach is actually superior on many common real-world workloads that aren't machine-learning (and CPU is unit-of-last-resort for large ML workloads anyway).
ARM has, optionally,the very capable SVE instruction set, but no consumer chips currently implement it.
The M1 and later big cores can dispatch four NEON FMA instructions per clock, so 512 bits worth of vector math, which compares OK with most Intel or AMD chips (Zen 4 can do two 256-bit MUL and two ADD, and Intel "client" bigcores since Sunny Cove typically do three 256-bit FMA).
Then why did Intel and AMD add it to their chips? My guess is AVX is used for when it is worth the effort. If I had to guess, it is used for scientific computing, maybe video processing (as in movies, TV, etc.), maybe audio processing, etc.
I think we could summarize this article as Jim Keller calls CUDA and x85 names. Name calling is not an argument and whenever I see someone resort to name calling, it tells me they probably don't have better arguments.
In engineering and software development, we have to accept trade-offs and imperfection. Every decision has pluses and minuses. Instead of name calling, Jim Keller should explain what his approach is and why he thinks it is an improvement. He should also explain its limitations and weaknesses. Finally, he should also explain how he will handle the inevitable compromises and imperfections which arise in successful systems.
Hot take: x86 was a swamp, but overtime we stopped using legacy features, and the instructions got better, and the amount of chip resources needed to implement the CISC architecture decreased to a small percentage of the total, such that it sort of just turned into a wierd compression algorithm for encoding RISC instructions and didn't matter much anymore.
It's a swamp and a moat. Lots of moats are swamps. Look at a lot of large enterprise software products. Awful to work with, sticky as hell, swamp as it gets. And very much a moat.
HIP is an AMD portability abstraction layer over the NVIDIA APIs that is pass-thru to cuDNN/cuBLAS/cuRAND on NVIDIA hardware, and pass-thru to AMD's equivalent APIs (that are basically drop-in replacements for NVIDIA's) on AMD hardware.
I don't think this is a deliberate strategy but rather the result of focusing resources in where there are currently the largest margins, which is in the big server chips.
It is somewhat short-sighted though, since having support on all their GPUs would surely do wonders for adoption via more organic channels.
Every successful platform is a swamp. Human lives are short, so we do things hastily; communication and coordination is hard; zero sum views of competition are widespread; we don't know how to do sustainable fast iteration of ecosystems (or even slow iteration); the swampiness helps concentrate wealth, which is even more highly rewarded.
Cities are a swamp. The Web is a swamp. Linux is a swamp. Capitalism is a swamp. Democracy is a swamp.
well said. Reminds me of a comment I received from a manager years ago: "This component is great, I haven't heard of any bugs." My retort was: "It's because no one is using it." Fast forward 10 years. Many bugs, CVE's, complaints of back-compatibility, challenges in porting/maintenance. Show me a technology w/o complains or bugs or technical debut & I suspect you'll find a similarly rarely or not used one. Swampiness is an emergent property of 'success at scale' in usage in a sense.
I think Tenstorrent is based in Canada. At least Jim and his team are all from Canada, and their headquarters seems to be there. Probably more complicated than that but it isn't so clear if this would apply.
> Even Nvidia itself has tools that do not exclusively rely on CUDA. For example, Triton Inference Server is an open-source tool by Nvidia that simplifies deploying AI models at scale, supporting frameworks like TensorFlow, PyTorch, and ONNX. Triton also provides features like model versioning, multi-model serving, and concurrent model execution to optimize the utilization of GPU and CPU resources.
> Nvidia's TensorRT is a high-performance deep learning inference optimizer and runtime library that accelerates deep learning inference on Nvidia GPUs. [...]
Keller was speaking of OpenAI's Triton (https://openai.com/research/triton), a Python-like language that is compiled to code for Nvidia GPUs, but Tom's Hardware mixed this up with Nvidia's Triton Inference Server, a higher level tool that's really not a replacement for CUDA and not directly related to the Triton language. Easy to confuse these if you are a writer in a hurry.