The ANE only has support for calculations with fp16, int16 and int8 all of which are too small to train with (too much instability). A common thing to do is train in fp32 to be able to get the small differences and gradients and then once the model is frozen do inference on fp16 or bf16.
Using mixed precision training you can do most operations in fp16 and just a few in fp32 where it's needed. This is the norm for NVIDIA GPU training nowadays. For instance using fastai add `.to_fp16()` after your learner call, and that happens automatically.
This article [0] from Nvidia gives a good overview of how mixed precision training works.
Super high level (from section 3):
1. Converting the model to use the float16 data type where possible.
2. Keeping float32 master weights to accumulate per-iteration weight updates.
3. Using loss scaling to preserve small gradient values.
That /sounds/ right, but training still has a forward part, so OP does raise a really great question. And looking at the silicon, the neural engine is almost the size of the GPU. Really need someone educated in this area to chime in :)
You have to stash more information from the forward pass in order to calculate the gradients during backprop. You can't just naively use an inference accelerator as part of training - inference-only gets to discard intermediate activations immediately.
(Also, many inference accelerators use lower precision than you do when training)
There are tricks you can do to use inference to accelerate training, such as one we developed to focus on likely-poorly-performing examples: https://arxiv.org/abs/1910.00762
The neural engine is only exposed through a CoreML inference API.
You can't even poke the ANE hardware directly from a regular process. The interface for accessing the neural engine is not hardened (you can easily crash the machine from it).
So the matter is essentially moot in practice as you'd need your users to run with SIP off...
Sounds like you've you done a bit of digging around, you're efforts are appreciated. I found and a github of people
sharing what they know, here's a guy live streaming hacking it and building a tinygrad https://youtu.be/mwmke957ki4
Question about terminology (no background in AI). In econometrics, estimation is model fitting (training, I guess), and inference refers to hypothesis testing (e.g. t or F tests). What does inference mean here?
In machine learning (especially deep learning or neural networks), the 'training' is done by using Stochastic Gradient Descent. These gradients are computed using Backpropagation. Backpropagation requires you to do a backward pass of your model (typically many layers of neural weights) and thus requires you to keep in memory a lot of intermediate values (called activations). However, if you are doing "inference" that is if the goal is only to get the result but not improve the model, then you don't have to do the backpropagation and thus you don't need to store/save the intermediate values. As the layers and number of parameters in Deep Learning grows, this difference in computation in training vs inference becomes signifiant. In most modern applications of ML, you train once but infer many times, and thus it makes sense to have specialized hardware that is optimized for "inference" at the cost of its inability to do "training".
Just to add to this, the reason these inference accelerators have become big recently (see also the "neural core" in Pixel phones) is because they help doing inference tasks in real time (lower model latency) with better power usage than a GPU.
As a concrete example, on a camera you might want to run a facial detector so the camera can automatically adjust its focus when it sees a human face. Or you might want a person detector that can detect the outline of the person in the shot, so that you can blur/change their background in something like a Zoom call. All of these applications are going to work better if you can run your model at, say, 60 HZ instead of 20 HZ. Optimizing hardware to do inference tasks like this as fast as possible with the least possible power usage it pretty different from optimizing for all the things a GPU needs to do, so you might end up with hardware that has both and uses them for different tasks.
it took me 20 years to learn this body of knowledge and now it can just sort of be summed up in a paragraph.
When I learned and used gradient descent, you had to analytically determine your own gradients (https://web.archive.org/web/20161028022707/https://genomics....). I went to grad school to learn how to determine my own gradients. Unfortunately, in my realm, loss landscapes have multiple minima, and gradient descent just gets trapped in local minima.
Huh. I talked to some experts and they told me NN loss functions are bowl-shaped and have single minima, but those minima take a very long time to navigate to in high dimensional spaces.
For higher feature counts the real concern is saddle points rather than minima, where the gradient is so small that you barely move at all each iteration and get "stuck".
To add here: for a local minimum to occur all those dimensions (or features) need to increase. This is highly unlikely for modern NNs where you have millions of dimensions. If one of the dimensions is going down but the rest up, you have a saddle point. Since you go down only one (or few) dimensions it takes longer.
protein folding and structure prediction. Protein simulations typically define an energy function, similar to a loss function, over all the atoms in the protein. There are many terms: at least one per bonded atom pair, at least one per bonded atom triple, at least one per bonded atom quadruple, one per each non-bonded pair (although atoms that are distant can be excluded, sometimes making this a sparse matrix). If you start with a proposed model (say, random coordinates for all the atoms) and apply gradient descent, you'll end up with a mess. All those energy terms end up creating a high dimensional surface that is absurdly spiky in the details, and extremely wavy with many local minima at coarse grain.
Instead of using gradient descent, we used molecular dynamics (I'm unaware if this has a direct equivalent) to sample the space by moving along various isocontours (constant energy, or constant temp, or usually constant pressure). Even so, you have to do a lot of sampling- in my day, it was years of computer time, now it's months- to get a good approximation to the total landscape, and measure transition frequencies between areas of the landscape that correspond to energy barries (local maxima) that are smaller than the thermal energy avaialble to the system.
It's complicated. also, deep mind obviated all my work by providng that sequence data (which is cheap to obtain) can be used to predict very accurate structures with little or no simulation.
Worth noting that inference in "traditional" statistics and ML/AI/DL isn't really that different at some level. In both cases you have an inverse problem; in one case the parameters are about a group or population (e.g., something about all cats in existence), and in another it is about an individual case (something about a particular cat).
This sounds really fascinating. Are there any resources that you'd recommend for someone who's starting out in learning all this? I'm a complete beginner when it comes to Machine Learning.
Deep Learning with Python (2nd ed), by Francois Chollet.
If you don't mind about learning the part where you program, it's got a lot of beginner/intermediate concepts clearly explained. If you do dive into the programming examples, you get to play around with a few architectures and ideas and you're left on the step to dive into the more advanced material knowing what you're doing.
It is confusing that the ML community have come to use "inference" to mean prediction, whereas statisticians have long used it to refer to training/fitting, or hypothesis testing.
I have background in both and it's very confusing to me. Inference in DL is running a trained model to predict/classify. Inference in stats and econometrics is totally different as you noted.
Inference here means "running" the model. So maybe it has a similar meaning as in econometrics?
Training is learning the weights (millions or billions of parameters) that control the model's behavior, vs inference is "running" the trained model on user data.
Since it's tangentially relevant, if you have an M1 Mac I've created some boilerplate for working with the latest Tensorflow with GPU acceleration as well: https://github.com/alexfromapex/tensorexperiments . I'm thinking of adding a branch for PyTorch now.
1.) Apple Silicon currently can't compete with Nvidia GPUs in terms of raw compute power, but they're already way ahead on energy efficiency. Training a small deep learning model on battery power on a laptop could actually be a thing now.
Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency
2.) Apple Silicon probably will compete directly with Nvidia GPUs in the near future in terms of raw compute power in future generations of products like the Mac Studio and Mac Pro, which is very exciting. Competition in this space is incredibly good for consumers.
3.) At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.
And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.
(which is normal, because no dedicated matrix math accelerators on the GPU notably)
2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)
3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.
> but they're already way ahead on energy efficiency.
For raw compute like you need for ML training, the M1s efficiency doesn't matter. Under the hood at hardware level, you have a direct mapping of power consumption to compute circuit activation that you really can't get around.
The general efficiency of M1 is due its architecture and how it fits together with normal consumer use. Less stuff on the instruction decode, more efficient reordering, less energy wasted moving around data due to shared memory architecture, e.t.c
And yet somehow Apples GPU ALUs are more efficient at 3.8 watts per TFLOP. Mind, I am not talking about specialized matrix multiplication units that have a different internal organization and can do things like matrix multiplication much more efficiently, but about basic general-purpose GPU ALUs.
The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.
As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.
How they do it at a conceptual level isn't a big secret: they don't need to minimize die area the way other companies do. For Apple, the die is just part of a chip that is part of the larger system they sell that they can amortize the cost over. nVidia doesn't have a system to do that with so their natural inclination is to lean towards keeping the die size as small as possible and just overclock the hell out of it. (right there is the 'trick': Apple can afford to do things that chew up die space that nVidia and others can't while maintaining their profit margins) Being a process generation ahead is also a rather huge thing too. (which is another thing they can amortize the cost of over large numbers of complete systems and mobile devices which their competitors can't)
Also related: Apple designs their hardware to do just what they want it to while everyone else is designing for a more general use case. This also costs die area, IP licensing fees etc.
But how does that apply to GPU ALUs? Looking at M1 die shots, they are comparatively tiny, and when comparing to other vendors, it doesn't seem like Apple is dedicated more logic space to the GPU. The M1 die is roughly 120mm2, an Nvidia Turing TU117 (GTX 1650) is roughly 200m2. Both feature the same amount of GPU ALUs (1024 32-bit units). And of course, M1's 5nm is around 5-6 times denser than Turing's 12nm, but M1 is an entire SoC with all kinds of components — not to mention a huge cache — the GPU takes maybe 20% of the die (let's say 1/3 if you also count in the display controller and memory controllers). All in all, the amount of normalised die space dedicated to GPU ALUs seems comparable.
Of course, my perspective here might be extremely naive, I know very little about semiconductor technology, just trying to understand the principal design differences.
Also I thought Apple is adding a large slice of cache. When you look at the 3D-V cache on Ryzen for performance (+15%?), this has a large impact. And because they sell expensive stuff, they can afford to build expensive CPUs.
Cache doesn't matter for pure ALU efficiency though. I mean, I did tests on long dependent chains of FMAs, the only memory touched there are two internal registers.
Apple Silicon is not ahead at all on energy efficiency for desktop workloads. If they were ahead on energy efficiency, they would simply be ahead on power. Indeed, GPUs are massively parallel architectures, and they are generally limited by the transistor and power budget (and memory, of course).
Apple is simply behind in the GPU space.
> At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.
The reason why it's cheaper is that its memory is at a fraction (around 20-35%) of the memory bandwidth of a 128GB equivalent GPU set up, which also has to be split with the CPU. This is an unavoidable bottleneck of shared memory systems, and for a great many applications this is a terminal performance bottleneck.
That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.
I'm not sure what you meant with the link, but the parent is right, so adding an explanation here: M1 Ultra has about 400GB/s theoretical bandwidth but Anandtech shows that none of the SoC blocks can actually reach that, pretty far for it. It seems that Apple summed all the bandwidth to all the blocks to get there, which does mean something but not that the GPU has access to this (the GPU memory controllers seem to be the bottleneck).
On the contrary, a 3080 laptop does reach 400GB/s, I'm personally seeing this routinely on AI workloads, so that's part of the explanation for subpar perf here (the other ones being probably matrix math and mixed precision)
Yes. And the M1 Ultra has even more memory bandwidth than the M1 Max. But a 128 GB system made of 3 NVidia A6000 has 3x768Gb/s of memory bandwidth, a more common AI-grade card has 2x2Tb/s of memory bandwidth, which simply dwarfs the M1 Ultra.
For researchers, sure, but it's still quite an apples-to-oranges comparison.
A6000 is ~$5k per card. I guess you're referring to something like an A100 on that other spec, which is $10k/card (for 40GB of memory).
I do a fair bit of neural/AI art experimentation, where memory on the execution side is sometimes a limiting factor for me. I'm not training models, I'm not a hardcore researcher--those folks will absolutely be using NVIDIA's high-end stuff or TPU pods.
128GB in a Studio is super compelling if it means I can up-res some of my pieces without needing to use high-memory-but-super-slow CPU cloud VMs, or hope I get lucky with an A100 on Colab (or just pay for a GPU VM).
I have a 128GB/Ultra Studio in my office now. It's a great piece of kit, and a big reason I splurged on it--okay, maybe "excuse"--was that I expect it'll be useful for a lot of my side project workloads over the next couple of years...
Hmm, that's interesting. What kind of inference workload requires more than the 48GB of memory you'd get from 2 3090s, for example? I'm genuinely curious because I haven't ran across them and it sounds interesting
Mostly it's old-school transfer style transfer! Well, "old" in the sense that it's pre-CLIP. I've played with CLIP-guided stuff too, but I've been tinkering with a custom style transfer workflow for a few years. The pipeline here is fractal IFS images (Chaotica/JWildfire) -> misc processing -> style transfer -> photo editing, basically.
Only the workflow is the custom part--the core here is literally the original jcjohnson implementation. Occasionally I look around at recent work in the area, but most seems focused on fast (video-speed) inference or pre-baked style models. I've never seen something that retains artistic flexibility.
My original gut feeling on style transfer was that it would be possible to mold it into a neat tool, but most people bumped into it, ran their profile photo against Starry Night, said "cool" and bounced off. And I get that--parameter tuning can be a sloooow process. When I really explore a series with a particular style I start to feed it custom content images made just for how it's reacting with various inputs.
That's from a local server in my garage with a K80. At some point I had two K80s in there (so basically four K40s with how they work), but dialed it back for power consumption/power reasons.
I do have a 3090 in the house, and a decent amount of cloud infra that I sometimes tap. The jcjohnson implementation is so far back that it doesn't even run against modern hardware. At some point I need to sort that out, or figure out how to wrangle a more modern implementation into behaving in the way that I like.
I don't really post these anywhere, although do throw them over the wall on Twitter if anyone is curious to see more. These are a mix of things, although the CLIP/Midjourney/etc stuff is pretty easy to spot: https://twitter.com/mwegner/media
Not sure about inference but for training, 128GB is big enough to fit a decent-sized dataset entirely into memory, which causes a massive speedup. It's also probably cheaper to get a 128GB Mac Studio than a dual-3090 rig unless you're willing to build the rig yourself and pay the bare minimum for every component except the GPUs themselves.
As for 128GB memory on-inference models that a consumer would be interested in, I got nothing, though it certainly seems like it would be fun to mess around with haha
I remain skeptical that Apple's best GPU silicon will match nvidia's premiere products (either the top-end desktop card, or a server monster) for training.
It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.
I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.
The thing is with the efficiency (which I'm not sure of) and the competition (probably possible) is that the current nvidia lineup is pretty old and on an even older process. They have a big moat.
There's definitely competition, and it's going to be really interesting to watch Nvidia and Apple duke it out over the next few years:
- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.
- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.
- Nvidia has insane engineers. Despite the fact they're using silicon that's more than twice as large by-area when compared to Apple, they're still doubling their numbers across the board. And that's their last-gen tech too, the comparison once they're on 5nm later this summer is going to be insane.
I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.
Nice results! But why are people still reporting benchmark results on VGG? Does anybody actually use this network anymore?
Better would be mobilenets or efficientNets or NFNets or vision transformers or almost anything that's come out in the 8 years since VGG was published (great work it was at the time!).
Why not? It's still good for simple classification tasks. We use it as an encoder for a segmentation model in some cases. Most ResNet variants are much heavier.
Those slow and inaccurate models at the bottom of the graph are the VGG models. A resnet34 is faster and more accurate than any VGG model. And there are better options now -- for example resnet34d is as fast as resnet34, and more accurate. And then convnext is dramatically better still.
> ResNet > VGG: ResNet-50 is faster than VGG-16 and more accurate than VGG-19 (7.02 vs 9.0); ResNet-101 is about the same speed as VGG-19 but much more accurate than VGG-16 (6.21 vs 9.0).
Probably because it will be impossible to compare with old results. If every year the community chooses a different model, how are you going to compare results year over year?
You need something constant, either the model or the hardware, otherwise you cannot have those relative numbers. And you usually want to have a trend.
See my other reply
It doesn't matter. Deep learning have been mainstream for only 10 years. MNIST is a dataset from 1998 and it is still being used in research papers.
The most important thing is to have a constant baseline, and ResNets are a baseline.
Think about changing the model every other year:
- 2015: ResNet trained in Nvidia k80
- 2017: Inception trained in Nvidia 1080 ti
- 2019: Transformer trained in Nvidia V100
- 2021: GTP-3 trained in a cluster
Now you have your new fancy algorithm X and an Nvidia 4090. How much better is your algorithm compared to the state of the art, and how much have you improved compared to the algorithms 5 years ago? Now you are in a nightmare and you have to run all the past algorithms in order to compare it. Or how fast is the new Nvidia card? which noone still have and nvidia has decided to give numbers based on a their own model?
> But why are people still reporting benchmark results on VGG?
It makes me feel like i’m missing something! Is is still used as a backbone in the same way as legacy code is everywhere, or is it something else entirely??
Do apple users really require the ability to train large ML models while mobile and without access to A/C power? Is this a real-world use case for the target market?
This is very interesting since the M1 studio supports 128GB of unified memory - training a large memory heavy model slowly on a single device could be interesting, or inferencing a very large model.
...specific use cases being the key operand here. Unified memory is cool, but there are reasons we don't use it at-scale:
- It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)
- ECC is still off-the-table on M1 apparently
- Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.
- Most of the so-called "specific workloads" that you're outlining aren't very general applications. So far I've only seen ARM outrun x86 in some low-precision physics demos, which is... fine, I guess? I still don't foresee meteorologists dropping their Intel rigs to buy a Mac Studio anytime soon.
> Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.
For sure but I expect this is different for the apps Apple _wants_ to write. It’s easy to imagine the next version of Logic or whatever doing fine tuning everywhere.
What is there to fine-tune, in a program like Logic? I've often heard that word associated with using extended instruction sets and leveraging accelerators, but where would the M1 have "untapped power" so-to-speak? I don't think the "upgrade" from a CISC architecture to a RISC one can yield much opportunity for optimization, at least not besides what the compiler already does for you.
> - It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)
In the first half of 2023, NVIDIA Grace Superchip will ship with an 1TB memory config (930GB usable because ECC bits) on a 1024-bit wide LPDDR5X-8533 config (same width as M1 Ultra, with LPDDR5-6400).
So it's going to become much less of an issue really soon.
> So it's going to become much less of an issue really soon.
The main issue would be trying to purchase one of those, which is likely going to be both very rare and orders of magnitude more expensive than a Mac Studio.
The Mac Studio isn't some crazy exotic hardware like datacenter class GPUs, but definitely has some exotic capabilities.
$ conda install pytorch torchvision torchaudio -c pytorch-nightly
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- torchaudio
And the pip install variant installs an old version of torchaudio that is broken
OSError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/test123/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at14RecordFunctionC1ENS_11RecordScopeEb
worked for me. I think it's something with your brew installation.
fragmede@samairmac:~$ python
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__file__
'/Users/fragmede/projects/miniforge3/lib/python3.9/sitepackages/torch/__init__.py'
>>>
Low. Apple doesn't have matrix math accelerators in their current GPUs.
The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.
Where it could still make sense is if you have a small VRAM pool on the dGPU and a big one on the M1, but with the price of a Mac, not sure that makes a lot of sense either in most scenarios compared to paying for a big dGPU.
have you actually benchmarked that? I think (someone please correct me if I'm way off here) the AMX instructions can hit ~2.8tflops (fp16) per co-processor and there are 2 on the 7-core M1. That's 5.6tflops vs the 4.6tflops the GPU can hit.
Yeah that's within the M1 family, but get within dGPUs and it doesn't even come close.
30Tflops for a 3080 for vector FP32, but 119Tflops FP16 dense with FP16 accumulate, 59.5 with FP32 accumulate, and if you exploit sparsity then that can go even higher.
I wrote a comment about an Tensorflow on M1 comparison to some cloud providers. I imagine PyTorch on M1 would give similar results. I think the gist would be that the 3070 is going to be a better investment.
It’s surprising to see PyTorch developers working on things like that when common operations like group convolutions are still completely unoptimized on Nvidia GPUs, despite many requests.
Why wouldn’t you be able to run them in parallel using CUDA? You shouldn’t be memory-transfer speed limited when group convolution layers are a part of a bigger net.
Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).
pointwise isn’t a case of grouped conv, they’re orthogonal ideas.
You can fuse grouped convs (depthwise is a special case of grouped convs) into preceding or following layers. Maybe JAX can do this already? No clue if any library offers such an optimization out of the box
Sorry, yes, I was replying to the post about depthwise convolution and that’s what I meant (though the naming of it is poor) - i.e. the special case of group convolutions where the number of groups is equal to the number of channels.
You must have installed it a while ago then :). I just recently did and only needed to I stall 2 packages via pip (i think tensoflow-macos and tenderfoot-metal ) which I found much better than wrangling with cuda and cudnn versions for Nvidia cards
yess! This is important for me, because I don't have any $$$ to rent GPUs for personal projects. Now we just need M1 support for JAX.
Since there are no hard benchmarks against other GPUs, here's a Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks like it's about 2x slower--the RTX laptop absolutely rips for gaming, I love it.
This is targeting AMD GPUs and M1 GPUs currently not targeting the integrated Intel GPUs present in Intel machines. However if you have a 16 inch Intel MBP, or a Mac Pro, etc, this should work with your AMD GPUs. That support isn’t in the nightly packages yet (only Apple Silicon support so far) but the PyTorch team is saying that it will be available by the end of the week hopefully. If you just can’t wait, you should be able to build from source to test it out right now.
> Accelerated GPU training is enabled using Apple’s Metal Performance Shaders (MPS) as a backend for PyTorch.
What do shaders have to do with it? Deep learning is a mature field now, it shouldn't need to borrow compute architecture from the gaming/entertainment field. Anyone else find this disconcerting?
It’s not the greatest term even for graphics only.
People new to CG are likely to intuit “shaders” as something related to, well, shading, but vertex shaders et al have nothing to do with the color of a pixel or a polygon.
Wait until they learn a kernel has nothing to do with operating systems! And tensor operations have nothing to do with tensor objects! And texture memory often isn't even used for textures!
It's an unfortunate set of terminology due to the way this space evolved from graphics programming - shader cores used to do fixed-function shading! But then people wanted them to be able to run arbitrary shaders and not just fixed-function. And then hey, look at this neat processor, let's run a compute program on it. At first that was "compute shaders" running across graphics APIs, then came CUDA, and later OpenCL. But it is still running on the part of the hardware that provides shading to the graphics pipeline.
Similarly, texture memory actually used to be used for textures, now it is a general-purpose binding that coalesces any type of memory access that has 1D/2D/3D locality.
You kinda just get used to it. Lots of niches have their own lingo that takes some learning. Mathematics is incomprehensible without it, really.
That terminology isn't used at all in GPGPU compute APIs specifically tailored for that purpose, which use quite different programming models where you can mix host and device code in the same program.
And there are "GPUs" today that can't do graphics at all (AMD MI100/MI200 generations) or in a restricted way (Hopper GH100) which has the fixed function pipeline only on two TPCs, for compatibility, but running very slowly due to that.
There's absolutely a lot of "graphics" terminology that spills into GPGPU. For example, texture memory in CUDA :) The reality is that GPU's, even the ones that can't output video, are ultimately still using hardware that largely is rooted in gaming. Obviously the underlying architectures for these ML cards are moving away from that (increasingly using more die space for ML related operations) but many of the core components like memory are still shared. It boils down to the fact that at the end of the day they're linear algebra processors.
I'd say that there has been quite some sharing between both back and forth. Evolutions in compute stacks shaped modern graphics APIs too.
Texture units are indeed a part that is useful enough to be exposed to GPGPU compute APIs directly. The "shader" term itself disappeared quite early in those though, as did access to a good part of the FF pipeline including the rasterisers themselves.