Accelerated PyTorch Training on M1 Mac

lekevicius · on May 18, 2022

Curiously neither PyTorch nor Tensorflow currently use M1's Neural Engine. Is too limited? Too hard to interact with? Not worth the effort?

why_only_15 · on May 18, 2022

The ANE only has support for calculations with fp16, int16 and int8 all of which are too small to train with (too much instability). A common thing to do is train in fp32 to be able to get the small differences and gradients and then once the model is frozen do inference on fp16 or bf16.

jph00 · on May 18, 2022

Using mixed precision training you can do most operations in fp16 and just a few in fp32 where it's needed. This is the norm for NVIDIA GPU training nowadays. For instance using fastai add `.to_fp16()` after your learner call, and that happens automatically.

omegalulw · on May 18, 2022

How is the choice between fp16 and fp32 made? Is it like if any gradients in the tensor need the extra range you use fp32?

andoma · on May 19, 2022

This article [0] from Nvidia gives a good overview of how mixed precision training works.

Super high level (from section 3):

  1. Converting the model to use the float16 data type where possible.
  2. Keeping float32 master weights to accumulate per-iteration weight updates.
  3. Using loss scaling to preserve small gradient values.

[0] https://docs.nvidia.com/deeplearning/performance/mixed-preci...

h-jones · on May 18, 2022

The PyTorch docs give a pretty good overview of AMP here https://pytorch.org/tutorials/recipes/recipes/amp_recipe.htm... and an overview of which operations cast to which dtype can be found here https://pytorch.org/docs/stable/amp.html#autocast-op-referen....

Edit: Fixed second link.

RicoElectrico · on May 18, 2022

Most probably Neural Engine is optimized for inference, not training.

munro · on May 18, 2022

That /sounds/ right, but training still has a forward part, so OP does raise a really great question. And looking at the silicon, the neural engine is almost the size of the GPU. Really need someone educated in this area to chime in :)

dgacmu · on May 18, 2022

You have to stash more information from the forward pass in order to calculate the gradients during backprop. You can't just naively use an inference accelerator as part of training - inference-only gets to discard intermediate activations immediately.

(Also, many inference accelerators use lower precision than you do when training)

There are tricks you can do to use inference to accelerate training, such as one we developed to focus on likely-poorly-performing examples: https://arxiv.org/abs/1910.00762

my123 · on May 18, 2022

The neural engine is only exposed through a CoreML inference API.

You can't even poke the ANE hardware directly from a regular process. The interface for accessing the neural engine is not hardened (you can easily crash the machine from it).

So the matter is essentially moot in practice as you'd need your users to run with SIP off...

viraptor · on May 19, 2022

That doesn't seem to be a huge issue. If someone actually does this for income, would they avoid disabling sip for 2x performance gain for example?

munro · on May 20, 2022

Sounds like you've you done a bit of digging around, you're efforts are appreciated. I found and a github of people sharing what they know, here's a guy live streaming hacking it and building a tinygrad https://youtu.be/mwmke957ki4

sillyinseattle · on May 18, 2022

Question about terminology (no background in AI). In econometrics, estimation is model fitting (training, I guess), and inference refers to hypothesis testing (e.g. t or F tests). What does inference mean here?

iamaaditya · on May 18, 2022

In machine learning (especially deep learning or neural networks), the 'training' is done by using Stochastic Gradient Descent. These gradients are computed using Backpropagation. Backpropagation requires you to do a backward pass of your model (typically many layers of neural weights) and thus requires you to keep in memory a lot of intermediate values (called activations). However, if you are doing "inference" that is if the goal is only to get the result but not improve the model, then you don't have to do the backpropagation and thus you don't need to store/save the intermediate values. As the layers and number of parameters in Deep Learning grows, this difference in computation in training vs inference becomes signifiant. In most modern applications of ML, you train once but infer many times, and thus it makes sense to have specialized hardware that is optimized for "inference" at the cost of its inability to do "training".

eklitzke · on May 18, 2022

Just to add to this, the reason these inference accelerators have become big recently (see also the "neural core" in Pixel phones) is because they help doing inference tasks in real time (lower model latency) with better power usage than a GPU.

As a concrete example, on a camera you might want to run a facial detector so the camera can automatically adjust its focus when it sees a human face. Or you might want a person detector that can detect the outline of the person in the shot, so that you can blur/change their background in something like a Zoom call. All of these applications are going to work better if you can run your model at, say, 60 HZ instead of 20 HZ. Optimizing hardware to do inference tasks like this as fast as possible with the least possible power usage it pretty different from optimizing for all the things a GPU needs to do, so you might end up with hardware that has both and uses them for different tasks.

sillyinseattle · on May 18, 2022

Thank you @iamaaditya and @eklitzke . Very informative

dekhn · on May 18, 2022

it took me 20 years to learn this body of knowledge and now it can just sort of be summed up in a paragraph.

When I learned and used gradient descent, you had to analytically determine your own gradients (https://web.archive.org/web/20161028022707/https://genomics....). I went to grad school to learn how to determine my own gradients. Unfortunately, in my realm, loss landscapes have multiple minima, and gradient descent just gets trapped in local minima.

thentherewere2 · on May 18, 2022

This is the case most contemporary neural networks as well. It turns out for many domains, a "good" local minima generalizes well across many tasks.

dekhn · on May 18, 2022

Huh. I talked to some experts and they told me NN loss functions are bowl-shaped and have single minima, but those minima take a very long time to navigate to in high dimensional spaces.

Salgat · on May 19, 2022

For higher feature counts the real concern is saddle points rather than minima, where the gradient is so small that you barely move at all each iteration and get "stuck".

timomo · on May 19, 2022

To add here: for a local minimum to occur all those dimensions (or features) need to increase. This is highly unlikely for modern NNs where you have millions of dimensions. If one of the dimensions is going down but the rest up, you have a saddle point. Since you go down only one (or few) dimensions it takes longer.

ray__ · on May 19, 2022

What's your realm?

dekhn · on May 19, 2022

protein folding and structure prediction. Protein simulations typically define an energy function, similar to a loss function, over all the atoms in the protein. There are many terms: at least one per bonded atom pair, at least one per bonded atom triple, at least one per bonded atom quadruple, one per each non-bonded pair (although atoms that are distant can be excluded, sometimes making this a sparse matrix). If you start with a proposed model (say, random coordinates for all the atoms) and apply gradient descent, you'll end up with a mess. All those energy terms end up creating a high dimensional surface that is absurdly spiky in the details, and extremely wavy with many local minima at coarse grain.

Instead of using gradient descent, we used molecular dynamics (I'm unaware if this has a direct equivalent) to sample the space by moving along various isocontours (constant energy, or constant temp, or usually constant pressure). Even so, you have to do a lot of sampling- in my day, it was years of computer time, now it's months- to get a good approximation to the total landscape, and measure transition frequencies between areas of the landscape that correspond to energy barries (local maxima) that are smaller than the thermal energy avaialble to the system.

It's complicated. also, deep mind obviated all my work by providng that sequence data (which is cheap to obtain) can be used to predict very accurate structures with little or no simulation.

derbOac · on May 19, 2022

Worth noting that inference in "traditional" statistics and ML/AI/DL isn't really that different at some level. In both cases you have an inverse problem; in one case the parameters are about a group or population (e.g., something about all cats in existence), and in another it is about an individual case (something about a particular cat).

dataexporter · on May 18, 2022

This sounds really fascinating. Are there any resources that you'd recommend for someone who's starting out in learning all this? I'm a complete beginner when it comes to Machine Learning.

dr_zoidberg · on May 18, 2022

Deep Learning with Python (2nd ed), by Francois Chollet.

If you don't mind about learning the part where you program, it's got a lot of beginner/intermediate concepts clearly explained. If you do dive into the programming examples, you get to play around with a few architectures and ideas and you're left on the step to dive into the more advanced material knowing what you're doing.

sshlocalhost98 · on May 19, 2022

Thanks for the explanation, really succinct. Do you recommend any good back propagation tutorials for an EE undergrad?

abm53 · on May 18, 2022

It is confusing that the ML community have come to use "inference" to mean prediction, whereas statisticians have long used it to refer to training/fitting, or hypothesis testing.

I'm not sure when or why this started.

mattkrause · on May 18, 2022

Prediction.

The model is literally "inferring" something about its inputs: e.g., these pixels denote a hot dog, those don't.

malshe · on May 18, 2022

I have background in both and it's very confusing to me. Inference in DL is running a trained model to predict/classify. Inference in stats and econometrics is totally different as you noted.

upwardbound · on May 18, 2022

Inference here means "running" the model. So maybe it has a similar meaning as in econometrics?

Training is learning the weights (millions or billions of parameters) that control the model's behavior, vs inference is "running" the trained model on user data.

Q6T46nT668w6i3m · on May 18, 2022

I’m surprised nobody has provided the basic explanation: inference, here, means matrix, matrix or matrix, scalar multiplication.

alexfromapex · on May 18, 2022

Since it's tangentially relevant, if you have an M1 Mac I've created some boilerplate for working with the latest Tensorflow with GPU acceleration as well: https://github.com/alexfromapex/tensorexperiments . I'm thinking of adding a branch for PyTorch now.

masklinn · on May 18, 2022

Did you compare that to Apple's tf plugin to see what was what?

galoisscobi · on May 18, 2022

This is great! Appreciate the note on H5Py troubleshooting as well.

mkaic · on May 18, 2022

This is really cool for a number of reasons:

1.) Apple Silicon currently can't compete with Nvidia GPUs in terms of raw compute power, but they're already way ahead on energy efficiency. Training a small deep learning model on battery power on a laptop could actually be a thing now.

Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency

2.) Apple Silicon probably will compete directly with Nvidia GPUs in the near future in terms of raw compute power in future generations of products like the Mac Studio and Mac Pro, which is very exciting. Competition in this space is incredibly good for consumers.

3.) At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

my123 · on May 18, 2022

> but they're already way ahead on energy efficiency

1) Nope. For neural network training not the case: https://tlkh.dev/benchmarking-the-apple-m1-max

And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.

(which is normal, because no dedicated matrix math accelerators on the GPU notably)

2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)

3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.

highfrequency · on May 19, 2022

What do you mean by: "if you can stand Metal for your use case?" What is Metal?

rrss · on May 19, 2022

Metal is apple’s api for writing software that uses their GPUs.

https://en.m.wikipedia.org/wiki/Metal_(API)

ActorNightly · on May 18, 2022

> but they're already way ahead on energy efficiency.

For raw compute like you need for ML training, the M1s efficiency doesn't matter. Under the hood at hardware level, you have a direct mapping of power consumption to compute circuit activation that you really can't get around.

The general efficiency of M1 is due its architecture and how it fits together with normal consumer use. Less stuff on the instruction decode, more efficient reordering, less energy wasted moving around data due to shared memory architecture, e.t.c

ribit · on May 18, 2022

And yet somehow Apples GPU ALUs are more efficient at 3.8 watts per TFLOP. Mind, I am not talking about specialized matrix multiplication units that have a different internal organization and can do things like matrix multiplication much more efficiently, but about basic general-purpose GPU ALUs.

The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.

As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.

blihp · on May 19, 2022

How they do it at a conceptual level isn't a big secret: they don't need to minimize die area the way other companies do. For Apple, the die is just part of a chip that is part of the larger system they sell that they can amortize the cost over. nVidia doesn't have a system to do that with so their natural inclination is to lean towards keeping the die size as small as possible and just overclock the hell out of it. (right there is the 'trick': Apple can afford to do things that chew up die space that nVidia and others can't while maintaining their profit margins) Being a process generation ahead is also a rather huge thing too. (which is another thing they can amortize the cost of over large numbers of complete systems and mobile devices which their competitors can't)

Also related: Apple designs their hardware to do just what they want it to while everyone else is designing for a more general use case. This also costs die area, IP licensing fees etc.

ribit · on May 20, 2022

But how does that apply to GPU ALUs? Looking at M1 die shots, they are comparatively tiny, and when comparing to other vendors, it doesn't seem like Apple is dedicated more logic space to the GPU. The M1 die is roughly 120mm2, an Nvidia Turing TU117 (GTX 1650) is roughly 200m2. Both feature the same amount of GPU ALUs (1024 32-bit units). And of course, M1's 5nm is around 5-6 times denser than Turing's 12nm, but M1 is an entire SoC with all kinds of components — not to mention a huge cache — the GPU takes maybe 20% of the die (let's say 1/3 if you also count in the display controller and memory controllers). All in all, the amount of normalised die space dedicated to GPU ALUs seems comparable.

Of course, my perspective here might be extremely naive, I know very little about semiconductor technology, just trying to understand the principal design differences.

KingOfCoders · on May 19, 2022

Also I thought Apple is adding a large slice of cache. When you look at the 3D-V cache on Ryzen for performance (+15%?), this has a large impact. And because they sell expensive stuff, they can afford to build expensive CPUs.

ribit · on May 20, 2022

Cache doesn't matter for pure ALU efficiency though. I mean, I did tests on long dependent chains of FMAs, the only memory touched there are two internal registers.

sudosysgen · on May 18, 2022

Apple Silicon is not ahead at all on energy efficiency for desktop workloads. If they were ahead on energy efficiency, they would simply be ahead on power. Indeed, GPUs are massively parallel architectures, and they are generally limited by the transistor and power budget (and memory, of course).

Apple is simply behind in the GPU space.

> At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

The reason why it's cheaper is that its memory is at a fraction (around 20-35%) of the memory bandwidth of a 128GB equivalent GPU set up, which also has to be split with the CPU. This is an unavoidable bottleneck of shared memory systems, and for a great many applications this is a terminal performance bottleneck.

That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.

p1esk · on May 18, 2022

its memory is at a fraction (around 30-40%) of the memory bandwidth of a 128GB equivalent GPU setup

Here's some info about M1 memory bandwidth: https://www.anandtech.com/show/17024/apple-m1-max-performanc...

eoerl · on May 19, 2022

I'm not sure what you meant with the link, but the parent is right, so adding an explanation here: M1 Ultra has about 400GB/s theoretical bandwidth but Anandtech shows that none of the SoC blocks can actually reach that, pretty far for it. It seems that Apple summed all the bandwidth to all the blocks to get there, which does mean something but not that the GPU has access to this (the GPU memory controllers seem to be the bottleneck).

On the contrary, a 3080 laptop does reach 400GB/s, I'm personally seeing this routinely on AI workloads, so that's part of the explanation for subpar perf here (the other ones being probably matrix math and mixed precision)

sudosysgen · on May 18, 2022

Yes. And the M1 Ultra has even more memory bandwidth than the M1 Max. But a 128 GB system made of 3 NVidia A6000 has 3x768Gb/s of memory bandwidth, a more common AI-grade card has 2x2Tb/s of memory bandwidth, which simply dwarfs the M1 Ultra.

matthew-wegner · on May 18, 2022

For researchers, sure, but it's still quite an apples-to-oranges comparison.

A6000 is ~$5k per card. I guess you're referring to something like an A100 on that other spec, which is $10k/card (for 40GB of memory).

I do a fair bit of neural/AI art experimentation, where memory on the execution side is sometimes a limiting factor for me. I'm not training models, I'm not a hardcore researcher--those folks will absolutely be using NVIDIA's high-end stuff or TPU pods.

128GB in a Studio is super compelling if it means I can up-res some of my pieces without needing to use high-memory-but-super-slow CPU cloud VMs, or hope I get lucky with an A100 on Colab (or just pay for a GPU VM).

I have a 128GB/Ultra Studio in my office now. It's a great piece of kit, and a big reason I splurged on it--okay, maybe "excuse"--was that I expect it'll be useful for a lot of my side project workloads over the next couple of years...

sudosysgen · on May 18, 2022

Hmm, that's interesting. What kind of inference workload requires more than the 48GB of memory you'd get from 2 3090s, for example? I'm genuinely curious because I haven't ran across them and it sounds interesting

matthew-wegner · on May 18, 2022

Mostly it's old-school transfer style transfer! Well, "old" in the sense that it's pre-CLIP. I've played with CLIP-guided stuff too, but I've been tinkering with a custom style transfer workflow for a few years. The pipeline here is fractal IFS images (Chaotica/JWildfire) -> misc processing -> style transfer -> photo editing, basically.

Only the workflow is the custom part--the core here is literally the original jcjohnson implementation. Occasionally I look around at recent work in the area, but most seems focused on fast (video-speed) inference or pre-baked style models. I've never seen something that retains artistic flexibility.

My original gut feeling on style transfer was that it would be possible to mold it into a neat tool, but most people bumped into it, ran their profile photo against Starry Night, said "cool" and bounced off. And I get that--parameter tuning can be a sloooow process. When I really explore a series with a particular style I start to feed it custom content images made just for how it's reacting with various inputs.

Here's a piece that just finished a few minutes ago: https://mwegner.com/misc/styled_render-BMrHXWz_2RBaUq8pAYKfL...

That's from a local server in my garage with a K80. At some point I had two K80s in there (so basically four K40s with how they work), but dialed it back for power consumption/power reasons.

I do have a 3090 in the house, and a decent amount of cloud infra that I sometimes tap. The jcjohnson implementation is so far back that it doesn't even run against modern hardware. At some point I need to sort that out, or figure out how to wrangle a more modern implementation into behaving in the way that I like.

I don't really post these anywhere, although do throw them over the wall on Twitter if anyone is curious to see more. These are a mix of things, although the CLIP/Midjourney/etc stuff is pretty easy to spot: https://twitter.com/mwegner/media

mkaic · on May 18, 2022

Not sure about inference but for training, 128GB is big enough to fit a decent-sized dataset entirely into memory, which causes a massive speedup. It's also probably cheaper to get a 128GB Mac Studio than a dual-3090 rig unless you're willing to build the rig yourself and pay the bare minimum for every component except the GPUs themselves.

As for 128GB memory on-inference models that a consumer would be interested in, I got nothing, though it certainly seems like it would be fun to mess around with haha

visarga · on May 18, 2022

GPT-3 sized models need that kind of memory for inference

sudosysgen · on May 18, 2022

GPT-3 is more like 300GB iirc

mkaic · on May 18, 2022

Interesting, I wasn't aware of the memory bandwidth point, though it makes sense. TIL!

dekhn · on May 18, 2022

I remain skeptical that Apple's best GPU silicon will match nvidia's premiere products (either the top-end desktop card, or a server monster) for training.

It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.

I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.

hedgehog · on May 18, 2022

To me the cool thing is working through a PyTorch-based course like FastAI on a local Mac may now be above the tolerably fast threshold.

mhh__ · on May 18, 2022

The thing is with the efficiency (which I'm not sure of) and the competition (probably possible) is that the current nvidia lineup is pretty old and on an even older process. They have a big moat.

smoldesu · on May 18, 2022

There's definitely competition, and it's going to be really interesting to watch Nvidia and Apple duke it out over the next few years:

- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.

- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.

- Nvidia has insane engineers. Despite the fact they're using silicon that's more than twice as large by-area when compared to Apple, they're still doubling their numbers across the board. And that's their last-gen tech too, the comparison once they're on 5nm later this summer is going to be insane.

I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.

fulafel · on May 22, 2022

> an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory

Interesting observation. I wonder is the biggest memory iGPU configuration you can get on the x86 side?

ekelsen · on May 18, 2022

Nice results! But why are people still reporting benchmark results on VGG? Does anybody actually use this network anymore?

Better would be mobilenets or efficientNets or NFNets or vision transformers or almost anything that's come out in the 8 years since VGG was published (great work it was at the time!).

plonk · on May 18, 2022

> Does anybody actually use this network anymore?

Why not? It's still good for simple classification tasks. We use it as an encoder for a segmentation model in some cases. Most ResNet variants are much heavier.

jph00 · on May 18, 2022

I don't think that's true - have a look at this analysis here:

https://www.kaggle.com/code/jhoward/which-image-models-are-b...

Those slow and inaccurate models at the bottom of the graph are the VGG models. A resnet34 is faster and more accurate than any VGG model. And there are better options now -- for example resnet34d is as fast as resnet34, and more accurate. And then convnext is dramatically better still.

YetAnotherNick · on May 18, 2022

> ResNet > VGG: ResNet-50 is faster than VGG-16 and more accurate than VGG-19 (7.02 vs 9.0); ResNet-101 is about the same speed as VGG-19 but much more accurate than VGG-16 (6.21 vs 9.0).

https://github.com/jcjohnson/cnn-benchmarks#:~:text=ResNet%2....

p1esk · on May 18, 2022

why are people still reporting benchmark results on VGG?

Probably because it makes the hardware look good.

0-_-0 · on May 18, 2022

This is the right answer. Efficient networks like EfficientNet are much harder to accelerate in HW.

DSingularity · on May 18, 2022

No. Because it is a way to compare performance. That’s all. Just convenience.

jorgemf · on May 18, 2022

Probably because it will be impossible to compare with old results. If every year the community chooses a different model, how are you going to compare results year over year?

ekelsen · on May 19, 2022

The numbers are relative speedups, not absolute numbers that can be compared with any prior results, so I don't really see how this matters.

jorgemf · on May 19, 2022

You need something constant, either the model or the hardware, otherwise you cannot have those relative numbers. And you usually want to have a trend. See my other reply

learndeeply · on May 18, 2022

ResNets have been around for 7 years...

jorgemf · on May 18, 2022

It doesn't matter. Deep learning have been mainstream for only 10 years. MNIST is a dataset from 1998 and it is still being used in research papers. The most important thing is to have a constant baseline, and ResNets are a baseline.

Think about changing the model every other year: - 2015: ResNet trained in Nvidia k80 - 2017: Inception trained in Nvidia 1080 ti - 2019: Transformer trained in Nvidia V100 - 2021: GTP-3 trained in a cluster

Now you have your new fancy algorithm X and an Nvidia 4090. How much better is your algorithm compared to the state of the art, and how much have you improved compared to the algorithms 5 years ago? Now you are in a nightmare and you have to run all the past algorithms in order to compare it. Or how fast is the new Nvidia card? which noone still have and nvidia has decided to give numbers based on a their own model?

6gvONxR4sf7o · on May 18, 2022

> But why are people still reporting benchmark results on VGG?

It makes me feel like i’m missing something! Is is still used as a backbone in the same way as legacy code is everywhere, or is it something else entirely??

sanxiyn · on May 19, 2022

VGG works better for style transfer than ResNet (this is a surprising result, but empirically true), but that's the only case I am aware of.

https://arxiv.org/abs/2104.05623

eoerl · on May 19, 2022

actually it seems that it was because a lot of other well known models are not yet supported, missing ops in the Metal backend

singularity2001 · on May 19, 2022

The installation command generated on https://pytorch.org/get-started/locally/ didn't install the latest version for me. What did it was:

pip3 install --pre torch==1.12.0.dev20220518 --extra-index-url https://download.pytorch.org/whl/nightly/cpu

singularity2001 · on May 21, 2022

If you came late make sure to update the date to 20220521 …

tzekid · on May 19, 2022

Ahh just saw this after compiling pytorch from source. Thanks!

nafizh · on May 18, 2022

Exciting!! But don't see comparison with any laptop Nvidia GPUs in terms of performance. That would be insightful.

sudosysgen · on May 18, 2022

It compares unfavourably, but then again NVidia GPUs on laptop are massive powerhogs.

smlacy · on May 18, 2022

Do apple users really require the ability to train large ML models while mobile and without access to A/C power? Is this a real-world use case for the target market?

sudosysgen · on May 18, 2022

Indeed, I doubt anyone really needs that. And anyways while training a model you'd be lucky to get an hour of battery life even on an M1 Max.

buildbot · on May 18, 2022

This is very interesting since the M1 studio supports 128GB of unified memory - training a large memory heavy model slowly on a single device could be interesting, or inferencing a very large model.

zdw · on May 18, 2022

Everything old is new again - the M1 studio's unified memory echos the SGI O2 which had similar unified CPU/GPU memory back in the 90's.

In both cases the unified memory machines outperformed much larger machines in specific use cases.

smoldesu · on May 18, 2022

...specific use cases being the key operand here. Unified memory is cool, but there are reasons we don't use it at-scale:

- It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)

- ECC is still off-the-table on M1 apparently

- Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.

- Most of the so-called "specific workloads" that you're outlining aren't very general applications. So far I've only seen ARM outrun x86 in some low-precision physics demos, which is... fine, I guess? I still don't foresee meteorologists dropping their Intel rigs to buy a Mac Studio anytime soon.

Q6T46nT668w6i3m · on May 18, 2022

> Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.

For sure but I expect this is different for the apps Apple _wants_ to write. It’s easy to imagine the next version of Logic or whatever doing fine tuning everywhere.

smoldesu · on May 18, 2022

What is there to fine-tune, in a program like Logic? I've often heard that word associated with using extended instruction sets and leveraging accelerators, but where would the M1 have "untapped power" so-to-speak? I don't think the "upgrade" from a CISC architecture to a RISC one can yield much opportunity for optimization, at least not besides what the compiler already does for you.

my123 · on May 18, 2022

> - It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)

In the first half of 2023, NVIDIA Grace Superchip will ship with an 1TB memory config (930GB usable because ECC bits) on a 1024-bit wide LPDDR5X-8533 config (same width as M1 Ultra, with LPDDR5-6400).

So it's going to become much less of an issue really soon.

zdw · on May 18, 2022

> So it's going to become much less of an issue really soon.

The main issue would be trying to purchase one of those, which is likely going to be both very rare and orders of magnitude more expensive than a Mac Studio.

The Mac Studio isn't some crazy exotic hardware like datacenter class GPUs, but definitely has some exotic capabilities.

my123 · on May 18, 2022

> The Mac Studio isn't some crazy exotic hardware like datacenter class GPUs, but definitely has some exotic capabilities.

Datacenter class GPUs are expensive yeah, but are quite easy to buy, even in a single unit amount.

example: https://www.dell.com/en-us/work/shop/nvidia-ampere-a100-pcie... for the first random link, but there are other stores selling them for significantly cheaper.

I wonder what their CPU pricing will be though... we'll see I guess.

ivstitia · on May 18, 2022

There was a report comparing M1 Pro with several other Nvidia GPUs from a few months ago: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

I'm curious on how the benchmarks change with this recent new release!

almostdigital · on May 19, 2022

Anyone actually got this to run on an M1 Mac?

    $ conda install pytorch torchvision torchaudio -c pytorch-nightly
    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.

    PackagesNotFoundError: The following packages are not available from current channels:

      - torchaudio

And the pip install variant installs an old version of torchaudio that is broken

    OSError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/test123/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at14RecordFunctionC1ENS_11RecordScopeEb

fragmede · on May 19, 2022

    pip3 install pytorch

worked for me. I think it's something with your brew installation.

    fragmede@samairmac:~$ python
    Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02)
    [Clang 11.1.0 ] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.__file__
    '/Users/fragmede/projects/miniforge3/lib/python3.9/sitepackages/torch/__init__.py'
    >>>

almostdigital · on May 19, 2022

Does torchaudio work for you? I can get torch and torchvision to work but not torchaudio

boopmaster · on May 20, 2022

pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Scene_Cast2 · on May 18, 2022

I'm curious about the performance compared to something like, say, the RTX 3070.

my123 · on May 18, 2022

Low. Apple doesn't have matrix math accelerators in their current GPUs.

The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.

Where it could still make sense is if you have a small VRAM pool on the dGPU and a big one on the M1, but with the price of a Mac, not sure that makes a lot of sense either in most scenarios compared to paying for a big dGPU.

Kon-Peki · on May 18, 2022

> Apple doesn't have matrix math accelerators in their current GPUs.

That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

my123 · on May 18, 2022

AMX is indeed very nice for FP64 where customer GPUs aren't an alternative at all.

However, for lower precisions (which is what deep learning uses), you're much better off with a GPU.

brrrrrm · on May 18, 2022

have you actually benchmarked that? I think (someone please correct me if I'm way off here) the AMX instructions can hit ~2.8tflops (fp16) per co-processor and there are 2 on the 7-core M1. That's 5.6tflops vs the 4.6tflops the GPU can hit.

my123 · on May 18, 2022

Yeah that's within the M1 family, but get within dGPUs and it doesn't even come close.

30Tflops for a 3080 for vector FP32, but 119Tflops FP16 dense with FP16 accumulate, 59.5 with FP32 accumulate, and if you exploit sparsity then that can go even higher.

brrrrrm · on May 18, 2022

Ah yes, I misunderstood your original comment

johndough · on May 18, 2022

Often the limiting factor is memory bandwidth instead of raw FLOPS, so dealing with 4 times larger data types (FP64 vs FP16) is a disadvantage.

brrrrrm · on May 18, 2022

to clarify: I am comparing FP16 performance, which both the GPU and AMX have native support for.

FP64 is also supported by AMX, making it quite an impressive region of silicon.

LeanderK · on May 18, 2022

> The neural engine is small and inference only

Why is it inference only? At least the operations are the same...just a bunch of linear algebra

londons_explore · on May 18, 2022

Inference is often done fixed point, whereas training is (usually) floating point.

Inference also prefers different IO patterns, because you don't need to keep the activations for every layer ready for backpropogation.

_jx7j · on May 18, 2022

I wrote a comment about an Tensorflow on M1 comparison to some cloud providers. I imagine PyTorch on M1 would give similar results. I think the gist would be that the 3070 is going to be a better investment.

https://news.ycombinator.com/item?id=30608125

ivstitia · on May 18, 2022

Here are some comparison numbers I've come across: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

It is not really comparable on a step per second level but the power consumption and now GPU memory will make it pretty enticing.

MasterScrat · on May 18, 2022

Small code example in the PyTorch doc:

https://pytorch.org/docs/master/notes/mps.html

nxpnsv · on May 19, 2022

Tried https://pytorch.org/tutorials/beginner/basics/quickstart_tut... with mps vs cpu. mps worked, but cpu actually was faster (16 vs 21s). Perhaps I am doing it wrong...

singularity2001 · on May 18, 2022

Anyone else getting "illegal hardware instruction"?

(pytorch_env) ~/dev/ai/ python -c "import torch"

zimpenfish · on May 18, 2022

IIRC, when I had that problem, it was because it was loading the wrong arch for Python.

in3d · on May 18, 2022

It’s surprising to see PyTorch developers working on things like that when common operations like group convolutions are still completely unoptimized on Nvidia GPUs, despite many requests.

jacobn · on May 18, 2022

Grouped convolutions can't really run faster than groups * conv(ch/group) and I believe that's close to where they're at?

Note that for ch<O(512) (varies by GPU & hw) you tend to be memory-transfer-speed limited, not compute limited.

So unfortunately depthwise convolutions end up having terrible performance.

in3d · on May 18, 2022

Why wouldn’t you be able to run them in parallel using CUDA? You shouldn’t be memory-transfer speed limited when group convolution layers are a part of a bigger net.

Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).

brrrrrm · on May 19, 2022

pointwise isn’t a case of grouped conv, they’re orthogonal ideas.

You can fuse grouped convs (depthwise is a special case of grouped convs) into preceding or following layers. Maybe JAX can do this already? No clue if any library offers such an optimization out of the box

in3d · on May 19, 2022

Sorry, yes, I was replying to the post about depthwise convolution and that’s what I meant (though the naming of it is poor) - i.e. the special case of group convolutions where the number of groups is equal to the number of channels.

arecurrence · on May 18, 2022

This is much nicer ergonomics than what I had to do for tensorflow. It’s ostensibly out of the box support as a different torch device.

mark_l_watson · on May 18, 2022

I agree. I appreciated the M1/Metal TensorFlow support, but that was not as easy to setup.

alfalfasprout · on May 18, 2022

I mean, building tensorflow is generally an awful experience.

dangrie158 · on May 18, 2022

You must have installed it a while ago then :). I just recently did and only needed to I stall 2 packages via pip (i think tensoflow-macos and tenderfoot-metal ) which I found much better than wrangling with cuda and cudnn versions for Nvidia cards

dilielloneluca · on May 19, 2022

I started collecting benchmarks of the M1 Max on PyTorch here: https://github.com/lucadiliello/pytorch-apple-silicon-benchm...

munro · on May 18, 2022

yess! This is important for me, because I don't have any $$$ to rent GPUs for personal projects. Now we just need M1 support for JAX.

Since there are no hard benchmarks against other GPUs, here's a Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks like it's about 2x slower--the RTX laptop absolutely rips for gaming, I love it.

[1] https://browser.geekbench.com/v5/compute/compare/4140651?bas...

jph00 · on May 18, 2022

You can use GPUs for free on Paperspace Gradient, Google Colab, and Kaggle.

kristianp · on May 19, 2022

A tangential thought: will we see animation studios buy mac studios for their rendering farms? What do they use these days, aws ec2?

Kalanos · on May 18, 2022

Anyone care to comment on how this is better than Metal's TensorFlow support?

macshome · on May 18, 2022

Does this work on any Metal hardware or just the M1 GPU?

atty · on May 19, 2022

This is targeting AMD GPUs and M1 GPUs currently not targeting the integrated Intel GPUs present in Intel machines. However if you have a 16 inch Intel MBP, or a Mac Pro, etc, this should work with your AMD GPUs. That support isn’t in the nightly packages yet (only Apple Silicon support so far) but the PyTorch team is saying that it will be available by the end of the week hopefully. If you just can’t wait, you should be able to build from source to test it out right now.

cj8989 · on May 18, 2022

really hope to see some comparisons with nvidia gpus!

toppy · on May 18, 2022

Does speed up refer to absolute value or percentage?

dagmx · on May 18, 2022

At least for the charts, it looks like a multiplier (or divisor I guess) since the CPU baseline looks to be at 1

toppy · on May 18, 2022

You're right! I've missed this.

sbeckeriv · on May 18, 2022

What is the * in the chart referencing?

mrchucklepants · on May 18, 2022

Probably supposed to be referencing the text under the plot stating the specific configuration of the hardware and software.

sbeckeriv · on May 18, 2022

looks like the website was updated after I posted. I used page search to look for the *.

amelius · on May 18, 2022

> Accelerated GPU training is enabled using Apple’s Metal Performance Shaders (MPS) as a backend for PyTorch.

What do shaders have to do with it? Deep learning is a mature field now, it shouldn't need to borrow compute architecture from the gaming/entertainment field. Anyone else find this disconcerting?

dagmx · on May 18, 2022

Shaders are just the way compute is defined on the GPU.

Why is that concerning to you?

WhitneyLand · on May 18, 2022

It’s not the greatest term even for graphics only.

People new to CG are likely to intuit “shaders” as something related to, well, shading, but vertex shaders et al have nothing to do with the color of a pixel or a polygon.

paulmd · on May 18, 2022

Wait until they learn a kernel has nothing to do with operating systems! And tensor operations have nothing to do with tensor objects! And texture memory often isn't even used for textures!

It's an unfortunate set of terminology due to the way this space evolved from graphics programming - shader cores used to do fixed-function shading! But then people wanted them to be able to run arbitrary shaders and not just fixed-function. And then hey, look at this neat processor, let's run a compute program on it. At first that was "compute shaders" running across graphics APIs, then came CUDA, and later OpenCL. But it is still running on the part of the hardware that provides shading to the graphics pipeline.

Similarly, texture memory actually used to be used for textures, now it is a general-purpose binding that coalesces any type of memory access that has 1D/2D/3D locality.

You kinda just get used to it. Lots of niches have their own lingo that takes some learning. Mathematics is incomprehensible without it, really.

my123 · on May 18, 2022

That terminology isn't used at all in GPGPU compute APIs specifically tailored for that purpose, which use quite different programming models where you can mix host and device code in the same program.

And there are "GPUs" today that can't do graphics at all (AMD MI100/MI200 generations) or in a restricted way (Hopper GH100) which has the fixed function pipeline only on two TPCs, for compatibility, but running very slowly due to that.

alfalfasprout · on May 18, 2022

There's absolutely a lot of "graphics" terminology that spills into GPGPU. For example, texture memory in CUDA :) The reality is that GPU's, even the ones that can't output video, are ultimately still using hardware that largely is rooted in gaming. Obviously the underlying architectures for these ML cards are moving away from that (increasingly using more die space for ML related operations) but many of the core components like memory are still shared. It boils down to the fact that at the end of the day they're linear algebra processors.

my123 · on May 18, 2022

I'd say that there has been quite some sharing between both back and forth. Evolutions in compute stacks shaped modern graphics APIs too.

Texture units are indeed a part that is useful enough to be exposed to GPGPU compute APIs directly. The "shader" term itself disappeared quite early in those though, as did access to a good part of the FF pipeline including the rasterisers themselves.

my123 · on May 18, 2022

Apple doesn't have a separate API tailored towards compute only, but a single unified API that makes concessions to both.

Concessions towards compute: a C++ programming language for device code (totally unlike what's done for most graphics APIs!)

Concessions towards graphics: no single-source programming model at all for example...

sudosysgen · on May 18, 2022

Many GPUs allow you to write device code in C++ via SYCL. It works well enough.

geertj · on May 18, 2022

Not sure if it’s concerning but it caught my eye as well.