Hacker News new | past | comments | ask | show | jobs | submit login

An NVIDIA RTX 4090 generates 73 TFLOPS. This iPad gives you nearly half that. The memory bandwidth of 120 GBps is roughly 1/10th of the NVIDIA hardware, but who’s counting!



The 4090 costs ~$1800 and doesn't have dual OLED screens, doesn't have a battery, doesn't weigh less than a pound, and doesn't actually do anything unless it is plugged into a larger motherboard, either.


From Geekbench: https://browser.geekbench.com/opencl-benchmarks

Apple M3: 29685

RTX 4090: 320220

When you line it up like that it's kinda surprising the 4090 is just $1800. They could sell it for $5,000 a pop and it would still be better value than the highest end Apple Silicon.


Comparing these directly like this is problematic.

The 4090 is highly specialized and not usable for general purpose computing.

Whether or not it's a better value than Apple Silicon will highly depend on what you intend to do with it. Especially if your goal is to have a device you can put in your backpack.


I'm not the one making the comparison, I'm just providing the compute numbers to the people who did. Decide for yourself what that means, the only conclusion I made on was compute-per-dollar.


A bit off-topic since not applicable for iPad:

Adding also M3 MAX: 86072

I wonder the results if the test would be done on Asahi Linux some day. Apple implementation is fairly unoptimized AFAK.


That's for OpenCL, Apple gets higher scores through Metal.


And Nvidia annihilates those scores with CUBlas. I'm going to play nice and post the OpenCL scores since both sides get a fair opportunity to optimize for it.


Actually, I'd like to see Nvidia's highest Geekbench scores. Feel free to link them.

It's stupid to look at OpenCL when that's not what's used in real use.


This is true, but... RTX 4090 has only 24GB RAM and M3 can run with 192GB RAM... A game changer for largest/best models...


CUDA features unified memory that is only limited by the bandwidth of your PCIe connector: https://developer.nvidia.com/blog/unified-memory-cuda-beginn...

People have been tiling 24gb+ models on a single (or several) 3090/4090s for a while now.


Shhh, don't correct the believers, they might learn something.


I think it would be simpler to compare cost/transistor.


And yet it’s worth it for deep learning. I’d like to see a benchmark training Resnet on an iPad.


TOPS != TFLOPS

RTX 4090 Tensor 1,321 TOPS according to spec sheet so roughly 35x.

RTX 4090 is 191 Tensor TFLOPS vs M2 5.6 TFLOPS (M3 is tough to find spec).

RTX 4090 is also 1.5 years old.


Yeah where are the bfloat16 numbers for the neural engine? For AMD you can at least divide by four to get the real number. 16 TOPS -> 4 tflops within a mobile power envelope is pretty good for assisting CPU only inference on device. Not so good if you want to run an inference server but that wasn't the goal in the first place.

What irritates me the most though is people comparing a mobile accelerator with an extreme high end desktop GPU. Some models only run on a dual GPU stack of those. Smaller GPUs are not worth the money. NPUs are primarily eating the lunch of low end GPUs.


> The memory bandwidth of 120 GBps is roughly 1/10th of the NVIDIA hardware, but who’s counting

Memory bandwidth is literally the main bottleneck when it comes to the types of applications gpus are used for, so everyone is counting


It would also blow through the iPad’s battery in 4 minutes flat


This comment needs to be downvoted more. TFLOPS is not TOPS, this comparison is meaningless, the 4090 is about 40x TOPS of the M4.


Many thanks for the encouraging comments.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: