YOLOv5 on CPUs: Sparsifying to Achieve GPU-Level Performance

37ef_ced3 · on Sept 10, 2021

Even without sparsification, AVX-512 CPUs are far more cost-effective for inference.

To get your money's worth, you must saturate the processor. It's easy, in practice, with an AVX-512 CPU (e.g., see https://NN-512.com or Fabrice Bellard's https://bellard.org/libnc), and almost impossible with a GPU.

A GPU cloud instance costs $1000+ per month (vs. $10 per month for an AVX-512 CPU). A bargain GPU instance (e.g., Linode) costs $1.50 per hour (and far more on AWS) but an AVX-512 CPU costs maybe $0.02 per hour.

That said, GPUs are essential for training.

Note that Neural Magic's engine is completely closed-source, and their python library communicates with the Neural Magic servers. If you use their engine, your project survives only as long as Neural Magic's business survives.

boulos · on Sept 11, 2021

Usual Disclosure: I used to work on Google Cloud.

> A GPU cloud instance costs $1000+ per month (vs. $10 per month for an AVX-512 CPU). A bargain GPU instance (e.g., Linode) costs $1.50 per hour (and far more on AWS) but an AVX-512 CPU costs maybe $0.02 per hour.

This is a little confused. A T4 on GCP is $.35/hr at on-demand rates. A single thread of a Skylake+ CPU is in the 2-3c/hr range [2] (so $15-20/month w/o any committed or sustained use discounts. Your $10 is close enough).

So roughly the T4 GPU itself is ~10 threads. Both of these are before adding memory, storage, etc., but the T4 is a great inference part and hard to beat.

Comparing a single thread of a CPU to training-optimized GPU parts (like the A100 or V100) is sort of apples and oranges.

[1] https://cloud.google.com/compute/gpus-pricing

[2] https://cloud.google.com/compute/all-pricing

37ef_ced3 · on Sept 11, 2021

$10 per month (i.e., $0.01389 per hour) Skylake cores: https://www.vultr.com/products/cloud-compute/

If you're not holding it for a full month, it's $0.015 per hour.

Here are the $1000 per month ($1.50 per hour) cloud GPUs I use, the cheapest that Linode provides: https://www.linode.com/pricing/

I would like to see a comparison of Linode vs Google Cloud, taking into account network bandwidth costs, etc. Maybe the Google Cloud T4's total cost of ownership is lower? I doubt it: for example, the Linode plan includes 16 terabytes of fast network traffic at no additional cost, and Google Compute charges $0.085 per gigabyte.

And the T4 is still 23x more expensive than an AVX-512 core, hourly.

rrss · on Sept 11, 2021

boulos is 100% right that you are choosing to compare to particularly expensive GPUs instead of cheaper GPU instances that are actually intended for inference at low cost.

EDIT: please stop it with the ninja edits.

ramzyo · on Sept 10, 2021

> Note that Neural Magic's engine is completely closed-source, and their python library communicates with the Neural Magic servers. If you use their engine, your project survives only as long as Neural Magic's business survives.

Could be totally off here, but have a feeling this team is going to get scooped up by one of the big co's. Having deja vu of XNOR.ai (purchased by Apple), when partners got the short end of the stick (see Wyze cam saga: https://www.theverge.com/2019/11/27/20985527/wyze-person-det...)

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi 37ef_ced3, AVX-512 has certainly helped close the gap for CPUs vs GPUs. Even then, though, most inference engines on CPUs are still very compute bound for networks. Unstructured sparsity enables us to cut the compute requirements down for CPUs leading to most layers becoming memory bound. This then allows us to focus on the unique cache architectures within the CPUs to remove the memory bottlenecks through proprietary techniques. The combination is what's truly unique for Neural Magic and enables the DeepSparse engine to compete with GPUs while leveraging the flexibility and deployability of highly available CPU servers.

Also as boulos said, the numbers quoted here are a bit lopsided. There are more affordable GPUs and GCP has enabled some very cost effective T4 instances. Our performance blog on YOLOv3 walks through a direct comparison to T4s and V100s on GCP for performance and cost: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...

Additionally, I'd like to clarify that our open sourced products do not communicate with any of our servers currently. These are stand alone products that do not require Neural Magic to be in the loop for use.

shock-value · on Sept 10, 2021

GCP lists a T4 which is suitable for inference for between $0.11/hour and $0.35/hour (depending on commitment duration and preemptibility).

https://cloud.google.com/compute/gpus-pricing

carbocation · on Sept 10, 2021

Agreed - I priced this out for a specific distributed inference task a few months ago and the T4 was cheaper and faster than GPU.

On 26 million images using a pytorch model that had 41 million parameters, T4 instances were about 32% cheaper than CPU instances, and took about 45% of the time even after accounting for extra GPU startup time.

ml_hardware · on Sept 10, 2021

T4 is a gpu :) NVIDIA Tesla T4: https://www.nvidia.com/en-us/data-center/tesla-t4/

carbocation · on Sept 11, 2021

Yes, I’m using “T4” as a shorthand for “instances otherwise matched to the CPU-based instances but which also have a T4 GPU.”

mkaic · on Sept 10, 2021

OP was most likely referring to the AWS EC2 instance type T4, which runs on Amazon Graviton processors IIRC.

mkl · on Sept 11, 2021

> GCP lists a T4

GCP not AWS, or are you talking about a different OP?

37ef_ced3 · on Sept 10, 2021

Many businesses/services can't saturate the hardware you describe. It's just too much compute power. With CPUs you can scale down to fit your actual needs: all the way down to a single AVX-512 core doing maybe 24 inferences per second (costing a few dollars PER MONTH).

Also, your cost/inference results will change if you use a fast CPU inference engine, instead of something slow like PyTorch (which you appear to be using).

carbocation · on Sept 11, 2021

Thanks - this is something I wasn’t familiar with. Do you have any pointers for CPU inference engines that you’ve had good experience with or that I can look into further?

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi carbocation, we'd love to see what you think of the performance using the DeepSparse engine for CPU inference: https://github.com/neuralmagic/deepsparse

Take a look through our getting started pages that walk through performance benchmarking, training, and deployment for our featured models: https://sparsezoo.neuralmagic.com/getting-started

pjmlp · on Sept 11, 2021

Well AVX-512 is kind of the survivor of the whole Larrabee failure, so it is no wonder that it has a similar capability.

gautamcgoel · on Sept 11, 2021

I notice that you failed to mention that you are the maintainer of NN-512.

outlace · on Sept 10, 2021

“That said, GPUs are essential for training”

http://learningsys.org/neurips19/assets/papers/18_CameraRead...

zekrioca · on Sept 10, 2021

Read parent's (emphasis mine):

> Even without sparsification, AVX-512 CPUs are far more cost-effective for inference.

fxtentacle · on Sept 10, 2021

I believe that's also why there are so many options for turning fully trained TensorFlow graphs into C++ code.

You use the expensive GPU for building the AI, then dumb it down for mass-deployment on cheap CPUs.

deepnotderp · on Sept 10, 2021

I'm pretty sure this isn't using the Tensor cores on the GPU.

If you see here (https://github.com/ultralytics/yolov5/blob/master/README.md), the speed of inference on a V100 for YOLOv5s should be 2 ms per image, or 500 imgs/s, not the 44.6 img/s being reported here.

This is important as it is more than an order of magnitude off.

ml_hardware · on Sept 10, 2021

My guess is they are using tensor cores as they report FP16 throughput, but they seem to be measuring at batch size 1, which is hugely unfair to the GPUs.

For inference workloads you usually batch incoming requests together and run once on GPU (though this increases latency). A latency/throughput tradeoff curve at different batch sizes would tell the whole story.

Also, they are using INT8 on CPU and neglect to measure the same on GPU. All the GPU throughputs would 2x.

tl;dr just use GPUs

Edit: to the comments below, I agree low-latency can be important to some workloads, but that's exactly why I think we need to see a latency-throughput tradeoff curve.

Unfortunately, I'm pretty sure that modern GPUs (A10/A30/A40/A100) basically dominate CPUs even when latency is constrained, and the MLPerf results give a good (fair!) comparison of this:

https://developer.nvidia.com/blog/extending-nvidia-performan...

The GPU throughputs are much, much higher than the CPU ones, and I don't think even NM's software can overcome this gap. Not to mention they degrade the model quality...

The last question is whether CPUs are more cost-effective despite being slower, and the answer is still... no. The instances used in this blog post cost:

- C5 CPU (c5.12xlarge): $2.04/hr - T4 GPU (g4dn.2xlarge): $0.752/hr

NM's best result @ batch-size-1 costs more, at lower throughput, at lower model quality, at ~same latency, than a last-gen GPU operating at half it's capacity. A new A10 GPU using INT8 will widen the perf/$ gap by another ~4x.

Also full disclosure I don't work at NVIDIA or anything like that so I'm not trying to shill :) I just like ML hardware a lot and want to help people make fair comparisons.

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi ml_hardware, we report results for both throughput and latency in the blog. As you noted, the throughput performance for GPUs does beat out our implementations by a bit, but we did improve the throughput performance on CPUs by over 10x. Our goal is to enable better flexibility and performance for deployments through more commonly available CPU servers.

For throughput costs, this flexibility becomes essential. The user could scale down to even one core if they wanted to, with a much more significant increase in the cost performance. We walk through these comparisons in more depth in our YOLOv3 blog: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...

INT8 wasn't run on GPUs because we have issues with operator support on the conversion from PyTorch graphs to TensorRT (PyTorch currently doesn't have support for INT8 on GPU). We are actively working on this, though, so stay tuned as we run those comparisons!

The models we're shipping will see performance gains on the A100s, as well, due to their support for semi-structured sparsity. Note, though, A100s are priced more expensive than the commonly available V100s and T4s, which will need to be considered. We generally keep our current benchmarks limited to what is available in the top cloud services to represent what is deployable for most people on servers. This usability is why we don't consider ML Perf a great source for most users. ML Perf has done a great job in standardizing benchmarking and improving numbers across the industry. Still, the systems submitted are hyper-engineered for ML Perf numbers, and most customers cannot realize these numbers due to the cost involved.

Finally, note that the post-processing for these networks is currently limited to CPUs due to operator support. This limitation will become a bottleneck for most deployments (it already is for GPUs and us for the YOLOv5s numbers). We are actively working on speeding up the post-processing by leveraging the cache hierarchy in the CPUs through the DeepSparse engine, and are seeing promising early results. We'll be releasing those sometime soon in the future to show even better comparisons.

bigcorp-slave · on Sept 10, 2021

It depends on your application. If you are running on a smartphone, or on an AR headset, or on a car, or on a camera, etc, you generally do not have the latency budget to wait for multiple frames and run at high batch size.

deepnotderp · on Sept 10, 2021

V100 GPUs have non tensor core fp16 operations too I think

woadwarrior01 · on Sept 10, 2021

Yes. Non tensor core fp16 ops are the default. Tensor cores are essentially 4x4 fp16 mac units and there's a requirement that matrix dimensions are multiples of 8[1] that needs to be met for them to be used.

[1]: https://docs.nvidia.com/deeplearning/performance/mixed-preci...

ml_hardware · on Sept 10, 2021

That's true.. in fact, seeing V100 FP16 < T4 FP16 makes me believe you're right, the V100 should be much faster if the tensor cores were being used.

37ef_ced3 · on Sept 10, 2021

Batch size 1 improves latency, especially for businesses/services with fewer users. Latency matters.

Also, your CPU cost numbers are way off, using an expensive provider like AWS instead of, say, Vultr (https://www.vultr.com)

And many businesses/services can't saturate the hardware you describe. It's just too much compute power. With CPUs you can scale down to fit your actual needs: all the way down to a single AVX-512 core doing maybe 24 inferences per second (costing a few dollars PER MONTH).

ml_hardware · on Sept 11, 2021

I was providing costs for the exact instance types that NeuralMagic used in their blog post, if we’re allowed to change that then I can also find cheaper GPU providers.

I can agree with you that on super, super small inference deployments, maybe you can lower monthly spend by using CPUs. But i must ask.. who is the target customer that is both spending <$100 / month and also trying to optimize this? I feel like big players will have big workloads that will be most cost-effective on GPUs.

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi deepnotderp, as noted by others the speeds listed here are combining throughput for GPU from Ultralytics to latency for GPU from Neural Magic. We did also include throughput measurements, though, where YOLOv5s was around 3 ms per image on a V100 at fp16 in our testing. All benchmarks were run on AWS instances for repeatability and availability and is likely where the 2 ms vs 3 ms discrepancy comes from (slower memory transfer on the AWS machine vs the one Ultralytics used). Note, though, a slower overall machine will also affect CPU results as well.

We benchmarked using the available PyTorch APIs mimicking what was done for Ultralytics benchmarking. This code is open sourced for viewing and use here: https://github.com/neuralmagic/deepsparse/blob/main/examples...

sdenton4 · on Sept 11, 2021

These days I'm tending to prefer structural sparsity over pruning, because if you make good choices you get high-quality models that are fast on both CPU and (G/T)PU. The EfficientNet architecture (which leans heavily on seperable convolutions) is a good example of this. Using GRUs with block-diagonal matrices and some 'side channel' for communication between the blocks also works very well.

A good structurally sparse model can usually be carefully pruned as well, but the gains are a bit smaller once you've already settled into a 'small enough' model architecture.

fwsgonzo · on Sept 10, 2021

I would love to use this in a "special setting" but it has to be compiled, static, code. So, if there is an implementation for C, C++, Rust or any other language that would be great! I am more than willing to handle the special build myself. I just need a language that compiles to machine code.

m3at · on Sept 11, 2021

You can use apache tvm [1] to get a static runtime. I'm not sure of the state of sparsification acceleration support, but at least quantization is supported.

Edit: it's in C++ and has rust bindings [2]

[1] https://tvm.apache.org/

[2] https://github.com/apache/tvm/tree/main/rust

spullara · on Sept 10, 2021

I don't know a lot about this but I think you can use this to remove the Python dependency:

https://pytorch.org/docs/stable/jit.html

codetrotter · on Sept 10, 2021

Is the special setting running it on iOS and distributing it via the Apple App Store? :)

fwsgonzo · on Sept 11, 2021

No, it's for an edge compute platform that I'm developing for someone. So I will just provide the execution environment, not the actual programs.

chessgecko · on Sept 10, 2021

Super interesting, but it would probably help the pytorch performance significantly on both the gpu and cpu if they torchscripted the models, it's probably pretty simple given they exported it to onnx and would be more apples to apples

It'll be exciting to see what they can do with the a10 gpus when they're available.

m3at · on Sept 11, 2021

Afaik models exported to onnx are already jit compiled in the export process: https://pytorch.org/docs/stable/onnx.html

chessgecko · on Sept 11, 2021

I was referring to the benchmark script from the article https://github.com/neuralmagic/deepsparse/blob/main/examples...

It looks like they used torch.load instead of torch.jit.load so I don’t think it’s scripted here, but I didn’t check the model file itself so I’m not entirely sure

boulos · on Sept 11, 2021

Disclosure: I used to work for Google Cloud and know Neural Magic's CEO (Brian Stevens).

It's great to see CPU optimizations, but I have a few nits:

> As of this writing, the new A100s do have hardware support for semi-structured sparsity but are not readily available.

A100s have been readily available on all major cloud providers for a while. This benchmark would only require one A100, anywhere in the world.

Echoing 37ef_ced3's, I would argue that what matters is not raw qps (the A100 will trounce this) but rather qps at a given latency per dollar. The c5.12xlarge is $2.04/hr (and has a sad 12 Gbps NIC...) while the g4dn.2xlarge is is $.752/hr.

So the T4 setup is about 20% more qps, but almost 3x better inference/$!

enricozb · on Sept 10, 2021

Tangentially related: There is some controversy around YOLOv5's name: https://github.com/ultralytics/yolov5/issues/2

ramzyo · on Sept 10, 2021

This is really impressive work. Would be interested to know whether they plan to support ARM CPUs in their runtime engine (currently looks like just a few AMD & Intel CPUs: https://github.com/neuralmagic/deepsparse). Many IoT and embedded devices with application-class processors could benefit from these speedups.

ddv · on Sept 10, 2021

"Is there plan to support sparse NN influence on mobile devices, such as Arm CPUs?"

"Based on the amount of work for support, it's on our medium to long-term roadmap currently."

https://github.com/neuralmagic/deepsparse/issues/183

monocasa · on Sept 10, 2021

And no community pitching in to help with that either.

> Open sourcing of the backend runtime is something we're not planning for the near to medium term. It is something we're actively reevaluating, though, as the company and the industry as a whole grows to see what works best for the needs of our users.

ThouYS · on Sept 10, 2021

I guess pretty soon we'll wonder how anyone was ever ok with using raw non-sparsified models

c54 · on Sept 10, 2021

What does sparsification mean here?

throwawaybanjo1 · on Sept 10, 2021

https://docs.neuralmagic.com/main/source/getstarted.html#spa...

baybal2 · on Sept 10, 2021

I told you! :)

CPUs keep throwing specialised hardware into the garbage bag every time few years after mass adoption of something big, as the computer science improves.

People were once telling that real time audio was physically impossible to do on the CPU, now you can do it even on a smartphone CPU.

hrydgard · on Sept 10, 2021

CPUs haven't replaced GPUs for graphics though, and are very far away from doing it...

Cymen · on Sept 11, 2021

It does depend on what you mean by CPUs but the Ryzen G 5000 series are able to play video games at 720p (sometimes 1080p) with the graphics integrated on the CPU.

sydthrowaway · on Sept 11, 2021

Err, isnt that GPU on the same die? That’s the same as Apple Silicon. No one does software rendering do they?

terafo · on Sept 11, 2021

G 5000 series is marginally worse than 5000 series in CPU performance and it is worse than any discrete GPU that was released in last 5 years if we are talking GPU performance.

monocasa · on Sept 10, 2021

Some of that is figuring out the software overhead too. Alexia Massalin was doing audio DSP on her 68k workstation back in the late 80s with her Synthesis kernel. Overall it seems that multiplexing with the right semantics is a majority of the problem to running new workloads on GPCPUs and takes time to understand.