Perhaps somewhat off topic, does anyone know how stuff like ggml compares to run...

refulgentis · on May 15, 2024

I'm intimately familiar, maintain ONNX and llama.cpp Flutter libraries across all 6 True Platforms.

Quick opinionated TL;DR:

- llama.cpp for LLMs and can do whisper with it's core dependency, GGML.

- ONNX for everything else.

- TF is the Apple of ML, it's great if you're completely wedded to Google ML ecosystem. Virtually dead outside that. (Something absurd ,like, 94%, of HF models are Pytorch)

- only chance I'd have to do a direct comparison in inference performance is Whisper in ONNX vs. GGML, someone got my llama .cpp lib running with Whisper and didn't report significant perf difference

catgary · on May 15, 2024

I don’t think that’s a terribly fair description of Google - AWS’s chips (inferon and trainium) both have robust XLA/JAX support. Plus JAX now exports MLIR, so there is a really compelling JAX -> IREE pipeline so JAX models can more or less be deployed anywhere, even on bare metal embedded devices.

refulgentis · on May 15, 2024

You're right, if you need to go from data => model running in web app, I'd do TF - the inartful Apple analogy is meant to indicate that: great vertical integration.

For local inference of an existing model, TF Lite pales in comparison to ONNX. ONNX goes out of its way for you get ~anything running ~anywhere on the best accelerator available on the platform.* AFAIK TF Lite only helps if your model was in TF.

And there simply isn't an LLM scene for TensorFlow, so it "loses" to llama.cpp for that. There isn't an ONNX LLM scene either, though. (see below)

* There's one key exception...until recently...LLMs! ONNX's model format was limited due to protobuf, IIRC it was 2-4 GB. Part of the Phi-3 announcement was this library they've been stubbing out that's on top of ONNX, but more specialized for LLMs. That being said, haven't seen any LLMs in it except Phi-3, and it's an absolute mess, the library was announced weeks ahead of when it was planned to be released, and then throw in the standard 6-week slippage, I'm probably not trying it again until June.

catgary · on May 15, 2024

I didn’t mention tensorflow or TF Lite? IREE is a different technology, it is essentially a compiler/VM for MLIR. Since JAX exports MLIR, you can run JAX on any platform supported by IREE (which is basically any platform from embedded hardware to datacenter GPUs). It is less mature than ONNX at this point, but much more promising when it comes to deploying on edge devices.

refulgentis · on May 15, 2024

> I didn’t mention tensorflow or TF Lite

I know.

OP didn't mention JAX, or IREE.

I avoided having a knee-jerk "Incorrect, the thing you are talking about is off-topic" reaction because it's inimical to friendly conversation.

Mentioning ONNX and llama.cpp made it clear they were talking about inference. In that context, TF Lite isn't helpful unless you have TF. JAX is irrelevant. IREE is expressly not there yet[1] and has nothing to do with Google.

[1] c.f. the GitHub "IREE is still in its early phase. We have settled down on the overarching infrastructure". From a llama.cpp/ONNX/local inference perspective, there's little more than "well, we'll get MLIR, then we can make N binaries for N model x arch x platform options." Doesn't sound so helpful to me, but I got it easy right now, models are treated as data instead of code in both ONNX and llama.cpp. I'm not certain that's the right thing long-term, ex. it incentivizes "kitchen sink" ML frameworks, people always want me to add a model to the library, rather than use my library as a dependency.

catgary · on May 16, 2024

Speaking as someone who deploys ML models on edge devices - having the MLIR and being able to make binaries for different platforms is terrifically useful! You’re often supporting a range of devices (e.g. video game consoles, phones, embedded, etc) and want to tune the compilation to maximize the performance for your deployment target. And there are a lot of algorithms/models in robotics and animation that simply won’t work with ONNX as they involve computing gradients (which torch cannot export, and the gradients ONNXRuntime give you only work in training sessions, which aren’t as widely). supported.

Also, if you have JAX you can get TF and therefore TFLite. And IREE is part of the OpenXLA project, which Google started.

a_wild_dandan · on May 15, 2024

What’re your thoughts on MLX? It’s been phenomenal on my MBP.

refulgentis · on May 15, 2024

No time* to try it unfortunately :( Sounds great, though, and Mac just kicks the pants of every other platform on local inference thanks to Metal, I imagine MLX must extend that lead to the point Qualcomm/Google has to a serious investment in open source acceleration. Cheapest iPhone from 2022 kicks the most expensive Android from 2023 (Pixel Fold) around the block, 2x on inference. 12 tkns/s vs. 6/s.

* it sounded like a great idea to do an OpenAI LLM x search app. Then it sounded like a great idea to add embeddings locally for privacy (thus, ONNX). Then it sounded like a great idea to do a local LLM (thus, llama.cpp). Then it sounded like a great idea to differentiate by being on all platforms, supported equally. Really taxing. Think I went too far this time. It works but, jeez, the workload...hopefully, after release, it turns out maintenance load is relatively low

svnt · on May 15, 2024

What are the True Platforms, in this case?

crthpl · on May 15, 2024

I'm guessing MacOS, Linux, Android, Windows, iOS, and Web?

refulgentis · on May 15, 2024

Correct

ilaksh · on May 15, 2024

On what hardware exactly?

koe123 · on May 15, 2024

Probably should have specified that! I’m referring to cpu inference.

ilaksh · on May 15, 2024

On what CPU?

koe123 · on May 16, 2024

Typically I'm considering embedded applications (armv8, armv7)