I don’t think that’s a terribly fair description of Google - AWS’s chips (infero...

refulgentis · on May 15, 2024

You're right, if you need to go from data => model running in web app, I'd do TF - the inartful Apple analogy is meant to indicate that: great vertical integration.

For local inference of an existing model, TF Lite pales in comparison to ONNX. ONNX goes out of its way for you get ~anything running ~anywhere on the best accelerator available on the platform.* AFAIK TF Lite only helps if your model was in TF.

And there simply isn't an LLM scene for TensorFlow, so it "loses" to llama.cpp for that. There isn't an ONNX LLM scene either, though. (see below)

* There's one key exception...until recently...LLMs! ONNX's model format was limited due to protobuf, IIRC it was 2-4 GB. Part of the Phi-3 announcement was this library they've been stubbing out that's on top of ONNX, but more specialized for LLMs. That being said, haven't seen any LLMs in it except Phi-3, and it's an absolute mess, the library was announced weeks ahead of when it was planned to be released, and then throw in the standard 6-week slippage, I'm probably not trying it again until June.

catgary · on May 15, 2024

I didn’t mention tensorflow or TF Lite? IREE is a different technology, it is essentially a compiler/VM for MLIR. Since JAX exports MLIR, you can run JAX on any platform supported by IREE (which is basically any platform from embedded hardware to datacenter GPUs). It is less mature than ONNX at this point, but much more promising when it comes to deploying on edge devices.

refulgentis · on May 15, 2024

> I didn’t mention tensorflow or TF Lite

I know.

OP didn't mention JAX, or IREE.

I avoided having a knee-jerk "Incorrect, the thing you are talking about is off-topic" reaction because it's inimical to friendly conversation.

Mentioning ONNX and llama.cpp made it clear they were talking about inference. In that context, TF Lite isn't helpful unless you have TF. JAX is irrelevant. IREE is expressly not there yet[1] and has nothing to do with Google.

[1] c.f. the GitHub "IREE is still in its early phase. We have settled down on the overarching infrastructure". From a llama.cpp/ONNX/local inference perspective, there's little more than "well, we'll get MLIR, then we can make N binaries for N model x arch x platform options." Doesn't sound so helpful to me, but I got it easy right now, models are treated as data instead of code in both ONNX and llama.cpp. I'm not certain that's the right thing long-term, ex. it incentivizes "kitchen sink" ML frameworks, people always want me to add a model to the library, rather than use my library as a dependency.

catgary · on May 16, 2024

Speaking as someone who deploys ML models on edge devices - having the MLIR and being able to make binaries for different platforms is terrifically useful! You’re often supporting a range of devices (e.g. video game consoles, phones, embedded, etc) and want to tune the compilation to maximize the performance for your deployment target. And there are a lot of algorithms/models in robotics and animation that simply won’t work with ONNX as they involve computing gradients (which torch cannot export, and the gradients ONNXRuntime give you only work in training sessions, which aren’t as widely). supported.

Also, if you have JAX you can get TF and therefore TFLite. And IREE is part of the OpenXLA project, which Google started.