I do a startup called Lepton AI. We provide AI PaaS and fast AI runtimes as a service, so we keep a close eye on the IaaS supply chain. For the last few months we see supply chain getting better and better, so the business model that worked 6 months ago - "we have gpus, come buy barebone servers" no longer work. However, a bigger problem emerges. Probably a problem that could shake the industry: people don't know how to efficiently use these machines.
There are clusters of GPUs sitting idle because companies don't know how to use them. It's embarrassing to resell them too because that makes the images look bad to VCs, but secondary market is slowly happening.
Essentially, people want a PaaS or SaaS on top of the barebone machines.
For example, for the last couple months we were helping a customer to fully utilize their hundreds-of-card cluster. Their IaaS provider was new to the field. So we literally helped both sides to (1) understand infiniband and nccl and training code and stuff; (2) figure out control plane traffic; (3) built accelerated storage layer for training; (4) all kinds of subtle signals that needs attention. Do you know that a GPU can appear OK in nvidia-smi, but still encounter issues when you actually run a cuda or nccl kernel? That needs care. (5) fast software runtimes, like LLM runtime, finetuning script, and many others.
So I think AI PaaS and SaaS is going to be a very valuable (and big) market, after people come out of the frenzy of "grabbing gpus" - and now we need to use them efficiently.
Hi folks - Yangqing from Lepton here. The idea came from a coffee chat with a colleague on the question: how much of the RAG quality comes from the old good search engine, vs LLMs? And we figured out that the best way is to build a quick experiment and try it out. What we learned is that search engine results matter a lot, and probably more important than LLMs. We decided to put it up as a site and also open source the full code.
You can try plug in different search engines or even your own elastic interface, write different LLM prompts, pick different LLM models - a lot of ablation studies that could be tried out.
General availability of the structured decoding capability for ALL open-source models hosted on Lepton AI. Simply provide the schema you want the LLM to produce, and all our model APIs will automatically produce outputs following the schema. In addition, you can host your own LLMs with structured decoding capability without having to finetune
Thanks - we definitely agree that llama.cpp is great. Big fan of their optimizations. We are more or less orthogonal to the engines though - in the sense that we serve as the infra/platform to run and manage those implementations easily. For example, we support running a wider range of models - for example sdxl is one single line too:
lep photon run -n sdxl -m hf:stabilityai/stable-diffusion-xl-base-1.0 --local
It's really about how to productize a wide range of models as easy as possible.
SDXL is indeed a monster to install and setup. The UIs are even worse.
IDK if the GPL license is compatible with your business, but I wonder if you could package Fooocus or Fooocus-MRE into a window? Its a hairy monster to install and run, but I've never gotten such consistently amazing results from a single prompt box + style dropdown box (including native HF diffusers and other diffusers-based frontends). The automatic augmentations to the SDXL pipine are amazing:
In theory one can have 640G = 8 * 80G A100s memory and launch it. 180B Falcon with fp16 will be 360G, so there would be enough memory. It's definitely going to be very expensive indeed.
There are clusters of GPUs sitting idle because companies don't know how to use them. It's embarrassing to resell them too because that makes the images look bad to VCs, but secondary market is slowly happening.
Essentially, people want a PaaS or SaaS on top of the barebone machines.
For example, for the last couple months we were helping a customer to fully utilize their hundreds-of-card cluster. Their IaaS provider was new to the field. So we literally helped both sides to (1) understand infiniband and nccl and training code and stuff; (2) figure out control plane traffic; (3) built accelerated storage layer for training; (4) all kinds of subtle signals that needs attention. Do you know that a GPU can appear OK in nvidia-smi, but still encounter issues when you actually run a cuda or nccl kernel? That needs care. (5) fast software runtimes, like LLM runtime, finetuning script, and many others.
So I think AI PaaS and SaaS is going to be a very valuable (and big) market, after people come out of the frenzy of "grabbing gpus" - and now we need to use them efficiently.