Hacker News new | past | comments | ask | show | jobs | submit | jiayq84's comments login

I do a startup called Lepton AI. We provide AI PaaS and fast AI runtimes as a service, so we keep a close eye on the IaaS supply chain. For the last few months we see supply chain getting better and better, so the business model that worked 6 months ago - "we have gpus, come buy barebone servers" no longer work. However, a bigger problem emerges. Probably a problem that could shake the industry: people don't know how to efficiently use these machines.

There are clusters of GPUs sitting idle because companies don't know how to use them. It's embarrassing to resell them too because that makes the images look bad to VCs, but secondary market is slowly happening.

Essentially, people want a PaaS or SaaS on top of the barebone machines.

For example, for the last couple months we were helping a customer to fully utilize their hundreds-of-card cluster. Their IaaS provider was new to the field. So we literally helped both sides to (1) understand infiniband and nccl and training code and stuff; (2) figure out control plane traffic; (3) built accelerated storage layer for training; (4) all kinds of subtle signals that needs attention. Do you know that a GPU can appear OK in nvidia-smi, but still encounter issues when you actually run a cuda or nccl kernel? That needs care. (5) fast software runtimes, like LLM runtime, finetuning script, and many others.

So I think AI PaaS and SaaS is going to be a very valuable (and big) market, after people come out of the frenzy of "grabbing gpus" - and now we need to use them efficiently.


Full open-source code with Apache license here: https://github.com/leptonai/search_with_lepton


Hi folks - Yangqing from Lepton here. The idea came from a coffee chat with a colleague on the question: how much of the RAG quality comes from the old good search engine, vs LLMs? And we figured out that the best way is to build a quick experiment and try it out. What we learned is that search engine results matter a lot, and probably more important than LLMs. We decided to put it up as a site and also open source the full code.

You can try plug in different search engines or even your own elastic interface, write different LLM prompts, pick different LLM models - a lot of ablation studies that could be tried out.

We appreciate your interest and happy Friday!


General availability of the structured decoding capability for ALL open-source models hosted on Lepton AI. Simply provide the schema you want the LLM to produce, and all our model APIs will automatically produce outputs following the schema. In addition, you can host your own LLMs with structured decoding capability without having to finetune


Super cool exhibition of what a local machine can already do in the AI frenzy!


Thanks so much for the warm words!


Thanks - we definitely agree that llama.cpp is great. Big fan of their optimizations. We are more or less orthogonal to the engines though - in the sense that we serve as the infra/platform to run and manage those implementations easily. For example, we support running a wider range of models - for example sdxl is one single line too:

lep photon run -n sdxl -m hf:stabilityai/stable-diffusion-xl-base-1.0 --local

It's really about how to productize a wide range of models as easy as possible.


SDXL is indeed a monster to install and setup. The UIs are even worse.

IDK if the GPL license is compatible with your business, but I wonder if you could package Fooocus or Fooocus-MRE into a window? Its a hairy monster to install and run, but I've never gotten such consistently amazing results from a single prompt box + style dropdown box (including native HF diffusers and other diffusers-based frontends). The automatic augmentations to the SDXL pipine are amazing:

https://github.com/MoonRide303/Fooocus-MRE


Oh wow yeah, that is a beast. Let me give it a shot.


Thanks - the policies are listed here: https://www.lepton.ai/policies

we'll put a link on our homepage.

In short - we do not collect, record, or log any of your prompts and responses. They are computed in memory, returned and discarded on the fly.


In theory one can have 640G = 8 * 80G A100s memory and launch it. 180B Falcon with fp16 will be 360G, so there would be enough memory. It's definitely going to be very expensive indeed.


Llama.cpp can run quantized Falcon on a top end Mac Studio, which is only five grand: https://twitter.com/ggerganov/status/1699791226780975439

If I'm paying a third party a hundred bucks a month, I'd at least want them to be able to match the capacities of consumer hardware.


Great catch! Our cloud machine encountered a cuda error (the GPU fell off PCIe) - had to restart it. It's back to normal now.

All the more reason to have a managed version of services :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: