Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights
93 points by tovacinni 82 days ago | hide | past | favorite | 24 comments
Hi HN! We’re Towaki and Allen, and we’re building Outerport (https://outerport.com), a distribution network for AI model weights that enables ‘hot-swapping’ of AI models to save on GPU costs.

‘Hot-swapping’ lets you serve different models on the same GPU machine with only ~2 second swap times (~150x faster than baseline). You can see this in action through a live demo where you try the same prompts on different open source large language models at https://hotswap.outerport.com and see the docs here https://docs.outerport.com.

Running AI models on the cloud is expensive. Outerport came from our own experience working on AI services ourselves and struggling with the cost.

Cloud GPUs are charged by the amount of time used. A long start-up time (from loading models into GPU memory) means that to serve requests quickly, we need to acquire extra GPUs with models pre-loaded for spare capacity (i.e. ‘overprovision’). The time spent on loading models also adds to the cost. Both lead to inefficient use of expensive hardware.

The long start-up times are caused by how massive modern AI models are, particularly large language models. These models are often several gigabytes to terabytes in size. Their sizes continue to grow as models evolve, exacerbating the issue.

GPU capacity also needs to adapt dynamically according to demand, further complicating the issue. Starting up a new machine with another GPU is time consuming, and sending a large model there is also time consuming.

Traditional container-based solutions and orchestration systems (like Docker, Kubernetes) are not optimized for these large, storage-intensive AI models, as they are designed for smaller, more numerous containerized applications (which are usually 50MB to 1GB in size). There needs to be a solution that is designed specifically for model weights (floating point arrays) running on GPUs, to take advantage of things like layer sharing, caching and compression.

We made Outerport, a specialized system to manage and deploy AI models, as a solution to these problems and to help save GPU costs.

Outerport is a caching system for model weights, allowing read-only models to be cached in pinned RAM for fast loading into GPU. Outerport is also hierarchical, maintaining a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing.

Within Outerport, models are managed by a dedicated daemon process which handles transfer to GPU, loading models from registry, and orchestrates the ‘hot swapping’ of multiple models on one machine.

‘Hot-swapping’ lets you provision a single GPU machine to be ‘multi-tenant’, such that multiple services with different models can run on the same machine. For example, this can facilitate A/B testing of two different models or having a text generation & image generation endpoint on the same machine.

We have been busy running simulations to determine the cost reductions we can get from leveraging this multi-model service scheme instead of multiple single-model services. Our initial simulation results show that we can achieve a 40% reduction in GPU running time costs. This improvement can be attributed to the multi-model service’s ability to smoothen out peaks of traffic, enabling more effective horizontal scaling. Overall, less time is wasted on acquiring additional machines and model loading, significantly saving costs. Our hypothesis is that the cost savings are substantial enough to make a viable business while still saving customers significant amounts of money.

We think that there are lots of exciting directions to take from here—from more sophisticated compression algorithms to providing a central platform for model management and governance. Towaki worked on ML systems and model compression at NVIDIA, and Allen used to do research in operations research, which is also why we’re so excited about this problem as something that combines both.

We’re super excited to share Outerport with you all. We’re also intending to release as much as possible of this in an open core model when we’re ready. We would love to know what you think—and experiences you have in working on this, related problems, or any other ideas you might have on this problem!




Genuine question, whats the difference between your startup and just calling the below code with a different model on a cloud machine, other than some ML/Dev OP's engineer not knowing what they are doing...?

  model = get_model(\*model_config)
  state_dict = torch.load(model_path, weights_only=True)
  new_state_dict = {k.replace('_orig_mod.', ''): v for k, v in state_dict.items()}
  model.load_state_dict(new_state_dict)
  model.eval()
  with torch.no_grad():
  output = model(torch.FloatTensor(X))
  probabilities = torch.softmax(output, dim=X)
  return probabilities.numpy()


The advantage of loading from a daemon over loading all the weights at once in Python is that it can support multiple processes or even the same process consecutively (if it dies or something, or had to switch to something else).

Loading from disk to VRAM can be super slow- so doing this every time you have a new process is wasteful. Instead, if you have a daemon process that keeps multiple model weights in pinned RAM, you can load them much quicker (~1.5 seconds for a 8B model like we show in the demo).

You _could_ also make a single mega router process, but then there are issues like all services needing to agree on dependency versioning. This has been a problem for me in the past (like LAVIS requiring a certain transformer version that was not compatible with some other diffusion libraries)


> Outerport is a caching system for model weights, allowing read-only models to be cached in pinned RAM for fast loading into GPU. Outerport is also hierarchical, maintaining a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing.

This is really cool. Are the costs to run this mainly storage or how much compute is actually tied up in it?

The time/cost to download models on a gpu cloud instance really add up when you are paying per second.


Thanks! If you mean the costs for users of Outerport, it'll be a subscription model for our hosted registry (with a limit on storage / S3 egress) and a license model for self-hosting the registry. So mainly storage, since the idea is to also minimize egress costs which are associated with the compute tied up in it!


This is very cool! Most of the work I've seen on reducing inference costs has been via things like LoRAX that lets multiple fine-tunes share the same underlying base model.

Do you imagine Outerport being a better fit for OSS model hosts like Replicate, Anyscale, etc. or for companies that are trying to host multiple models themselves?

Your use case mentioned speaks more to the latter, but it seems like the value at scale is with model hosting as a service companies.


Thanks!

I think both are fits- we've gotten interest from both types of companies and our first customer is a "OSS model host".

Our 40% savings result is also specifically for the 5 model services case, so there could be non-trivial cost reduction even with a reasonably small number of models.


Could you craft a model-weight as a preamble to a prompt? So you can submit prompts through a layer which will pre-warm the model weights for you based on the prompt - Taking the output into some next step in your workflow, apply a new weight preamble depending on what the next phase is?

Like, for a particular portion of the workflow - assume some crawler of weird Insurance Claims data of scale - and you want particular weights for the aspects of certain logic that youre running to search for fraud.


That's a super neat idea- we should in fact be able to use this same system to support the orchestration of a 'system prompt caching' sort of thing (across deployments). I'll put this on my 'things to hack on' list :)


This seems useful but honestly I think you guys are better off getting IP protection and licensing out the technology. This is a classic "feature not a product" and I don't see you competing against google/microsoft/huggingface in the model management space.


Maybe! Many people don't want to be vendor locked-in though and there are new GPU cloud providers gaining traction. Some still prefer on-prem.

We hope to make it easier to bridge the multi-cloud landscape by being independent and 'outer'.


This is really exciting! I was hoping for someone to tackle inference time and this product will definitely be a boost to some of our use cases in medical imaging.


Awesome to hear- that sounds like an application we'd love to help with!

(Please feel free to reach out to us too at towaki@outerport.com !)


Is this tied to a specific framework like pytorch or an inference server like vLLM?

Our inference stack is built using candle in Rust, how hard would it be to integrate?


We’d just need to write a Rust client for the daemon and load the weights in a way that is compatible with candle- we can definitely look into this since parts of what we are building is already in Rust!


Do all variations of the model need to have the same architecture?

Or can they be different types of models with different number of layers, etc?


Variants do not have to be the same architecture- the demo (https://hotswap.outerport.com/) runs on a couple of different open source architectures.

That being said, there is some smart caching / hashing on layers such that if you do have models that are similar (i.e. a fine-tuned model where only some layers are fine-tuned), it'll minimize storage and transfer by reusing those weights.


This looks awesome! will try it out


Thanks!!


Nice! Will this work for Triton instances ie can I swap the model loaded to the Triton instance? Or am I miss understanding the concept? EDIT: typo


From what I gather, Triton assumes models are stored either in a remote repository or a local folder, and the model loading logic is all kept internal to the server.

Since we use pinned RAM memory for model loading and manage the cache hierarchy, the sever needs to at least make a call to our daemon. So we'd need to fork the Triton Server. But hopefully it'd only take a few lines of change!

I've actually never used Triton Server myself - curious how you have found it so far if you've used it. How does it compare to other alternatives in your opinion?


Yet to go through in detail but this is really powerful. Initiatives such as these are what we need to further democratize DL. Kudos team


Thank you! We definitely stand by broader adoption of DL :)


Cool! Will this work for multi-GPU inference?


Yep, it'll work for multi-GPU as well!




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: