Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

phyalow · 2024-08-22T14:00:02 1724335202

Genuine question, whats the difference between your startup and just calling the below code with a different model on a cloud machine, other than some ML/Dev OP's engineer not knowing what they are doing...?

  model = get_model(\*model_config)
  state_dict = torch.load(model_path, weights_only=True)
  new_state_dict = {k.replace('_orig_mod.', ''): v for k, v in state_dict.items()}
  model.load_state_dict(new_state_dict)
  model.eval()
  with torch.no_grad():
  output = model(torch.FloatTensor(X))
  probabilities = torch.softmax(output, dim=X)
  return probabilities.numpy()

tovacinni · 2024-08-22T16:25:53 1724343953

The advantage of loading from a daemon over loading all the weights at once in Python is that it can support multiple processes or even the same process consecutively (if it dies or something, or had to switch to something else).

Loading from disk to VRAM can be super slow- so doing this every time you have a new process is wasteful. Instead, if you have a daemon process that keeps multiple model weights in pinned RAM, you can load them much quicker (~1.5 seconds for a 8B model like we show in the demo).

You _could_ also make a single mega router process, but then there are issues like all services needing to agree on dependency versioning. This has been a problem for me in the past (like LAVIS requiring a certain transformer version that was not compatible with some other diffusion libraries)

harrisonjackson · 2024-08-21T21:46:05 1724276765

> Outerport is a caching system for model weights, allowing read-only models to be cached in pinned RAM for fast loading into GPU. Outerport is also hierarchical, maintaining a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing.

This is really cool. Are the costs to run this mainly storage or how much compute is actually tied up in it?

The time/cost to download models on a gpu cloud instance really add up when you are paying per second.

tovacinni · 2024-08-21T22:21:50 1724278910

Thanks! If you mean the costs for users of Outerport, it'll be a subscription model for our hosted registry (with a limit on storage / S3 egress) and a license model for self-hosting the registry. So mainly storage, since the idea is to also minimize egress costs which are associated with the compute tied up in it!

dbmikus · 2024-08-21T18:16:51 1724264211

This is very cool! Most of the work I've seen on reducing inference costs has been via things like LoRAX that lets multiple fine-tunes share the same underlying base model.

Do you imagine Outerport being a better fit for OSS model hosts like Replicate, Anyscale, etc. or for companies that are trying to host multiple models themselves?

Your use case mentioned speaks more to the latter, but it seems like the value at scale is with model hosting as a service companies.

tovacinni · 2024-08-21T18:29:42 1724264982

Thanks!

I think both are fits- we've gotten interest from both types of companies and our first customer is a "OSS model host".

Our 40% savings result is also specifically for the 5 model services case, so there could be non-trivial cost reduction even with a reasonably small number of models.

samstave · 2024-08-21T22:24:29 1724279069

Could you craft a model-weight as a preamble to a prompt? So you can submit prompts through a layer which will pre-warm the model weights for you based on the prompt - Taking the output into some next step in your workflow, apply a new weight preamble depending on what the next phase is?

Like, for a particular portion of the workflow - assume some crawler of weird Insurance Claims data of scale - and you want particular weights for the aspects of certain logic that youre running to search for fraud.

tovacinni · 2024-08-21T22:40:00 1724280000

That's a super neat idea- we should in fact be able to use this same system to support the orchestration of a 'system prompt caching' sort of thing (across deployments). I'll put this on my 'things to hack on' list :)

CuriouslyC · 2024-08-21T23:14:21 1724282061

This seems useful but honestly I think you guys are better off getting IP protection and licensing out the technology. This is a classic "feature not a product" and I don't see you competing against google/microsoft/huggingface in the model management space.

tovacinni · 2024-08-22T00:15:58 1724285758

Maybe! Many people don't want to be vendor locked-in though and there are new GPU cloud providers gaining traction. Some still prefer on-prem.

We hope to make it easier to bridge the multi-cloud landscape by being independent and 'outer'.

volkopat · 2024-08-22T04:16:20 1724300180

This is really exciting! I was hoping for someone to tackle inference time and this product will definitely be a boost to some of our use cases in medical imaging.

tovacinni · 2024-08-22T04:53:33 1724302413

Awesome to hear- that sounds like an application we'd love to help with!

(Please feel free to reach out to us too at towaki@outerport.com !)

zackangelo · 2024-08-22T14:31:10 1724337070

Is this tied to a specific framework like pytorch or an inference server like vLLM?

Our inference stack is built using candle in Rust, how hard would it be to integrate?

tovacinni · 2024-08-22T16:30:51 1724344251

We’d just need to write a Rust client for the daemon and load the weights in a way that is compatible with candle- we can definitely look into this since parts of what we are building is already in Rust!

bravura · 2024-08-21T22:17:37 1724278657

Do all variations of the model need to have the same architecture?

Or can they be different types of models with different number of layers, etc?

tovacinni · 2024-08-21T22:24:32 1724279072

Variants do not have to be the same architecture- the demo (https://hotswap.outerport.com/) runs on a couple of different open source architectures.

That being said, there is some smart caching / hashing on layers such that if you do have models that are similar (i.e. a fine-tuned model where only some layers are fine-tuned), it'll minimize storage and transfer by reusing those weights.

parrot987 · 2024-08-22T01:14:50 1724289290

This looks awesome! will try it out

tovacinni · 2024-08-22T02:47:28 1724294848

Thanks!!

mr_yoni · 2024-08-22T14:17:05 1724336225

Nice! Will this work for Triton instances ie can I swap the model loaded to the Triton instance? Or am I miss understanding the concept? EDIT: typo

AllenHW · 2024-08-23T10:46:46 1724410006

From what I gather, Triton assumes models are stored either in a remote repository or a local folder, and the model loading logic is all kept internal to the server.

Since we use pinned RAM memory for model loading and manage the cache hierarchy, the sever needs to at least make a call to our daemon. So we'd need to fork the Triton Server. But hopefully it'd only take a few lines of change!

I've actually never used Triton Server myself - curious how you have found it so far if you've used it. How does it compare to other alternatives in your opinion?

raghavbali · 2024-08-22T07:59:19 1724313559

Yet to go through in detail but this is really powerful. Initiatives such as these are what we need to further democratize DL. Kudos team

AllenHW · 2024-08-23T10:48:36 1724410116

Thank you! We definitely stand by broader adoption of DL :)

astroalex · 2024-08-22T01:46:49 1724291209

Cool! Will this work for multi-GPU inference?

tovacinni · 2024-08-22T02:49:23 1724294963

Yep, it'll work for multi-GPU as well!