Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Vellum (YC W23) – Dev Platform for LLM Apps
136 points by noaflaherty on March 6, 2023 | hide | past | favorite | 40 comments
Hi HN – Noa, Akash, and Sidd here. We’re building Vellum (https://www.vellum.ai), a developer platform for building on LLMs like OpenAI’s GPT-3 and Anthropic’s Claude. We provide tools for efficient prompt engineering, semantic search, performance monitoring, and fine-tuning, helping you bring LLM-powered features from prototype to production.

The MLOps industry has matured rapidly for traditional ML (typically open-source models hosted in-house), but companies using LLMs are suffering from a lack of tooling to support things like experimentation, version control, and monitoring. They’re forced to build these tools themselves, taking valuable engineering time away from their core product.

There are 4 main pain points. (1) Prompt engineering is tedious and time consuming. People iterate on prompts in playgrounds of individual model providers and store results in spreadsheets or documents. Testing across many test cases is usually not done because of the manual nature of prompt engineering. (2) LLM calls against a corpus of text are not possible without semantic search. Due to limited context windows, any time an LLM has to return factual data from a set of documents, companies need to create embeddings, store them in a vector database and host semantic search models to query for relevant results at runtime; building this infrastructure is complex and time consuming. (3) There is limited observability / monitoring once LLMs are used in production. With no baseline for how something is performing, it’s scary making changes to it for fear of making it worse; and (4) Creating fine-tuned models and re-training them as new data becomes available is rarely done despite the potential gains (higher quality, lower cost, lower latency, more defensibility). Companies don’t usually have the capacity to build the infrastructure for collecting high-quality training data and the automation pipelines used to re-train and evaluate new models.

We know these pain points from experience. Sidd and Noa are engineers who worked at Quora and DataRobot building ML tooling. Then the three of us worked together for a couple years at Dover (YC S19), where we built features powered by GPT-3 when it was still in beta. Our first production feature was a job description writer, followed by a personalized recruiting email generator and then a classifier for email responses.

We found it was easy enough to prototype, but taking features to production and improving them was a different story. It was a pain to keep track of what prompts we had tried and to monitor how they were performing under real user inputs. We wished we could version control our prompts, roll back, and even A/B test. We found ourselves investing in infrastructure that had nothing to do with our core features (e.g. semantic search). We ended up being scared to change prompts or try different models for fear of breaking existing behavior. As new LLM providers and foundation models were released, we wished we could compare them and use the best tool for the job, but didn’t have the time to evaluate them ourselves. And so on.

It’s clear that better tools are required for businesses to adopt LLMs at scale, and we realized we were in a good position to build them, so here we are! Vellum consists of 4 systems to address the pain points mentioned above:

(1) Playground—a UI for iterating on prompts side-by-side and validating them against multiple test cases at once. Prompt variants may differ in their text, underlying model, model parameters (e.g. “temperature”), and even LLM provider. Each run is saved as a history item and has a permanent url that can be shared with teammates.

(2) Search—upload a corpus of text (e.g. your company help docs) in our UI (PDF/TXT) and Vellum will convert the text to embeddings and store it in a vector database to be used at run time. While making an LLM call, we inject relevant context from your documents into the query and instruct the LLM to only answer factually using the provided context. This helps prevent hallucination and avoids you having to manage your own embeddings, vector store, and semantic search infra.

(3) Manage—a low-latency, high-reliability API wrapper that’s provider-agnostic across OpenAI, Cohere, and Anthropic (with more coming soon). Every request is captured and persisted in one place, providing full observability into what you’re sending these models, what they’re giving back, and their performance. Prompts and model providers can be updated without code changes. You can replay historical requests and version history is maintained. This serves as a data layer for metrics, monitoring, and soon, alerting.

(4) Optimize—the data collected in Manage is used to passively build up training data, which can be used to fine-tune your own proprietary models. With enough high quality input/output pairs (minimum 100, but depends on the use case), Vellum can produce fine-tuned models to provide better quality, lower cost or lower latency. If a new model solves a problem better, it can be swapped without code changes.

We also offer periodic evaluation against alternative models (i.e. we can see if fine-tuning Curie produces results of comparable quality to Davinci, but at a lower price). Even though OpenAI is the dominant model provider today, we expect there to be many providers with strong foundation models, and in that case model interoperability will be key!

Here’s a video demo showcasing Vellum (feel free to watch on 1.5x!): https://www.loom.com/share/5dbdb8ae87bb4a419ade05d92993e5a0.

We currently charge a flat monthly platform fee that varies based on the quantity and complexity of your use-cases. In the future, we plan on having more transparent pricing that’s made up of a fixed platform fee + some usage-based component (e.g. number of tokens used or requests made).

If you look at our website you’ll notice the dreaded “Request early access” rather than “Try now”. That’s because the LLM Ops space is evolving extremely quickly right now. To maximize our learning rate, we need to work intensively with a few early customers to help get their AI use cases into production. We’ll invite self-serve signups once that core feature set has stabilized a bit more. In the meantime, if you’re interested in being one of our early customers, we’d love to hear from you and you can request early access here: https://www.vellum.ai/landing-pages/hacker-news.

We deeply value the expertise of the HN community! We’d love to hear your comments and get your perspective on our overall direction, the problems we’re aiming to solve, our solution so far, and anything we may be missing. We hope this post and our demo video provide enough material to start a good conversation and we look forward to your thoughts, questions, and feedback!




Congrats on the launch! I'm glad to see all the tooling come up in this space.

Regarding tests, how do you evaluate the generated completions for tests? Allowing users to execute a set of tests against a prompt and show completions for visual inspection is a good start but imho doesn't scale when the app is in production with a large corpus of tests. Something we are exploring right now is to generate a similarity/divergence score between generated completions to make this easy at scale.

Disclosure: We are building something very similar at Promptly (https://trypromptly.com) out of our experience using GPT-3 at MakerDojo


Thanks! We totally agree that spot-checking won't scale long term. We're currently testing a feature in beta that allows you to provide an "expected output" and then choose from a variety of comparison metrics (e.g. exact match, semantic similarity, Levenshtein distance, etc.) to derive a quantitative measure of output quality. The jury's still out whether this is sufficient, but we're excited to continue pushing in this direction.

p.s. it's cool to hear from another company that's helping expand this market!


What I think would be really interesting is to apply distance metric learning (DML) to the problem. You have users tell you what responses are good and bad and use that to learn a metric that will classify responses as good as bad. One of the big challenges is that DML is typically applied to data in some vector space as opposed to strings, but I would expect using some embedding constructed from the output could work well.


Super interesting idea! We already expose UIs and APIs for supplying feedback on the quality of the output, so this could totally be possible once enough feedback has been collected. Thanks for sharing


Letting users pick a comparison metric of their choice is a good option till something better comes along. Good luck with Vellum!


Please remove that "text-shadow: 8px -9px 0px #ffffff;" for the "hero-title" class. It is possible to use text shadows effectively, but it is very, very easy to use them in ways that are a lot worse than not using them at all.


congrats on launching! 1) how do you evaluate the opportunity here vs previous players like Humanloop (seems to have pivoted to weak labeling) and Dust.tt (unclear traction)?

and 2) it seems with OpenAI being so far ahead of everyone else (https://crfm.stanford.edu/helm/latest/?group=core_scenarios) I think the "model interoperability" is a key assumption that needs to be tested. Nobody's talking about "model interoperability" between dalle, midjourney, or stable diffusion - they each have their strengths, and that's that. prompts aren't code that can be shipped indiscriminately everywhere, they only exist within the context of the model they are run against


Thank you for the thoughtful questions!

1) We believe that timing is a critical piece of this opportunity. With the recent media buzz around ChatGPT, we have found that leadership in companies large and small are actively considering how to best make use of LLMs in their business. The problems we've identified emerged as clear patterns across hundreds of calls with companies that are either currently managing LLM-powered features in production, or aspiring to. The level of interest was much smaller just 6 months ago, has grown quickly, and we anticipate it to grow only more in the near future.

2) We agree that with OpenAI's current dominance in the space, being provider-agnostic is not top of mind for most at the moment. We are betting that this will become increasingly important as the space evolves. We are already seeing Google investing hundreds of millions in Anthropic (https://www.bloomberg.com/news/articles/2023-02-03/google-in...), Google working on their own LLMs (e.g. BARD), and Facebook launching their own LLM (https://ai.facebook.com/blog/large-language-model-llama-meta...). We expect this to become an increasingly competitive space and hope to provide companies with the tools needed to effectively evaluate their options.


Congrats on launching!

I personally think the target audience for this is a little hard to find when compared to products like langchain that do something similar already (i wouldn't be surprised if you guys built on top of this).

As a developer, I wouldn't have much difficulty spinning a Colab instance and running Langchain (takes a few minutes) and get it up and running compared to a solution like yours. Would be awesome to get a pros/cons table of a solution like yours compared to Langchain so developers can best figure out how to dedicate their time without having to try both tools.


i have nearly 20 years of experience in operations and development (including information retrieval / search) and it took me a good 30 solid days of staring at this stuff with a dumb look on my face before i could wrap my head around the basics of how to do something even remotely useful in this ecosystem (torch/tf, huggingface, langchain, openai, vector dbs for embeddings, all of the ancillary python modules, CUDA, basic linear algebra, etc).

in addition to pure dev and ops people, i can see how a tool like this could be useful for prompt engineers, product people, prototype / spec / doc authors, and others who aren't necessarily going to be involved directly in the nitty gritty of things writing production code or MLops.


Appreciate the feedback! A comparison table is a great idea and something we'll look into.

We fully anticipate having tighter integrations with Langchain in the near future. We view them as complimentary frameworks in many ways. For example, we might subclass the `BaseLLM` class such that you can interact with Vellum deployments and get all the monitoring/observability that Vellum provides, but invoke them via your Lanchain chain.


Congrats on the launch! I'm building an app that interfaces with OpenAI GPT models and they recently released an API to upload and create text embeddings.

I watched most of your Loom and was left wondering why wouldn't I use them directly vs you?


Thank you and good question! If you're comfortable with the quality of OpenAI's embeddings, performing your own chunking, rolling your own integration with a vector db, and don't need Vellum's other features that surround the usage of those embeddings, then Vellum is probably not a good fit. Vellum's Search offering is most valuable to companies that want to be able to experiment with different embedding models, don't want to manage their own semantic search infra, and want a tight integration with how those embeddings are used downstream.


Are you planning to make this Open Source? We run an Open Source project[0] and we would love to work with y'all since this is related to where our heads are at, but we would want to have the core be open for that!

0: https://github.com/lunasec-io/lunasec


Thanks for your interest! We've toyed with the idea of making at least pieces of Vellum open source, but have decided against it for the time being. There are some great open source libraries like Langchain or GPT-Index that, while quite different, may satisfy some of your needs.


First: Congratulations! I've just sent an early access request. Second - I've got questions!

Here is my use case. We have hundreds of clients who each have dozens of videos on our platform. Videos are grouped in collections. For each video we have structured content (think of it like a complex mashup of transcript + other info). I would like to be able to send the API a group of 13 documents, grouped into a collection called "User123_collection2" or whatever. And then run natural language queries against that through an LLM.

A. Do I understand correctly that you allow to send and organize documents programmatically? Are there any length restrictions?

B. In your demo, Vellum simply grabs the top 3 most relevant snippets (and those seem to be relatively short snippets). Can this be customized (longer snippets and more of them?)

C. Can I get the "sources" cited with the answer? Let's say I run a query on these snippets such as the ones in your demo. (An "end user" kind of question). Similar to how Bing's chatbot will give the links to the pages used to build the answer, I'd like the response I get to tell me it comes from Document 7 in collection "User123_collection2". Even better if I can get the response to tell me where in the document (otherwise I'd have to split my documents into smaller pieces when uploading).

D. Do you offer any guarantees of privacy of, not only the data, but also the prompts? I think that these prompts might be one of the valuable "trade secrets" for startups who want to add LLM features. If the prompt is leaked publicly or to another Vellum customer/stakeholder then it makes the feature replicable.

E. How much latency does this additional layer tack onto typical response times? If I was going to make an API request to OpenAI, how much longer does my app wait for the response?

I'm impressed by where you're going. It's a brilliant idea and I hope we get to be a customer, rather than build the whole LangChain/GPT-Index idea we were going to run with up until I read your post :)


Thanks so much for your interest and thoughtful questions! I'll do my best to answer here, but looking forward to digging in deeper with you offline, too :)

A. Documents are uploaded and organized via the UI at this moment, but later this week we will be exposing APIs to do the same programmatically. You'll be able to programmatically create a "Document Index" and then upload documents to a given index. In your case, you'd likely have one index per collection. We don't currently enforce a strict size limit, but it's likely we'll need to soon. In this case, you might break the document up into smaller documents prior to uploading.

B. Yes, the number of chunks returned can be specified as part of the API call when performing a search. Currently, the chunking strategy and max size is static, but we fully intend on making this configurable very soon.

C. Yes, we track which document and which index each chunk came from. With proper prompt engineering, you can have the LLM include this citation in the final response. We helped a customer of ours just recently construct a prompt that did this same thing! Saying where in the document it came from is a bit trickier (although you do know the text of the chunk that's most relevant, which is a helpful starting point).

D. We do not share data or prompts across our customers, although we do provide strategic advice that's informed by our experiences. We'd love to learn what guarantees you're looking for and feel confident we can work within most reasonable bounds. For what it's worth, my personal opinion is that companies should be cautious about banking on prompts as the primary point of defensibility for LLM-powered apps. Reverse prompt-engineering is a thing (interesting article here: https://lspace.swyx.io/p/reverse-prompt-eng). My take is that LLM defensibility will come from your data (e.g. data that powers your semantic search or training data used to fine-tune proprietary models), as this is much harder for competitors to recreate, not to mention the user experience and go-to-market that surrounds it all.

E. We haven't yet done formal benchmarking (although we're admittedly overdue for it!), but we have architected our inference endpoint with low-latency as a top priority. For example, we've selected cloud data centers close to where we understand OpenAI's to be and have minimized all blocking procedures such that we perform as much as we can asynchronously. We host this endpoint separately from the rest of our web application, have at least one instance running at all times (to prevent cold starts), and have it auto-scale as traffic demands.


Thanks for your reply!

A. OK, I like that, thanks

B. OK

C. Just to confirm - the chunks are verbatim right? So in theory I could just do a string search in the document to locate the chunk?

D. I would assume that you are currently not encrypting the data at rest. Encrypting it with the customer's API would probably result in a performance hit?

In any case, if you're not encrypting it at all, and in absence of any certification/assurances as to your data security practices then as a customer I'm forced to assume that giving you access to the data (and prompts) is tantamount to public disclosure of it. I mean, LastPass had their data stolen. You and us are both startups. Anything valuable that is not nailed to the floor (encrypted with utmost paranoia) is like leaving the chairs on the patio of a beach bar at night.

In which case there is no defensibility to be found there for us. It doesn't prevent us from becoming a customer, but it means we have to hedge our use cases so as not to put our own customers' data (e.g. trade secrets that might be included in their documents) at risk.

E. Great to know, thanks!


What if I'm building a service that leverages LLMs for my customers? Would I be able to use an API to upload my customers' data and have embeddings created for that? Or is this not a use case you're building for?


Hi yes, that's the idea! The example shown in the demo video uses internal help docs as the "source of knowledge" for embeddings, but the same principles apply to customer data.


Great! Would I be able to provide customers any guarantees about the privacy of their data? Could you create embeddings based on data encrypted homomorphically?


We'd love to learn more about what types of guarantees your customers expect – it's likely we can provide many of them now and will inevitably offer even more down the line. Feel free to reach out directly to noa@vellum.ai if you'd like to discuss!

Vellum currently embeds any text you send it, but to be honest, we haven't experimented with performing semantic search across homomorphically encrypted text and can't speak to its performance. If this becomes a recurring theme from our customers, we'd be excited to dig into it deeper!


Yeah I understand that operating on opaque data might not be one of the first items on your roadmap. Thanks for the quick responses.


Do you have any plans for automated acceptance testing?


Great question! We're starting with the manually triggered unit tests in Playground and back-tests prior to updating existing deployments that you see in the demo video, but absolutely envision automated tests being a natural extension of this as we learn what works well through manually-triggered tests.


Do you provide optimization options for finetuning, RLHF or both?


Thanks for the question! Would you mind elaborating on what you mean by "optimization options?" We've helped a number of our customers fine tune models and optimize for increased quality, lower cost, or decreased latency (e.g. fine-tune curie to perform as well as regular davinci, but at a lower cost and latency).

We offer UIs and APIs for "feeding back actuals" and providing indications on the quality of the models output / what it should have output. This feedback loop is used to then periodically re-train fine-tuned models.

Hopefully this answers your question, but happy to respond with follow-ups if not!


I'm thinking about improving model response quality.

Training of preexisting LLM models that I'm familiar with consists of two aspects/sides/options: fine-tuning the model with additional, domain specific data (like internal company documentation) and RLHF (like comparing model responses to customer service actual responses) to further improve how well it's using that and original resources it has access to. That's how https://github.com/CarperAI sets up the process, for example.

What you're describing seems closer to the latter, but I'm not entirely sure if you're following the same structure at all.


Hey, Sidd from Vellum here!

Right now we offer traditional fine tuning with prompt/completion pairs but not training a reward model. This works great for a lot of use cases including classification, extracting structured data, or responding with a very specific tone and style.

For making use of domain specific data we recommend using semantic search to pull in the correct context at runtime instead of trying to fine tune a model on the entire corpus of knowledge.


Cool, I just requested early access! I have been using Open AI's APIs for text summarization tasks, have also played around with a few other platforms.


I appreciate your interest! We'll be reaching out soon :)


I just saw your demo Noa, great job guys. The documents you can upload for training the model, there is a max size? How large can these be?


Congratulations guys! We've been looking for a good automated test solution for our prompts and haven't found anything solid. Vellum looks like it has some solid potential. Looking forward to trying it out.


Thank you for the kind words :)


Congrats on the Launch!

So that you know, Vellum [1] is the name for an often used and well known piece of software used to write books. It's an absolutely fantastic piece of software. Vellum has been around since 2015. [2]

Vellum (the word) is prepared animal skin or membrane, typically used as writing material. [3]

[1] https://vellum.pub/

[2] https://web.archive.org/web/20151112064306/http://vellum.pub...

[3] https://en.wikipedia.org/wiki/Vellum


Yeah, it sounded almost too good too be true, a name that is related to writing and has the letters LLM in it, in the right order even. Wow.

Interesting to see how this works out, if it's "too near for comfort" or not to your [1]. Thanks!


Thank you for flagging! We've come across them in prior searches, but interesting to learn how well-known they are


It's also a major component of the 3d software Houdini, and is used for physics simulations around cloth, grains, hair, etc. https://www.sidefx.com/docs/houdini/vellum/index.html


Yup I came here to say this too. It's a well-known product in the self-publishing world.


Likewise, it is a well-known app among writers.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: