One thing I can confirm from this code is that unlike what I previously thought vector search is not the main point of a vector database, rather it's read, write, update crud, just like normal SQL databases, with the exception that data are vector data instead of text and numbers.
i think the confusion comes from the mixup between the words "database", "store", and "index". Vector "store" is trivial, even for hundreds of millions of vectors you are still in the realm of what is possible on a single disk. Vector "index" to enable efficient aNN is not trivial for large numbers of high-dimensional vectors, and this is usually the proposed value add of someone providing a vector "database", which combines the two. I think this is also how the words are understood more generally. This project is a wrapper over Cloudflare's infrastructure, which does provide a vector index, though it is not clear how well their index performs in real-world use cases.
It's a stretch to call this building a vector database when this is just an API over Cloudflare's distributed database offerings.
This also uses a fixed embedding which will not be compatible with all machine learning projects that people will want to use a vector database for. The chosen embedding only supports text so making an image search for example wouldn't be possible to do.
Most vector databases are using some local or external provider to get the embeddings and then using some storage engine to store and retreive the embeddings. Whether it's pgvector leaning on postgresql, or chromadb on sqlite, or pinecone originally being on rocksdb (I believe they've now built their own engine).
This is no different, and is still in it's infancy, so one presumes they might add support for other methods of getting the embeddings much as chromadb has.
this project is extremely simplistic in regards to its vector search tech. pgvector is an open source implementation of an _index_ (multiple algos actually), this uses Cloudflare's completely proprietary index with a single call.
Naming is hard. AthenaDB was the initial name of the MVP, I just never came up with a better one. Guess I'll have to do that now that I know AWS has another DB adjacent product with that name.
Curious at how people are using vectordbs at an enterprise level.
Let's say you have a team of 5 data scientists/developers who are working on a collection of GenAI features/tooling. Does it make sense to have one single vectordb where all documentation is embedded and powers all the apps, or do you make a bunch of niche databases that are tailored to the service?
Also, one of the things i've noticed is that these databases seem less optimized for update operations so when user #1 embeds and saves 100 documents then user #2 does the same, with 10 overlapping - I'd guess that doubling of the similiarity space would exclude new documents. How are people handling that?
I am not sure what you mean specifically by 'overlapping'. But high-dimensional vector space is really "big" in the sense that everything is way closer together compared to low dimensions (this is the curse of dimensionality for euclidean norm), and this is already something one has to think about regardless of the similarity of the source documents. From reading wikipedia it seems like it's been argued that the curse is the worst with independent and uniformly distributed features.
We have multiple teams kind of working in silos so we haven't really consolidated on a single enterprise solution yet. That said, the team I'm on has consolidated on using qdrant with different collections. We've also started using a sort of hungarian notation in collection names as we've just ran into the problem of multiple embedding models.
Curious to hear what criteria each team considers important (top-3) for choosing the VDB they chose. There are so many vector databases available and in my experience it's actually not the most critical component in the overall GenAI/RAG architecture, although it gets the most attention.
I’m finding smaller vector databases can be almost ephemeral if you avoid parsing : https://hushh-labs.github.io/hushh-labs-blog/posts/you_dont_...
I can retrieve a query, encode the embedding, load the vector store, calculate KNN, and return rendered results in under 50 milliseconds. It’s even faster if you simply cache the vector store.
Really interested in this space and hoping to hear more ideas.