Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I built a vector database API on Cloudflare (github.com/timesurgelabs)
101 points by chand1012 11 months ago | hide | past | favorite | 28 comments



One thing I can confirm from this code is that unlike what I previously thought vector search is not the main point of a vector database, rather it's read, write, update crud, just like normal SQL databases, with the exception that data are vector data instead of text and numbers.


What did you previously think a vector db was?


sounds like they thought it was something more akin to Elastisearch


i think the confusion comes from the mixup between the words "database", "store", and "index". Vector "store" is trivial, even for hundreds of millions of vectors you are still in the realm of what is possible on a single disk. Vector "index" to enable efficient aNN is not trivial for large numbers of high-dimensional vectors, and this is usually the proposed value add of someone providing a vector "database", which combines the two. I think this is also how the words are understood more generally. This project is a wrapper over Cloudflare's infrastructure, which does provide a vector index, though it is not clear how well their index performs in real-world use cases.


It's a stretch to call this building a vector database when this is just an API over Cloudflare's distributed database offerings.

This also uses a fixed embedding which will not be compatible with all machine learning projects that people will want to use a vector database for. The chosen embedding only supports text so making an image search for example wouldn't be possible to do.


Glass half empty then.

Most vector databases are using some local or external provider to get the embeddings and then using some storage engine to store and retreive the embeddings. Whether it's pgvector leaning on postgresql, or chromadb on sqlite, or pinecone originally being on rocksdb (I believe they've now built their own engine).

This is no different, and is still in it's infancy, so one presumes they might add support for other methods of getting the embeddings much as chromadb has.


> to store and retrieve

retrieve != query

this project is extremely simplistic in regards to its vector search tech. pgvector is an open source implementation of an _index_ (multiple algos actually), this uses Cloudflare's completely proprietary index with a single call.


chroma (single-node) doesn't use sqlite for vector - it's for metadata search.


Cool project, I look forward to digging in a bit - been meaning to check out some of cloudflare's dev offerings for a bit.

Athena is already a name in the cloudy database segment though - Amazon's managed Presto offering

https://docs.aws.amazon.com/athena/latest/ug/what-is.html


Trademark aside: you’ll probably never rank in search which is also a pain.


Came here to say this.

Very cool project but unfortunately the name is immediately a trademark issue.


Indeed. Athena is a pretty popular database service and is trademarked by AWS:

https://aws.amazon.com/trademark-guidelines/

I would expect a cease and desist at some point.


From checking the trademark database they registered their trademark only a few months ago (Dec. 21, 2023).

The first commit to AthenaDB was on Dec 11, 2023.

Is the prior use to a trademark the important date or the actual date of registration?


Unregistered trademarks are a thing too. In both cases its use in market time, not registration time.


Just because you could potentially win in court doesn’t mean you want to go toe to toe with Amazon legal, ya know?


The page says they hold the trademark to “Amazon Athena”. Does this come into conflict with a project named “AthenaDB”?


Even if you avoid legal issues, there's still the practical issues.

Fighting for ranking on Google. Conflicting tags on stackoverflow. Confused users sending you complaints about someone else's product. And so on.


Naming is hard. AthenaDB was the initial name of the MVP, I just never came up with a better one. Guess I'll have to do that now that I know AWS has another DB adjacent product with that name.


For me in Google at least the greek goddess wins, 50/50 shared with the movie.


Yeah, came here to say this, you’ll probably have problems with Amazon’s trademarked Athena database offering.


Indubitably, there might (p > 0.85) be an issue with the managed Presto database offered by Amazon, which is also named Athena btw.


Curious at how people are using vectordbs at an enterprise level.

Let's say you have a team of 5 data scientists/developers who are working on a collection of GenAI features/tooling. Does it make sense to have one single vectordb where all documentation is embedded and powers all the apps, or do you make a bunch of niche databases that are tailored to the service?

Also, one of the things i've noticed is that these databases seem less optimized for update operations so when user #1 embeds and saves 100 documents then user #2 does the same, with 10 overlapping - I'd guess that doubling of the similiarity space would exclude new documents. How are people handling that?


I am not sure what you mean specifically by 'overlapping'. But high-dimensional vector space is really "big" in the sense that everything is way closer together compared to low dimensions (this is the curse of dimensionality for euclidean norm), and this is already something one has to think about regardless of the similarity of the source documents. From reading wikipedia it seems like it's been argued that the curse is the worst with independent and uniformly distributed features.


We have multiple teams kind of working in silos so we haven't really consolidated on a single enterprise solution yet. That said, the team I'm on has consolidated on using qdrant with different collections. We've also started using a sort of hungarian notation in collection names as we've just ran into the problem of multiple embedding models.


Curious to hear what criteria each team considers important (top-3) for choosing the VDB they chose. There are so many vector databases available and in my experience it's actually not the most critical component in the overall GenAI/RAG architecture, although it gets the most attention.


I’m finding smaller vector databases can be almost ephemeral if you avoid parsing : https://hushh-labs.github.io/hushh-labs-blog/posts/you_dont_... I can retrieve a query, encode the embedding, load the vector store, calculate KNN, and return rendered results in under 50 milliseconds. It’s even faster if you simply cache the vector store. Really interested in this space and hoping to hear more ideas.


Somewhat related: I actually think vector databases are not as important as people are led to believe. Read more here: https://vectara.com/blog/vector-database-do-you-really-need-...


Honest question. Why don't simply use the filesystem? Vector databases just store a pile of numbers




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: