Show HN: I built a vector database API on Cloudflare

habibur · 2024-02-19T09:18:13 1708334293

One thing I can confirm from this code is that unlike what I previously thought vector search is not the main point of a vector database, rather it's read, write, update crud, just like normal SQL databases, with the exception that data are vector data instead of text and numbers.

blovescoffee · 2024-02-19T14:58:17 1708354697

What did you previously think a vector db was?

ericlewis · 2024-02-19T15:24:33 1708356273

sounds like they thought it was something more akin to Elastisearch

falling_myshkin · 2024-02-19T18:13:54 1708366434

i think the confusion comes from the mixup between the words "database", "store", and "index". Vector "store" is trivial, even for hundreds of millions of vectors you are still in the realm of what is possible on a single disk. Vector "index" to enable efficient aNN is not trivial for large numbers of high-dimensional vectors, and this is usually the proposed value add of someone providing a vector "database", which combines the two. I think this is also how the words are understood more generally. This project is a wrapper over Cloudflare's infrastructure, which does provide a vector index, though it is not clear how well their index performs in real-world use cases.

charcircuit · 2024-02-19T09:21:05 1708334465

It's a stretch to call this building a vector database when this is just an API over Cloudflare's distributed database offerings.

This also uses a fixed embedding which will not be compatible with all machine learning projects that people will want to use a vector database for. The chosen embedding only supports text so making an image search for example wouldn't be possible to do.

vertis · 2024-02-19T10:22:00 1708338120

Glass half empty then.

Most vector databases are using some local or external provider to get the embeddings and then using some storage engine to store and retreive the embeddings. Whether it's pgvector leaning on postgresql, or chromadb on sqlite, or pinecone originally being on rocksdb (I believe they've now built their own engine).

This is no different, and is still in it's infancy, so one presumes they might add support for other methods of getting the embeddings much as chromadb has.

falling_myshkin · 2024-02-19T19:58:26 1708372706

> to store and retrieve

retrieve != query

this project is extremely simplistic in regards to its vector search tech. pgvector is an open source implementation of an _index_ (multiple algos actually), this uses Cloudflare's completely proprietary index with a single call.

jeffchuber · 2024-02-19T15:59:17 1708358357

chroma (single-node) doesn't use sqlite for vector - it's for metadata search.

WaxProlix · 2024-02-19T06:39:39 1708324779

Cool project, I look forward to digging in a bit - been meaning to check out some of cloudflare's dev offerings for a bit.

Athena is already a name in the cloudy database segment though - Amazon's managed Presto offering

https://docs.aws.amazon.com/athena/latest/ug/what-is.html

scosman · 2024-02-19T12:47:53 1708346873

Trademark aside: you’ll probably never rank in search which is also a pain.

pushedx · 2024-02-19T07:00:39 1708326039

Came here to say this.

Very cool project but unfortunately the name is immediately a trademark issue.

threeseed · 2024-02-19T07:03:38 1708326218

Indeed. Athena is a pretty popular database service and is trademarked by AWS:

https://aws.amazon.com/trademark-guidelines/

I would expect a cease and desist at some point.

tomschwiha · 2024-02-19T09:46:04 1708335964

From checking the trademark database they registered their trademark only a few months ago (Dec. 21, 2023).

The first commit to AthenaDB was on Dec 11, 2023.

Is the prior use to a trademark the important date or the actual date of registration?

scosman · 2024-02-19T12:46:15 1708346775

Unregistered trademarks are a thing too. In both cases its use in market time, not registration time.

iambateman · 2024-02-19T14:58:48 1708354728

Just because you could potentially win in court doesn’t mean you want to go toe to toe with Amazon legal, ya know?

nvartolomei · 2024-02-19T09:01:31 1708333291

The page says they hold the trademark to “Amazon Athena”. Does this come into conflict with a project named “AthenaDB”?

michaelt · 2024-02-19T09:39:55 1708335595

Even if you avoid legal issues, there's still the practical issues.

Fighting for ranking on Google. Conflicting tags on stackoverflow. Confused users sending you complaints about someone else's product. And so on.

chand1012 · 2024-02-19T14:39:44 1708353584

Naming is hard. AthenaDB was the initial name of the MVP, I just never came up with a better one. Guess I'll have to do that now that I know AWS has another DB adjacent product with that name.

tomschwiha · 2024-02-19T09:48:23 1708336103

For me in Google at least the greek goddess wins, 50/50 shared with the movie.

rlt · 2024-02-19T07:56:13 1708329373

Yeah, came here to say this, you’ll probably have problems with Amazon’s trademarked Athena database offering.

seejayseesjays · 2024-02-19T09:00:07 1708333207

Indubitably, there might (p > 0.85) be an issue with the managed Presto database offered by Amazon, which is also named Athena btw.

rustyboy · 2024-02-19T13:41:38 1708350098

Curious at how people are using vectordbs at an enterprise level.

Let's say you have a team of 5 data scientists/developers who are working on a collection of GenAI features/tooling. Does it make sense to have one single vectordb where all documentation is embedded and powers all the apps, or do you make a bunch of niche databases that are tailored to the service?

Also, one of the things i've noticed is that these databases seem less optimized for update operations so when user #1 embeds and saves 100 documents then user #2 does the same, with 10 overlapping - I'd guess that doubling of the similiarity space would exclude new documents. How are people handling that?

falling_myshkin · 2024-02-19T18:53:25 1708368805

I am not sure what you mean specifically by 'overlapping'. But high-dimensional vector space is really "big" in the sense that everything is way closer together compared to low dimensions (this is the curse of dimensionality for euclidean norm), and this is already something one has to think about regardless of the similarity of the source documents. From reading wikipedia it seems like it's been argued that the curse is the worst with independent and uniformly distributed features.

swalsh · 2024-02-19T16:25:06 1708359906

We have multiple teams kind of working in silos so we haven't really consolidated on a single enterprise solution yet. That said, the team I'm on has consolidated on using qdrant with different collections. We've also started using a sort of hungarian notation in collection names as we've just ran into the problem of multiple embedding models.

ofermend · 2024-02-19T16:43:27 1708361007

Curious to hear what criteria each team considers important (top-3) for choosing the VDB they chose. There are so many vector databases available and in my experience it's actually not the most critical component in the overall GenAI/RAG architecture, although it gets the most attention.

jdonaldson · 2024-02-22T03:31:01 1708572661

I’m finding smaller vector databases can be almost ephemeral if you avoid parsing : https://hushh-labs.github.io/hushh-labs-blog/posts/you_dont_... I can retrieve a query, encode the embedding, load the vector store, calculate KNN, and return rendered results in under 50 milliseconds. It’s even faster if you simply cache the vector store. Really interested in this space and hoping to hear more ideas.

ofermend · 2024-02-21T17:13:12 1708535592

Somewhat related: I actually think vector databases are not as important as people are led to believe. Read more here: https://vectara.com/blog/vector-database-do-you-really-need-...

peter-m80 · 2024-02-20T07:21:54 1708413714

Honest question. Why don't simply use the filesystem? Vector databases just store a pile of numbers