Adding txtai to the list: https://github.com/neuml/txtai txtai is an all-in-one ...

simonw · on Aug 20, 2023

What are the benefits of a dedicated embeddings database over adding a vector index to an existing engine, like sqlite-vss or pg_vector or Elasticsearch?

Vector search still feels like more of an index type feature than a separate product to me.

whakim · on Aug 20, 2023

pg_vector doesn't do nearly as well (relative to a dedicated vector DB) in terms of speed and accuracy. That being said, I think it's still clearly the correct choice in a lot of cases. An order of magnitude speedup may not matter at all at small or medium scale. And pg_vector (or something else) is likely to continue to improve significantly. Contrast that with the costs (in terms of time and money) of running additional infrastructure.

dmezzetti · on Aug 20, 2023

pg_vector doesn't perform well compared to other methods, at least according to ANN-Benchmarks (https://ann-benchmarks.com/).

txtai is more than just a vector database. It also has a built-in graph component for topic modeling that utilizes the vector index to autogenerate relationships. It can store metadata in SQLite/DuckDB with support for other databases coming. It has support for running LLM prompts right with the data, similar to a stored procedure, through workflows. And it has built-in support for vectorizing data into vectors.

For vector databases that simply store vectors, I agree that it's nothing more than just a different index type.

askIoT · on Aug 21, 2023

I find the utility of pg_vector very useful since it also acts as the default DB for a ton of other functions. Interested to see if folks think that a combination of postgres with something like qdrant is the way to go? is the benefits worth the trade off in terms of ease of use and flexibility.

jvans · on Aug 20, 2023

I would make a distinction between adding an index to sqlite/postgres and adding one to elastic. The ANN algorithms usually require the index fits into RAM which can lead to high memory requirements on large datasets. Something like elastic will do a lot better because of its ability to horizontally scale

I assume that is the rationale behind a dedicated database but I generally feel the same way as you

ZephyrBlu · on Aug 21, 2023

Vector search seems to have very different usage patterns than a regular database, both in terms of writes and queries.

drittich · on Aug 21, 2023

Note you can also just use pretty much any SQL db, which is not fast but can be good enough. I typically get my results in about 250 ms. If your app is already using SQL it's nice to not introduce a new dependency.

convolvatron · on Aug 20, 2023

communications cost. you certainly get cache coherency and prefetching that you wouldn't get if you were following pointer. in a distributed context that's a big hit.

chaxor · on Aug 21, 2023

It was my understanding that txtai is not a vector database, but rather it uses databases. It pulls together a large set of tools that most people use together in a very nice way, such that the API is more consistent for researchers to show what they are doing.

dmezzetti · on Aug 21, 2023

That is a great question. txtai is indeed different than most on that list.

It's similar in that it writes data to vector index formats such as Faiss, Hnswlib. It has metadata filtering via SQLite/DuckDB to filter on additional fields.

It's different in that it can use other vector databases for it's file format. And it has significant logic via workflows for data transformation. Then there is the graph component for topic modeling.

So that's where I came up with the term "embeddings database" which I consider a vector database and much more.