My smooth brain might not understand this properly, but the idea is we generate ...

bryantwolf · 2024-04-17T21:08:46 1713388126

You'd have to update the embedding every time the data used to generate it changes. For example, if you had an embedding for user profiles and they updated their bio, you would want to make a new embedding.

I don't expect to have to change the embeddings for each icon all that often, so storing them seemed like a good choice. However, you probably don't need to cache the embedding for each search query since there will be long-tail ones that don't change that much.

The reason to use pgvector over blobs is if you want to use the distance functions in your queries.

kmeisthax · 2024-04-17T20:17:13 1713385033

Yes, you can shove the embeddings in a BLOB, but then you can't do the kinds of query operations you expect to be able to do with embeddings.

simonw · 2024-04-17T23:14:57 1713395697

You can run similarity scores with a custom SQLite function.

I use a Python one usually, but it's also possible to build a much faster one in C: https://simonwillison.net/2024/Mar/23/building-c-extensions-...

crowcroft · 2024-04-17T20:21:44 1713385304

Right like you could use it sort of like cache and send the blobs to OpenAI to use their similarity API, but you couldn't really use SQL to do cosine similarity operations?

My understanding of what's going on at a technical level might be a bit limited.

kmeisthax · 2024-04-17T21:58:16 1713391096

Yes.

Although if you really wanted to, and normalized your data like a good little Edgar F. Codd devotee, you could write something like this:

SELECT SUM(v.dot) / (SQRT(SUM(v.v1)) * SQRT(SUM(v.v2))) FROM (SELECT v1.dimension as dim, v1.value as v1, v2.value as v2, v1.value * v2.value as dot FROM vectors as v1 INNER JOIN vectors as v2 ON v1.dimension = v2.dimension WHERE v1.vector_id = "?" AND v2.vector_id = "?") as v;

This assumes one table called "vectors" with columns vector_id, dimension, and value; vector_id and dimension being primary. The inner query grabs two vectors as separate columns with some self-join trickery, computes the product of each component, and then the outer query computes aggregate functions on the inner query to do the actual cosine similarity.

No I have not tested this on an actual database engine, I probably screwed up the SQL somehow. And obviously it's easier to just have a database (or Postgres extension) that recognizes vector data as a distinct data type and gives you a dedicated cosine-similarity function.

crowcroft · 2024-04-17T22:30:01 1713393001

Thanks for the explanation! Appreciate that you took the time to give an example. Makes a lot more sense why we reach for specific tools for this.

laborcontract · 2024-04-17T19:46:29 1713383189

A KV store is both good enough and highly performant. I use Redis for storing embeddings and expire them after a while. Unless you have a highly specialized use case it’s not economical to persistently store chunk embedding.

Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).

crowcroft · 2024-04-17T19:53:41 1713383621

Redis sounds like a good option. I like that it’s not more infrastructure, I already have redis setup for my app so I’m not adding more to the stack.

H1Supreme · 2024-04-17T19:43:37 1713383017

Vector databases are used to store embeddings.

crowcroft · 2024-04-17T19:56:50 1713383810

But why is that? I’m sure it’s the ‘best’ way to do things, but it also means more infrastructure which for simple apps isn’t worth the hassle.

I should use redis for queues but often I’ll just use a table in a SQLite database. For small scale projects I find it works fine, I’m wondering what an equivalent simple option for embeddings would be.

e0 · 2024-04-18T18:43:37 1713465817

Perhaps sqlite-vss? It adds vector searches to sqlite.

https://github.com/asg017/sqlite-vss

chuckhend · 2024-04-18T14:00:55 1713448855

check out https://github.com/tembo-io/pg_vectorize - we're taking it a little bit beyond just the storage and index. The project uses pgvector for the indices and distance operators, but also adds a simpler API, hooks into pre-trained embedding models, and helps you keep embeddings updated as data changes/grows

alexgarcia-xyz · 2024-04-17T20:10:35 1713384635

Re storing vectors in BLOB columns: ya, if it's not a lot of data and it's fast enough for you, then there's no problem doing it like that. I'd even just store then in JSON/npy files first and see how long you can get away with it. Once that gets too slow, then try SQLite/redis/valkey, and when that gets too slow, look into pgvector or other vector database solutions.

For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower.

One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`.

I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector".

Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever.

[0] https://huggingface.co/blog/matryoshka

[1] https://huggingface.co/blog/embedding-quantization

[2] https://github.com/asg017/sqlite-vss

crowcroft · 2024-04-17T20:27:28 1713385648

This is very useful appreciate the insight. Storing embeddings in a table and joining when needed feels like a really nice solution for what I'm trying to do.

simonw · 2024-04-17T23:13:32 1713395612

I store them as blobs in SQLite. It works fine - depending on the model they take up 1-2KB each.