Hacker News new | past | comments | ask | show | jobs | submit login

My smooth brain might not understand this properly, but the idea is we generate embeddings, store them, then use retrieval each time we want to use them.

For simple things we might not need to worry about storing much, we can generate the embeddings and just cache them or send them straight to retrieval as an array or something...

The storing of embeddings seems the hard part, do I need a special database or PG extension? Is there any reason I can't store them as a blobs in SQlite if I don't have THAT much data, and I don't care too much about speed? Do embeddings generated ever 'expire'?




You'd have to update the embedding every time the data used to generate it changes. For example, if you had an embedding for user profiles and they updated their bio, you would want to make a new embedding.

I don't expect to have to change the embeddings for each icon all that often, so storing them seemed like a good choice. However, you probably don't need to cache the embedding for each search query since there will be long-tail ones that don't change that much.

The reason to use pgvector over blobs is if you want to use the distance functions in your queries.


Yes, you can shove the embeddings in a BLOB, but then you can't do the kinds of query operations you expect to be able to do with embeddings.


You can run similarity scores with a custom SQLite function.

I use a Python one usually, but it's also possible to build a much faster one in C: https://simonwillison.net/2024/Mar/23/building-c-extensions-...


Right like you could use it sort of like cache and send the blobs to OpenAI to use their similarity API, but you couldn't really use SQL to do cosine similarity operations?

My understanding of what's going on at a technical level might be a bit limited.


Yes.

Although if you really wanted to, and normalized your data like a good little Edgar F. Codd devotee, you could write something like this:

SELECT SUM(v.dot) / (SQRT(SUM(v.v1)) * SQRT(SUM(v.v2))) FROM (SELECT v1.dimension as dim, v1.value as v1, v2.value as v2, v1.value * v2.value as dot FROM vectors as v1 INNER JOIN vectors as v2 ON v1.dimension = v2.dimension WHERE v1.vector_id = "?" AND v2.vector_id = "?") as v;

This assumes one table called "vectors" with columns vector_id, dimension, and value; vector_id and dimension being primary. The inner query grabs two vectors as separate columns with some self-join trickery, computes the product of each component, and then the outer query computes aggregate functions on the inner query to do the actual cosine similarity.

No I have not tested this on an actual database engine, I probably screwed up the SQL somehow. And obviously it's easier to just have a database (or Postgres extension) that recognizes vector data as a distinct data type and gives you a dedicated cosine-similarity function.


Thanks for the explanation! Appreciate that you took the time to give an example. Makes a lot more sense why we reach for specific tools for this.


A KV store is both good enough and highly performant. I use Redis for storing embeddings and expire them after a while. Unless you have a highly specialized use case it’s not economical to persistently store chunk embedding.

Redis also does have vector search capability as well. However, the most popular answer you’ll get here is to use Postgres (pgvectpr).


Redis sounds like a good option. I like that it’s not more infrastructure, I already have redis setup for my app so I’m not adding more to the stack.


Vector databases are used to store embeddings.


But why is that? I’m sure it’s the ‘best’ way to do things, but it also means more infrastructure which for simple apps isn’t worth the hassle.

I should use redis for queues but often I’ll just use a table in a SQLite database. For small scale projects I find it works fine, I’m wondering what an equivalent simple option for embeddings would be.


Perhaps sqlite-vss? It adds vector searches to sqlite.

https://github.com/asg017/sqlite-vss


check out https://github.com/tembo-io/pg_vectorize - we're taking it a little bit beyond just the storage and index. The project uses pgvector for the indices and distance operators, but also adds a simpler API, hooks into pre-trained embedding models, and helps you keep embeddings updated as data changes/grows


Re storing vectors in BLOB columns: ya, if it's not a lot of data and it's fast enough for you, then there's no problem doing it like that. I'd even just store then in JSON/npy files first and see how long you can get away with it. Once that gets too slow, then try SQLite/redis/valkey, and when that gets too slow, look into pgvector or other vector database solutions.

For SQLite specifically, very large BLOB columns might effect query performance, especially for large embeddings. For example, a 1536-dimension vector from OpenAI would take 1536 * 4 = 6144 bytes of space, if stored in a compact BLOB format. That's larger than SQLite default page size of 4096, so that extra data will overflow into overflow pages. Which again, isn't too big of a deal, but if the original table had small values before, then table scans can be slower.

One solution is to move it to a separate table, ex on an original `users` table, you can make a new `CREATE TABLE users_embeddings(user_id, embedding)` table and just LEFT JOIN that when you need it. Or you can use new techniques like Matryoshka embeddings[0] or scalar/binary quantization[1] to reduce the size of individual vectors, at the cost of lower accuracy. Or you can bump the page size of your SQLite database with `PRAGMA page_size=8192`.

I also have a SQLite extension for vector search[2], but there's a number of usability/ergonomic issues with it. I'm making a new one that I hope to release soon, which will hopefully be a great middle ground between "store vectors in a .npy files" and "use pgvector".

Re "do embeddings ever expire": nope! As long as you have access to the same model, the same text input should give the same embedding output. It's not like LLMs that have temperatures/meta prompts/a million other dials that make outputs non-deterministic, most embedding models should be deterministic and should work forever.

[0] https://huggingface.co/blog/matryoshka

[1] https://huggingface.co/blog/embedding-quantization

[2] https://github.com/asg017/sqlite-vss


This is very useful appreciate the insight. Storing embeddings in a table and joining when needed feels like a really nice solution for what I'm trying to do.


I store them as blobs in SQLite. It works fine - depending on the model they take up 1-2KB each.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: