One of my biggest annoyances with the modern AI tooling hype is that you need to use a vector store for just working with embeddings. You don't.
The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.
Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.
It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.
Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.
I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.
I've been in the "mid-sized" area a lot where Numpy etc cannot handle it, so I had to go to Postgres or more specialized tooling like Spark. But I always started with the simple thing and only moved up if it didn't suffice.
Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.
Right on: I've used a single Postgres database on AWS to handle 1M+ concurrent users. If you're Google, sure, not gonna cut it, but for most people these things scale vertically a lot further than you'd expect (especially if, like me, you grew up in the pre-SSD days and couldn't get hundreds of gigs of RAM on a cloud instance).
Even when you do pass that point, you can often shard to achieve horizontal scalability to at least some degree, since the real heavy lifting is usually easy to break out on a per-user basis. Some apps won't permit that (if you've got cross-user joins then it's going to be a bit of a headache), but at that point you've at least earned the right to start building up a more complex stack and complicating your queries to let things grow horizontally.
Horizontal scaling is a huge headache, any way you cut it, and TBH going with something like Spanner is just as much of a headache because you have to understand its limitations extremely well if you want it to scale. It doesn't just magically make all your SQL infinitely scalable, things that are hard to shard are typically also hard to make fast on Spanner. What it's really good at is taking an app with huge traffic where a) all the hot queries would be easy to shard, but b) you don't want the complexity of adding sharding logic (+re-sharding, migration, failure handling, etc), and c) the tough to shard queries are low frequency enough that you don't really care if they're slow (I guess also d) you don't care that it's hella expensive compared to a normal Postgres or MySQL box). You still need to understand a lot more than when using a normal DB, but it can add a lot of value in those cases.
I can't even say whether or not Google benefits from Spanner, vs multiple Postgres DBs with application-level sharding. Reworking your systems to work with a horizontally-scaling DB is eerily similar to doing application-level sharding, and just because something is huge doesn't mean it's better with DB-level sharding.
The unique nice thing in Spanner is TrueTime, which enables the closest semblance of a multi-master DB by making an atomic clock the ground truth (see Generals' Problem). So you essentially don't have to worry about a regional failure causing unavailability (or inconsistency if you choose it) for one DB, since those clocks are a lot more reliable than machines. But there are probably downsides.
Numpy might not be able to handle a full o(n^2) comparison of vectors but you can use a lib with hsnw and it can have great performance on medium
(and large) datasets.
If I remember correctly, the simplest thing Numpy choked on was large sparse matrix multiplication. There are also other things like text search that Numpy won't help with and you don't want to do in Python if it's a large set.
This sentiment is pretty common I guess. Outside of a niche, the massive scale for which a vast majority of the data tech was designed doesn't exist and KISS wins outright. Though I guess that's evolution, we want to test the limits in pursuit of grandeur before mastering the utility (ex. pyramids).
KISS doesn't get me employed though. I narrowly missed being the chosen candidate for a State job which called for Apache Spark experience. I missed two questions relating to Spark and "what is a parquet file?" but otherwise did great on the remaining behavioral questions (the hiring manager gave me feedback after requesting it). Too bad they did not have a question about processing data using command lines tools.
When I'm messing around, I normally have everything in a Pandas DataFrame already so I just add embeddings as a column and calculate cosine similarity on the fly. Even with a hundred thousand rows, it's fast enough to calculate before I can even move my eyes down on the screen to read the output.
I regret ever messing around with Pinecone for my tiny and infrequently used set ups.
Yup. I was just playing around with this in Javascript yesterday and with ChatGPT's help it was surprisingly simple to go from text => embedding (via. `openai.embeddings.create`) and then to compare the embedding similarity with the cosine distance (which ChatGPT wrote for me): https://gist.github.com/christiangenco/3e23925885e3127f2c177...
Seems like the next standard feature in every app is going to be natural language search powered by embeddings.
For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.
Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)
As an individual, I love the idea of pushing to simplify even further to understand these core concepts. For the ecosystem, I like that vector stores make these features accessible to environments outside of Python.
If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.
Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):
function cosineSimilarity(vecA, vecB) {
if (vecA.length !== vecB.length) {
throw "Vectors do not have the same dimensions";
}
let dotProduct = 0.0;
let normA = 0.0;
let normB = 0.0;
for (let i = 0; i < vecA.length; i++) {
dotProduct += vecA[i] * vecB[i];
normA += vecA[i] ** 2;
normB += vecB[i] ** 2;
}
if (normA === 0 || normB === 0) {
throw "One of the vectors is zero, cannot compute similarity";
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.
Even in production my guess is most teams would be better off just rolling their own embedding model (huggingface) + caching (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I suppose there is some expertise needed, but working with a vector database vendor has major drawbacks too.
I think it was due to GloVe embeddings back then: I don't recall them ever being called GloVe vectors, although the "Ve" does stand for vector so it could have been RAS syndrome.
The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.
Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.