Hacker News new | past | comments | ask | show | jobs | submit login

One of my biggest annoyances with the modern AI tooling hype is that you need to use a vector store for just working with embeddings. You don't.

The reason vector stores are important for production use-cases are mostly latency-related for larger sets of data (100k+ records), but if you're working on a toy project just learning how to use embeddings, you can compute cosine distance with a couple lines of numpy by doing a dot product of a normalized query vectors with a matrix of normalized records.

Best of all, it gives you a reason to use Python's @ operator, which with numpy matrices does a dot product.




100k records is still pretty small!

It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.

Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.

I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.


I've been in the "mid-sized" area a lot where Numpy etc cannot handle it, so I had to go to Postgres or more specialized tooling like Spark. But I always started with the simple thing and only moved up if it didn't suffice.

Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.


Right on: I've used a single Postgres database on AWS to handle 1M+ concurrent users. If you're Google, sure, not gonna cut it, but for most people these things scale vertically a lot further than you'd expect (especially if, like me, you grew up in the pre-SSD days and couldn't get hundreds of gigs of RAM on a cloud instance).

Even when you do pass that point, you can often shard to achieve horizontal scalability to at least some degree, since the real heavy lifting is usually easy to break out on a per-user basis. Some apps won't permit that (if you've got cross-user joins then it's going to be a bit of a headache), but at that point you've at least earned the right to start building up a more complex stack and complicating your queries to let things grow horizontally.

Horizontal scaling is a huge headache, any way you cut it, and TBH going with something like Spanner is just as much of a headache because you have to understand its limitations extremely well if you want it to scale. It doesn't just magically make all your SQL infinitely scalable, things that are hard to shard are typically also hard to make fast on Spanner. What it's really good at is taking an app with huge traffic where a) all the hot queries would be easy to shard, but b) you don't want the complexity of adding sharding logic (+re-sharding, migration, failure handling, etc), and c) the tough to shard queries are low frequency enough that you don't really care if they're slow (I guess also d) you don't care that it's hella expensive compared to a normal Postgres or MySQL box). You still need to understand a lot more than when using a normal DB, but it can add a lot of value in those cases.


I can't even say whether or not Google benefits from Spanner, vs multiple Postgres DBs with application-level sharding. Reworking your systems to work with a horizontally-scaling DB is eerily similar to doing application-level sharding, and just because something is huge doesn't mean it's better with DB-level sharding.

The unique nice thing in Spanner is TrueTime, which enables the closest semblance of a multi-master DB by making an atomic clock the ground truth (see Generals' Problem). So you essentially don't have to worry about a regional failure causing unavailability (or inconsistency if you choose it) for one DB, since those clocks are a lot more reliable than machines. But there are probably downsides.


Numpy might not be able to handle a full o(n^2) comparison of vectors but you can use a lib with hsnw and it can have great performance on medium (and large) datasets.


If I remember correctly, the simplest thing Numpy choked on was large sparse matrix multiplication. There are also other things like text search that Numpy won't help with and you don't want to do in Python if it's a large set.


This sentiment is pretty common I guess. Outside of a niche, the massive scale for which a vast majority of the data tech was designed doesn't exist and KISS wins outright. Though I guess that's evolution, we want to test the limits in pursuit of grandeur before mastering the utility (ex. pyramids).


KISS doesn't get me employed though. I narrowly missed being the chosen candidate for a State job which called for Apache Spark experience. I missed two questions relating to Spark and "what is a parquet file?" but otherwise did great on the remaining behavioral questions (the hiring manager gave me feedback after requesting it). Too bad they did not have a question about processing data using command lines tools.


yeah, glad the hype around big data is dead. Not a lot of solid numbers in here, but this post covers it well[0].

We have duckdb embedded in our product[1] and it works perfectly well for billions of rows of a data without the hadoop overhead.

0 - https://motherduck.com/blog/big-data-is-dead/

1 - https://www.definite.app/


When I'm messing around, I normally have everything in a Pandas DataFrame already so I just add embeddings as a column and calculate cosine similarity on the fly. Even with a hundred thousand rows, it's fast enough to calculate before I can even move my eyes down on the screen to read the output.

I regret ever messing around with Pinecone for my tiny and infrequently used set ups.


Could not agree more. For some reason Pandas seems to get phased out as developers advance.


Actually, I had a pretty good experience with Pinecone.


Yup. I was just playing around with this in Javascript yesterday and with ChatGPT's help it was surprisingly simple to go from text => embedding (via. `openai.embeddings.create`) and then to compare the embedding similarity with the cosine distance (which ChatGPT wrote for me): https://gist.github.com/christiangenco/3e23925885e3127f2c177...

Seems like the next standard feature in every app is going to be natural language search powered by embeddings.


For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.

Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)


As an individual, I love the idea of pushing to simplify even further to understand these core concepts. For the ecosystem, I like that vector stores make these features accessible to environments outside of Python.


If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.

Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):

    function cosineSimilarity(vecA, vecB) {
        if (vecA.length !== vecB.length) {
            throw "Vectors do not have the same dimensions";
        }
        let dotProduct = 0.0;
        let normA = 0.0;
        let normB = 0.0;
        for (let i = 0; i < vecA.length; i++) {
            dotProduct += vecA[i] * vecB[i];
            normA += vecA[i] ** 2;
            normB += vecB[i] ** 2;
        }
        if (normA === 0 || normB === 0) {
            throw "One of the vectors is zero, cannot compute similarity";
        }
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }
Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.


i get triggered that the norm vars aren't ever really the norms but their squares


I love it!


Even in production my guess is most teams would be better off just rolling their own embedding model (huggingface) + caching (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I suppose there is some expertise needed, but working with a vector database vendor has major drawbacks too.


Or you just shove it into Postgres + pg_vector and just use the DBMS you already use anyway.


Using Postgres with pgvector is trivial and cheap. Its also available on AWS RDS.


Also on supabase!


hnswlib, usearch. Both handle tens of millions of vectors easily. The latter even without holding them in RAM.


Does anyone know the provenance for when vectors started to be called embeddings?


In an NLP context, earliest I could find was ICML 2008:

http://machinelearning.org/archive/icml2008/papers/391.pdf

I'm sure there are earlier instances, though - the strict mathematical definition of embedding has surely been around for a lot longer.

(interestingly, the word2vec papers don't use the term either, so I guess it didn't enter "common" usage until the mid-late 2010s)


I think it was due to GloVe embeddings back then: I don't recall them ever being called GloVe vectors, although the "Ve" does stand for vector so it could have been RAS syndrome.


>> https://nlp.stanford.edu/projects/glove/

A quick scan of the project website yields zero uses of 'embedding' and 23 of 'vector'


It's how I remember it when I was working with them back in the day (word embeddings): I could be wrong.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: