One of my biggest annoyances with the modern AI tooling hype is that you need to...

hereonout2 · 2024-04-17T19:23:48 1713381828

100k records is still pretty small!

It feels a bit like the hype that happended with "big data". People ended up creating spark clusters to query a few million records. Or using Hadoop for a dataset you could process with awk.

Professionally I've only ever worked with dataset sizes in the region of low millions and have never needed specialist tooling to cope.

I assume these tools do serve a purpose but perhaps one that only kicks in at a scale approaching billions.

hot_gril · 2024-04-17T21:22:05 1713388925

I've been in the "mid-sized" area a lot where Numpy etc cannot handle it, so I had to go to Postgres or more specialized tooling like Spark. But I always started with the simple thing and only moved up if it didn't suffice.

Similarly, I read how Postgres won't scale for a backend application and I should use Citus, Spanner, or some NoSQL thing. But that day has not yet arrived.

cornel_io · 2024-04-17T22:44:48 1713393888

Right on: I've used a single Postgres database on AWS to handle 1M+ concurrent users. If you're Google, sure, not gonna cut it, but for most people these things scale vertically a lot further than you'd expect (especially if, like me, you grew up in the pre-SSD days and couldn't get hundreds of gigs of RAM on a cloud instance).

Even when you do pass that point, you can often shard to achieve horizontal scalability to at least some degree, since the real heavy lifting is usually easy to break out on a per-user basis. Some apps won't permit that (if you've got cross-user joins then it's going to be a bit of a headache), but at that point you've at least earned the right to start building up a more complex stack and complicating your queries to let things grow horizontally.

Horizontal scaling is a huge headache, any way you cut it, and TBH going with something like Spanner is just as much of a headache because you have to understand its limitations extremely well if you want it to scale. It doesn't just magically make all your SQL infinitely scalable, things that are hard to shard are typically also hard to make fast on Spanner. What it's really good at is taking an app with huge traffic where a) all the hot queries would be easy to shard, but b) you don't want the complexity of adding sharding logic (+re-sharding, migration, failure handling, etc), and c) the tough to shard queries are low frequency enough that you don't really care if they're slow (I guess also d) you don't care that it's hella expensive compared to a normal Postgres or MySQL box). You still need to understand a lot more than when using a normal DB, but it can add a lot of value in those cases.

hot_gril · 2024-04-18T00:53:04 1713401584

I can't even say whether or not Google benefits from Spanner, vs multiple Postgres DBs with application-level sharding. Reworking your systems to work with a horizontally-scaling DB is eerily similar to doing application-level sharding, and just because something is huge doesn't mean it's better with DB-level sharding.

The unique nice thing in Spanner is TrueTime, which enables the closest semblance of a multi-master DB by making an atomic clock the ground truth (see Generals' Problem). So you essentially don't have to worry about a regional failure causing unavailability (or inconsistency if you choose it) for one DB, since those clocks are a lot more reliable than machines. But there are probably downsides.

jerrygenser · 2024-04-17T22:26:44 1713392804

Numpy might not be able to handle a full o(n^2) comparison of vectors but you can use a lib with hsnw and it can have great performance on medium (and large) datasets.

hot_gril · 2024-04-18T01:03:59 1713402239

If I remember correctly, the simplest thing Numpy choked on was large sparse matrix multiplication. There are also other things like text search that Numpy won't help with and you don't want to do in Python if it's a large set.

smahs · 2024-04-17T19:57:28 1713383848

This sentiment is pretty common I guess. Outside of a niche, the massive scale for which a vast majority of the data tech was designed doesn't exist and KISS wins outright. Though I guess that's evolution, we want to test the limits in pursuit of grandeur before mastering the utility (ex. pyramids).

jonnycoder · 2024-04-17T20:47:02 1713386822

KISS doesn't get me employed though. I narrowly missed being the chosen candidate for a State job which called for Apache Spark experience. I missed two questions relating to Spark and "what is a parquet file?" but otherwise did great on the remaining behavioral questions (the hiring manager gave me feedback after requesting it). Too bad they did not have a question about processing data using command lines tools.

mritchie712 · 2024-04-17T20:02:28 1713384148

yeah, glad the hype around big data is dead. Not a lot of solid numbers in here, but this post covers it well[0].

We have duckdb embedded in our product[1] and it works perfectly well for billions of rows of a data without the hadoop overhead.

0 - https://motherduck.com/blog/big-data-is-dead/

1 - https://www.definite.app/

ertgbnm · 2024-04-17T21:08:24 1713388104

When I'm messing around, I normally have everything in a Pandas DataFrame already so I just add embeddings as a column and calculate cosine similarity on the fly. Even with a hundred thousand rows, it's fast enough to calculate before I can even move my eyes down on the screen to read the output.

I regret ever messing around with Pinecone for my tiny and infrequently used set ups.

laurshelly · 2024-04-17T21:23:36 1713389016

Could not agree more. For some reason Pandas seems to get phased out as developers advance.

m1117 · 2024-04-17T21:58:04 1713391084

Actually, I had a pretty good experience with Pinecone.

christiangenco · 2024-04-17T19:08:18 1713380898

Yup. I was just playing around with this in Javascript yesterday and with ChatGPT's help it was surprisingly simple to go from text => embedding (via. `openai.embeddings.create`) and then to compare the embedding similarity with the cosine distance (which ChatGPT wrote for me): https://gist.github.com/christiangenco/3e23925885e3127f2c177...

Seems like the next standard feature in every app is going to be natural language search powered by embeddings.

minimaxir · 2024-04-17T19:18:38 1713381518

For posterity, OpenAI embeddings come pre-normalized so you can immediately dot-product.

Most embeddings providers do normalization by default, and SentenceTransformers has a normalize_embeddings parameter which does that. (it's a wrapper around PyTorch's F.normalize)

bryantwolf · 2024-04-17T19:18:55 1713381535

As an individual, I love the idea of pushing to simplify even further to understand these core concepts. For the ecosystem, I like that vector stores make these features accessible to environments outside of Python.

simonw · 2024-04-17T19:41:15 1713382875

If you ask ChatGPT to give you a cosine similarity function that works against two arrays of floating numbers in any programming language you'll get the code that you need.

Here's one in JavaScript (my prompt was "cosine similarity function for two javascript arrays of floating point numbers"):

    function cosineSimilarity(vecA, vecB) {
        if (vecA.length !== vecB.length) {
            throw "Vectors do not have the same dimensions";
        }
        let dotProduct = 0.0;
        let normA = 0.0;
        let normB = 0.0;
        for (let i = 0; i < vecA.length; i++) {
            dotProduct += vecA[i] * vecB[i];
            normA += vecA[i] ** 2;
            normB += vecB[i] ** 2;
        }
        if (normA === 0 || normB === 0) {
            throw "One of the vectors is zero, cannot compute similarity";
        }
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }

Vector stores really aren't necessary if you're dealing with less than a few hundred thousand vectors - load them up in a bunch of in-memory arrays and run a function like that against them using brute-force.

sixbrx · 2024-04-18T21:59:59 1713477599

i get triggered that the norm vars aren't ever really the norms but their squares

bryantwolf · 2024-04-17T20:41:29 1713386489

I love it!

twelfthnight · 2024-04-17T18:53:21 1713380001

Even in production my guess is most teams would be better off just rolling their own embedding model (huggingface) + caching (redis/rocksdb) + FAISS (nearest neighbor) and be good to go. I suppose there is some expertise needed, but working with a vector database vendor has major drawbacks too.

danielbln · 2024-04-17T18:59:26 1713380366

Or you just shove it into Postgres + pg_vector and just use the DBMS you already use anyway.

hackernoteng · 2024-04-17T18:58:03 1713380283

Using Postgres with pgvector is trivial and cheap. Its also available on AWS RDS.

jonplackett · 2024-04-17T19:07:37 1713380857

Also on supabase!

leobg · 2024-04-17T19:32:28 1713382348

hnswlib, usearch. Both handle tens of millions of vectors easily. The latter even without holding them in RAM.

itronitron · 2024-04-17T19:10:19 1713381019

Does anyone know the provenance for when vectors started to be called embeddings?

nmfisher · 2024-04-18T04:00:41 1713412841

In an NLP context, earliest I could find was ICML 2008:

http://machinelearning.org/archive/icml2008/papers/391.pdf

I'm sure there are earlier instances, though - the strict mathematical definition of embedding has surely been around for a lot longer.

(interestingly, the word2vec papers don't use the term either, so I guess it didn't enter "common" usage until the mid-late 2010s)

minimaxir · 2024-04-17T19:27:25 1713382045

I think it was due to GloVe embeddings back then: I don't recall them ever being called GloVe vectors, although the "Ve" does stand for vector so it could have been RAS syndrome.

itronitron · 2024-04-17T19:32:09 1713382329

>> https://nlp.stanford.edu/projects/glove/

A quick scan of the project website yields zero uses of 'embedding' and 23 of 'vector'

minimaxir · 2024-04-17T19:34:11 1713382451

It's how I remember it when I was working with them back in the day (word embeddings): I could be wrong.