Hacker News new | past | comments | ask | show | jobs | submit login
Python, C, Assembly – Faster Cosine Similarity (ashvardanian.com)
203 points by ashvardanian 10 months ago | hide | past | favorite | 55 comments



Well now I know I have a bunch of software I want to try https://github.com/unum-cloud/. The author is definitely brilliant.

I should benchmark https://github.com/ashvardanian/SimSIMD/blob/main/include/si... and see if it's any faster than what hnswlib currently uses for bruteforce IP search because I'm currently using hnswlib in something and it's become a bottleneck. Or maybe I'll just try usearch directly.


Agreed, the work that Ash and team are doing with Unum is excellent!


The hardest (still missing) part of efficient cosine computation distance computation is picking a good epsilon for the `sqrt` calculation and avoiding "division by zero" problems.

We have an open issue about it in USearch and a related one in SimSIMD itself, so if you have any suggestions, please share your insights - they would impact millions of devices using the library (directly on servers and mobile, and through projects like ClickHouse and some of the Google repos): https://github.com/unum-cloud/usearch/issues/320


I just did a skim, and it looks like you're essentially seeing the limits of IEE 754 resolution at that scale (10 ^ -130). Theoretically they can store +/- 308 base-10 exponential range, but only about 17 decimal digits of precision.

Floats were designed to be mostly-reliable within a mostly-reasonable range of numbers for the kinds of systems we deal with in our scale of the universe. The scale probed here is far far beyond Plank-scale if dealing with metric measurements. We have the converse issue (even with 64bit) at large scales. I could make a geometric vector in space from here to a rock on a planet 15 billion light years away, and another vector to an adjacent rock. I bet those would have a nondeterministic cosine issue too.

I guess a solution is FP 128, or beyond.


Kahan floats are also commonly used in such cases, but I believe there is room for improvement without hitting those extremes. First of all, we should tune the epsilon here: https://github.com/ashvardanian/SimSIMD/blob/f8ff727dcddcd14...

As for the 64-bit version, its harder, as the higher-precision `rsqrt` approximations are only available with "AVX512ER". I'm not sure which CPUs support that, but its not available on Sapphire Rapids.


Why would floats for cosine similarity need to support ranges beyond Planck scale?


Can you adjust vector magnitude prior to the normalization? If the max coordinate C is close enough to 0, divide all coordinates by C.


I wonder if this code could be even further speed up, by taking into account the througput versus the latency of the FMA operator. If the latency is 4 cycles and the througput is 0.5 cycle for example, that means that doing n dependent operation will cost 4n cycles, and n independent operations will cost 0.5 cycles, that is an 8x speed up. By dependent operation, I mean that the input of an operation depends on the output the previous operation.

In the current final code, each loop has 3 independent FMAs. And in each loop iteration, the 3 FMAs are dependent on the FMAs from the previous iteration.

So one thing that could speed up the computation is to do 6 independent FMAs per loop iteration instead of 3. This can be done as follow. Instead of doing cumulative sums from 0 to n-1, one could split the computation of the cumulative sums from 0 to n/2-1 and from n/2 to n-1. So the main loop goes only from 0 to n/2-1, and at the k-th loop iteration, we do the 2x3 FMAs corresponding the entries of indices k and n/2+k. And at the end we add the cumulative sums.


I've had a similar hypothesis, but it didn't work out for me, only bloating the codebase. Please lmk if you have a different outcome :)


I don't know about this code in particular but if you're mostly doing math you tend to run out of architectural registers before you achieve the full throughput of the instructions you care about :(


That makes sense. I did observe significant speed up using this technic for evaluating a polynomial on several values. I used Hörner algorithm and indeed the number of registers is very small in this case.

If the lack of registers is really the bottleneck, a variant might to use both zmm registers for some cumulative sums and ymm registers for some others. In this case the speed up might be less spectacular though.

Edit: actually, I just discovered that zmm registers overlap the ymm registers, so the only registers left are the ones from the FPU.


AVX512 added 16 more vector registers. Isn't 32 registers enough to unroll by 8?


Very nice job! I love these kinds of optimizations. I wonder if we'll get to a compiler that can do it by itself.

Looks like you're missing a square root in the zip-Python version. Does that affect runtime?


Oh, right, forgot to put it there. If anything, it would make the Python version only slower, and the point bolder :)


> C is the lowest-level hardware-agnostic programming language

Not really, plenty of other ones offer similar capabilities.

> Unlike C++, however, C doesn’t support “generics” or “template functions”.

It does support a light version of them via _Generic.


`_Generic` is more akin to type switches, and generics or templates generally refer to an ability to instantiate multiple functions on demand.


An implementation detail how generics are exposed in C.


That is not an implementation detail of the same sort of feature. It is just a totally different feature.

OK not totally different but different enough: you cannot just write a fuction with a type parameter and expect it to work with multiple different types in C as you can in C++.


How the sausage gets made is not much of a problem for many.


Can someone explain why the numpy code ends up being so slow? Is the author actually doing things in a vectorized manner or just looping over np arrays instead of lists? It's hard to tell from the info given.


> Here’s a tip: If you’re using NumPy, go all in. Mixing it with regular Python can really slow you down!

It looks like they shoved numpy arrays into the place native Python containers used to occupy. They incurred more of a slowdown than you might naively have expected from that particular obvious error.

Unrelated to this article, numpy is often a lot slower than expected even when used "correctly." Major culprits include order of operations (compared to the ideal / forced by the API / when you think vectorization automagically makes a bad order of operations fast), allocations, memory bandwidth, kernels optimized for a dimension other than the one you care about, kernels optimized for a use case other than the one you care about, and improper installation (appropriately linking fast BLAS/LAPACK objects).

Most of those sins are rectifiable without any major change in coding habits by wrapping your nightmares in `jax.jit` or `torch.compile`, but if your devs are writing the sort of code that makes those things _necessary_ (rather than just nice-to-haves) then that probably still won't solve your problems.


Since NumPy is called out here, I wrote an example in KlongPy which is NumPy backed and got about 1ns/iter on my M2 Studio:

dot::{+/xy};mag::{(+/xx)^0.5};cossim::{dot(x;y)%mag(x)*mag(y)};cd::{1-cossim(x;y)}

https://github.com/briangu/klongpy/blob/main/examples/ml/cos...


Interesting! Haven't heard of KlongPy before!

We currently support several techniques if you want to provide custom distance functions to USearch, instead of using the default SimSIMD variants. At this point its Numba, Cppyy, and PeachPy: https://unum-cloud.github.io/docs/usearch/python/index.html

Maybe KlongPy can be the next one ;)


I was mainly surprised how fast NumPy can be if you push it. KlongPy is effectively a NumPy DSL (array language). So vector thinking is natural.


This is a very nice writeup, and clearly dictates performance benefits to be gained from using underlying hardware properly.

Couple of things i noticed. ``` # calculating dot product

a = np.random.randn(1,1536).astype("float32") b = np.random.randn(1536, 1).astype("float32")

result = np.matmul(a * b) ~1.1 microsecond ```

My system configurations are: Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz 2.30 GHz, (quad core) AVX256 instructions are supported.

problem lies in calculating the magnitude of vectors, which i guess scipy is doing as ``np.sum(np.square(a)``.Any numpy routine like ``matmul`` using Blas library would be very very fast, but all other routines are just simple C for loops. Magnitude calculation by passing numpy buffer to a C extension takes about ``1.4`` microseconds. using SIMD instructions for magnitude calculation is very straightforward and i get a speed up of about 8x.

For critical calculations you are almost always better off by passing numpy buffer to C code and fusing operations there, and optionally speed up code using SIMD instructions based on hardware you want to target.

It should be ``scipy.spatial.distance.cosine``, not ``scipy.distance.spatial.cosine`` in your blog !


If the author of the article sees this, this link in the table of contents doesn't work:

https://ashvardanian.com/posts/python-c-assembly-comparison/...


Patched, thanks!


I think the speed of your pure Python version can be improved by a factor of 2..3 by making better use of some features in the standard library. Try:

    from math import sqrt, hypot
    from operator import mul

    def cosine_distance(a, b):
        dot_product = sum(map(mul, a, b))
        magnitude_a = hypot(*a)
        magnitude_b = hypot(*b)
        cosine_similarity = dot_product / (magnitude_a * magnitude_b)
        return 1 - cosine_similarity
Also check sumprod introduced in the math module from 3.12 onwards.


That's some good benchmarking however doing this kind of optimization at scale is impossible. Because

1- Going from correct to fast is simple but the other direction is full of misery.

2- Optimizations are super super input and hardware specific.


1 - Very true, but in some ways its helpful, as provides an opportunity to highlight some of the poorly setup constraints in downstream applications. Like changing the compiler used for a major project, and seeing a 1000 warnings at once. In this case, the applications that will suffer are the ones that compares floats and float-arrays without absolute and relative tolerance intervals.

2 - Also true, but the SimSIMD library covers the absolute majority of mobile, desktop, and server CPUs produced in the last years.


Great article! Shame it left me curious to know how a further implementation using more pure Assembly would’ve performed vs using intrinsics. Anyone know or is it easy to just assume “faster”?


I've generally found no/minimal change between assembly and intrinsics. Once I start using them I tend to look at the generated assembly to see what's actually being generated and to make sure the compiler isn't doing something surprising.


"As pointed on HackerNews, I forgot to apply the square root for magnitude_a and magnitude_b."

Does that mean the example above this sentence is fixed, or does it mean it is wrong?


The latency of the operation will be slightly higher if we add the `sqrt`, only making the difference bigger... bigger than 2500x.

PS: Even though it wasn't intended in that specific case, square roots are often avoided in vector search, most commonly in Euclidean distance computations. Even without the square root, the triangle inequality holds, you can use it to find the closest object, it's just you don't know how close it is :)


Excuse me?


Just want to thank the author for writing Python using snake_case, which is what a sensible, well-educated engineer should do.


Remind me of an article about the opimization of matrix calculus with IA, that eventually shown better algorithm for specific size of matrix, than the one found decades ago and proven to be the best one.


What’s IA stand for? This sounds like an interesting article, I wanna try to find it


If that's a typo and they meant to say "AI", then I may be remembering the article(s) being alluded to:

> AlphaTensor discovered algorithms that outperform the state-of-the-art complexity for many matrix sizes. Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

https://towardsdatascience.com/how-deepmind-discovered-new-w...

https://www.nature.com/articles/s41586-022-05172-4


If I had to guess, I'd say it's a Frenchism ("Intelligence artificielle").


Could be spanish too :)


Apropos of nothing, that is a horrible use of goto.

The code would be clearer as a do { } while(n); instead--it's just a bottom-tested loop, which isn't that uncommon. (I guess the author may just be unfamiliar with do/while loops?)


There are several `do {} while ()` loops in the SimSIMD codebase, mostly in SVE [1]. One of the reasons I don't like putting them in many places - mostly in cases with complex conditions, but also to avoid adding one more level of nesting for that scope.

[1]: https://github.com/ashvardanian/SimSIMD/blob/f8ff727dcddcd14...


I came to say something similar, though I would use:

  while (1) {
    ...
    if (!n) {
     ...
     return 1 - ab * rsqrt_a2 * rsqrt_b2;
    }
  }
Also, the "So let’s use the common Pythonic zip idiom:" version is missing the sqrt calls so doesn't give the same answer as the first version.


Do you really think it’s likely that someone who uses SIMD intrinsic is unfamiliar with do…while loops?

It’s more likely that compilers generate better autovectorised code with a goto, even though they’re semantically the same.


> It’s more likely that compilers generate better autovectorised code with a goto, even though they’re semantically the same.

No, it's not. Most compiler IRs will work on a basic block representation, where the compiler will reduce all control flow to unconditional and conditional jumps. If they're not working on that level, then they likely have some form of structure control flow, where autovectorization will ignore gotos entirely because they're the definition of unstructured control flow, so they won't notice when the goto forms a loop that could be autovectorized.


Maybe, but in my experience gcc is, and clang to a lesser extent, is notoriously useless at autovectorisation unless you hold its hand all the way. Even things like changing an ‘int’ to a ‘bool’ can throw it off completely.


That matches my experience, and goes beyond GCC and Clang. Between 2018 and 2020 I was giving a lot of lectures on this topic and we did a bunch of case studies with Intel on their older ICC and what later became the OneAPI.

Short story, unless you are doing trivial data-parallel operations, like in SimSIMD, compilers are practically useless. As a proof, I wrote what is now the StringZilla library (https://github.com/ashvardanian/stringzilla) and we've spent weeks with an Intel team, tuning the compiler, no result. So if you are processing a lot of strings, or variable-length coded data, like compression/decompression, hand-written SIMD kernels are pretty much unbeatable.


> Do you really think it’s likely that someone who uses SIMD intrinsic is unfamiliar with do…while loops?

You don't have to think them unfamiliar to recommend it. Sometimes very skilled people can forget to apply a particular style of control flow, even if they're otherwise experts - everybody makes mistakes.


I've just wasted a day of benchmarks, because in one of my major ongoing refactoring efforts a compilation flag wasn't propagated to the needed target - so I definitely make a lot of mistakes :)

As for `do`/`while` keywords, its just a preference, but feel free to recommend more patches - I'm always happy to see new contributors - it's a luxury in low-level projects :)


Am I understanding correctly that the Numpy one doesn't actually use Numpy to do the calculation, but just to store the data? If so that's a super disingenuous example. The speed of Numpy comes from being able to stay in C for the entire hot loop.

EDIT: I tested it myself, also using ChatGPT to code it for me.

>>> can you write a cosine distance function using numpy?

    def cosine_distance(vector1, vector2):
        dot_product = np.dot(vector1, vector2)
        norm_vector1 = np.linalg.norm(vector1)
        norm_vector2 = np.linalg.norm(vector2)
        cosine_similarity = dot_product / (norm_vector1 * norm_vector2)
        cosine_distance = 1 - cosine_similarity
        return cosine_distance
The original list one takes 161us on my machine, passing a Numpy array to the original function takes 750us, and using the native Numpy function takes 6.24us.

Using Numpy correctly makes it >100x faster!


Yes, the Numpy example is bad. He should try something like,

  1 - np.dot(a,b) / (np.linalg.norm(a) * np.linalg.norm(b))


I implemented cosine similarity too in the last week or two for something and the 1- threw me for a loop until I looked at the math for another 15 seconds, realized that it didn't make much sense for "Cat" and "Cat" to have 0 similarity haha


So following the numpy example the author gives one using scipy. This should be the same if not faster than a pure numpy implementation


You'll probably want to use a function to normalize vectors if you don't want it to barf when you accidentally put something close to a 0 vector in there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: