Nearest neighbor indexes for similarity search

jamesbriggs · on Aug 12, 2021

Hey all, I'm the author - feel free to ask me any questions about the article!

lootsauce · on Aug 12, 2021

Are there any approaches to this problem that are distributed perhaps could take advantage of serverless functions type implementation?

bguberfain · on Aug 12, 2021

Thanks for sharing! I have a question: is there any option for sparse vectors (like tf-idf outputs)?

mumblemumble · on Aug 12, 2021

nmslib (https://github.com/nmslib/nmslib) supports sparse vectors for some of its spaces. It has fewer indexing methods than faiss, though.

https://github.com/nmslib/nmslib/blob/master/manual/spaces.m...

jamesbriggs · on Aug 12, 2021

Welcome! Unfortunately faiss doesn't support sparse vectors. If you're using sparse vectors I would either not use faiss (there are better options like Elasticsearch) - or convert them into dense for use with faiss

hintymad · on Aug 12, 2021

Do such vector-search databases have a sizable market, say > 10B? It looks only a few big players would need such technology: search engines, security companies, LexisNexis (maybe). But for most of companies, their dataset is small enough that an in-memory library like faiss may well just do the work. And if such database becomes so useful, I'd imagine that software like Solr or ES can simply replace their current implementation of ANN search with a better version.

_009 · on Aug 12, 2021

Apart from search, ANNs can be use for recommendations, classification, and other information retrieval problems.

Currently, ES and Solr, both based on Lucene, can't really manage vector representations, as they are mainly based on inverted indexes to n-grams.

ANNs potential applications extend to audio, bioinformatics, video, among any modality that can be represented as a vector. All you need is an encoder! How nice.

Faiss is definitely powerful. I have been running experiments using 80 million vectors that map to legal documents, and vectorizing protein-folds (using Alphafold). While it is an interesting technology, at this moment, perhaps for my usecases, I see it more as a lib or tool than a full-featured product like ES or Solr.

For instance, ATM, updating a Faiss index is a non trivial process, with many of the workflow tools you would expect in ES missing. There is also the problem of encoding the input into vectors, which takes a few milliseconds (do you batch, parallelize, are you ok with eventual consistency?).

I recently been found with pgvector (postgres + vector support) https://github.com/ankane/pgvector. Perhaps less performant, but easier to work with for teams. With support of migrations, ORM, sharing, and all the postgres goodies.

Another interesting/product-ready alternative is https://jina.ai.

And Google's ScaNN, https://www.youtube.com/watch?v=0SvrDtnUgV4

m-i-l · on Aug 12, 2021

> "Currently, ES and Solr, both based on Lucene, can't really manage vector representations"

Lucene does have an ANN implementation due in 9.0, based on HNSW - see https://issues.apache.org/jira/browse/LUCENE-9004 for details. See also https://issues.apache.org/jira/browse/SOLR-12890 and https://issues.apache.org/jira/browse/SOLR-14397 for Solr.

heipei · on Aug 12, 2021

Some more Open Source vector search engines for ANN: https://milvus.io/ https://vespa.ai/ https://vald.vdaas.org/ https://github.com/semi-technologies/weaviate https://opendistro.github.io/for-elasticsearch/features/knn....

hintymad · on Aug 12, 2021

> Apart from search, ANNs can be use for recommendations, classification, and other information retrieval problems.

Yeah, those are all good use cases. I was wondering about a different thing: will the demand for a distributed vector search service concentrate to a few big companies, as smaller companies can use a simpler solution so they don't really need to pay for the technology.

> Currently, ES and Solr, both based on Lucene, can't really manage vector representations, as they are mainly based on inverted indexes to n-grams.

ES has kNN plugin, which stores vectors separately in each segment in Lucene index. Plus, they can also use better storage formats and algorithms.

rpedela · on Aug 12, 2021

Google has a distributed embedding matching service in preview: https://cloud.google.com/vertex-ai/docs/matching-engine/over...

I guess it depends on what you mean by "simple". The algorithms are complex but there are good tools that implement them. I would imagine smaller companies would use off the shelf tooling, and I would argue that is simpler. Vector embeddings are so unbelievably powerful and often yield better results than classical methods with one of the good tools + pretrained embeddings.

Specifically for search, I use them to completely replace stemming, synonyms, etc in ES. I match the query's embedding to the document embeddings, find the top 1000 or so. Then I ask ES for the BM25 score for that top 1000. I combine the embedding match score with BM25, recency, etc for final rank. The results are so much better than using stemming, etc and it's overall simpler because I can use off the shelf tooling and the data pipeline is simpler.

hintymad · on Aug 12, 2021

> I match the query's embedding to the document embeddings,

I assume the doc size is relatively small, otherwise a document may contain too many different topics that make it hard to differentiate different queries?

rpedela · on Aug 12, 2021

For my search use case, documents are mostly single topic and less than 10 pages. However I have found embeddings still work surprisingly well for longer documents with a few topics in them. But yes, multi-topic documents can certainly be an issue. Segmentation by sentence, paragraph, or page can help here. I believe there are ML-based topic segmentation algorithms too, but that certainly starts making it less simple.

gk1 · on Aug 12, 2021

The moment you cross 10M items or 100 QPS then scaling such a system becomes non-trivial. That's not a high threshold for any enterprise software company handling customer data or any consumer tech company with >10M users. Once you add other requirements to the mix, such as index freshness and metadata filtering, the managed options where this is already built-in start to become compelling even at lower volumes.

Also, Pinecone (disclosure: I work there) has usage-based billing that starts at $72/month, so "paying for the technology" is not that scary.

sanxiyn · on Aug 12, 2021

Well, OpenSearch already has ANN search plugin. Implementation currently uses nmslib. I am using it and it works fine.

https://opensearch.org/docs/search-plugins/knn/index/

jjgreen · on Aug 12, 2021

I've been looking at Faiss, it's on its way to Debian ...

https://packages.qa.debian.org/f/faiss.html

PaulHoule · on Aug 12, 2021

In most of the cases that I've done vector similarity search the number of vectors was a million or less (say the 300,000 public domain images from the Metropolitan Museum of Art I am looking at now) and in that case in-memory search with naive algorithms works very well and more complex algorithms are more trouble than they are worth.

I've also done similarity search over spiky vectors (have most of the weight in a few dimensions even if they aren't really sparse) with a conventional database (MySQL) in which case I could search a small fraction of the database and have a "proof" that there aren't going to be any more documents that will make it into the top N.

Generally though the hyperdimensional indexes just aren't as good as 1-d indexes, just as the 1-d indexes are a bit better than 2-d and 3-d indexes. (E.g. consider the problem of deciding which lines to draw if you're drawing a map of a small piece of the New Mexico-Arizona border, where the border is a long straight line that has both vertexes far away.)

gk1 · on Aug 12, 2021

The moment you cross 10M items or 100 QPS then scaling such a system becomes non-trivial. Combine that with some additional features a typical company may want but won't find in the open-source libraries -- such as live index updates and metadata filtering -- and you have a good reason to look for managed solutions like Pinecone[0].

That's not a high threshold for any enterprise software company handling customer data or any consumer tech company with >10M users. Google, Elastic, and AWS thought so too because they all introduced (or planning to introduce, in the case of Elastic) vector search in their platforms.

[0] https://www.pinecone.io

IfOnlyYouKnew · on Aug 12, 2021

High-dimensional vectors are the most natural representation of a lot of the intermediary and final steps of ML pipelines.

It’s true that it might be more of a feature than a product. But it’s a feature that’s currently missing from a lot of products one would usually turn to.

ramraj07 · on Aug 12, 2021

With the advent of the whole data information complex (Tim cooks words not mine) more and more companies are getting access to a ton of data, so that’s there. Both in consumer and healthcare spaces.

fergie · on Aug 12, 2021

Aren’t all approaches to search indices essentially based on vector similarity?

I get that here they are applying "non-traditional" machine-learned magnitudes and scoring schemes, but couldn’t you do the same thing on, say, Solr given a little magic on the indexing and query pipelines?

PaulHoule · on Aug 12, 2021

I worked on a heavily hacked Solr installation that used a neural network that cooked documents down to 50-d vectors which were similarity searched and also used an "information theoretic" similarity function for the ordinary word frequency vectors before that was officially supported.

Performance in terms of indexing time and throughput was awful, but searches were quick enough and performance in terms of "write a paragraph about your invention and find related patents" was fantastic.

IfOnlyYouKnew · on Aug 12, 2021

Hmmm… no?

Consider word embedding, where finding the most similar words given some input, you need to do a NN over 300 dimensions. That’s a common task, and it’s largely unsolved. Postgres, for example, allows you to create indices over their „cube“ datatype, but those are slower than the naive brut-force approach.