Aren’t all approaches to search indices essentially based on vector similarity?
I get that here they are applying "non-traditional" machine-learned magnitudes and scoring schemes, but couldn’t you do the same thing on, say, Solr given a little magic on the indexing and query pipelines?
I worked on a heavily hacked Solr installation that used a neural network that cooked documents down to 50-d vectors which were similarity searched and also used an "information theoretic" similarity function for the ordinary word frequency vectors before that was officially supported.
Performance in terms of indexing time and throughput was awful, but searches were quick enough and performance in terms of "write a paragraph about your invention and find related patents" was fantastic.
Consider word embedding, where finding the most similar words given some input, you need to do a NN over 300 dimensions. That’s a common task, and it’s largely unsolved. Postgres, for example, allows you to create indices over their „cube“ datatype, but those are slower than the naive brut-force approach.
I get that here they are applying "non-traditional" machine-learned magnitudes and scoring schemes, but couldn’t you do the same thing on, say, Solr given a little magic on the indexing and query pipelines?