BMX: A Freshly Baked Take on BM25

leobg · 2024-08-17T20:03:14 1723924994

> Entropy-weighted similarity: We adjust the similarity scores between query tokens and related documents based on the entropy of each token.

Sounds a lot like BM25 weighted word embeddings (e.g. fastText).

dmezzetti · 2024-08-18T11:57:51 1723982271

> Sounds a lot like BM25 weighted word embeddings (e.g. fastText).

If you're interested in this topic, I wrote an article on this method back in 2020: https://medium.com/towards-data-science/building-a-sentence-...

nblgbg · 2024-08-19T01:53:27 1724032407

Thanks for sharing the article. How exactly are you combining BM25 and fastText? Are you combining the TF-IDF score + embedding distance? What are the weights for each of these?

antman · 2024-08-18T08:05:33 1723968333

How about computational complexity? There seems to be a small improvement in metrics but not sure if it is enough to switch to bmx

intalentive · 2024-08-18T08:16:22 1723968982

Neat. I wonder how GPT-4’s query expansion might compare with SPLADE or similar masked BERT methods. Also if you really want to go nuts you can apply term expansion to the document corpus.

deepsquirrelnet · 2024-08-18T00:36:25 1723941385

Very cool! Glad to see continued research in this direction. I’ve really enjoyed reading the Mixedbread blog. If you’re interested in retrieval topics, they’re doing some cool stuff.

bernihackernews · 2024-08-17T23:33:26 1723937606

baguetter library for the win!

yokee · 2024-08-18T04:04:35 1723953875

Super cool! It is definitely a good choice for the RAG system.

timsuchanek · 2024-08-18T03:15:27 1723950927

Amazing! When will we have this in the major databases?

herrmannfield · 2024-08-18T20:36:42 1724013402

why do i have this vibe ?

https://blogs.perficient.com/2012/09/25/a-mathematical-model...

flawn · 2024-08-17T22:39:25 1723934365

Gemischtes Brot!