Hacker News new | past | comments | ask | show | jobs | submit login
BMX: A Freshly Baked Take on BM25 (mixedbread.ai)
91 points by breadislove 4 months ago | hide | past | favorite | 11 comments



> Entropy-weighted similarity: We adjust the similarity scores between query tokens and related documents based on the entropy of each token.

Sounds a lot like BM25 weighted word embeddings (e.g. fastText).


> Sounds a lot like BM25 weighted word embeddings (e.g. fastText).

If you're interested in this topic, I wrote an article on this method back in 2020: https://medium.com/towards-data-science/building-a-sentence-...


Thanks for sharing the article. How exactly are you combining BM25 and fastText? Are you combining the TF-IDF score + embedding distance? What are the weights for each of these?


How about computational complexity? There seems to be a small improvement in metrics but not sure if it is enough to switch to bmx


Neat. I wonder how GPT-4’s query expansion might compare with SPLADE or similar masked BERT methods. Also if you really want to go nuts you can apply term expansion to the document corpus.


Very cool! Glad to see continued research in this direction. I’ve really enjoyed reading the Mixedbread blog. If you’re interested in retrieval topics, they’re doing some cool stuff.


baguetter library for the win!


Super cool! It is definitely a good choice for the RAG system.


Amazing! When will we have this in the major databases?



Gemischtes Brot!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: