Hacker News new | past | comments | ask | show | jobs | submit login

You can also get efficient FTS with this method if you implement indexing in user space and avoid BM25. The Lucene practical scoring function works well with this method in my experience: https://www.elastic.co/guide/en/elasticsearch/guide/current/...



Is there a dumbed down version of this indexing conversation for someone who understands b-trees, but BM25 or user space indexing?


Well, this post is in the context of an e2e encrypted DB, but it's subject to the same constraints: https://medium.com/@ZeroDB_/scalable-full-text-search-over-e...

If you understand btrees you understand the hardest part already :)

Basically, you need to design a search index that examines the fewest DB pages in order to find the result. The Lucene scoring method stores a mapping of term -> document[] sorted in relevance order. The main idea is that you can examine only the first n documents for each term in the search query in order to find the most relevant search results. Picking n is sort of tricky, but if your index is stored in this way it's possible to fulfill a large % of queries efficiently without downloading the whole index.

Here's a little Python implementation of what I mean by a "user space implementation". Note that it's a toy but it performs pretty well on some demo sklearn data sets: https://gist.github.com/petehunt/724eeb77189332db101ad7b0db8...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: