Major props to the authors of this library. I re-built https://progscrape.com [1] on top of it last year, replacing an ancient Python2 AppEngine codebase that I had neglected for a while. It's a great library and insanely fast, as in indexing the entire library of 1M stories on a Raspberry Pi in seconds.
I'm able to host a service on a Pi at home with full-text search and a regular peak load of a few rps (not much, admittedly), with a CPU that barely spikes above a few percent. I've load tested searches on the Pi up to ~100rps and it held up. I keep thinking I should write up my experiences with it. It was pretty much a drop-in, super-useful library and the team was very responsive with bug reports, of which there were very few.
If you want to see how responsive the search is on such a small device, try clicking the labels on each story -- it's virtually instantaneous to query, and this is hitting up to 10 years * 12 months of search shards! https://progscrape.com/?search=javascript
I'd recommend looking at it over Lucene for modern projects. I am a big fan, as you might be able to tell. Given how well it scales on a tiny little ARM64, I'd wager your experiences on bigger iron will be even more fantastic.
It is a very nice library. I’m using it for a very work in progress incremental email backup CLI tool for email providers using JMAP.
I wanted users to be able to search their backups. As I’m using Rust Tantivy looked like just the right thing for the job. Indexing happens so fast for an email I did not bother to move the work to a separate thread. And search across thousands of emails seems to be no problem.
If anyone wants search for their Rust application they should take a look at Tantivy.
AFAIK, PostgreSQL doesn't provide a way to get the IDF of a term, which makes its ranking function pretty limited. tf-idf (and its varians, like Okapi BM25) are kinda table stakes for an information retrieval system IMO.
I'm not saying PostgreSQL's functionality is useless, but if you need ranking based on the relative frequency of a term in a corpus, then I don't believe PostgreSQL can handle that unless something has changed in the last few years. Usually the reason to use something like Lucene or Tantivy is precisely for its ranking support that incorporates inverse document frequency.
Postgres's FTS is actually quite solid! You can get very far with just the built-in tsvector. The ranking could be improved, though, which was one of the reasons for creating pg_search in the first place: https://github.com/paradedb/paradedb/tree/dev/pg_search (disclaimer: I work on pg_search @ ParadeDB)
Okay, but I didn't say it wasn't solid. I just said its ranking wasn't great because it lacks IDFs. It seems like we must be in violent agreement, given that you work on something that must be adding IDFs to PostgreSQL FTS. :P
I'm able to host a service on a Pi at home with full-text search and a regular peak load of a few rps (not much, admittedly), with a CPU that barely spikes above a few percent. I've load tested searches on the Pi up to ~100rps and it held up. I keep thinking I should write up my experiences with it. It was pretty much a drop-in, super-useful library and the team was very responsive with bug reports, of which there were very few.
If you want to see how responsive the search is on such a small device, try clicking the labels on each story -- it's virtually instantaneous to query, and this is hitting up to 10 years * 12 months of search shards! https://progscrape.com/?search=javascript
I'd recommend looking at it over Lucene for modern projects. I am a big fan, as you might be able to tell. Given how well it scales on a tiny little ARM64, I'd wager your experiences on bigger iron will be even more fantastic.
[1] https://github.com/progscrape/progscrape