While I really like their lightweight, SQL-like protocol instead of Elasticsearch's fat JSON, I really think that this project could have much more impact if it could be a drop-in replacement for ES.
Even if it offers only a fraction of the features offered by ES, that may be fair enough for at least half of the use-cases out there.
Sonic could have really had a strong selling point: "Use an ES-alternative that works fine in most of the real-world applications, but it's written in Rust and it only takes a fraction of the memory footprint required by ES, and it shouldn't require you to change your application code".
Instead, they are proposing yet another search protocol, that developers have to learn and adopt. That definitely increases the adoption barriers.
Since Elastic spitefully patched all of their client libraries to fail if the server is not a "genuine" ES server, I don't see what good a drop-in replacement with protocol compatibility would do.
Those libraries are open source, just nuke those restrictions and you're good to go. Is it the best way? Maybe not, but it's better than modifying your server responses (and in the worst 1984 case, allowing Elastic to sue you), if you develop such a tool you can always put that distinction in your README.
I don't see how they can legally have any control over what a 3rd party's software outputs. And more importantly, how would they even enforce such restrictions?
same thing nintendo did on old consoles: the "DRM" was requiring the cartridge to display a nintendo logo at start-up. so if you included that, you could get sued.
my guess is because it's "X-Elastic-Product" you'd be falsely "saying" your product is made by Elastic, the company so they could sue over it.
A trademark does not forbid people from using a name, it only restricts how it can be used in marketing. I do not see how that would be applicable here.
Wow it's weird that this comes up, I'm actually running a site I am going to repost to HN today that I want to use as a testbed for search engines (kind of like an extension to my recent collaboration with supabase[0]).
Right now I've got the site going on just Postgres FTS + trigram and it's pretty darn fast, looks like I need to test sonic too.
Going to burn some midnight oil (in my timezone, anyway) and get it out -- though sonic isn't implemented yet!
Anyway to make this comment useful to people, here's my short list of engines that I want to run in parallel:
There isn't enough out there comparing all these for the simple typical fuzzy search/search box usecase, so I'm adapting a little podcast search site I made to try and use all of these at the same time. So far only Postgres though, will try and add Meilisearch today and post it!
Like other people are pointing out, most of these engines won't have all the features of ES (or more accurately Lucene) but I am pretty convinced that most of the time it doesn't actually matter and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).
> and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).
I don't understand this comment. Why would you search something that *isn't*, in some senses, a repository of information? I would say almost every website needs to have search in some sense, and it's *because* sites function as a repository of information that they need this search. Think about e.g. Stripe's documentation, or Github's repository / code search. HN is also another great example—I search for stories or comments all the time to try and remember something I read about recently or heard about last week, but couldn't quite remember. I'm hard-pressed to think of a web site I use regularly that *shouldn't* have full-text search, if I'm being honest.
I don't consider use cases like documentation a "repository" of information, but maybe this is just me not phrasing it badly. In the literal sense sure it is, but when I think of a "repository of information" I think of wikipedia, amazon search items, etc.
The scale of a documentation site is a very different problem -- you can brute force it in ways that you can't at larger scales.
I agree that HN would be a case of the large repository, but even then what most people want out of HN search is pretty simple/basic keyword search. I think a decent non-frustrating HN search feature could be very basic and get by without most of the advanced features/rabbit holes available in search.
Basically I think most apps fall into the lighter search use case -- command palettes, search inside of apps with a small scale of information, etc.
My comment wasn't that apps shouldn't have full text search -- it was that most that have full text search don't need complex full text search with all the bells and whistles that lucene and other serious search engines provide. These up-and-comers might be enough for a bunch of apps for which search is not the main feature.
I think the sibling comment on this thread kind of disproves this to some extent. Sure, some simple in-memory search library might be good enough for a command palette or for "apps", but the fact that even tiny docs and "marketing" sites that try to implement site search are almost always outdone by google "site:example.com query" really goes to show how much value a full stemming / synonym clustering / syntax normalizing search engine can bring.
Are you aware of any that can be used client side like Lyra and supports faceted search?
I've been looking for a solution and cannot find it, even an algorithm and/or a data structure can be helpful. I attempted coming up with a solution myself but ended up with frustration when it came to making the facets dynamic and update as other filters are applied.
I read a couple of papers and one stood out [0], which introduces category theory as a solution to faceted filtering. I understood it in theory and it was still does not seem straight forward to implement but I haven't attempted yet.
There are some others but I can't find them at this moment -- a bunch of the other projects I find are somewhat abandoned, lunr is actually on my list of things to use (because it makes the most sense to just ship a pre-built index with the first like... 5 letters maybe of typeahead, no matter how fast the backend is)
I have for many years now a small search engine project in my free-time pipeline, but I'm before crawling even and I intend to sit for searching part after some of that.
If you decide to add Manticore Search to the list feel free to ping me at sergey@manticoresearch.com if you need help with preparing the ingestion scripts etc.
Oh! Damn it I forgot about manticore -- I had seen it before but forgot to include it.
Eventually all of these projects will be highlighted on Awesome F/OSS (https://awsmfoss.com), but for now I'm just going to dump my bookmarks here for other people, since I'm leaving awesome projects out:
There is no way I can get to running all of these (this project was supposed to be quick!!), but I will run the ones I noted earlier, and probably manticore too since it was high on my list since it's quite polished looking.
Xapian is great, especially when you need a a C/C++ library rather than a separate service. Kinda like an sqlite for search. Some of my favorite tools like notmuch and recoll use it.
Yup this is exactly why I started Awesome F/OSS (https://awsmfoss.com) — I have similar lists for a lot of software and just need to make sure they’re somewhere else other than just my browser!
I've written a full-text search engine as well. I don't tout it as a replacement for Elasticsearch, but it does have a few advantages: it's fast; supports HTML documents; supports Polish inflection (via a full-blown morphological dictionary, not just a stemmer); and has a very compact on-disk format (pre-parsed HTML trees, Huffman-encoded over large alphabets). Oh, and it's 100% Clojure.
It is open-source already (MIT)! I just need to make other languages more easily pluggable, and factor out the search engine so that it can be used on its own. :)
>Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)
Does this mean that it only ever finds at most N documents per word? Even searches for "A and B" would probably not find everything, even if less than N documents contain A and B, because they might have been removed with the sliding window already for A or B alone. Is that correct?
Every time someone comes up with an alternative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me".
This is surely an impressive engineering feat, but hardly a replacement for the myriad of query possibilities Elasticsearch offers.
Agreed, they do a good job of hedging it. I think OP was probably pre-empting the usual comments along the lines of "yep, $tool is super bloated, $smallerTool proves that those other guys building $tool are bad engineers."
To be fair, there is often a better reasons to only replace a portion of ES' functionality, since doing so can save a lot of computation and space; than to replace ES itself, since it already exists and does a good job if what you need is the full kit.
I found myself last week reimplementing 10% of RoaringBitmap's functionality as a homebrew replacement, because doing so was 500% faster. Not that RB isn't great, but it's designed for a general problem space, and not my particular problem.
I don't think your opinion is wrong, but I do think ElasticSearch has a lot of features that many people consider bloat depending on their work, and scaling and doing general dev ops for ES can be an absolute slog. Light weight alternatives that cut down to a set of core features for some niche seem like a good idea to me.
It's totally fine that many people consider stuff bloat, but other people don't. I've built a highly specialised search engine for manufacturing companies on top of Elasticsearch, and I decidedly need vector queries, TF-IDF queries, geospatial range queries, and heaps of other, niche features you probably never used before.
Having a lightweight search engine is fine, but calling it an alternative to Elasticsearch is not doing either justice.
That's very assumptive of you. I have in fact used most of those features, and note I said their opinion was not wrong. In their readme they said it's a replacement for some use cases which is upfront and fine.
Vector queries aren't niche, Elastic however only tacked on a proper (non HNSW) implementation in the last year and a half. Geospatial isn't niche, anyone working with location data will work with those queries. TF-IDF is a basic ranking algo / signal.
Maybe Elasticsearch is good for you because they have all their features in aggregate. But I can name a tool that focuses specifically on each area and query type and is better for that specific subset of functionality.
So my point still stands, if all you need are specific features Elastic is too much. You need all of it and that's fine too.
I mean, Sonic doesn't store term frequency at all, so it can't do TF-IDF. It probably doesn't want to. If you need any ranking other than latest first, Sonic is not for you.
"ative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me"
Which is perfectly fine. A lot of tools become so general and bloated, that there are large groups that would be fine with many different 10% subsets of their features...
Kind of like how I don't need MS Word or OpenOffice Write, any simple text editing program with a few basic features (like printing, bold/italics, and word count) will do for my needs...
I'm not opposed to that, however, the chance of their 10% and my 10% overlapping is rather slim. Just like you only need basic formatting, and I require footnotes in my documents. Nothing wrong with either, but I'd be upset if you tried to sell me GEdit as a replacement for OpenOffice Write.
Honestly most of the "alternative to" programs do not meet expectations they set by dropping a big known name. So much so, that I think people are doing FOSS disservice by comparing to those who they can't meaningfully overtake.
The only exceptions could be small single feature utilities.
To me it seems the “alternative to” part is more damaging in that sense than dropping a big name. The name is used to put a complicated piece of software in a context many people are familiar with. The same thing happens with the “Tinder/Uber/Airbnb for <x>…” type of services.
The friction is introduced where it’s not made crystal clear how it’s similar, and which concept are different or missing altogether. Then it will cause unmet expectations.
Perhaps it’s better to say “inspired by …” or “similar to …” to make a more precise statement.
That is exactly what they are, and I don't think they hide it?! So, I don't know what the issue it. This is the kind of innovation that keeps us moving forward.
Projects like Meili Search are already coming for Elastic Search's lunch: https://www.meilisearch.com. I think there is a market for fast, light weight alternatives like Meili that offers up a fully featured open source experience.
With Elastic Search many of the features, security being one, are locked away behind commercial licenses. With Meili it seems they are, for the time being anyway, going with a proper open source version. I understand Elastic needs to earn money, and I get their licensing model to accomplish this. But Meili will probably steal away a good portion of customers interested in self hosting their search solution.
I’m not sure what this competitor will be but > ES will have the following properties:
- written in rust or maybe just C
- extremely lightweight and high performance
- single small binary that runs anywhere
- designed to run in Kubernetes from the ground up
- scales dynamically up/down
- zero downtime upgrades
- rigorous security built into the core offering
- fully open source
- wire compatibility with ES
I hope that ES themselves do this. There are pretty significant barriers to creating a serious competitor to ES (unlike something like MongoDB for example which seems to have a very limited role in the future).
and almost all of them won't offer ES-compatible API. Out of my head i can think about manticore (https://manticoresearch.com/) that offers at least subset of elasticsearch API
well, and how do you solve excessive ram usage for a search engine? Generally you write the indexes/search trees to disk which, may or may not be ideal.
While I get the wants-and-needs since ElasticSearch has a voracious appetite for RAM, I get the feeling most people think search engines are a simple thing where you can just import some lib, fool around for a bit and call it a day.
The truth is that ElasticSearch/Solr/Lucene is orders of magnitude more complex and powerful than these "alternatives". All this is mostly fine as long everyone is on the same page regarding the expectations.
Most people don't need ElasticSearch for their use cases on the surface, but I feel they expect top-notch mind-reading results and that requires something like ElasticSearch and someone who knows the field.
Having said all of that, Meilisearch and this are quite fine.
I migrated from typesense to Meilisearch on a project after I found it had much better search accuracy. I can't exactly explain why, but overall Meilisearch results feel more relevant by default.
There are actually benchmarks that allow measuring search relevancy objectively, e.g. BEIR[1]. Manticore Search team did an effort to make a PR to include it to the list. The results are here [2]. Unfortunately the BEIR team seems to be too busy to review a whole pile of PRs including about Vespa. Nevertheless it would be nice to have both Meilisearch and Typesense there too since it's interesting what performance those non-tf-idf based search engines would show compared to BM25-based and vector search engines.
I migrated in April 2021 (latest version of typesense & meilisearch at that time).
I don't have a public dataset has it was a fairly large ecommerce catalog with close to ~500k entries. And again, it was just my own perception which is hard to define. I just found that Typesense was a bit off compared to Meilisearch on search accuracy, and of course could totally be different today with a more recent release.
Got it, thank you for sharing that. Typesense was at v0.19.0 around that time. Two prominent issues we had in that version were how we handled matches across multiple fields and how we handled "keyword stuffing".
We're now at v0.24.rc, and we've iterated quite a lot on improving relevancy since then, as more users shared their datasets with us and gave us feedback over the last 1.5 years.
If you get a chance to try out Typesense again in the future, I'd love to hear how relevance feels with the latest version, out of the box for your dataset.
You need a dataset and an evaluation metric. The usual evaluation metric is NDCG(Normalized Discounted Cumulative Gain): https://en.wikipedia.org/wiki/NDCG
I think the upshot is that if you have no idea what all the advanced features of ES even are then you probably don't need ES because it's not turnkey.
If you utter the phrase "I just want search" then it really is a matter of just using one of these lightweight projects and libs because your needs are simple.
There is also the other important thing: the "search engine" in elasticsearch is just "searching the content of documents", not more.
It won't show you which one is "best" (for a given value of "best"), just one that looks most similar to the input.
Trying to index anything that can contain any trace of SEO would be doomed to failure, it also won't tell you which of the sites got linked the most, and million other things other web search engines do to give good results.
In "just put a documents in DB and search them" it is barely enough to look thru corporate knowledge database and it still won't get nuances like "this page is linked from 20 other pages, maybe it should be higher?"
* We imported ~1,000,000 messages of dynamic length (some very long, eg. emails);
* Once imported, the search index weights 20MB (KV) + 1.4MB (FST) on disk;
This is almost unbelievably succinct! If you encode the document features into 8 bits per document, and thus completely forego the need to store the document ID by indexing them implicitly, that alone is 1 MB.
Getting meaningful search out of on average 21 bytes per document seriously impressive.
Wonder if this has anything to do with the sliding window:
> Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)
Default window looks like 1k documents. I read this as saying that super common words are basically dropped from the index (only 1k out of many thousands of docs retained), but I don’t know enough about the internals to be sure. Not sure if this actually hurts search results in practice, seems like an ok trade off for help docs at least.
It's definitely a great trade-off to make for efficiently, but makes it inherently unusable for most of elastic searchs usecases.
Looking at it from a practical example such as log search (almost everyone I know has used kibana/logstash/elasticsearch at some point): you'd be able to search for things like tracingId/requestId but adding more filters such as logLevel, requestType or serviceName would be impossible
It has it's niche, but calling it an elasticsearch alternative really is a stretch
I wonder how easy it would be to change "most recently pushed" to something like a redis sorted set where each document has a score and only the top N results are retained when sorted by their separate score value? That would allow you to sort by pageviews / popularity in a more useful way. But it fails entirely when looking for uncommon intersections of common words, which feels like it makes it useless for most actual full-text search use-cases :(
Long ago I was searching on lightweight search engines that could run on the Edge, as ElasticSearch –while very popular– is also quite heavy and relies on the Lucene/JVM.
Apart from Sonic, I also found Tantivy [1] and Meilisearch [2]... all delightfully made in Rust.
My favorite, and the closest one to ElasticSearch (for its features) is probably Tantivy.
I'd recommend anyone to check up this three projects and choose on what best fits your needs... it's awesome to see that more projects are becoming available by the day!
I've looked up tantivy and quickwit. Quickwit uses tantivy as the engine.
It has decoupled storage (awesome, only recently elastic announced something comparable) but is oriented towards log processing and esplitly warns against its use to power an user facing site search.
Do you happen to know if there's anything like that with the same minimal footprint that can scale up and, importantly, down to serve the needs of highly variable traffic websites?
Right now I'm looking at something with clustering capabilities and decoupled storage (e.g on s3) like quickwit
One of the reasons for not using Quickwit for user facing search is the latency: for example, you pay 70ms of latency when you make a request on AWS S3... and generally you expect latency below that figure.
Decoupling compute and storage while keeping a very low latency may be then impossible unless ending up by caching all your data on disk :).
You can have a look at lnx (https://lnx.rs/) that is based on tantivy and is performing quite well. It's not yet distributed but the author Chillfish8 has some thoughts about how to do it.
Another interesting alternative: https://github.com/meilisearch/meilisearch - I'm using it in one of my (small) projects and I had a good experience with it, also very helpful community.
This is not a direct alternative to ElasticSearch. Tantivy is closer to an alternative to ElasticSearch since ES is built on top of Lucene. An alternative could be achieved if built on top of Tantivy.
Sonic here only returns document identifiers so you will never be able to get document information back. This is very useful though if all you want to do is index text data and then get the stored information from another data store.
I believe you must have misread what you quoted, because whatever point you're trying doesn't really follow what you quoted.
They let you configure the number of expected results to cache for a given query, the number of cache results are configurable based on your use-case for the results (e.g. if your website only lists 100 results, don't store beyond that).
If more results than that for a given query are returned then they disregard additional results since you told it you won't make use of them. In essence, they're saving you from caching results that you'll never consume.
How you got from this to "just use /dev/null" is a mystery to me. It has to be a misread or misunderstanding.
The readme doesn't offer enough information to accept that it can be an alternative to elasticsearch. From what I can gather by skimming the information, it can only do word level matching and that it isn't some form of TF-IDF type index (as is Lucene, which stands behind Solr/ElasticSearch).
Nice.
I have done some tests with SQLlite, and I find its index module very interesting, also because it offers stemming, which seems missed here: am I wrong?
SQLite has stemming only for english out-of-the-box, but I find it quite a need for a good ES drop in replacement.
It is great and works, but sonic has broader applications (I found it because it was actually being used as a way to index an existing SQLite database that pointed to file storage).
Just another plug for Lucene or the library route. I had a simple use case to offer a search/autocomplete API for the employee directory of ~50,000 records. The source of truth was only updated once a day. We ran a job that reindexed daily and published the index as a file (< 15 megabytes) to where the service could access it.
That service worked beautifully. Results were returned in 10-20ms and we only ever made software updates to handle the occasional CVE. It did, however, take quite a bit of fiddling initially to get the query results to match the user expectations. For example, weighting first vs last vs full name.
One of the features I like in ES that I haven’t seen in alternatives is “Percolate queries” (queries where you feed the service a document and it returns a list of queries that you’ve indexed that would match that document - basically inverting the whole process).
Does anyone know of any alternatives that support this use case?
I am not sure if it can be called an "alternative". ElasticSearch has thousands of features and settings while this library seems to be just a simple inverted index implementation only for text search.
By the way if you are looking for lightweight "alternative" for ElasticSearch you might look at sphinx search engine (although it doesn't has as much features as ES has and it has became closed-source since 3.0 version).
Using a 32 bit ID is an interesting choice. It means you can only index 64-bits per bucket. I wonder if using a varint encoding would give you even more savings while handling > 4 billion documents at the cost of a bit more expensive serialization/deserialization cost (which should be negligible in the grand scheme of everything else being done).
I don't see the use case for Sonic. It doesn't support clusters, which is where ES really shines, and if footprint for a local instance is a concern I'd go for SQLite.
There are already millions of solution to have entries in an inverted index that you can query within ms, none of them have the power of ES in term of features, scaling and HA.
Does anyone have any recommendations of books or other resources that go over the theory behind full-text search? i.e. language processing, data encoding, on-disk storage and retrieval, etc.
You might want to check Redis-Stack - https://redis.io/docs/stack.
It's a stack on top of Redis, which come bundled with RedisTimeSeries, RediSearch, and RedisJSON (also includes RedisGraph and RedisBloom).
About 2 weeks ago, I was searching for an alternative to Elastic for this exact use case. Funny how the world works, now I have my answer: "someone has built it".
You can have a look at Quickwit (https://quickwit.io), it's a search engine made for logs :). It's still pretty young and... there are way less features than in ES.
Even if it offers only a fraction of the features offered by ES, that may be fair enough for at least half of the use-cases out there.
Sonic could have really had a strong selling point: "Use an ES-alternative that works fine in most of the real-world applications, but it's written in Rust and it only takes a fraction of the memory footprint required by ES, and it shouldn't require you to change your application code".
Instead, they are proposing yet another search protocol, that developers have to learn and adopt. That definitely increases the adoption barriers.