Hacker News new | past | comments | ask | show | jobs | submit login
Sonic: Fast, lightweight and schema-less search backend (github.com/valeriansaliou)
575 points by rcarmo on Oct 24, 2022 | hide | past | favorite | 155 comments



While I really like their lightweight, SQL-like protocol instead of Elasticsearch's fat JSON, I really think that this project could have much more impact if it could be a drop-in replacement for ES.

Even if it offers only a fraction of the features offered by ES, that may be fair enough for at least half of the use-cases out there.

Sonic could have really had a strong selling point: "Use an ES-alternative that works fine in most of the real-world applications, but it's written in Rust and it only takes a fraction of the memory footprint required by ES, and it shouldn't require you to change your application code".

Instead, they are proposing yet another search protocol, that developers have to learn and adopt. That definitely increases the adoption barriers.


Since Elastic spitefully patched all of their client libraries to fail if the server is not a "genuine" ES server, I don't see what good a drop-in replacement with protocol compatibility would do.

Go client: https://github.com/elastic/go-elasticsearch/blob/3985f2a1554...

Python client: https://github.com/elastic/elasticsearch-py/commit/e72aa3e24...


Is it prohibited to include `X-Elastic-Product: Elasticsearch` in the output of your server if the user instructs the server to do so? :)


Those libraries are open source, just nuke those restrictions and you're good to go. Is it the best way? Maybe not, but it's better than modifying your server responses (and in the worst 1984 case, allowing Elastic to sue you), if you develop such a tool you can always put that distinction in your README.


I don't see how they can legally have any control over what a 3rd party's software outputs. And more importantly, how would they even enforce such restrictions?


I imagine AWS can't put it on the headers of their managed service, and that's what it's about.


AWS has a compatibility mode that allows your cluster to report its version as 7.10.


same thing nintendo did on old consoles: the "DRM" was requiring the cartridge to display a nintendo logo at start-up. so if you included that, you could get sued.

my guess is because it's "X-Elastic-Product" you'd be falsely "saying" your product is made by Elastic, the company so they could sue over it.


I believe Elasticsearch is a trademark.


If it really does work this way, then we're all doomed.

https://stackoverflow.com/questions/1114254/why-do-all-brows...


If its use is so widespread, then it's probably no longer a trademark, in that context at least.


A trademark does not forbid people from using a name, it only restricts how it can be used in marketing. I do not see how that would be applicable here.


Are HTTP headers important or even relevant at all for branding trademark purposes?

Such a concern seems utterly ridiculous.


ElasticSearch is so much more than search. Sonic is very minimal in comparison, so a drop in replacement doesn't work here.

But yes, Sonic could replace lots of use cases.


Although not exactly the same, Elastic has an SQL query syntax which can be used now as well.

https://www.elastic.co/what-is/elasticsearch-sql


They’ve had it for a long time already. Was using it over a year and a half ago. Not really “new”


In fairness, the comment you're replying to didn't say it was new.


It's probably fairly easy to write an adapter here.


Wow it's weird that this comes up, I'm actually running a site I am going to repost to HN today that I want to use as a testbed for search engines (kind of like an extension to my recent collaboration with supabase[0]).

Right now I've got the site going on just Postgres FTS + trigram and it's pretty darn fast, looks like I need to test sonic too.

Going to burn some midnight oil (in my timezone, anyway) and get it out -- though sonic isn't implemented yet!

Anyway to make this comment useful to people, here's my short list of engines that I want to run in parallel:

- MeiliSearch (https://github.com/meilisearch/MeiliSearch)

- TypeSense (https://github.com/typesense/typesense)

- Lyra (https://github.com/LyraSearch/lyra)

- OpenSearch (https://github.com/opensearch-project/OpenSearch)

- ZincSearch (https://github.com/prabhatsharma/zinc)

- Sonic (https://github.com/valeriansaliou/sonic)

There isn't enough out there comparing all these for the simple typical fuzzy search/search box usecase, so I'm adapting a little podcast search site I made to try and use all of these at the same time. So far only Postgres though, will try and add Meilisearch today and post it!

Like other people are pointing out, most of these engines won't have all the features of ES (or more accurately Lucene) but I am pretty convinced that most of the time it doesn't actually matter and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).

[0]: https://supabase.com/blog/postgres-full-text-search-vs-the-r...


> and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).

I don't understand this comment. Why would you search something that *isn't*, in some senses, a repository of information? I would say almost every website needs to have search in some sense, and it's *because* sites function as a repository of information that they need this search. Think about e.g. Stripe's documentation, or Github's repository / code search. HN is also another great example—I search for stories or comments all the time to try and remember something I read about recently or heard about last week, but couldn't quite remember. I'm hard-pressed to think of a web site I use regularly that *shouldn't* have full-text search, if I'm being honest.


Most site searches are basically unusable. Either it isn't very good, is painfully slow, or both.

Just gooling site:foo.com/baz <query> almost always produces better results.


I don't consider use cases like documentation a "repository" of information, but maybe this is just me not phrasing it badly. In the literal sense sure it is, but when I think of a "repository of information" I think of wikipedia, amazon search items, etc.

The scale of a documentation site is a very different problem -- you can brute force it in ways that you can't at larger scales.

I agree that HN would be a case of the large repository, but even then what most people want out of HN search is pretty simple/basic keyword search. I think a decent non-frustrating HN search feature could be very basic and get by without most of the advanced features/rabbit holes available in search.

Basically I think most apps fall into the lighter search use case -- command palettes, search inside of apps with a small scale of information, etc.

My comment wasn't that apps shouldn't have full text search -- it was that most that have full text search don't need complex full text search with all the bells and whistles that lucene and other serious search engines provide. These up-and-comers might be enough for a bunch of apps for which search is not the main feature.


I think the sibling comment on this thread kind of disproves this to some extent. Sure, some simple in-memory search library might be good enough for a command palette or for "apps", but the fact that even tiny docs and "marketing" sites that try to implement site search are almost always outdone by google "site:example.com query" really goes to show how much value a full stemming / synonym clustering / syntax normalizing search engine can bring.


Hey that's a great list of tools.

Are you aware of any that can be used client side like Lyra and supports faceted search?

I've been looking for a solution and cannot find it, even an algorithm and/or a data structure can be helpful. I attempted coming up with a solution myself but ended up with frustration when it came to making the facets dynamic and update as other filters are applied.

I read a couple of papers and one stood out [0], which introduces category theory as a solution to faceted filtering. I understood it in theory and it was still does not seem straight forward to implement but I haven't attempted yet.

0. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5145200/#!po=28...


So for client-side search, I generally know of Lunr.js:

https://lunrjs.com/docs/index.html

There are some others but I can't find them at this moment -- a bunch of the other projects I find are somewhat abandoned, lunr is actually on my list of things to use (because it makes the most sense to just ship a pre-built index with the first like... 5 letters maybe of typeahead, no matter how fast the backend is)


Thanks for the link. This unfortunately is not what I am looking for. Faceted filters are a different beast.


Thank you for this comparison. I would also like to know how Bleve Search (https://github.com/blevesearch/bleve) turns out.

I have for many years now a small search engine project in my free-time pipeline, but I'm before crawling even and I intend to sit for searching part after some of that.


You're right I should put bleve on there as well. This isn't even the whole list. Toshi (https://github.com/toshi-search/Toshi) is also out there...


If you decide to add Manticore Search to the list feel free to ping me at sergey@manticoresearch.com if you need help with preparing the ingestion scripts etc.


Oh! Damn it I forgot about manticore -- I had seen it before but forgot to include it.

Eventually all of these projects will be highlighted on Awesome F/OSS (https://awsmfoss.com), but for now I'm just going to dump my bookmarks here for other people, since I'm leaving awesome projects out:

Search Engines

AWS OpenSearch https://github.com/opensearch-project/OpenSearch

https://github.com/opensearch-project/OpenSearch-Dashboards

https://github.com/opensearch-project/perftop

https://github.com/go-ego/riot

https://groonga.org/ https://github.com/groonga/groonga

https://github.com/meilisearch/MeiliSearch

https://github.com/mosuka/bayard

https://github.com/nezaboodka/nevod

https://github.com/searx/searx

https://github.com/stryku/okon

https://github.com/toshi-search/Toshi

https://github.com/typesense/typesense

https://github.com/valeriansaliou/sonic

Algolia

https://github.com/marconi1992/algolite

https://quickwit.io/

https://github.com/quickwit-inc/quickwit

https://docs.meilisearch.com/

https://github.com/prabhatsharma/zinc

phalanx https://github.com/blugelabs/bluge https://github.com/mosuka/phalanx https://github.com/mosuka/blast

ManticoreSearch

https://github.com/manticoresoftware/manticoresearch

https://github.com/manticoresoftware/docker

https://manticoresearch.com/blog/manticore-alternative-to-el...

https://manticoresearch.com/

https://manual.manticoresearch.com/Introduction

https://forum.manticoresearch.com/t/manticore-search-cheatsh...

https://forum.manticoresearch.com/

Whoosh https://whoosh.readthedocs.io/en/latest/

https://pypi.org/project/Whoosh/

lyra https://github.com/nearform/lyra

https://nearform.github.io/lyra/

https://github.com/LyraSearch/lyra

https://lyrasearch.io/

flexsearch

https://github.com/nextapps-de/flexsearch#performance-benchm...

https://pagefind.app/docs/

Lucene

https://github.com/apache/lucene

https://lucene.apache.org/

ZincSearch

https://zincsearch.com/

Solr

https://solr.apache.org/

https://solr.apache.org/operator/

https://solr.apache.org/guide/solr/latest/getting-started/so...

https://github.com/apache/solr

https://solr.apache.org/guide/solr/latest/deployment-guide/s...

Konnu https://gitlab.com/shadowislord/konnu

Quickwit QuickWit + Clickhouse

https://clickhouse.com/docs/en/guides/developer/full-text-se...

https://clickhouse.com/docs/en/sql-reference/functions/strin...

There is no way I can get to running all of these (this project was supposed to be quick!!), but I will run the ones I noted earlier, and probably manticore too since it was high on my list since it's quite polished looking.


+ Milvus (https://github.com/milvus-io/milvus) for large scale similarity/semantic search.


+ xapian which has been around a while, and while gpl licensed, is quite capable https://xapian.org/


Xapian is great, especially when you need a a C/C++ library rather than a separate service. Kinda like an sqlite for search. Some of my favorite tools like notmuch and recoll use it.


I'd encourage you to maintain and publish your list of search engines. Even if you aren't supporting them.

The list has value on its own, especially if you maintain it.


Yup this is exactly why I started Awesome F/OSS (https://awsmfoss.com) — I have similar lists for a lot of software and just need to make sure they’re somewhere else other than just my browser!


+ Weaviate for vector based search. Has a BSD-3 license. https://weaviate.io/developers/weaviate/current/


You can consider also lnx that is based on tantivy and is performing quite well (https://lnx.rs/).


Would it make sense to include Sqlite FTS5 in that mix?


It would, I did for the supabase post but... This is already way too much! I have no idea when I'll actually be able to get to all this as-is.

Waiting for meilisearch to ingest documents right now and the Show HN is going up.


Meili is still ingesting documents but we're live:

https://news.ycombinator.com/item?id=33321268

Maybe I should have used their batch thing instead.


Have you heard about Xapian?


Nope I hadn't, but it was mentioned here:

https://news.ycombinator.com/item?id=33318533


I've written a full-text search engine as well. I don't tout it as a replacement for Elasticsearch, but it does have a few advantages: it's fast; supports HTML documents; supports Polish inflection (via a full-blown morphological dictionary, not just a stemmer); and has a very compact on-disk format (pre-parsed HTML trees, Huffman-encoded over large alphabets). Oh, and it's 100% Clojure.

It underlies a concordancer GUI called Smyrna: https://github.com/nathell/smyrna, https://smyrna.danieljanus.pl

I haven't touched it in six years, other than a few small changes. But I do plan on revisiting it when time permits.


That’s very cool. I hope you consider open sourcing it so others can contribute.


It is open-source already (MIT)! I just need to make other languages more easily pluggable, and factor out the search engine so that it can be used on its own. :)


Could your steamer be ported to Lucene? Might get more usage there.


>Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

Does this mean that it only ever finds at most N documents per word? Even searches for "A and B" would probably not find everything, even if less than N documents contain A and B, because they might have been removed with the sliding window already for A or B alone. Is that correct?


Huh? Yeah. I can keep my index size down by throwing results away as well.

Every time you think it’s somehow magic, someone has to dump a bucket of cold water over your head.


As far as I can tell, yes, this is correct.


Every time someone comes up with an alternative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me".

This is surely an impressive engineering feat, but hardly a replacement for the myriad of query possibilities Elasticsearch offers.


From the opening line of the README:

"Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases."

Seem pretty up-front about it, and doesn't claim to be a full-featured alternative.


Agreed, they do a good job of hedging it. I think OP was probably pre-empting the usual comments along the lines of "yep, $tool is super bloated, $smallerTool proves that those other guys building $tool are bad engineers."


To be fair, there is often a better reasons to only replace a portion of ES' functionality, since doing so can save a lot of computation and space; than to replace ES itself, since it already exists and does a good job if what you need is the full kit.

I found myself last week reimplementing 10% of RoaringBitmap's functionality as a homebrew replacement, because doing so was 500% faster. Not that RB isn't great, but it's designed for a general problem space, and not my particular problem.


can you explain more of your problem and solution? Or is your solution open/published anywhere?


Agreed, although to be fair to the actual author (assuming they didn't post this here) the readme is a lot more upfront about it's capabilities.

> Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases.


I don't think your opinion is wrong, but I do think ElasticSearch has a lot of features that many people consider bloat depending on their work, and scaling and doing general dev ops for ES can be an absolute slog. Light weight alternatives that cut down to a set of core features for some niche seem like a good idea to me.


It's totally fine that many people consider stuff bloat, but other people don't. I've built a highly specialised search engine for manufacturing companies on top of Elasticsearch, and I decidedly need vector queries, TF-IDF queries, geospatial range queries, and heaps of other, niche features you probably never used before.

Having a lightweight search engine is fine, but calling it an alternative to Elasticsearch is not doing either justice.


That's very assumptive of you. I have in fact used most of those features, and note I said their opinion was not wrong. In their readme they said it's a replacement for some use cases which is upfront and fine.

Vector queries aren't niche, Elastic however only tacked on a proper (non HNSW) implementation in the last year and a half. Geospatial isn't niche, anyone working with location data will work with those queries. TF-IDF is a basic ranking algo / signal.

Maybe Elasticsearch is good for you because they have all their features in aggregate. But I can name a tool that focuses specifically on each area and query type and is better for that specific subset of functionality.

So my point still stands, if all you need are specific features Elastic is too much. You need all of it and that's fine too.


> But I can name a tool that focuses specifically on each area and query type and is better for that specific subset of functionality.

Please do name them, because I for one would like to never run ElasticSearch again for faceted, full-text and specialised search.


I mean, Sonic doesn't store term frequency at all, so it can't do TF-IDF. It probably doesn't want to. If you need any ranking other than latest first, Sonic is not for you.


The problem with subsets is everyone wants a different subset. It's my popular software almost always bloats. Everyone wants some different features.


"ative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me"

Which is perfectly fine. A lot of tools become so general and bloated, that there are large groups that would be fine with many different 10% subsets of their features...

Kind of like how I don't need MS Word or OpenOffice Write, any simple text editing program with a few basic features (like printing, bold/italics, and word count) will do for my needs...


I'm not opposed to that, however, the chance of their 10% and my 10% overlapping is rather slim. Just like you only need basic formatting, and I require footnotes in my documents. Nothing wrong with either, but I'd be upset if you tried to sell me GEdit as a replacement for OpenOffice Write.


Honestly most of the "alternative to" programs do not meet expectations they set by dropping a big known name. So much so, that I think people are doing FOSS disservice by comparing to those who they can't meaningfully overtake.

The only exceptions could be small single feature utilities.


To me it seems the “alternative to” part is more damaging in that sense than dropping a big name. The name is used to put a complicated piece of software in a context many people are familiar with. The same thing happens with the “Tinder/Uber/Airbnb for <x>…” type of services.

The friction is introduced where it’s not made crystal clear how it’s similar, and which concept are different or missing altogether. Then it will cause unmet expectations.

Perhaps it’s better to say “inspired by …” or “similar to …” to make a more precise statement.


My guess is that the majority of people using ES could actually use something simpler like this.


True, but most deployments are also just generic searching of records like Algolia rather than using all the low-level functionality.

Tyoesense is probably the most compete competitor in that regard: https://typesense.org/

Other alternatives here: https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...


That is exactly what they are, and I don't think they hide it?! So, I don't know what the issue it. This is the kind of innovation that keeps us moving forward.


I agree, but ES should re-write their core engine to be more lightweight, otherwise a viable competitor will emerge.


Projects like Meili Search are already coming for Elastic Search's lunch: https://www.meilisearch.com. I think there is a market for fast, light weight alternatives like Meili that offers up a fully featured open source experience.

With Elastic Search many of the features, security being one, are locked away behind commercial licenses. With Meili it seems they are, for the time being anyway, going with a proper open source version. I understand Elastic needs to earn money, and I get their licensing model to accomplish this. But Meili will probably steal away a good portion of customers interested in self hosting their search solution.


I’m not sure what this competitor will be but > ES will have the following properties:

- written in rust or maybe just C - extremely lightweight and high performance - single small binary that runs anywhere - designed to run in Kubernetes from the ground up - scales dynamically up/down - zero downtime upgrades - rigorous security built into the core offering - fully open source - wire compatibility with ES

I hope that ES themselves do this. There are pretty significant barriers to creating a serious competitor to ES (unlike something like MongoDB for example which seems to have a very limited role in the future).


and almost all of them won't offer ES-compatible API. Out of my head i can think about manticore (https://manticoresearch.com/) that offers at least subset of elasticsearch API


Quickwit does too (a subset anyway): https://github.com/quickwit-oss/quickwit


well, and how do you solve excessive ram usage for a search engine? Generally you write the indexes/search trees to disk which, may or may not be ideal.


tbh "Text Search" is a vague description for these kind of softwares, so I guess everyone go with elasticsearch-like.


While I get the wants-and-needs since ElasticSearch has a voracious appetite for RAM, I get the feeling most people think search engines are a simple thing where you can just import some lib, fool around for a bit and call it a day.

The truth is that ElasticSearch/Solr/Lucene is orders of magnitude more complex and powerful than these "alternatives". All this is mostly fine as long everyone is on the same page regarding the expectations.

Most people don't need ElasticSearch for their use cases on the surface, but I feel they expect top-notch mind-reading results and that requires something like ElasticSearch and someone who knows the field.

Having said all of that, Meilisearch and this are quite fine.


I would like to suggest https://typesense.org/ It has some features that makes it a better choice than Meilisearch


Can you elaborate on said features?

I migrated from typesense to Meilisearch on a project after I found it had much better search accuracy. I can't exactly explain why, but overall Meilisearch results feel more relevant by default.


There are actually benchmarks that allow measuring search relevancy objectively, e.g. BEIR[1]. Manticore Search team did an effort to make a PR to include it to the list. The results are here [2]. Unfortunately the BEIR team seems to be too busy to review a whole pile of PRs including about Vespa. Nevertheless it would be nice to have both Meilisearch and Typesense there too since it's interesting what performance those non-tf-idf based search engines would show compared to BM25-based and vector search engines.

[1] https://github.com/beir-cellar/beir [2] https://docs.google.com/spreadsheets/d/1_ZyYkPJ_K0st9FJBrjbZ...


I work on Typesense. Mind if I ask which version of Typesense and Meilisearch you tried this on? And if this was on some public dataset I can use?

I’d love to take a closer look.


Hey jabo,

I migrated in April 2021 (latest version of typesense & meilisearch at that time).

I don't have a public dataset has it was a fairly large ecommerce catalog with close to ~500k entries. And again, it was just my own perception which is hard to define. I just found that Typesense was a bit off compared to Meilisearch on search accuracy, and of course could totally be different today with a more recent release.


Got it, thank you for sharing that. Typesense was at v0.19.0 around that time. Two prominent issues we had in that version were how we handled matches across multiple fields and how we handled "keyword stuffing".

We're now at v0.24.rc, and we've iterated quite a lot on improving relevancy since then, as more users shared their datasets with us and gave us feedback over the last 1.5 years.

If you get a chance to try out Typesense again in the future, I'd love to hear how relevance feels with the latest version, out of the box for your dataset.


Yeah there needs to be some kind of acid test that will compare these products on equal footing and show the pitfalls.


This is very very difficult, but Tantivy tried: see https://github.com/quickwit-oss/search-benchmark-game



That would be great. However if you wanted to benchmark relevance ranking, how would you do that?


You need a dataset and an evaluation metric. The usual evaluation metric is NDCG(Normalized Discounted Cumulative Gain): https://en.wikipedia.org/wiki/NDCG

An example dataset is BEIR(BEnchmarking Information Retrieval), published in NIPS 2021: https://github.com/beir-cellar/beir


I think the upshot is that if you have no idea what all the advanced features of ES even are then you probably don't need ES because it's not turnkey.

If you utter the phrase "I just want search" then it really is a matter of just using one of these lightweight projects and libs because your needs are simple.


There is also the other important thing: the "search engine" in elasticsearch is just "searching the content of documents", not more.

It won't show you which one is "best" (for a given value of "best"), just one that looks most similar to the input.

Trying to index anything that can contain any trace of SEO would be doomed to failure, it also won't tell you which of the sites got linked the most, and million other things other web search engines do to give good results.

In "just put a documents in DB and search them" it is barely enough to look thru corporate knowledge database and it still won't get nuances like "this page is linked from 20 other pages, maybe it should be higher?"


* We imported ~1,000,000 messages of dynamic length (some very long, eg. emails);

* Once imported, the search index weights 20MB (KV) + 1.4MB (FST) on disk;

This is almost unbelievably succinct! If you encode the document features into 8 bits per document, and thus completely forego the need to store the document ID by indexing them implicitly, that alone is 1 MB.

Getting meaningful search out of on average 21 bytes per document seriously impressive.

[For reference, this sentence is 42 bytes.]


Wonder if this has anything to do with the sliding window:

> Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

Default window looks like 1k documents. I read this as saying that super common words are basically dropped from the index (only 1k out of many thousands of docs retained), but I don’t know enough about the internals to be sure. Not sure if this actually hurts search results in practice, seems like an ok trade off for help docs at least.


It's definitely a great trade-off to make for efficiently, but makes it inherently unusable for most of elastic searchs usecases.

Looking at it from a practical example such as log search (almost everyone I know has used kibana/logstash/elasticsearch at some point): you'd be able to search for things like tracingId/requestId but adding more filters such as logLevel, requestType or serviceName would be impossible

It has it's niche, but calling it an elasticsearch alternative really is a stretch


Also the ability to weight fields when fetching results to boost relevancy, which is needed for a lot of my use cases.


I wonder how easy it would be to change "most recently pushed" to something like a redis sorted set where each document has a score and only the top N results are retained when sorted by their separate score value? That would allow you to sort by pageviews / popularity in a more useful way. But it fails entirely when looking for uncommon intersections of common words, which feels like it makes it useless for most actual full-text search use-cases :(


Long ago I was searching on lightweight search engines that could run on the Edge, as ElasticSearch –while very popular– is also quite heavy and relies on the Lucene/JVM.

Apart from Sonic, I also found Tantivy [1] and Meilisearch [2]... all delightfully made in Rust. My favorite, and the closest one to ElasticSearch (for its features) is probably Tantivy.

I'd recommend anyone to check up this three projects and choose on what best fits your needs... it's awesome to see that more projects are becoming available by the day!

[1]: https://github.com/quickwit-oss/tantivy

[2]: https://github.com/meilisearch/meilisearch


I've looked up tantivy and quickwit. Quickwit uses tantivy as the engine. It has decoupled storage (awesome, only recently elastic announced something comparable) but is oriented towards log processing and esplitly warns against its use to power an user facing site search. Do you happen to know if there's anything like that with the same minimal footprint that can scale up and, importantly, down to serve the needs of highly variable traffic websites? Right now I'm looking at something with clustering capabilities and decoupled storage (e.g on s3) like quickwit


One of the reasons for not using Quickwit for user facing search is the latency: for example, you pay 70ms of latency when you make a request on AWS S3... and generally you expect latency below that figure. Decoupling compute and storage while keeping a very low latency may be then impossible unless ending up by caching all your data on disk :).

You can have a look at lnx (https://lnx.rs/) that is based on tantivy and is performing quite well. It's not yet distributed but the author Chillfish8 has some thoughts about how to do it.


Thank you! I'll look into it


There is also sphinx search which was open source before 3.0 version.


And it's open source continuation - Manticore Search [1]

[1] https://manticoresearch.com/


Do they support document access control like ES does?


Yes, Meilisearch supports ES-like document access control.


Another interesting alternative: https://github.com/meilisearch/meilisearch - I'm using it in one of my (small) projects and I had a good experience with it, also very helpful community.


This is not a direct alternative to ElasticSearch. Tantivy is closer to an alternative to ElasticSearch since ES is built on top of Lucene. An alternative could be achieved if built on top of Tantivy.

Sonic here only returns document identifiers so you will never be able to get document information back. This is very useful though if all you want to do is index text data and then get the stored information from another data store.


Quickwit is a search engine built on top of Tantivy (by the author of Tantivy): https://github.com/quickwit-oss/quickwit

Quickwit supports Elasticsearch compatible bulk indexing API.


> Sonic here only returns document identifiers

In many cases that is what you want because you have the data in a database and don't want to duplicate it in Elastisearch.


> Sonic here only returns document identifiers so you will never be able to get document information back

Why would you want that anyway? Always thought it was silly to duplicate all your data which will be stored in a real database anyway


From a use case I am not experienced with. If you index books, you want the search engine to return highlighted data like google does.

Also, now that I think of it, typically logs/structured data is stored only in ES.


>Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

If you discard many potential hits, why not use /dev/null as the search engine?


I believe you must have misread what you quoted, because whatever point you're trying doesn't really follow what you quoted.

They let you configure the number of expected results to cache for a given query, the number of cache results are configurable based on your use-case for the results (e.g. if your website only lists 100 results, don't store beyond that).

If more results than that for a given query are returned then they disregard additional results since you told it you won't make use of them. In essence, they're saving you from caching results that you'll never consume.

How you got from this to "just use /dev/null" is a mystery to me. It has to be a misread or misunderstanding.


This thing looks like a very genetic cache. You can of course use /dev/null as a degenerate cache, without any performance benefit though.


The readme doesn't offer enough information to accept that it can be an alternative to elasticsearch. From what I can gather by skimming the information, it can only do word level matching and that it isn't some form of TF-IDF type index (as is Lucene, which stands behind Solr/ElasticSearch).


Yes, it doesn't do any ranking at all. Results are returned in the reverse order of indexing.



Nice. I have done some tests with SQLlite, and I find its index module very interesting, also because it offers stemming, which seems missed here: am I wrong?

SQLite has stemming only for english out-of-the-box, but I find it quite a need for a good ES drop in replacement.

My two cents


It is great and works, but sonic has broader applications (I found it because it was actually being used as a way to index an existing SQLite database that pointed to file storage).


Just another plug for Lucene or the library route. I had a simple use case to offer a search/autocomplete API for the employee directory of ~50,000 records. The source of truth was only updated once a day. We ran a job that reindexed daily and published the index as a file (< 15 megabytes) to where the service could access it.

That service worked beautifully. Results were returned in 10-20ms and we only ever made software updates to handle the occasional CVE. It did, however, take quite a bit of fiddling initially to get the query results to match the user expectations. For example, weighting first vs last vs full name.


One of the features I like in ES that I haven’t seen in alternatives is “Percolate queries” (queries where you feed the service a document and it returns a list of queries that you’ve indexed that would match that document - basically inverting the whole process).

Does anyone know of any alternatives that support this use case?

https://www.elastic.co/guide/en/elasticsearch/reference/mast...


Yes. Manticore Search does. Here's an interactive course[1] about it, it's a little bit outdated though. More info in the docs[2]

[1] https://play.manticoresearch.com/pq [2] https://manual.manticoresearch.com/Creating_an_index/Local_i...


Somewhat related, this guy: https://github.com/mosuka/ seems to be very passionate about search service.

He built two distributed search services:

- https://github.com/mosuka/phalanx, written in Go.

- https://github.com/mosuka/bayard, written in Rust.


Lots of (elastic)search alternatives now, I keep track here: https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...

Sonic is good. Typesense is probably what most are looking for as more of an Algolia-like setup: https://typesense.org/


I am not sure if it can be called an "alternative". ElasticSearch has thousands of features and settings while this library seems to be just a simple inverted index implementation only for text search.

By the way if you are looking for lightweight "alternative" for ElasticSearch you might look at sphinx search engine (although it doesn't has as much features as ES has and it has became closed-source since 3.0 version).


> you might look at sphinx search engine

Manticore Search [1] forked from the latest open source version and has been continually improved for more than 5 years.

[1] https://manticoresearch.com/

> although it doesn't has as much features as ES has

Manticore unlike Sphinx is much closer to Elasticsearch in terms of features set.


Using a 32 bit ID is an interesting choice. It means you can only index 64-bits per bucket. I wonder if using a varint encoding would give you even more savings while handling > 4 billion documents at the cost of a bit more expensive serialization/deserialization cost (which should be negligible in the grand scheme of everything else being done).


What about relevancy?

There’s not much mention of that. I’m always on the lookout for something lightweight that improves on PostgreSQL full text.


Sonic doesn't do any ranking other than latest first.


Ouch


So I can use that to inject millions of logs daily and it will do sharding and rebalancing automatically?


No. Sonic is a single node server and not distributed.


Related:

Sonic: Fast, lightweight and schemaless search back end in Rust - https://news.ycombinator.com/item?id=19471471 - March 2019 (39 comments)


I don't see the use case for Sonic. It doesn't support clusters, which is where ES really shines, and if footprint for a local instance is a concern I'd go for SQLite.


Is Sonic clustered? otherwise it's useless

There are already millions of solution to have entries in an inverted index that you can query within ms, none of them have the power of ES in term of features, scaling and HA.


First thing I looked for is how long does it takes to delete a document from the index.

Looks like it rebuilds the whole index periodically and that's very processor intensive. The delete will be reflected after a rebuild.


Does anyone have any recommendations of books or other resources that go over the theory behind full-text search? i.e. language processing, data encoding, on-disk storage and retrieval, etc.



If you want a book, Managing Gigabytes is still pretty good.


Does anyone know of an alternative for the time series side of Elastic?


Manticore Search. Here's a blog post with detailed comparison [1]

[1] https://manticoresearch.com/blog/manticore-alternative-to-el...


You might want to check Redis-Stack - https://redis.io/docs/stack. It's a stack on top of Redis, which come bundled with RedisTimeSeries, RediSearch, and RedisJSON (also includes RedisGraph and RedisBloom).


A slight name-conflict with sanic: https://github.com/sanic-org/sanic


If anyone has been successful compiling this with VSCode on Win10 please let me know how you get CLANG/LLVM to play nicely with VSCode.

I'd like to avoid compiling LLVM from source if I can


About 2 weeks ago, I was searching for an alternative to Elastic for this exact use case. Funny how the world works, now I have my answer: "someone has built it".


Most of the time these ES replacements lack a decent access control.

One thing is to find what you search, but the other is not to find what you aren't allowed to see.


Meilisearch supports ES-like document access control.


IANAL but it seems to me like using the name "sonic" and a hedgehog mascot together teeters on the edge of IP infringement


Now if it were drop-in capable and still more efficient, that would be impressive and I would count the days until Elastic buys you out.


There's also mellisearch which is another elasticsearch alternative written in rust.

Comparison to elasticsearch: https://docs.meilisearch.com/learn/what_is_meilisearch/compa...

Github: https://github.com/meilisearch/meilisearch

Website: https://www.meilisearch.com/


It would be nice if it can be replacement for logging stack, elastic is super hungry for ram


You can have a look at Quickwit (https://quickwit.io), it's a search engine made for logs :). It's still pretty young and... there are way less features than in ES.

(disclaimer: I'm one of the cofounder)


I don't see any mention on the website on how it compares to Loki, which is surprising since it looks like a very similar model?


Look at zincsearch. Its another lightweight elastic alternative and they advertise logging as a usecase.


I wish someone would write a full text engine that supports pluggable storage engines.


If they keep introducing hipster names like Deno and Sonic, no one will know what anything means anymore.


But does it scale?


No, it doesn't.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: