Sonic: Fast, lightweight and schema-less search backend

blacklight · on Oct 24, 2022

While I really like their lightweight, SQL-like protocol instead of Elasticsearch's fat JSON, I really think that this project could have much more impact if it could be a drop-in replacement for ES.

Even if it offers only a fraction of the features offered by ES, that may be fair enough for at least half of the use-cases out there.

Sonic could have really had a strong selling point: "Use an ES-alternative that works fine in most of the real-world applications, but it's written in Rust and it only takes a fraction of the memory footprint required by ES, and it shouldn't require you to change your application code".

Instead, they are proposing yet another search protocol, that developers have to learn and adopt. That definitely increases the adoption barriers.

xvello · on Oct 24, 2022

Since Elastic spitefully patched all of their client libraries to fail if the server is not a "genuine" ES server, I don't see what good a drop-in replacement with protocol compatibility would do.

Go client: https://github.com/elastic/go-elasticsearch/blob/3985f2a1554...

Python client: https://github.com/elastic/elasticsearch-py/commit/e72aa3e24...

snikolaev · on Oct 24, 2022

Is it prohibited to include `X-Elastic-Product: Elasticsearch` in the output of your server if the user instructs the server to do so? :)

AbraKdabra · on Oct 24, 2022

Those libraries are open source, just nuke those restrictions and you're good to go. Is it the best way? Maybe not, but it's better than modifying your server responses (and in the worst 1984 case, allowing Elastic to sue you), if you develop such a tool you can always put that distinction in your README.

hangonhn · on Oct 24, 2022

I don't see how they can legally have any control over what a 3rd party's software outputs. And more importantly, how would they even enforce such restrictions?

blowski · on Oct 24, 2022

I imagine AWS can't put it on the headers of their managed service, and that's what it's about.

sickmate · on Oct 25, 2022

AWS has a compatibility mode that allows your cluster to report its version as 7.10.

collegeburner · on Oct 24, 2022

same thing nintendo did on old consoles: the "DRM" was requiring the cartridge to display a nintendo logo at start-up. so if you included that, you could get sued.

my guess is because it's "X-Elastic-Product" you'd be falsely "saying" your product is made by Elastic, the company so they could sue over it.

yvan · on Oct 24, 2022

I believe Elasticsearch is a trademark.

mumblemumble · on Oct 24, 2022

If it really does work this way, then we're all doomed.

https://stackoverflow.com/questions/1114254/why-do-all-brows...

blowski · on Oct 25, 2022

If its use is so widespread, then it's probably no longer a trademark, in that context at least.

jeltz · on Oct 24, 2022

A trademark does not forbid people from using a name, it only restricts how it can be used in marketing. I do not see how that would be applicable here.

metadat · on Oct 24, 2022

Are HTTP headers important or even relevant at all for branding trademark purposes?

Such a concern seems utterly ridiculous.

leros · on Oct 24, 2022

ElasticSearch is so much more than search. Sonic is very minimal in comparison, so a drop in replacement doesn't work here.

But yes, Sonic could replace lots of use cases.

markandrewj · on Oct 24, 2022

Although not exactly the same, Elastic has an SQL query syntax which can be used now as well.

https://www.elastic.co/what-is/elasticsearch-sql

_boffin_ · on Oct 24, 2022

They’ve had it for a long time already. Was using it over a year and a half ago. Not really “new”

neodymiumphish · on Oct 25, 2022

In fairness, the comment you're replying to didn't say it was new.

tensor · on Oct 24, 2022

It's probably fairly easy to write an adapter here.

hardwaresofton · on Oct 24, 2022

Wow it's weird that this comes up, I'm actually running a site I am going to repost to HN today that I want to use as a testbed for search engines (kind of like an extension to my recent collaboration with supabase[0]).

Right now I've got the site going on just Postgres FTS + trigram and it's pretty darn fast, looks like I need to test sonic too.

Going to burn some midnight oil (in my timezone, anyway) and get it out -- though sonic isn't implemented yet!

Anyway to make this comment useful to people, here's my short list of engines that I want to run in parallel:

- MeiliSearch (https://github.com/meilisearch/MeiliSearch)

- TypeSense (https://github.com/typesense/typesense)

- Lyra (https://github.com/LyraSearch/lyra)

- OpenSearch (https://github.com/opensearch-project/OpenSearch)

- ZincSearch (https://github.com/prabhatsharma/zinc)

- Sonic (https://github.com/valeriansaliou/sonic)

There isn't enough out there comparing all these for the simple typical fuzzy search/search box usecase, so I'm adapting a little podcast search site I made to try and use all of these at the same time. So far only Postgres though, will try and add Meilisearch today and post it!

Like other people are pointing out, most of these engines won't have all the features of ES (or more accurately Lucene) but I am pretty convinced that most of the time it doesn't actually matter and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).

[0]: https://supabase.com/blog/postgres-full-text-search-vs-the-r...

nightpool · on Oct 24, 2022

> and if someone is searching on your site excessively maybe there's a problem with your UX (unless you're a search engine or repository of information).

I don't understand this comment. Why would you search something that *isn't*, in some senses, a repository of information? I would say almost every website needs to have search in some sense, and it's *because* sites function as a repository of information that they need this search. Think about e.g. Stripe's documentation, or Github's repository / code search. HN is also another great example—I search for stories or comments all the time to try and remember something I read about recently or heard about last week, but couldn't quite remember. I'm hard-pressed to think of a web site I use regularly that *shouldn't* have full-text search, if I'm being honest.

TylerE · on Oct 24, 2022

Most site searches are basically unusable. Either it isn't very good, is painfully slow, or both.

Just gooling site:foo.com/baz <query> almost always produces better results.

hardwaresofton · on Oct 24, 2022

I don't consider use cases like documentation a "repository" of information, but maybe this is just me not phrasing it badly. In the literal sense sure it is, but when I think of a "repository of information" I think of wikipedia, amazon search items, etc.

The scale of a documentation site is a very different problem -- you can brute force it in ways that you can't at larger scales.

I agree that HN would be a case of the large repository, but even then what most people want out of HN search is pretty simple/basic keyword search. I think a decent non-frustrating HN search feature could be very basic and get by without most of the advanced features/rabbit holes available in search.

Basically I think most apps fall into the lighter search use case -- command palettes, search inside of apps with a small scale of information, etc.

My comment wasn't that apps shouldn't have full text search -- it was that most that have full text search don't need complex full text search with all the bells and whistles that lucene and other serious search engines provide. These up-and-comers might be enough for a bunch of apps for which search is not the main feature.

nightpool · on Oct 24, 2022

I think the sibling comment on this thread kind of disproves this to some extent. Sure, some simple in-memory search library might be good enough for a command palette or for "apps", but the fact that even tiny docs and "marketing" sites that try to implement site search are almost always outdone by google "site:example.com query" really goes to show how much value a full stemming / synonym clustering / syntax normalizing search engine can bring.

Bilal_io · on Oct 24, 2022

Hey that's a great list of tools.

Are you aware of any that can be used client side like Lyra and supports faceted search?

I've been looking for a solution and cannot find it, even an algorithm and/or a data structure can be helpful. I attempted coming up with a solution myself but ended up with frustration when it came to making the facets dynamic and update as other filters are applied.

I read a couple of papers and one stood out [0], which introduces category theory as a solution to faceted filtering. I understood it in theory and it was still does not seem straight forward to implement but I haven't attempted yet.

0. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5145200/#!po=28...

hardwaresofton · on Oct 24, 2022

So for client-side search, I generally know of Lunr.js:

https://lunrjs.com/docs/index.html

There are some others but I can't find them at this moment -- a bunch of the other projects I find are somewhat abandoned, lunr is actually on my list of things to use (because it makes the most sense to just ship a pre-built index with the first like... 5 letters maybe of typeahead, no matter how fast the backend is)

Bilal_io · on Oct 24, 2022

Thanks for the link. This unfortunately is not what I am looking for. Faceted filters are a different beast.

hawski · on Oct 24, 2022

Thank you for this comparison. I would also like to know how Bleve Search (https://github.com/blevesearch/bleve) turns out.

I have for many years now a small search engine project in my free-time pipeline, but I'm before crawling even and I intend to sit for searching part after some of that.

hardwaresofton · on Oct 24, 2022

You're right I should put bleve on there as well. This isn't even the whole list. Toshi (https://github.com/toshi-search/Toshi) is also out there...

snikolaev · on Oct 24, 2022

If you decide to add Manticore Search to the list feel free to ping me at sergey@manticoresearch.com if you need help with preparing the ingestion scripts etc.

hardwaresofton · on Oct 24, 2022

Oh! Damn it I forgot about manticore -- I had seen it before but forgot to include it.

Eventually all of these projects will be highlighted on Awesome F/OSS (https://awsmfoss.com), but for now I'm just going to dump my bookmarks here for other people, since I'm leaving awesome projects out:

Search Engines

AWS OpenSearch https://github.com/opensearch-project/OpenSearch

https://github.com/opensearch-project/OpenSearch-Dashboards

https://github.com/opensearch-project/perftop

https://github.com/go-ego/riot

https://groonga.org/ https://github.com/groonga/groonga

https://github.com/meilisearch/MeiliSearch

https://github.com/mosuka/bayard

https://github.com/nezaboodka/nevod

https://github.com/searx/searx

https://github.com/stryku/okon

https://github.com/toshi-search/Toshi

https://github.com/typesense/typesense

https://github.com/valeriansaliou/sonic

Algolia

https://github.com/marconi1992/algolite

https://quickwit.io/

https://github.com/quickwit-inc/quickwit

https://docs.meilisearch.com/

https://github.com/prabhatsharma/zinc

phalanx https://github.com/blugelabs/bluge https://github.com/mosuka/phalanx https://github.com/mosuka/blast

ManticoreSearch

https://github.com/manticoresoftware/manticoresearch

https://github.com/manticoresoftware/docker

https://manticoresearch.com/blog/manticore-alternative-to-el...

https://manticoresearch.com/

https://manual.manticoresearch.com/Introduction

https://forum.manticoresearch.com/t/manticore-search-cheatsh...

https://forum.manticoresearch.com/

Whoosh https://whoosh.readthedocs.io/en/latest/

https://pypi.org/project/Whoosh/

lyra https://github.com/nearform/lyra

https://nearform.github.io/lyra/

https://github.com/LyraSearch/lyra

https://lyrasearch.io/

flexsearch

https://github.com/nextapps-de/flexsearch#performance-benchm...

https://pagefind.app/docs/

Lucene

https://github.com/apache/lucene

https://lucene.apache.org/

ZincSearch

https://zincsearch.com/

Solr

https://solr.apache.org/

https://solr.apache.org/operator/

https://solr.apache.org/guide/solr/latest/getting-started/so...

https://github.com/apache/solr

https://solr.apache.org/guide/solr/latest/deployment-guide/s...

Konnu https://gitlab.com/shadowislord/konnu

Quickwit QuickWit + Clickhouse

https://clickhouse.com/docs/en/guides/developer/full-text-se...

https://clickhouse.com/docs/en/sql-reference/functions/strin...

There is no way I can get to running all of these (this project was supposed to be quick!!), but I will run the ones I noted earlier, and probably manticore too since it was high on my list since it's quite polished looking.

fzliu · on Oct 24, 2022

+ Milvus (https://github.com/milvus-io/milvus) for large scale similarity/semantic search.

kapilvt · on Oct 24, 2022

+ xapian which has been around a while, and while gpl licensed, is quite capable https://xapian.org/

donio · on Oct 24, 2022

Xapian is great, especially when you need a a C/C++ library rather than a separate service. Kinda like an sqlite for search. Some of my favorite tools like notmuch and recoll use it.

_tom_ · on Oct 24, 2022

I'd encourage you to maintain and publish your list of search engines. Even if you aren't supporting them.

The list has value on its own, especially if you maintain it.

hardwaresofton · on Oct 25, 2022

Yup this is exactly why I started Awesome F/OSS (https://awsmfoss.com) — I have similar lists for a lot of software and just need to make sure they’re somewhere else other than just my browser!

thirdtrigger · on Oct 24, 2022

+ Weaviate for vector based search. Has a BSD-3 license. https://weaviate.io/developers/weaviate/current/

francoismassot · on Oct 24, 2022

You can consider also lnx that is based on tantivy and is performing quite well (https://lnx.rs/).

MobiusHorizons · on Oct 24, 2022

Would it make sense to include Sqlite FTS5 in that mix?

hardwaresofton · on Oct 24, 2022

It would, I did for the supabase post but... This is already way too much! I have no idea when I'll actually be able to get to all this as-is.

Waiting for meilisearch to ingest documents right now and the Show HN is going up.

hardwaresofton · on Oct 24, 2022

Meili is still ingesting documents but we're live:

https://news.ycombinator.com/item?id=33321268

Maybe I should have used their batch thing instead.

jojo_ · on Oct 25, 2022

Have you heard about Xapian?

hardwaresofton · on Oct 25, 2022

Nope I hadn't, but it was mentioned here:

https://news.ycombinator.com/item?id=33318533

nathell · on Oct 24, 2022

I've written a full-text search engine as well. I don't tout it as a replacement for Elasticsearch, but it does have a few advantages: it's fast; supports HTML documents; supports Polish inflection (via a full-blown morphological dictionary, not just a stemmer); and has a very compact on-disk format (pre-parsed HTML trees, Huffman-encoded over large alphabets). Oh, and it's 100% Clojure.

It underlies a concordancer GUI called Smyrna: https://github.com/nathell/smyrna, https://smyrna.danieljanus.pl

I haven't touched it in six years, other than a few small changes. But I do plan on revisiting it when time permits.

johnebgd · on Oct 24, 2022

That’s very cool. I hope you consider open sourcing it so others can contribute.

nathell · on Oct 24, 2022

It is open-source already (MIT)! I just need to make other languages more easily pluggable, and factor out the search engine so that it can be used on its own. :)

_tom_ · on Oct 24, 2022

Could your steamer be ported to Lucene? Might get more usage there.

atesti · on Oct 24, 2022

>Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

Does this mean that it only ever finds at most N documents per word? Even searches for "A and B" would probably not find everything, even if less than N documents contain A and B, because they might have been removed with the sliding window already for A or B alone. Is that correct?

Aeolun · on Oct 24, 2022

Huh? Yeah. I can keep my index size down by throwing results away as well.

Every time you think it’s somehow magic, someone has to dump a bucket of cold water over your head.

sanxiyn · on Oct 24, 2022

As far as I can tell, yes, this is correct.

9dev · on Oct 24, 2022

Every time someone comes up with an alternative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me".

This is surely an impressive engineering feat, but hardly a replacement for the myriad of query possibilities Elasticsearch offers.

GrinningFool · on Oct 24, 2022

From the opening line of the README:

"Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases."

Seem pretty up-front about it, and doesn't claim to be a full-featured alternative.

lolinder · on Oct 24, 2022

Agreed, they do a good job of hedging it. I think OP was probably pre-empting the usual comments along the lines of "yep, $tool is super bloated, $smallerTool proves that those other guys building $tool are bad engineers."

marginalia_nu · on Oct 24, 2022

To be fair, there is often a better reasons to only replace a portion of ES' functionality, since doing so can save a lot of computation and space; than to replace ES itself, since it already exists and does a good job if what you need is the full kit.

I found myself last week reimplementing 10% of RoaringBitmap's functionality as a homebrew replacement, because doing so was 500% faster. Not that RB isn't great, but it's designed for a general problem space, and not my particular problem.

unrealhoang · on Oct 25, 2022

can you explain more of your problem and solution? Or is your solution open/published anywhere?

jamil7 · on Oct 24, 2022

Agreed, although to be fair to the actual author (assuming they didn't post this here) the readme is a lot more upfront about it's capabilities.

> Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases.

ianbutler · on Oct 24, 2022

I don't think your opinion is wrong, but I do think ElasticSearch has a lot of features that many people consider bloat depending on their work, and scaling and doing general dev ops for ES can be an absolute slog. Light weight alternatives that cut down to a set of core features for some niche seem like a good idea to me.

9dev · on Oct 24, 2022

It's totally fine that many people consider stuff bloat, but other people don't. I've built a highly specialised search engine for manufacturing companies on top of Elasticsearch, and I decidedly need vector queries, TF-IDF queries, geospatial range queries, and heaps of other, niche features you probably never used before.

Having a lightweight search engine is fine, but calling it an alternative to Elasticsearch is not doing either justice.

ianbutler · on Oct 24, 2022

That's very assumptive of you. I have in fact used most of those features, and note I said their opinion was not wrong. In their readme they said it's a replacement for some use cases which is upfront and fine.

Vector queries aren't niche, Elastic however only tacked on a proper (non HNSW) implementation in the last year and a half. Geospatial isn't niche, anyone working with location data will work with those queries. TF-IDF is a basic ranking algo / signal.

Maybe Elasticsearch is good for you because they have all their features in aggregate. But I can name a tool that focuses specifically on each area and query type and is better for that specific subset of functionality.

So my point still stands, if all you need are specific features Elastic is too much. You need all of it and that's fine too.

pbowyer · on Oct 24, 2022

> But I can name a tool that focuses specifically on each area and query type and is better for that specific subset of functionality.

Please do name them, because I for one would like to never run ElasticSearch again for faceted, full-text and specialised search.

sanxiyn · on Oct 24, 2022

I mean, Sonic doesn't store term frequency at all, so it can't do TF-IDF. It probably doesn't want to. If you need any ranking other than latest first, Sonic is not for you.

_tom_ · on Oct 24, 2022

The problem with subsets is everyone wants a different subset. It's my popular software almost always bloats. Everyone wants some different features.

coldtea · on Oct 24, 2022

"ative to a software behemoth like Elasticsearch, what they actually mean is: "An alternative to the 10% of functionality of $tool that are interesting to me"

Which is perfectly fine. A lot of tools become so general and bloated, that there are large groups that would be fine with many different 10% subsets of their features...

Kind of like how I don't need MS Word or OpenOffice Write, any simple text editing program with a few basic features (like printing, bold/italics, and word count) will do for my needs...

9dev · on Oct 24, 2022

I'm not opposed to that, however, the chance of their 10% and my 10% overlapping is rather slim. Just like you only need basic formatting, and I require footnotes in my documents. Nothing wrong with either, but I'd be upset if you tried to sell me GEdit as a replacement for OpenOffice Write.

RicoElectrico · on Oct 24, 2022

Honestly most of the "alternative to" programs do not meet expectations they set by dropping a big known name. So much so, that I think people are doing FOSS disservice by comparing to those who they can't meaningfully overtake.

The only exceptions could be small single feature utilities.

graftak · on Oct 24, 2022

To me it seems the “alternative to” part is more damaging in that sense than dropping a big name. The name is used to put a complicated piece of software in a context many people are familiar with. The same thing happens with the “Tinder/Uber/Airbnb for <x>…” type of services.

The friction is introduced where it’s not made crystal clear how it’s similar, and which concept are different or missing altogether. Then it will cause unmet expectations.

Perhaps it’s better to say “inspired by …” or “similar to …” to make a more precise statement.

tensor · on Oct 24, 2022

My guess is that the majority of people using ES could actually use something simpler like this.

manigandham · on Oct 24, 2022

True, but most deployments are also just generic searching of records like Algolia rather than using all the low-level functionality.

Tyoesense is probably the most compete competitor in that regard: https://typesense.org/

Other alternatives here: https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...

felipellrocha · on Oct 24, 2022

That is exactly what they are, and I don't think they hide it?! So, I don't know what the issue it. This is the kind of innovation that keeps us moving forward.

osigurdson · on Oct 24, 2022

I agree, but ES should re-write their core engine to be more lightweight, otherwise a viable competitor will emerge.

snorremd · on Oct 24, 2022

Projects like Meili Search are already coming for Elastic Search's lunch: https://www.meilisearch.com. I think there is a market for fast, light weight alternatives like Meili that offers up a fully featured open source experience.

With Elastic Search many of the features, security being one, are locked away behind commercial licenses. With Meili it seems they are, for the time being anyway, going with a proper open source version. I understand Elastic needs to earn money, and I get their licensing model to accomplish this. But Meili will probably steal away a good portion of customers interested in self hosting their search solution.

osigurdson · on Oct 24, 2022

I’m not sure what this competitor will be but > ES will have the following properties:

- written in rust or maybe just C - extremely lightweight and high performance - single small binary that runs anywhere - designed to run in Kubernetes from the ground up - scales dynamically up/down - zero downtime upgrades - rigorous security built into the core offering - fully open source - wire compatibility with ES

I hope that ES themselves do this. There are pretty significant barriers to creating a serious competitor to ES (unlike something like MongoDB for example which seems to have a very limited role in the future).

rlex · on Oct 24, 2022

and almost all of them won't offer ES-compatible API. Out of my head i can think about manticore (https://manticoresearch.com/) that offers at least subset of elasticsearch API

sanxiyn · on Oct 24, 2022

Quickwit does too (a subset anyway): https://github.com/quickwit-oss/quickwit

jethro_tell · on Oct 24, 2022

well, and how do you solve excessive ram usage for a search engine? Generally you write the indexes/search trees to disk which, may or may not be ideal.

papruapap · on Oct 24, 2022

tbh "Text Search" is a vague description for these kind of softwares, so I guess everyone go with elasticsearch-like.

PedroBatista · on Oct 24, 2022

While I get the wants-and-needs since ElasticSearch has a voracious appetite for RAM, I get the feeling most people think search engines are a simple thing where you can just import some lib, fool around for a bit and call it a day.

The truth is that ElasticSearch/Solr/Lucene is orders of magnitude more complex and powerful than these "alternatives". All this is mostly fine as long everyone is on the same page regarding the expectations.

Most people don't need ElasticSearch for their use cases on the surface, but I feel they expect top-notch mind-reading results and that requires something like ElasticSearch and someone who knows the field.

Having said all of that, Meilisearch and this are quite fine.

alessmar · on Oct 24, 2022

I would like to suggest https://typesense.org/ It has some features that makes it a better choice than Meilisearch

paraboul · on Oct 24, 2022

Can you elaborate on said features?

I migrated from typesense to Meilisearch on a project after I found it had much better search accuracy. I can't exactly explain why, but overall Meilisearch results feel more relevant by default.

snikolaev · on Oct 24, 2022

There are actually benchmarks that allow measuring search relevancy objectively, e.g. BEIR[1]. Manticore Search team did an effort to make a PR to include it to the list. The results are here [2]. Unfortunately the BEIR team seems to be too busy to review a whole pile of PRs including about Vespa. Nevertheless it would be nice to have both Meilisearch and Typesense there too since it's interesting what performance those non-tf-idf based search engines would show compared to BM25-based and vector search engines.

[1] https://github.com/beir-cellar/beir [2] https://docs.google.com/spreadsheets/d/1_ZyYkPJ_K0st9FJBrjbZ...

jabo · on Oct 24, 2022

I work on Typesense. Mind if I ask which version of Typesense and Meilisearch you tried this on? And if this was on some public dataset I can use?

I’d love to take a closer look.

paraboul · on Oct 24, 2022

Hey jabo,

I migrated in April 2021 (latest version of typesense & meilisearch at that time).

I don't have a public dataset has it was a fairly large ecommerce catalog with close to ~500k entries. And again, it was just my own perception which is hard to define. I just found that Typesense was a bit off compared to Meilisearch on search accuracy, and of course could totally be different today with a more recent release.

jabo · on Oct 24, 2022

Got it, thank you for sharing that. Typesense was at v0.19.0 around that time. Two prominent issues we had in that version were how we handled matches across multiple fields and how we handled "keyword stuffing".

We're now at v0.24.rc, and we've iterated quite a lot on improving relevancy since then, as more users shared their datasets with us and gave us feedback over the last 1.5 years.

If you get a chance to try out Typesense again in the future, I'd love to hear how relevance feels with the latest version, out of the box for your dataset.

keyle · on Oct 24, 2022

Yeah there needs to be some kind of acid test that will compare these products on equal footing and show the pitfalls.

sanxiyn · on Oct 24, 2022

This is very very difficult, but Tantivy tried: see https://github.com/quickwit-oss/search-benchmark-game

DeathArrow · on Oct 24, 2022

Here is a performance benchmark: https://db-benchmarks.com/test-hn/#manticore-search-columnar...

jasfi · on Oct 24, 2022

That would be great. However if you wanted to benchmark relevance ranking, how would you do that?

sanxiyn · on Oct 24, 2022

You need a dataset and an evaluation metric. The usual evaluation metric is NDCG(Normalized Discounted Cumulative Gain): https://en.wikipedia.org/wiki/NDCG

An example dataset is BEIR(BEnchmarking Information Retrieval), published in NIPS 2021: https://github.com/beir-cellar/beir

Spivak · on Oct 24, 2022

I think the upshot is that if you have no idea what all the advanced features of ES even are then you probably don't need ES because it's not turnkey.

If you utter the phrase "I just want search" then it really is a matter of just using one of these lightweight projects and libs because your needs are simple.

ilyt · on Oct 24, 2022

There is also the other important thing: the "search engine" in elasticsearch is just "searching the content of documents", not more.

It won't show you which one is "best" (for a given value of "best"), just one that looks most similar to the input.

Trying to index anything that can contain any trace of SEO would be doomed to failure, it also won't tell you which of the sites got linked the most, and million other things other web search engines do to give good results.

In "just put a documents in DB and search them" it is barely enough to look thru corporate knowledge database and it still won't get nuances like "this page is linked from 20 other pages, maybe it should be higher?"

marginalia_nu · on Oct 24, 2022

* We imported ~1,000,000 messages of dynamic length (some very long, eg. emails);

* Once imported, the search index weights 20MB (KV) + 1.4MB (FST) on disk;

This is almost unbelievably succinct! If you encode the document features into 8 bits per document, and thus completely forego the need to store the document ID by indexing them implicitly, that alone is 1 MB.

Getting meaningful search out of on average 21 bytes per document seriously impressive.

[For reference, this sentence is 42 bytes.]

mattb314 · on Oct 24, 2022

Wonder if this has anything to do with the sliding window:

> Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

Default window looks like 1k documents. I read this as saying that super common words are basically dropped from the index (only 1k out of many thousands of docs retained), but I don’t know enough about the internals to be sure. Not sure if this actually hurts search results in practice, seems like an ok trade off for help docs at least.

411111111111111 · on Oct 24, 2022

It's definitely a great trade-off to make for efficiently, but makes it inherently unusable for most of elastic searchs usecases.

Looking at it from a practical example such as log search (almost everyone I know has used kibana/logstash/elasticsearch at some point): you'd be able to search for things like tracingId/requestId but adding more filters such as logLevel, requestType or serviceName would be impossible

It has it's niche, but calling it an elasticsearch alternative really is a stretch

rabuse · on Oct 24, 2022

Also the ability to weight fields when fetching results to boost relevancy, which is needed for a lot of my use cases.

nightpool · on Oct 24, 2022

I wonder how easy it would be to change "most recently pushed" to something like a redis sorted set where each document has a score and only the top N results are retained when sorted by their separate score value? That would allow you to sort by pageviews / popularity in a more useful way. But it fails entirely when looking for uncommon intersections of common words, which feels like it makes it useless for most actual full-text search use-cases :(

syrusakbary · on Oct 24, 2022

Long ago I was searching on lightweight search engines that could run on the Edge, as ElasticSearch –while very popular– is also quite heavy and relies on the Lucene/JVM.

Apart from Sonic, I also found Tantivy [1] and Meilisearch [2]... all delightfully made in Rust. My favorite, and the closest one to ElasticSearch (for its features) is probably Tantivy.

I'd recommend anyone to check up this three projects and choose on what best fits your needs... it's awesome to see that more projects are becoming available by the day!

[1]: https://github.com/quickwit-oss/tantivy

[2]: https://github.com/meilisearch/meilisearch

alserio · on Oct 24, 2022

I've looked up tantivy and quickwit. Quickwit uses tantivy as the engine. It has decoupled storage (awesome, only recently elastic announced something comparable) but is oriented towards log processing and esplitly warns against its use to power an user facing site search. Do you happen to know if there's anything like that with the same minimal footprint that can scale up and, importantly, down to serve the needs of highly variable traffic websites? Right now I'm looking at something with clustering capabilities and decoupled storage (e.g on s3) like quickwit

francoismassot · on Oct 24, 2022

One of the reasons for not using Quickwit for user facing search is the latency: for example, you pay 70ms of latency when you make a request on AWS S3... and generally you expect latency below that figure. Decoupling compute and storage while keeping a very low latency may be then impossible unless ending up by caching all your data on disk :).

You can have a look at lnx (https://lnx.rs/) that is based on tantivy and is performing quite well. It's not yet distributed but the author Chillfish8 has some thoughts about how to do it.

alserio · on Oct 24, 2022

Thank you! I'll look into it

codedokode · on Oct 24, 2022

There is also sphinx search which was open source before 3.0 version.

snikolaev · on Oct 24, 2022

And it's open source continuation - Manticore Search [1]

[1] https://manticoresearch.com/

croes · on Oct 24, 2022

Do they support document access control like ES does?

sanxiyn · on Oct 24, 2022

Yes, Meilisearch supports ES-like document access control.

dewey · on Oct 24, 2022

Another interesting alternative: https://github.com/meilisearch/meilisearch - I'm using it in one of my (small) projects and I had a good experience with it, also very helpful community.

excsn · on Oct 24, 2022

This is not a direct alternative to ElasticSearch. Tantivy is closer to an alternative to ElasticSearch since ES is built on top of Lucene. An alternative could be achieved if built on top of Tantivy.

Sonic here only returns document identifiers so you will never be able to get document information back. This is very useful though if all you want to do is index text data and then get the stored information from another data store.

sanxiyn · on Oct 24, 2022

Quickwit is a search engine built on top of Tantivy (by the author of Tantivy): https://github.com/quickwit-oss/quickwit

Quickwit supports Elasticsearch compatible bulk indexing API.

codedokode · on Oct 24, 2022

> Sonic here only returns document identifiers

In many cases that is what you want because you have the data in a database and don't want to duplicate it in Elastisearch.

counttheforks · on Oct 24, 2022

> Sonic here only returns document identifiers so you will never be able to get document information back

Why would you want that anyway? Always thought it was silly to duplicate all your data which will be stored in a real database anyway

excsn · on Oct 24, 2022

From a use case I am not experienced with. If you index books, you want the search engine to return highlighted data like google does.

Also, now that I think of it, typically logs/structured data is stored only in ES.

DeathArrow · on Oct 24, 2022

>Also, Sonic only keeps the N most recently pushed results for a given word, in a sliding window way (the sliding window width can be configured)

If you discard many potential hits, why not use /dev/null as the search engine?

Someone1234 · on Oct 24, 2022

I believe you must have misread what you quoted, because whatever point you're trying doesn't really follow what you quoted.

They let you configure the number of expected results to cache for a given query, the number of cache results are configurable based on your use-case for the results (e.g. if your website only lists 100 results, don't store beyond that).

If more results than that for a given query are returned then they disregard additional results since you told it you won't make use of them. In essence, they're saving you from caching results that you'll never consume.

How you got from this to "just use /dev/null" is a mystery to me. It has to be a misread or misunderstanding.

nine_k · on Oct 24, 2022

This thing looks like a very genetic cache. You can of course use /dev/null as a degenerate cache, without any performance benefit though.

mhitza · on Oct 24, 2022

The readme doesn't offer enough information to accept that it can be an alternative to elasticsearch. From what I can gather by skimming the information, it can only do word level matching and that it isn't some form of TF-IDF type index (as is Lucene, which stands behind Solr/ElasticSearch).

sanxiyn · on Oct 24, 2022

Yes, it doesn't do any ranking at all. Results are returned in the reverse order of indexing.

cies · on Oct 24, 2022

Other:

https://www.meilisearch.com/

https://github.com/quickwit-oss/tantivy

https://github.com/toshi-search/Toshi

https://github.com/typesense/typesense

daitangio · on Oct 24, 2022

Nice. I have done some tests with SQLlite, and I find its index module very interesting, also because it offers stemming, which seems missed here: am I wrong?

SQLite has stemming only for english out-of-the-box, but I find it quite a need for a good ES drop in replacement.

My two cents

rcarmo · on Oct 24, 2022

It is great and works, but sonic has broader applications (I found it because it was actually being used as a way to index an existing SQLite database that pointed to file storage).

thedougd · on Oct 24, 2022

Just another plug for Lucene or the library route. I had a simple use case to offer a search/autocomplete API for the employee directory of ~50,000 records. The source of truth was only updated once a day. We ran a job that reindexed daily and published the index as a file (< 15 megabytes) to where the service could access it.

That service worked beautifully. Results were returned in 10-20ms and we only ever made software updates to handle the occasional CVE. It did, however, take quite a bit of fiddling initially to get the query results to match the user expectations. For example, weighting first vs last vs full name.

erikcw · on Oct 24, 2022

One of the features I like in ES that I haven’t seen in alternatives is “Percolate queries” (queries where you feed the service a document and it returns a list of queries that you’ve indexed that would match that document - basically inverting the whole process).

Does anyone know of any alternatives that support this use case?

https://www.elastic.co/guide/en/elasticsearch/reference/mast...

snikolaev · on Oct 24, 2022

Yes. Manticore Search does. Here's an interactive course[1] about it, it's a little bit outdated though. More info in the docs[2]

[1] https://play.manticoresearch.com/pq [2] https://manual.manticoresearch.com/Creating_an_index/Local_i...

didip · on Oct 24, 2022

Somewhat related, this guy: https://github.com/mosuka/ seems to be very passionate about search service.

He built two distributed search services:

- https://github.com/mosuka/phalanx, written in Go.

- https://github.com/mosuka/bayard, written in Rust.

manigandham · on Oct 24, 2022

Lots of (elastic)search alternatives now, I keep track here: https://gist.github.com/manigandham/58320ddb24fed654b57b4ba2...

Sonic is good. Typesense is probably what most are looking for as more of an Algolia-like setup: https://typesense.org/

codedokode · on Oct 24, 2022

I am not sure if it can be called an "alternative". ElasticSearch has thousands of features and settings while this library seems to be just a simple inverted index implementation only for text search.

By the way if you are looking for lightweight "alternative" for ElasticSearch you might look at sphinx search engine (although it doesn't has as much features as ES has and it has became closed-source since 3.0 version).

snikolaev · on Oct 24, 2022

> you might look at sphinx search engine

Manticore Search [1] forked from the latest open source version and has been continually improved for more than 5 years.

[1] https://manticoresearch.com/

> although it doesn't has as much features as ES has

Manticore unlike Sphinx is much closer to Elasticsearch in terms of features set.

vlovich123 · on Oct 24, 2022

Using a 32 bit ID is an interesting choice. It means you can only index 64-bits per bucket. I wonder if using a varint encoding would give you even more savings while handling > 4 billion documents at the cost of a bit more expensive serialization/deserialization cost (which should be negligible in the grand scheme of everything else being done).

eric4smith · on Oct 24, 2022

What about relevancy?

There’s not much mention of that. I’m always on the lookout for something lightweight that improves on PostgreSQL full text.

sanxiyn · on Oct 24, 2022

Sonic doesn't do any ranking other than latest first.

eric4smith · on Oct 24, 2022

Thaxll · on Oct 24, 2022

So I can use that to inject millions of logs daily and it will do sharding and rebalancing automatically?

sanxiyn · on Oct 24, 2022

No. Sonic is a single node server and not distributed.

dang · on Oct 24, 2022

ggeorgovassilis · on Oct 26, 2022

I don't see the use case for Sonic. It doesn't support clusters, which is where ES really shines, and if footprint for a local instance is a concern I'd go for SQLite.

Thaxll · on Oct 25, 2022

Is Sonic clustered? otherwise it's useless

There are already millions of solution to have entries in an inverted index that you can query within ms, none of them have the power of ES in term of features, scaling and HA.

habibur · on Oct 24, 2022

First thing I looked for is how long does it takes to delete a document from the index.

Looks like it rebuilds the whole index periodically and that's very processor intensive. The delete will be reflected after a rebuild.

scottwick · on Oct 24, 2022

Does anyone have any recommendations of books or other resources that go over the theory behind full-text search? i.e. language processing, data encoding, on-disk storage and retrieval, etc.

snikolaev · on Oct 24, 2022

https://nlp.stanford.edu/IR-book/information-retrieval-book....

sanxiyn · on Oct 24, 2022

If you want a book, Managing Gigabytes is still pretty good.

speps · on Oct 24, 2022

Does anyone know of an alternative for the time series side of Elastic?

snikolaev · on Oct 24, 2022

Manticore Search. Here's a blog post with detailed comparison [1]

[1] https://manticoresearch.com/blog/manticore-alternative-to-el...

gkorland · on Oct 24, 2022

You might want to check Redis-Stack - https://redis.io/docs/stack. It's a stack on top of Redis, which come bundled with RedisTimeSeries, RediSearch, and RedisJSON (also includes RedisGraph and RedisBloom).

hxl · on Oct 25, 2022

A slight name-conflict with sanic: https://github.com/sanic-org/sanic

AndrewKemendo · on Oct 24, 2022

If anyone has been successful compiling this with VSCode on Win10 please let me know how you get CLANG/LLVM to play nicely with VSCode.

I'd like to avoid compiling LLVM from source if I can

eerikkivistik · on Oct 24, 2022

About 2 weeks ago, I was searching for an alternative to Elastic for this exact use case. Funny how the world works, now I have my answer: "someone has built it".

croes · on Oct 24, 2022

Most of the time these ES replacements lack a decent access control.

One thing is to find what you search, but the other is not to find what you aren't allowed to see.

sanxiyn · on Oct 24, 2022

Meilisearch supports ES-like document access control.

airstrike · on Oct 25, 2022

IANAL but it seems to me like using the name "sonic" and a hedgehog mascot together teeters on the edge of IP infringement

giancarlostoro · on Oct 24, 2022

Now if it were drop-in capable and still more efficient, that would be impressive and I would count the days until Elastic buys you out.

keroro · on Oct 24, 2022

There's also mellisearch which is another elasticsearch alternative written in rust.

Comparison to elasticsearch: https://docs.meilisearch.com/learn/what_is_meilisearch/compa...

Github: https://github.com/meilisearch/meilisearch

Website: https://www.meilisearch.com/

pavelevst · on Oct 24, 2022

It would be nice if it can be replacement for logging stack, elastic is super hungry for ram

francoismassot · on Oct 24, 2022

You can have a look at Quickwit (https://quickwit.io), it's a search engine made for logs :). It's still pretty young and... there are way less features than in ES.

(disclaimer: I'm one of the cofounder)

GordonS · on Oct 31, 2022

I don't see any mention on the website on how it compares to Loki, which is surprising since it looks like a very similar model?

IceWreck · on Oct 24, 2022

Look at zincsearch. Its another lightweight elastic alternative and they advertise logging as a usecase.

endisneigh · on Oct 24, 2022

I wish someone would write a full text engine that supports pluggable storage engines.

pipeline_peak · on Oct 24, 2022

If they keep introducing hipster names like Deno and Sonic, no one will know what anything means anymore.

IYasha · on Oct 24, 2022

But does it scale?

sanxiyn · on Oct 24, 2022

No, it doesn't.