Hacker News new | past | comments | ask | show | jobs | submit login
Full text search over Postgres: Elasticsearch vs. alternatives (paradedb.com)
319 points by philippemnoel 5 months ago | hide | past | favorite | 121 comments



The BM25 thing is actually a really big deal.

BM25 is similar to TF/IDF. In both cases, the key idea is to consider statistics of the overall corpus as part of relevance calculations. If the user searches for "charities in new orleans" in a corpus where "new orleans" is only represented in a few documents, those should clearly rank highly. If the corpus has "new orleans" in almost every document then the term "charity" is more important.

PostgreSQL FTS cannot do this, because it doesn't maintain statistics for word frequencies across the entire corpus. This severely limits what it can implement in terms of relevance scoring - each result is scored based purely on if the search terms are present or not.

For comparison, SQLite FTS (which a lot of people are unaware of) actually does implement full index statistics, and SQLite FTS5 implements BM25 out of the box.


The problem with BM25 in a database, is that is can have unexpected outcomes for some common use cases.

Take multi-tenancy.

What if user 1 has many more documents than user 2, and uses "new orleans" a lot. But user 2 does not. User 2 does the search.

The db will first use FTS, and then filter. So user 1 will bias the results of user 2. Perhaps enough for user 2 to discover what words are in user 1 corpus.


If you're a Postgres pg_search user, partial indexes (https://docs.paradedb.com/search/full-text/index#partial-bm2...) can solve.


If you're creating one Postgres index per user, you're going to have a bad time really fast.


Would eat up inodes? Not sure how these work, does it create separate files for each one?


Index per tenant but yes, if you have hundreds or thousands of tenants that becomes a more difficult problem to manage but not at all unmanageable.


> Take multi-tenancy.

Using a SQLite DB per tenant is a good alternative to handle that: https://turso.tech/


If you care about not losing your data you should reconsider using turso.

https://news.ycombinator.com/item?id=38539595

Really, use some proven and mature tech for your data. Don't jump on every hype train. In case of SQLite self hosting is especially easy using https://litestream.io/.


SQLite has full text search with BM25?



AWS offers a solution which is OpenSearch serverless collections. A dedicated collection can be spun up for each user/tenant, e.g. «thing1-user345», «thing2-user123», instead of co-mingling the «user123» and «user345» in the same collection index. It increases the overall overhead but with the infrastructure repeatability, courtesy of IaaC, it is easy to roll out and process/consume «thing1-user123» and «thing2-user123» as discrete datasets.

Tangentially related. I have been finding multi-tenancy to have become more of a liability at worst or a nuisance at best due to the increasingly frequent customer demands to satisfy the data sovereignity, data confidentiality (each tenant wants their own, non-shared cryptography keys etc), data privacy, compliance, the right to forget/GDPR and similar requiremenets. For anything more complex than an anonymous online flower shop, it is simpler to partition the whole thing off – for each customer/tenant.


Any idea how “OpenSearch serverless collections” are implemented? I’m guessing that a “collection” is basically an ElasticSearch index, and “serverless” refers to some method of serializing/loading it on-demand with some cold start tradeoffs?


Basically opensearch and elasticsearch both offer serverless modes now. They work differently because all this was developed post fork. But they do similar things. The Elastic implementation is in early access mode right now, so it is not released yet. I saw a demo of this at one of their meetups in June. I think Opensearch actually moved a bit faster than Elastic on this after Elastic announced that they were working on this a few years ago.

The aim with both is to not have users worry about cluster sizes or cluster management any more, which from an operational point of view is a huge gain because most companies using this stuff aren't very good at doing that properly and the consequences are poor performance, outages, and even data loss when data volume grows.

Serverless essentially decouples indexing and querying traffic. All the nodes are transient and use their local disk only as a cache. Data at rest lives in S3 which becomes the single source of truth. So, if a new node comes up, it simply loads it's state from there and it doesn't have to coordinate with other nodes. There's no more possibility for a cluster to go red either. If a node goes down, it just gets replaced. Basically this makes use of the notion that lucene index segments are immutable after they are written to. There's a cleanup process running in the background that merges segments to clean things up but basically that just means you get a new file that then needs to be loaded by query nodes. I'm not 100% sure how write nodes coordinate segment creation and management. But I assume it involves using some kind of queueing.

So you get horizontal scalability for both reads and writes and you no longer have to worry about managing cluster state.

The tradeoff is that you have a bit increased latency before query nodes can see your incoming data because it has to hit S3 as part of the indexing process before read nodes can pick up new segments. Think multiple seconds before new data becomes visible. Both solutions are best suited for time series type use cases but they can also support regular search use cases. Any kind of use case where the same documents get updated regularly or where reading your own writes matters, things are not going to be great.

Amazon's implementation of course probably leverages a lot of their cloud stuff. Elastics implementation will eventually be available in their cloud solution on all supported cloud providers. Self hosting this is going to be challenging with either solution. So that's another tradeoff.


Thanks for the detailed response and insight. This is a great example of when meetups and in-person networking/collaboration can help you stay ahead of the curve.

It does sound like the solution glosses over some cold start problems that will surface with increasing regularity for more fragmented indexes. For example if you have one index per tenant (imagine GitHub search has one public index and then additionally one index per authenticated user containing only repos they are authorized to read), then each user will experience a cold start on their first authenticated search.

I bet these tradeoffs are not so bad, and in practice, are worth the savings. But I will be curious to follow the developments here and to see the limitations more clearly quantified.

(Also this doesn’t address the writes but I’m sure that’s solvable.)


> This is a great example of when meetups and in-person networking/collaboration can help you stay ahead of the curve.

Did I miss something? This is a comment on a publicly accessible website. How did you infer these benefits?


The context offered in their comment was based on a demo they saw at a meetup:

> I saw a demo of this at one of their meetups in June


> For anything more complex than an anonymous online flower shop, it is simpler to partition the whole thing off – for each customer/tenant.

Is this really a viable approach at the scale of B2B SaaS like Salesforce (or, contextually, Algolia)? They would end up with literally 10s of 1000s of DBs. That is surely cost-prohibitive.


Doesn't that affect BM25 with a solution like Elasticsearch as well? Or is that smart enough to apply filters to the corpus statistics before calculating relevance?

You could solve that in SQLite by giving each user their own separate FTS table - not impossibly complex, but would grow increasingly messy if you have 10s of thousands of users.


One way to address this in Elasticsearch would be to put each customers documents in their own index. Other than that, as far as I can tell it's not smart enough to apply filters first.

Also, shards can affect BM25 scoring: https://www.elastic.co/blog/practical-bm25-part-1-how-shards...


I wrote this blog :). Good to see it still getting use.

FYI for folks just skimming this, shards can affect scoring, but they don't have to. 2 mitigations: 1. The default in Elasticsearch has been 1 shard per index for a while, and many people (if not most) probably don't need more than 1 shard

2. You can do a dfs_query_then_fetch query, which adds a small amount of latency, but solves this problem

The fundamental tenant is accurate here that any time you want to break up term statistics (e.g. if you want each user to experience different relevance results for their own term stats) then yes, you need a separate index for that. I'd say that's largely not all that common though in practice.

A more common problem that warrants splitting indices is when you have mixed language content: the term "LA" in Spanish content adds very little information while it adds a reasonable amount of information in most English language documents (where is can mean Los Angeles). If you mix both content together, it can pollute term statistics for both. Considering how your segments of users and data will affect scoring is absolutely important as a general rule though, and part of why I'm super excited to be working on neutral retrieval now


Thanks for the clarifications! I've been spending the last 3 weeks deep in the weeds of TF/IDF scoring and was about to give up on Elastic Search when this got posted. The article has been eye opening!!!


It’s why you need “enough” data. Generally the distribution of teens is a power-law distribution. There’s a set of common terms and a very long tail of other terms. This will be the case across every shard if you do in fact have enough data per shard. This becomes a factor in certain distributed aggregations as well. YMMV of course.

In general you trade accuracy for speed. You tend to get a lot of speed for just a small amount of accuracy sacrifice.


Oh sure, but it is a side channel attack to be aware of.


It could be solved with an index partitioning feature, no idea if it already exists somewhere...


Yup, partial indexes have been a part of PG for decades, and partitioned tables (on tenant) will have different indexes per client automatically.

This would be fine.


ParadeDB supports partial indexes as well: https://docs.paradedb.com/search/full-text/index#partial-bm2...


The multi-tenancy problem actually already applies to almost every multitenant database in Postgres, Oracle and MySQL. Whether or not they use FTS. You just might not notice its impact in your case if query performance is "good enough".

War story time.

Awhile ago, I worked on a big SQL (Oracle) database, millions of rows per tenant, 10ks of tenants. Tenant data was highly dissimilar between most tenants in all sorts of ways, and the distributio n of data "shape" between tenants wasn't that spiky. Tenants were all over the place, and the common case wasn't that common. Tenants (multitenant tables and/or whole separate schemas) routinely got migrated between physical database hosts by hand (GoldenGate didn't exist at the beginning of this company, and was too expensive by the end).

A whole host of problems in this environment cropped up because, inside the DB, indexes and cache-like structures were shared between tenants. An index histogram for a DATETIME column of a multitenant table might indicate that 75% of the dates were in 2016. But it turns out that most of that 75% were the rows owned by a few big tenants, and the other 1000+ tenants on that database were overwhelmingly unlikely to have 2016 dates. As a result, query plans often sucked for queries that filtered on that date.

Breaking tables up by tenant didn't help: query plans, too, were cached by the DB. Issue was, they were cached by query text. So on a database with lots of separate (identical) logical schemas, a query plan would get built for some query when it first ran against a schema with one set of index histograms, and then another schema with totally different histograms would run the same query, pull the now-inappropriate plan out of the cache, and do something dumb. This was a crappy family of bug, in that it was happening all the time without major impact (a schema that was tiny overall is not going to cause customer/stability problems by running dumb query plans on tiny data), but cropped up unpredictably with much larger impact when customers loaded large amounts of data and/or rare queries happened to run from cache on a huge customer's schema.

The solve for the query plan issue? Prefix each query with the customer's ID in a comment, because the plan cacher was too dumb (or intended for this use case, who knows?) to strip comments. The SQL keys in the plan cache would end up looking like a zillion variants of "/* CUSTOMER-1A2BFC10 */ SELECT ....". I imagine this trick is commonplace, but folks at that gig felt clever for finding it out back in the bad old days.

All of which is to say: database multitenancy poses problems for far more than search. That's not an indictment of multitenancy in general, but does teach the valuable lesson that abandoning multitenancy, even as wacky and inefficient as that seems, should be considered as a first-class tool in the toolbox of solutions here. Database-on-demand solutions (e.g. things like neon.tech, or the ability to swiftly provision and detach an AWS Aurora replica to run queries, potentially removing everything but one tenant's data) are becoming increasingly popular, and those might reduce the pain of tenant-per-instance-ifying database layout.


BM25 is definitely a big deal when you're doing FTS. It's one of the reasons I've switched to Arango as my DB for the project I'm working on that needs FTS. The fact it also comes with graphs means I don't need to bang my head against CTEs either.

Not saying Arango can replace Postgres but for my needs, it's a much better fit AND it offers the FTS that I need out of the box.


Arango is sweet! We've actually talked to many people who switched from Postgres to Arango because of its better FTS. This was one of the reasons for creating ParadeDB in the first place. Many of the users who made this switch wished they could have stayed in Postgres without compromising on FTS.


I've learned information retrieval in university before ever using any full text search engine and I was honestly kinda surprised that most engines are literal textbook implementations. IR was one of those courses where I definitely expected the SotA to be further out than the third slide deck.


That’s probably because SotA in IR is proprietary implementations like Google Search, the most bread-and-butter trade secrets of their business. So they’re not publishing that (at least in any holistic manner, and even if they did, it’s not like one product could implement all its functionality including the complexities around the periphery). And what you’re left with is some pretty decent parity between industry and academic SotA, same as a lot of other areas of CS (databases, decentralized systems, etc). If anything, in most cases, implementation lags behind academic theory.


It kind of is now if you count deep-learning based solutions to an information need haha


That is true


It's been a while since I used tools in this space. Some questions:

Where does AI/semantic search/vector DBs fit into the state of the art today? Is it generally a total replacement for traditional text search tools like elastic, or does it sit side by side? Most, but not all, use-cases I can think of would prefer semantic search, but what are the considerations here?


As others mentioned you want to do combine both semantic and td-idf like searches. The thing is that searches that carry no semantic weight (e.g. a part number) or that have special meaning compared to the corpus used to train your embedding model (e.g. average Joe thinks about the building when seeing the word "bank", but if you work in a city planning firm you might only consider the bank of a river) fail spectacularly when using only semantic search.

Alternatively you can finetune your embedding model so that it was exposed to these words/meanings. However, the best (from personal experience) is doing both and use some sort of query rewriting on the full-text search to keep only the "keywords".


bm25 is often used along side vector search with a reciprocal rerank algorithm.


As others in this thread have mentioned, semantic/vector-based solutions tend to be much better when any of the following is true:

1. There are natural language questions being asked 2. There's ambiguity in any of the query terms but which there's more clarity if you understand the term in context of nearby terms (e.g. "I studied IT", "I saw the movie It", "What is it?") 3. There are multiple languages in either the query or the documents (many embedding models can embed to a very similar vector across languages) 4. Where you don't want to maintain giant synonym lists (e.g. user might search for or document might contain "movie" or "film" or "motion picture")

Whether you need any of those depends on your use case. If you are just doing e.g. part number search, you probably don't need any of this, and certainly not semantic/vector stuff.

But semantic/vector systems don't work well with terms that weren't trained in. e.g. "what color is part NY739DCP?" Fine tuning is bad at handling this (fine tuning is generally good for changing the format of the response, and generally not all that good for introducing or curtailing knowledge). Whether you need a keyword search system depends on whether your information are more of the "general knowledge" type or something more specific to your business.

Most companies have some need for both because they're building a search on "their" data/business, and you want to combine the results to make sure you're not duplicating. But I'll say I've seen a lot of companies get this sort of combination business wrong. Keeping 2 datastores completely in sync is wrought with failure, the models all have different limitations than the keyword system limitations which is good in some ways but can cause unexpected results in others, and the actual blending logic (be it RRF or whatever) can be difficult to implement "right."

I usually recommend folks look to use a single system for combining together these results as opposed to trying to invent yourself. Full disclosure: I work for Vectara (vectara.com) which has a complete pipeline that does combine both a neural/vector system and traditional keyword search, but I would recommend that folks look to combine these into a single system even if they didn't use our solution because it just takes so much operational complexity off the table.


There was some work toward this, not sure about its current status: https://github.com/postgrespro/rum

Badly needed.


RUM indices are a nice improvements, but they're not in Postgres core and likely won't ever be :( If you're looking for Elastic-level FTS in Postgres, your best bet is likely to overengineer pgvector+tsvector or simply to use ParadeDB pg_search


I wish SQLite FTS had a nicer to use API. Last time I looked (a few months ago) it was very confusing and not very user friendly.


I've built a bunch of Python (and CLI) code to try and help make that work better: https://sqlite-utils.datasette.io/en/stable/python-api.html#...

    import sqlite_utils
    
    db = sqlite_utils.Database("news.db")
    db["articles"].enable_fts(["headline", "body"])
    results = list(db["articles"].search("softball"))
Or with the CLI tool: https://sqlite-utils.datasette.io/en/stable/cli.html#configu...

    sqlite-utils enable-fts news.db articles headline body
    sqlite-utils search news.db articles softball


That's cool, maybe I can use it as a reference. Thank you!


I didn't realise it was such a big deal. I wrote a custom BM25 and (faceted? boosted?) search implementation using roaring bitmaps, memory mapped files and pebble to replace sqlite FTS when it couldn't cope anymore with my requirements (100+ GB search index).


Facets are possible in Postgres but it looks complex:

https://web.archive.org/web/20200815141031/https://roamanaly...

pg_search [0] not to be confused with the ruby gem of the same name [1]:

[0] - https://github.com/paradedb/paradedb/tree/dev/pg_search#over...

[1] - https://github.com/Casecommons/pg_search


I implemented faceted search on my own blog entirely on top of PostgreSQL. It works well, but that's because my blog has only 11,348 documents total.

Notes on how I built that here: https://simonwillison.net/2017/Oct/5/django-postgresql-facet...


That's a lot of posts for a blog :)


Faceting over even small result sets and producing the relevant counts is really hard for postgres, and caching tends to be impossible due to the combinatorics of the problem. There's some discussion around supporting better indexing options here that is relevant: https://www.postgresql.org/message-id/CAJqPDh9o%3DoxRP%3DxZh.... Practically, https://github.com/cybertec-postgresql/pgfaceting is the best bet for people who must use postgres even though it still uses a manual maintenance step.


If you care about search quality, postgres ends up being fairly limited. I've done stuff with it; it's alright for really simple stuff. But it's the wrong tool for the job generally.

If the order of the search results matters to you, you might want something that gives you some more tools to control that. And if you are not measuring search quality to begin with, you probably don't care enough to even know that you are missing the tools to do a better job.

I consult clients on this stuff professionally and I've seen companies do all sorts of silly shit. Mostly it's because they simply lack the in house expertise which is usually how they end up talking to me.

I've actually had to sit clients down and explain them their own business model. Usually I come in for some technical problem and then end up talking to product managers or senior managers about stuff like this because usually the real problem is at that level. The technical issues are just a symptom.

Here's a discussion I had with a client fairly recently (paraphrasing/exaggerating, obviously):

"Me: So your business model is that your users find shit on your web site (it has a very prominent search box at the top) and then some transaction happens that causes you to make money? Customer: yes. Me: so you make more money if your search works better and users find stuff they want. Customer: yes, we want to make more money! Me: congratulations, you are a search company! Customer: LOL whut?! Me: So, why aren't you doing the kinds of things that other search companies do to ensure you maximize profit? Like measuring how good your search is or generally giving a shit whether users can actually find what they are looking for. Customer: uhhhhh ???? Me: where's your search team? Customer: oh we don't have one, you do it! Me: how did you end up with what you currently have. Customer: oh that guy (some poor, overworked dev) over there picked solution X 3 years ago and we never gave it a second thought."

Honestly, some companies can't be helped and this was an example of a company that was kind of hopelessly flailing around and doing very sub optimal things at all levels in the company. And wasting lots of time and money in the process. Not realizing your revenue and competitiveness are literally defined by your search quality is never a good sign. You take different decisions if you do.


I certainly get the sentiment of your comment, but I'd be much more interested to hear your take on ParadeDB which the post is about?


It seems to be a slightly better version of what postgresql offers. I have no experience or insights in what it does as it seems to be invite only currently.

But it joins a long list of not quite Elasticsearch alternatives with a much smaller/narrower/ more limited feature set. It might be good enough for some.

As for the criticism regarding Elasticsearch:

- it's only unstable if you do it wrong. Plenty of very large companies run it successfully. Doing it right indeed requires some skill; and that's a problem. I've a decade plus of experience with this. I've seen a lot of people doing making all sorts of mistakes with this. And it's of course a complicated product to work with.

- ETL is indeed needed. But if you do that properly, that's actually a good feature. Optimizing your data for search is not optional if you want to offer good search experience. You need to do data enrichment, denormalization, etc. IMHO it's a rookie mistake not to bother with architecting a proper ETL solution. I'd recommend doing this even if you use posgres or paradedb.

- Freshness of data. That's only a problem if you do your ETL wrong. I've seen this happen; it's actually a common problem with new Elasticsearch users not really understanding how to architect this properly and taking short cuts. If you do it right, you should have your index updated within seconds of database updates and the ability to easily rebuild your indices from scratch when things evolve.

- Expensive to run. It can be; it depends. We spend about 200/month on Elastic Cloud for a modestly sized setup with a few million documents. Self hosting is an option as well. Scaling to larger sizes is possible. You get what you pay for. And you can turn this around: what's a good search worth to you? Put a number on it in $ if you make money by users finding stuff on your site or bouncing if they don't. And with a lot of cloud based solutions you trade off cost against convenience. Bare metal is a lot cheaper and faster generally.

- Expensive engineers. Definitely a challenge and a good reason for using things like Algolia or similar search as a service products. But on the other hand if your company's revenue depends on your search quality you might want to invest in this.


ParadeDB founder here. We are open-source: https://github.com/paradedb/paradedb. Very much not "invite only". We have hundreds of open-source deployments in production and several paying customers.

Our bring-your-own-cloud solution, which is our primary hosted service, is indeed in developer preview and if anyone is interested in using it, you can contact us. It will enable adding ParadeDB to AWS RDS/Aurora.


My apologies. That wasn't clear from your website though at a glance. But I did not spend a lot of time looking at it.


For internet user facing full-text search I would always prefer to use a separate tool and not a SQL database, because

- the fulltext tool, can and should hold only 'active' data

- as it has only active data, data size is usually much much smaller

- as data size is smaller, it better fits in RAM

- as data size is smaller, it can be probably run on poorer HW the full ACID db

- as the indexed data are mostly read-only, the VM where it runs can be relatively easily cloned (never seen a corruption till now)

- as FTS tools are usually schema-less, there is no outage during schema changes (compared to doing changes in ACID db)

- as the indexed data are mostly read-only, the can be easily backup-ed

- as the backups are smaller, restoring a backup can be very fast

- and there is no such thing as database upgrade outage, you just spin a new version, feed it with new data and than change the backends

- functionality and extensibility

There is probably more, but if one doesn't needs to do a fulltext search on whole database (and you usually don't), than its IMHO better to use separate tool, that doesn't comes with all the ACID constraints. Probably only downside is that you need to format data for the FTS and index them, but if you want run a serious full-text search, you will have to take almost the same steps in the database.

On a 15y old side project, I use SOLR for full-text search, serving 20-30k/request per day on a cheap VM, and PostgreSQL is used as primary data source. The PostgreSQL has had several longer outages - during major upgrades, because of disk corruption, because of failed schema migrations, because of 'problems' between the chair and keyboard etc... During that outages the full-text search always worked - it didn't had most recent data, but most users probably never noticed.


playing devils avocate

> the fulltext tool, can and should hold only 'active' data

very possible with postgres, too. instead of augmenting your primary table to support search, you would have a secondary/ephemeral table serving search duties

> as data size is smaller, it better fits in RAM

likewise, a standalone table for search helps here, containing only the relevant fields and attributes. this can be further optimized by using partial indexes.

> as FTS tools are usually schema-less, there is no outage during schema changes (compared to doing changes in ACID db)

postgresql can be used in this manner by using json/jsonb fields. instead of defining every field, just define one field and drop whatever you want in it.

> as the indexed data are mostly read-only, the can be easily backup-ed

same for postgres. the search table can be exported very easily as parquet, csv, etc.

> as the backups are smaller, restoring a backup can be very fast

tbh regardless of underlying mechanism, if your search index is based on upstream data it is likely easier to just rebuild it versus restoring a backup of throwaway data.

> The PostgreSQL has had several longer outages - during major upgrades, because of disk corruption, because of failed schema migrations, because of 'problems' between the chair and keyboard etc...

to be fair, these same issues can happen with elasticsearch or any other tool.

how big was your data in solr?


PostgreSQL is over 90GB, SOLR is under 1GB.

The PostgreSQL has to handle writes, reports, etc..., so I doubt it will cache as efficiently as full-text engine, you'll need to have full or partial replicas to distribute the load.

And, yes, I agree, almost all of this can be done with separate search table(s), but this table(s) will still live in a 'crowded house', so again replicas will be probably necessary at some point.

And using replicas brings new set of problems and costs ;-)

One client used MySQL for fulltext search, it was a single beefy RDS server, costing well over $1k per month and the costs kept raising. It was replaced with a single ~$100 EC2 machine running Meilisearch.


Meilisearch is good, probably sufficient for a vast majority of use cases.

In fact self hosting it is so easy and performant that I wonder how their paid saas is doing.


While agreeing somewhat with the post above, the answer isn't really so black and white but depends on your context, i.e. scale, app-complexity, search needs, data size etc.

>the fulltext tool, can and should hold only 'active' data

Same can be said about your DB. You can create separate tables, partitions to hold only active data. I assume materialized views are also there(but never used them for FTS). You can even choose to create a separate postgres instance but only use it for FTS data. The reason to do that might be to avoid coupling your business logic to another ORM/DSL and having your team t learn another query language and its gotchas.

> as data size is smaller, it better fits in RAM

> as data size is smaller, it better fits in RAM

> as the indexed data are mostly read-only, the VM where it runs can be relatively easily cloned

> as the indexed data are mostly read-only, the can be easily backup-ed

> as the backups are smaller, restoring a backup can be very fast

Once the pg tables are separate and relevant indexing, i assume PG can also keep most data in memory. There isn't anything stopping you from using a different instance of PG for FTS if needed.

> as FTS tools are usually schema-less, there is no outage during schema changes

True. But in practice for example ES does have schema(mappings, columns, indexes), and will have you re-index your rows/data in some cases rebuild your index entirely to be safe. There are field types and your querying will depend on the field types you choose. i remember even SOLR did, because i had to figure out Geospatial field types to do those queries, but haven't used it in a decade so can't say how things stand now.

https://www.elastic.co/guide/en/elasticsearch/reference/curr...

While the OPs point stands, in a sufficiently complex FTS search project you'll need all of the features and you'll have to deal with the following on search oriented DBs

- Schema migrations or some async jobs to re-index data. Infact it was worse than postgres because atleast in RDBMS migrations are well understood. In ES devs would change field types and expect everything to work without realizing only the new data was getting it. So we had to re-index entire indexes sometimes to get around this for each change in schema.

- At scale you'll have to tap into WAL logs via CDC/Debezium to ensure your data in your search index is up-to-date and no rows were missed. Which means dealing with robust queues/pub-sub.

- A whole another ORM or DSL for elasticsearch. If you don't use these, your queries will soon start to become a mish-mash of string concats or f-strings which is even worse for maintainability.

https://elasticsearch-py.readthedocs.io/en/v8.14.0/ https://elasticsearch-dsl.readthedocs.io/en/latest/

- Unless your search server is directly serving browser traffic, you'll add additional latency traversing hops. In some cases meilisearch, typesense might work here.

I usually recommend engineers(starting out on a new search product feature) to start with FTS on postgres and jump to another search DB as and when needed. FTS support has improved greatly on python frameworks like Django. I've made the other choice of jumping too soon to a separate search DB and come to regret it because it needed me to either build abstractions on top or use DSL sdk, then ensure the data in both is "synced" up and maintain observability/telemetry on this new DB and so on. The time/effort investment was not linear is and the ROI wasn't in the same range for the use-case i was working on.

I actually got more mileage out of search by just dumping small CSV datasets into S3 and downloading them in the browser and doing FTS client side via JS libs. This basically got me zero latency search, albeit for small enough per-user datasets.


Yes, it always depends on application and purpose.

But once you will have to deal with a real FTS load, as you say, you have to use separate instances and replication, use materialized views etc.. and you find your self almost halfway to implementing ETL pipeline and because of replicas, with more complicated setup than having a FTS tool. And than somebody finds out what vector search is, and ask you if there is an PG extension for it (yes it is).

So IMHO with FTS in database, you'll probably have to deal with the almost same problems as with external FTS (materialized views, triggers, reindexing, replication, migrations) but without all its features, and with constrains of ACID database (locks, transactions, writes)...

Btw. I've SOLR right behind the OpenResty, so no hops. With database there would be one more hop and bunch of SQL queries, because it doesn't speaks HTTP (although I'm sure there is an PG extension for that ;-)


> Because Elasticsearch is not a reliable data store, organizations that use Postgres typically extract, transform, and load (ETL) data from Postgres to Elasticsearch

I’ll admit haven’t kept up with this but is it still the case that Elasticsearch is “not a reliable data store”?

I remember there used to be a line in the Elasticsearch docs saying that Elasticseach shouldn’t be your primary data store or something to that effect. At some point they removed that verbiage, seemingly indicating more confidence in their reliability but I still hear people sticking with the previous guidance.


It got a lot better in the ~7 series IIRC when they added checksums to the on-disk files. I don't know if you still have to recover corruptions by hand, or whether the correct file gets copied in from a replica.

The replication protocols and leader election were IMO not battle-hardened or likely to pass Aphyr-style testing. It was pretty easy to get into a state where the source of truth was unclear.

Source: Ran an Elasticsearch hosting company in the 2010's. A little out of the loop, but not sure much has changed.


I'm not sure what the author was referring to, but in our stack ES is the only non-serverless tech we have to work with. I know there's a lot of hate in HN around serverless for many reasons but for us for several years, we've been able to scale without any worry of our systems being affected performance-wise (I know this won't last forever).

ES is not this way, we have to manage our nodes ourselves and figure out "that one node is failing, why?" type questions. I hear they're working on a serverless version, but honestly, I think we will be leaving ES before that happens.


Elastic’s serverless offering went GA recently: https://www.elastic.co/elasticsearch/serverless


It's still in technical preview.


> that one node is failing, why

We have a similar experience except we’re using AWS’ version, so any inquiry as to what happened ends with support saying “maybe if you upgrade your nodes this won’t happen again? idk”


And that's why we pay AWS the big bucks.


What are you considering replacing it with for full-text search?


Just read the article: „Elasticsearch’s lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging.“


ES is just unreliable. Can be running smoothly for a year and boom it falls over and you’re left scratching your head.


A problem here is that ES (and Solr too) are pathological with respect to garbage collection.

To make generational GC efficient, you want to have very short lived objects, or objects that live forever. Lots of moderately long-lived objects is the worst case scenario, as it causes permanent fragmentation of the old GC generation.

Lucene generally churns through a lot of strings while indexing, but it also builds a lot of caches that live on-heap for a few minutes. Because the minor GCs come fast and furious due to indexing, that means you have caches that last just long enough to get evicted into the old generation, only to become useless shortly thereafter.

The end result looks like a slow burning memory leak. I've seen the worst cases take down servers every hour or two, but this can accumulate over time on a slower fuse as well.


Can this be fixed with alternative GC’s or tuning?


It might be better by now with the newer GC options (ZGC, G1, Azul). For a while those had their own problems, but I'm a little out of the loop.

Tuning the older options wasn't really all that beneficial. We tried tuning, then running a custom build of OpenJDK to mess with the survivor space (which isn't that tunable via config), then ultimately settled on more aggressive (i.e. weekly) rolling restarts of servers.


I've been using and supporting ES (and OS lately) for well over a decade. Mostly its fine but a lot of users struggle with sizing their clusters properly (which is costly). Elasticsearch falling over is what happens when you don't do that. It scales fine until it doesn't and then you hit a brick wall and things get ugly.

Additionally, many companies learn the hard way that dynamic mapping is a bad idea because you might end up with hundreds of fields and a lot of memory overhead and garbage collection. I've fixed more than a few situations like this for clients that ended up with hundreds or thousands of fields, many shards and indices. Usually it's because they are just dumping a lot of data in there without thinking about how to optimize that for what they need.

A properly architected setup is not going to fall over randomly. But you need to know what you are doing and there are a lot of clients that I help that clearly don't have the in house expertise to do this properly and are a bit out of their depth.


That exact scenario just happened to me a few days ago.


What would you consider as replacement


SingleStore is pretty good at ES use cases. Low latency, scalability, real time ingest, full text search + millisecond query times.


Yes, SingleStore seems to be really good.


Thanks a lot


How did you fix it? What happened?


This seems to be a Lucene/SOLR problem. I used Lucene/SOLR years ago, and it died so often that we just auto-indexed nightly and had a re-index button in the UI.


They publicly track their resillency efforts here

https://www.elastic.co/guide/en/elasticsearch/resiliency/mas...

>7 its become quite reliable. It comes down to how you manage and maintain the cluster


I've been abusing it as a data store for many years. I don't recommend it but not because of a lack of reliability. It's actually fine but you need to know what you are doing and it's not exactly an optimal solution from a performance point of view.

The main issue is not lack of robustness but the fact that there are no transactions and it doesn't scale very well on the write side unless you use bulk inserts. You can work around some of the limitations with things like optimistic locking to guarantee that you aren't overwriting your own writes. Doing that requires a bit of boiler plate. For applications with low amounts of writes, it's actually not horrible if you do that. Otherwise, if you make sure you do regular backups (with snapshots), you should be fine. Additionally, you need to size your cluster properly. Things get bad when you run low on memory or disk. But that's true for any database.

If you are interested in doing this, my open source kotlin client for Elasticsearch and Opensearch (I support both) has an IndexRepository class which makes all this very easy to use and encapsulates a lot of the trickery and bookkeeping you need to do to get optimistic locking. I also have a very easy to use way to do bulk indexing that takes care of all the bookkeeping, retries, etc. without requiring a lot of boiler plate. And of course because this is Kotlin, it comes with nice to use kotlin DSLs.

You can emulate what it does with other clients for other languages but I'm not really aware of a lot of other projects that make this as easy. Frankly, most clients are a bit bare bones and require a lot of boiler plate. Getting rid of boiler plate was the key motivation for me to create this library.

https://github.com/jillesvangurp/kt-search


Solr. The correct answer is Solr. All the search features of Elasticsearch (both are built on Lucene, which provides the actual search implementation) without the problems that Elastic (the company) or Elasticsearch (the product) brings with it. 99% of companies using Elasticsearch would be far better of with Solr.


Solr will work perfectly if you want to do a standard search (db, shop) or need a customized approach, like custom transformers, ranking, etc. However, if you aim for "something that just works good enough" and has much better scaling capability and tooling, you can't beat ES/OpenSearch here (Solr scaling is much more manual process than ES/OS). And if you want to stay away from ElasticSearch drama, stick to OpenSearch - they are shipping excellent stuff these days.


Never heard of it before. I find that Apache often does not do well at promoting their products."


Solr is seen (probably unfairly) as passé by many people.

I used Solr back in 2008-2009 or so and it did a great job, but people didn't like that it used XML rather than JSON. Then we had a requirement for something like Elasticsearch's percolate query functionality, which Solr didn't support. So we switched, and subsequent projects I've joined have all used ES.

As I understand, Solr now has a JSON REST API and it's improved in other ways over the years, but ES has quite a bit more mindshare these days: https://books.google.com/ngrams/graph?content=solr%2Celastic...

So now in addition to the "does it do the job?" question, teams might also ask "can we hire and retain people to work with this technology rather than the more popular alternative?".



Manticore is seldom mentioned around here but it seems to be a really good alternative [1]. The fact that is the search engine Craigslist uses gives them a lot of credibility.

This may be a bit silly, but I wonder if the design of their blog and website makes them look a bit "old fashioned" or something.

--

1: https://manticoresearch.com/blog/manticore-alternative-to-el...


Funny you mentioned because I also get that vibe from the Manticore website. We do judge books by their covers afterall.


Manticore is primarily oriented towards the MySQL world


It uses the MySQL protocol and has some extra features for MySQL users like the Federated engine [1], so it should definitely be nice for MySQL users. But it also has an HTTP API [2] so sending data to manticore should be no different than sending it to Elastic, Meili or whatnot.

--

1: https://manual.manticoresearch.com/Extensions/FEDERATED

2: https://manual.manticoresearch.com/Connecting_to_the_server/...


Currently using meilisearch, the DX is a lot better than Elastic Search. Yeah I'm sure you can get better tuned results with ES but for simple full text search + facets you're setup and running a lot faster. Also request format tends to be stable between versions.

My main gripe is how slow to display their API documentation is. I don't know how they managed to make a text only website take 3 or 4 seconds per link.


> My main gripe is how slow to display their API documentation is. I don't know how they managed to make a text only website take 3 or 4 seconds per link.

Fortunately, the Meilisearch documentation website is no longer the slowest website of all times! https://x.com/striftcodes/status/1823637020121440305?s=46&t=...


I hate how slow the docs have become. Fwiw, we're working on fixing it :p


Typesense has been a godsend; amazing piece of tech and on top of that fairly easy to operate even in HA mode; if you can fit your dataset in memory I really recommend giving it a try


Yeah, but those are separate things to learn and operate.


Postgres has very limited features related to FTS, so it is useful only for very simple cases or for very limited budget. Elasticsearch is stable enough(but good devops and devs needed) and has numerous of features. Anyway Elasticsearch is not a "plug and play" solution and requires much work to be done for good search quality. Why I know this all? I'm running ecommerce search startup, based on Elasticsearch as primary storage.


Always great to see Postgres-based alternatives.

One clarification question - the blog post lists "lack of ACID transactions and MVCC can lead to data inconsistencies and loss, while its lack of relational properties and real-time consistency makes many database queries challenging" as the bad for ElasticSearch. What is pg_bm25's consistency model? It had been mentioned previously as offering "weak consistency" [0], which I interpret to have the same problems with transactions, MVCC, etc?

[0]: https://news.ycombinator.com/item?id=37864089


Good question. That was from a very old version of pg_bm25 (since renamed to pg_search). BM25 indexes are now strongly consistent.


Nice! I've seen other extensions that don't have transactional semantics, which runs counter to the norm for PG.

So since it was previously weakly consistent due to performance reasons, how does strong consistency affect transactional inserts/updates latency?


It adds a small overhead to transactions. The exact number depends on how much you're inserting and indexing, but is in the milliseconds.

We have adopted strong consistency because we've observed most of our customers run ParadeDB pg_search in a separate instance from their primary Postgres, to tune the instance differently and pick more optimized hardware. The data is sent from the primary Postgres to the ParadeDB instance via logical replication, which is Zero-ETL. In this model, there are no transactions powering the app executed against the ParadeDB instance, only search queries, so this is a non-issue.

We had previously adopted weak consistency because we expected our customers to run pg_search in their primary Postgres database, but observed this is not what they wanted. This is specifically for our mid-market/enterprise customers. We may bring back weak consistency as an optional feature eventually, as it would enable faster ingestion.


I've had good experiences with both Postgres + ES and also Postgres + Meilisearch. The latter one I only used for small side projects but it was very easy to get running (With a RoR app).


The analytical part is what usually makes the Elasticsearch (or opensearch) the choice - it's not that you have to search just the text, its that with that text you need to show 5 different categories some of the top values and the number of occurrences and permutations thereof if the user decides to clicky clicky and filter it more.


How does Elasticsearch (or the ELK stack) would help implement what you described?


ELK stack is unrelated, this is the analytical/OLAP use-case not the logging stack stuff. Although you can say that logging is a specific use-case of the analytical/OLAP.

Search for the discussion around "facets".

Basically the way to think of this is, elastic/solr are MapReduce with weak database semantics layered on top. The ingest stage is the "map", you ingest JSON documents and optionally perform some transformation on them. The query is the "reduce", and you are performing some analytical summation or search on top of this.

The point of having the "map" step is that on ingest, you can augment the document with various things like tokenizations or transforms, and then write those into the stored representation, like a generated column. This happens outside your relational DB and thus doesn't incur RDBMS storage space for the transformed (and duplicated!) data, or the generation expense at query time. Pushing that stuff inside your relational store is wasteful.

It's an OLAP store. You have your online/OLTP store, you extract your OLAP measurements as a JSON document and send it to the OLAP store. The OLAP store performs some additional augmentations, then searches on it.

What facets let you do is have "groups within groups". So you have company with multiple employees inside it, you could do "select * from sales where company = 'dell' group by product facet by employee". But it's actually a whole separate layer that runs on top of grouping (and actually behaves slightly differently!), and because of the way this is done (inverted indexes) this is actually incredibly efficient for many many groups.

It's built on a "full-scan every time" idiom rather than maintaining indexes etc... but it actually makes full-scans super performant, if you stay on the happy path. And because you're full-scanning every time, you can actually build the information an index would have contained as you go... hence "inverted index". You are collecting pointers to the rows that fit a particular filter, like an index, and aggregating on them as it's built.

And the really clever bit is that it can actually build little trees of these counts, representing the different slices through OLAP hyperspace you could take when navigating your object's dimensions. So you can know that if you filter by employee there are 8 different types of products, and if we filter by "EUV tool" then there is only 1 company in the category. Which is helpful for hinting that sidebar product navigation tree in stores etc.

A lot of times if you look carefully at the HTML/JS you can see evidence of this, it will often explicitly mention facets and sometimes expose Lucene syntax directly.


This is a very good explanation of what we've found keeps users on Elastic: The combination of FTS and fast facets/aggregates in the same query. At ParadeDB, we've implemented support this in what we call aggregations: https://docs.paradedb.com/search/full-text/aggregations


Semi-related: I guess some companies have already started augmenting their search with the use of LLMs. If you've worked on a similar project, what is your (very) high-level architecture? Did you see a noticeable difference in relevance and query time?


I haven't used LLMs but am using a hybrid approach with pg_vector: https://github.com/agoodway/vecto (some useful links in the README).

One way that comes to mind is using LLMs is to re-write the query before shoving it into the search engine (known as "query rewriting").

More full-text ideas here: https://gist.github.com/cpursley/c8fb81fe8a7e5df038158bdfe0f...


I believe the primary issue with using full-text search in PostgreSQL is its performance and scalability limitations.

Elasticsearch/OpenSearch offers more than just search functionality; they come with extensive ecosystems, including ELK for logging and processors for ETL pipelines. These platforms are powerful and provide out-of-the-box solutions for developers. However, as the article mentioned, scaling them up or out can be costly.

ParadeDB looks like is based on tantivy(https://github.com/quickwit-oss/tantivy) which is an impressive project. We leverage it to implement full-text indexing for GreptimeDB too (https://github.com/GreptimeTeam/greptimedb). Nevertheless, building full-text indexes with Tantivy is still resource-intensive. This is why Grafana Loki, which only indexes tags instead of building full-text indexes, is popular in the observability space. In GreptimeDB, we offer users the flexibility to choose whether to build full-text indexes for text fields, while always creating inverted indexes for tags.


https://github.com/infiniflow/infinity, dense vector + sparse vector + fulltext search(BM25) + late interact reranker(Colbert)


Solr still seems to be an option


Just upgraded an old system to solr 9. I've been very impressed with it.


Solr and ElasticSearch are both based on Lucene.


I remember Lucene, forgot that Solr is a derivative


AWS RDS support is key IMO, I have done a lot of consulting on infrastructure and most clients I had don't want to have third party services and don't want to manage their DBs.


Agreed! But for most of our customers, they want to separate their OLTP Postgres from their Search Postgres. This is because query patterns differ, and you want to tune Postgres and choose hardware differently. This also provides higher reliability and ensures search queries can't slow down transactions.

ParadeDB implements this by ingesting logical replication (WALs) from the primary Postgres. This gives a "closely coupled" solution, where you have the feel of a single Postgres database (ParadeDB is essentially an extra read replica in your Postgres HA setup). More here: https://docs.paradedb.com/replication/pg_search


Are there success cases of using ElasticSearch/OpenSearch as the primary data store in production? Just curious.


I've been using it for a small real estate app for like ~8 years. Zero problems, works like a dream. I use MySQL for some app-user data, but most of the bulk data is on ES.

Solo dev running a 3 node cluster on Hetzner.


Is this distributed as a PostgreSQL extension as well?


yeah looks like how they distribute too, basically pre installed and configured pg_search pg_datalake and pgvector


Correct. If you run your own Postgres, you can install pg_search directly, otherwise we suggest our users integrate with their existing Postgres deployments (say AWS RDS) via logical replication: https://docs.paradedb.com/replication/pg_search


> We’ve talked to many companies who have tried and regretted their decision to use Elasticsearch as their primary data store

oh no…




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: