Hacker News new | past | comments | ask | show | jobs | submit login

Hopefully this will let each of them compete on their own merits.

I’ve been tossing up moving our workloads to Elastic Cloud anyway, because AWS ES Service is a source of constant headaches for us. Feels like at least once a week a server ends up in a state where we can’t fix it, and AWS engineers have to manually fix their internal state.

Their standard response is “add more nodes”; well, we did that, and it is costing us an arm and a leg, and it didn’t fix the problems. (Plus, now we have new problems where networking blips appear to be causing quorum problems and sending the cluster into a death spiral.)

Purely off the customer experience, it feels like Elastic Cloud has to be better; the whole licensing debacle has definitely turned me off Elastic though.




I also had a serious problem with AWS Managed ES. In my case, some of the heap didn't free on every garbage collect which would eventually result in cluster failure. This was likely a JVM misconfiguration and was most easily observed by viewing a shrinking sawtooth pattern on the memory graphs. This resulted in a multi day marathon of sleeplessness keeping the cluster alive by continually rolling it every few hours (We initially assumed we had done something wrong and investigated ourselves first... eventually I shunted traffic to both my own cluster and the managed cluster... my cluster did free heap as expected and we successfully switched over without downtime but wow was it ever hairy during high traffic periods).

We saved a bunch of money and gained performance by using our own cluster. That cluster hasn't gone down since... years later.

It's very difficult to debug these problems when you don't have direct access to elastic search's configuration... what would normally take minutes to verify can take hours to isolate.


For what it's worth, this is most likely not a JVM configuration issue and more likely an ES/OpenSearch issue.


I'm fairly confident it was because I ended up finding a ticket later where very similar behavior was isolated to a jvm.options configuration problem.

Effectively a newer config file was lacking a jvm.options line that changed the behavior with an older machine setup. I would not be surprised if AWS deployed a new config to an old environment.

Unfortunately, not having direct machine access, I could not confirm whether this was the case in the failing cluster.


FWIW that experience is exactly the same as my team has... But we're on Elastico hosting. Constant black box breakages that can't be debugged, nodes getting restarted for opaque reasons... We've been considering switching to Amazon's hosting to avoid these.

Sounds like it might be a "grass is always greener" scenario. ...maybe I'll spend more time looking into self-hosting...


Ditto. Really want to support Elastic but have found their cloud offering just randomly breaks. Latest problem is we lost a node and the new node couldn't recover its shards from backups. Constant crashing until we upsized the instance to have more memory. Makes no sense and support wouldn't lift a finger to help.


Are you running dedicated master nodes? We switched from every node does everything few years ago and it’s been incredibly stable


I'm surprised they'd even let you run a cluster without dedicated master nodes. It's basic good practice and their docs emphasize this quite clearly.


I don’t remember if 12 years ago it was so obvious…. But yes it is very clear these days


We had similarly terrible experiences with AWS ES, so we moved to self-hosted Elastic. It's better but still pricey and requires more manhours dedicated to tuning.

Their move away from open-source has been unfortunate. For that and some other reasons we've ended up more impressed with Logz.io and Splunk SaaS.


What kind of tuning do you do? We spend time but typically on upgrades… very rarely have we spent much time on tuning I’m curious what kind of tuning are people doing beyond mlock and having enough memory / nodes?


I mean, logz.io is okay and all, but it's quite expensive, and has a loooot of outages. I don't remember a week where we didn't have issues, the most common being log ingestion lag, which take hours to fix. That, and their API is bad: you can only query over a 2 day period and their API keys allow full control over everything.


I think the AWS ES Service is the worst AWS service I've ever used. WE have constant issues with it and AWS staff don't seem all that comfortable working with it - which is rare in my experience. All in all, it's the most hands-on 'manged service' I've used.


Like the others have said, definitely seems to be hit or miss. I inherited an old cluster running on elastic cloud, migrated it to AWS managed and it reduced monthly costs by nearly an order of magnitude on top of going from a 5-15 minute outage per day for master node election to having 3 over 2 years.

We’ve rewritten most of the interactions with the older version of ES into a new service and use a self-managed OpenSearch cluster on Graviton instances and it’s the most stable Elasticsearch/Solr solution I’ve ever interacted with


I hate to say it but your problem here is ES itself.

If you just need "Lucene but clustered" I highly suggest looking at Solr instead, it's design is much more straightforward and has pretty much all the most important knobs for actual indexing that ES has.

If however you are tightly coupled to ES API or use it with 3rd party systems you are sort of up shit creek without a paddle...

Ran ES at large scale for many years, eventually gave up and only use Solr or custom built search engines these days.


We run our own ES cluster it’s not very hard… ec2 instances. The hardest part is just reading the docs carefully before upgrading… we also don’t modify schema eer mappings very often…. Oh and we run 3 masters, and 3 clients and then the rest are data nodes… since doing that knock on wood super stable


> Their standard response is “add more nodes”; well, we did that, and it is costing us an arm and a leg, and it didn’t fix the problems. (Plus, now we have new problems where networking blips appear to be causing quorum problems and sending the cluster into a death spiral.)

This. I have had similar experiences and adding more nodes had just amplified the issues; nodes were all also barely loaded.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: