Things to know before using AWS’s Elasticsearch Service (acloud.guru)
Something else people should know: AWS ES is on the Internet. You can't deploy it to a vpc yet, and you can only lock it down using IAM, which may or may not be good enough for your use-case.

For those that prefer the ease provided by AWS ES Service, consider Elastic Cloud, which affords most of the same capabilities but is run by Elastic themselves (it was previously known as Found, which Elastic purchased a few years ago). There's also an Enterprise offering. If you're looking for a hosted Elasticsearch solution, it's probably better than what AWS is offering. Side note: they update about as often as elastic releases, whereas AWS ES is consistently behind.

As Daniel Parker (acloud.guru) mentioned in the article comments, they went with Algolia [1] despite being AWS experts. He cited the uncertainty and complexity around AWS ES as the problem.

[1] https://www.algolia.com/

An attacker is only ever RCE on one server away from being on your VPC subnet. You're going to have to set up authentication for internal applications anyway, although I suppose vulnerabilities in the login process are harder to exploit if you can't even get to it.

I'm curious about how the pricing compares. I'm not very satisfied with AWS ES, but managing ES manually doesn't seem like the most fun either. (In fairness it looks like there's not too many knobs to turn, but it's still another concern to have to personally deal with.)

Elastic cloud doesn't require any more management than AWS. You just need to click some buttons to add capacity https://www.elastic.co/cloud

The IAM authentication is really annoying. It's not supported by many client libraries, nor have I found an easy way to make arbitrary HTTP calls with signature v4.

The only other options are completely public or IP-based whitelist, the latter which is untenable in most cloud environments.

You can also use a signing proxy.

I wasn't aware of that option. I'll look into it.

A simple solution in this vein is to white list your the EIP addresses of your NAT. This would give access to all resources in a private subnet (this is useful for Lambda's running in subnets).

>nor have I found an easy way to make arbitrary HTTP calls with signature v4.


Yep, that's precisely why I made awscurl "easy way to make calls to AWS".

I can be easily tested with AWS Elasticsearch.

It's a great tool man, I use it tonnes, thanks for making it!

This sounds good. Any feedback on cost? How is plugin support / security? Integration with IAM?

(Disclaimer: I work on Elastic's Cloud team)

While AWS ES can be cheaper in some configurations, Elastic Cloud is actually quite competitive in pricing for larger clusters when compared to AWS' ES-service. This post compares the two services, and there's an example price comparison at the end of the post: https://www.elastic.co/blog/hosted-elasticsearch-services-ro...

We support most official plugins, and if you get a gold or platinum subscription you can upload your own plugins. Elastic's X-Pack is included in every cluster, which includes security features like role based access control.

It's not possible for external service providers to integrate with IAM at this point.

One issue I've found with Elastic Cloud is that there doesn't seem to be a horizontal scale-out option other than multi-DC or getting bigger boxes. Is horizontal scaling in the works? Easy horizontal scaling seems like one of the better benefits of ES.

Or, alternatively, am I mistaken about how configuration works?

Per availability zone, Elastic Cloud currently scales vertically (in power-of-two increments) until a cluster hits 64GiB memory, at which point multiple 64GiB-nodes are added. While you can run Elasticsearch with e.g. two 8GiB nodes per zone, we prefer a single 16GiB node as there's fewer things that can go wrong. (If you want the second 8GiB node for redundancy, then that's exactly what our multi-zone HA configurations are for, and we encourage HA setups by making them less than twice (or thrice) the price, throwing in additional master-only nodes for free)

(A bit of history: When Found (the company Elastic acquired and which is now Elastic Cloud) was in private beta in early 2012, we actually did allow custom cluster topologies. We ultimately disabled that as it was overwhelmingly used to make sub-optimal cluster configurations, such as 5 x 1GiB memory nodes)

Any idea if/when Elastic.co will support multiple user accounts and 2FA in the management portal? Not having those was a deal-breaker for us when we evaluated it awhile back, and was the sole reason we went with the less stable AWS service.

This is (understandably!) a common request, and both are on our roadmap.

My understanding is that this is going to change and soon. Amazon has been poaching key elastic search employees presumably with the idea to improve the service to on par or better.

Elastic employee here, and this is the first I'm hearing of that. I don't know a single person that has left for Amazon, from any team. Certainly no "key elastic search employees".

Got a source?

Article is a bit naive about what it takes to run a shared service. Any API that AWS ES exposes has to be there forever, clearly pending_tasks had some risk of leaking internal implementation details that either couldn't be exposed, or that they didn't want customers building a dependency on.

Likewise with the doubling of nodes, this is obviously a blue-green style deployment. In place updates would be quicker but ES can get into all sorts of weird states that require manual debugging to fix, with blue green for most of the deployment you can simply flip back.

I've been pretty impressed with AWS ES compared to running it myself (other than the poor fit of IAM auth)

> Any API that AWS ES exposes has to be there forever, clearly pending_tasks had some risk of leaking internal implementation details that either couldn't be exposed, or that they didn't want customers building a dependency on.

If this is a reality of using cloud based ES then clearly it's something to seriously consider before using it - which is all the author is saying. The article is titled 'things to consider' not 'things AWS needs to fix'.

ES is big complex beast of a Java app. This is good advice regardless from someone who has used both approaches (self hosted vs AWS) in production.

I did not get the impress that he's saying that AWS can resolve this easily.

> Article is a bit naive about what it takes to run a shared service.

This is a bad assumption. Loggly is a shared service.

> Any API that AWS ES exposes has to be there forever

This a bad assumption. No API is forever. Maybe you meant a different timescale. AWS has removed and made breaking changes to APIs over the years (e.g. random breaking change: https://forums.aws.amazon.com/message.jspa?messageID=513640).

Worth mentioning that elastic.co themselves run a hosted service on AWS that is of a high quality and has none of these flaws.

I can confirm the author's frustrations with AWS ES. Having set up clusters on my own (on EC2 hosts) and using the service... The latter is expensive, inefficient, behind on features, hard to integrate with, and generally just a really crappy piece of work (like almost every peripheral AWS service, ie anything but EC2, S3, and DynamoDb).

Elastic search is honestly pretty simple to set up, save yourself money and trouble and just do it.

I had the same experience with their code hosting products (CodeCommit). It was better to just setup an EC2 instance and manage it myself.

Seconded. ES is honestly one of the lowest-maintenance products I've ever deployed. It has a few quirks, but for the most part, it Just Works.

This appears to be somewhat out of date. If you use version 5.x then pending_tasks is available.


Good to know, I didn't realize 5.x has that API available. Why it's only available in 5.x makes no since ES has had the API since at least 1.x

Having a DevOps Engineer that wants me to go the AWS Dedicated Everything Route, I need articles like this to explain to him my fear that our problems will just change, not go away, by going that route. + Adding a fat layer of dependency.

Complexity never goes away... it just shifts. I dunno if that is a common saying or not but a former coworker of mine once said it and it's very true IMO.

I do infrastructure engineering for a small startup and really I think with any of these managed systems you need to step back and evaluate them within the context of TCO, lock-in, security, reliability, performance and flexibility/customizability. I've heard ES isn't that much of a PITA to manage on its own, but on the flip-side I'd never sign up a small team to run PgSQL at scale.

> I've heard ES isn't that much of a PITA to manage on its own

I just run ES for my logstash setup, and ES is lovely and rock-solid... except when it isn't. For example ES deciding to just silently refuse input when its disk is 90% full - that was a bit hard to find when it happened. ES looked alive, but hunting down the reason why it stopped wasn't trivial. I've had a couple of similar but lesser gotchas as well.

I guess you could say of my experience that it's not that much of a PITA (as you say), but it is still a bit of a PITA.

Disclaimer: if these things weren't a bit of a PITA, there'd be no need for us sysadmins, so I should be grateful...

Also fun is being an engineer who constantly has to explain this fear to VPs with a "Nobody ever got fired for using Amazon" mindset.

You (and other respondents to your comment) are right in that the problems will change rather than go away.

AWS has many cool toys and I use a subset of them every day. However, there's no way in hell I'd entertain the idea of going in fully for everything we do. Not only are there a bunch of inadequate services, they can also be nasty to debug and cause more problems than they're worth.

It sounds like you may have an inexperienced guy getting overly enthusiastic about what he could achieve instead of focusing on what's required (I don't mean to insult him, it just sounds like he may not know enough about infrastructure to be making these decisions properly). Being provider agnostic (at least as much as you can be) is currently a way I see a lot of companies starting to leverage the great tools that cloud providers have, but being able to be free enough to chop and change as the companies needs evolve.

Maybe point him towards things like Terraform and get him looking at what Google cloud and Azure can do as well as AWS?

Not all AWS services are created equal... Some are rock solid and others (cough Data Pipeline cough Kinesis Firehose cough) must have been written by interns.

I don't use data pipeline anymore, but 2 years ago when I was using it in my previous company there were a few moments that drove me nuts. I vaguely remember one day AWS upgraded data pipeline to a newer version and broke hundreds of our pipelines that write data to Redshift. We contacted AWS and they rolled that update back...

What issues did you have with kinesis firehose? I just deployed a couple of those and would like to know what to watch out for.

Lots of downtime. Probably 3-4 serious incidents within a 6 month period. We also had a high number of transient errors that AWS Support considered 'expected'.

One issue we've hit is that logrotate with copytruncate enabled breaks firehose, but afaik it's mostly been good.

Using copytruncate breaks a lot of software, not just Firehose. I generally discourage its use in favor of addressing whatever root cause is making you want to use it.

In my experience, any 'improve logging' ticket goes straight to the back of the dev's backlog. Followed the next week by complaints about the logging system not being all that good... :)

Why the hate for interns :'(

Regardless of how you look at it, writing software is a hard problem and takes experience to do reasonably well. There's nothing wrong with interns themselves, rather the practice that many companies seem to follow where unsupervised or poorly supervised intern code goes right into big products.

What issues did you have with AWS Data Pipelines?

Pipelines randomly failing with inscrutable error messages. High error rates.

Another point of frustration is that because these endpoints are locked down[0] you cannot fully use management tools like curator - https://github.com/elastic/curator


Curator support got better in the AWS ES 5.3 release a few weeks ago: https://github.com/elastic/curator/releases/tag/v5.1.0

"AWS ES 5.3 officially supports Curator now. Documentation has been updated to reflect this."

You still can't use Curator to take snapshots, because AWS ES doesn't expose the `/_snapshot/_status` endpoint.


If you are stuck on 5.1.2, the Talend fork of curator works.

The change is trivial, so I get the sense that Elastic is just fucking with Amazon.

Like almost everything else you can build on AWS managed services (RDS, Elasticache, API Gateway/Lambda, Kinesis, etc), if it's truly critical to your application's uptime, you should be managing it yourself.

But if your need for ES is to support a backend system that would make your life inconvenient for a while if there are problems, is relatively small and won't grow too fast, but isn't business-critical, then the AWS managed service is fine.

Good warning. Yeah, beta software released to production.

There is that and also from fragility from ES. I was wondering what alternatives are out there to ES. I know of Solr only.

Solr has the added complexity of zookeeper. ES isn't bad, but in an MT context you really have to layer a lot on top for security and configurability.

It's possible it's not MT and they just didn't write the facade APIs. That'd be pretty crazy.

My biggest complaint would be lack of plugin support.

To be fair, ZK is great at it's job, and that responsibility is something ES has had a lot of trouble replicating.

If you want to have consistency in a distributed system you need something like ZK. If ES does not have ZK it surely has something else, probably with different trade offs.

They have fixed most of the stability issues in 5.x. I suspect some of the problems people have had with AWS ES is actually using a pre-5.x version.

Our 2.x cluster is much more stable than our 5.1.2 cluster despite the fact that our 5.x cluster is significantly smaller (in node count) than it's older brother.

Also of note: Amazon's documentation on HTTP limits is wrong. There are some instance types listed as having a 100mb max payload that are only 10mb. We found that out when Logstash recorded a crapload of errors with the 10mb limit on what was allegedly a 100mb instance type.

Well after talking to some of their tech guys we determined that we have different definitions for the work "fixed".

Such as?

Such as disallowing configuration they do not like, having memory/resource leaks that they cannot find the root cause for and few other things I forgot already. G1GC is disallowed because they had a data corruption bug with it. This is few months back, they might changed some it already. The question for me is what do I get using ES over Solr? If Solr's features are enough for our use cases should I even try ES?

Here is their resiliency status: https://www.elastic.co/guide/en/elasticsearch/resiliency/cur...

It depends on your use case. If you are already familiar with Solr and it is good enough for your use case, then use it. Solr and ES are about the same feature-wise. Scaling is easier for ES because it is built-in. Here is a good comparison of their APIs.

part 1: http://opensourceconnections.com/blog/2015/12/15/solr-vs-ela...

part 2: http://opensourceconnections.com/blog/2016/01/22/solr-vs-ela...

I looked at using AWS Elasticsearch Service for a project but had to back out due to the lack of plugin support. Running elasticsearch yourself, even in a HA setup is actually fairly easy.

Does anyone have any experience of Compose.io's hosted Elasticsearch offering?

I have been using it on a new project the last couple of weeks and it seems to be working well.

Had a very similar experience with Redis on ElastiCache. When things go south, it's really hard to debug. You don't get access to logs, you don't get to change a lot of config parameters.

Had to provision our own EC2 instances.

It was 2 years ago though, things might be different now.

It even get worse if you use Aws ElastSearch for logs. Logs are usually high volume and it can quickly beczme nightmare.

It used to be worse. The max EBS volume size was 512GB (with 15% reserved for Amazon) and a max cluster size of 20.

We hit that limit and had to ruthlessly prune live data.

You can now add 1.5TB per node (with very large and expensive instance types) as well as scale past 20. Requesting the limit increase was a lot more difficult than most other limit increases.

I just want to get rid of the stupid proxy I have just so I can make it work... Amazon just let me put it in a VPC

Side note: reading anything on medium.com is frustrating, such a slow and janky site for a glorified blog.

Bookmarked to look at the next time my boss wants to lock us into yet another AWS service. Thank you.

there's also the bit about how adding a whitelisted IP for access takes like 20 FREAKING MINUTES to take effect.

Wow WTH Amazon? Take the training wheels off already or fix the defaults for this service.

Preferably both.

Brutal plug, but MarkLogic is a good alternative if you want a good search solution that runs and scales on AWS (and you can migrate to another cloud or on prem)

