AWS services to avoid

antirez · on May 10, 2020

"Remember redis is meant to hold ephemeral data in memory, so please don’t treat it as a DB. It’s best to design your system assuming that redis may or may not lose whatever is inside"

Of course I disagree with that statement. Many serious Redis use cases at big companies assume Redis is there and will hold your data. What they usually assume is that under certain cases during failovers or other serious failures you many lose some acknowledged writes that were sent immediately before. And this most of the times does not happen too. Note that many high performance applications using SQL DBs with a relaxed fsync configuration will have the same assumption. The suggestion should be more: if Redis for you is just a volatile aid, use a given setup, if it's were you store data, use another setup.

Anyway if you want to see a version of Redis where you can store also bank accounts, transactions or any other of the most critical stuff you can store in a DB make sure to check the news at Redis Conf 2020 (the conf is in streaming and free).

EDIT: But even after that announcement the biggest value for Redis is that it can be fast and provide a "best effort" consistency level that is adequate for a lot of use cases in practice. Many systems were sacrified in the temple of wanting to provide linearizability for all the use cases.

ComputerGuru · on May 10, 2020

I was agreeing with you until

> And this most of the times does not happen too.

This rubs me very very wrong. There is no maybe in data integrity. Redis wasn’t designed to be ACID and shouldn’t be treated as such just because “usually” it doesn’t lose data “under the right circumstances.”

If a sysadmin uses an RDBMS with relaxed fsync, they know what they are getting (and durable writes is not one of them, no usually about it). Same for Redis.

antirez · on May 10, 2020

What I meant is that beat effort consistency can be super naive or can try hard, even without guaranteeing it, to avoid losing data in certain failure modes.

shanemhansen · on May 10, 2020

> If a sysadmin uses an RDBMS with relaxed fsync, they know what they are getting

Removing fsync is generally referred to as "running with scissors" mode, although postgres now supports it better with unlogged tables.

unexaminedlife · on May 10, 2020

Building computer systems / architectures is always about compromises. If your requirements are that "everything must be perfect always", you will never deliver anything.

I came up with a "mental framework" that allowed me to get beyond these details early when I started architecting systems for companies.

1) Identify what are the possible malfunctions

2) What mechanisms will I use to detect these malfunctions as early as possible

3) Do I have a clear and concise recovery path for said malfunctions.

This frame of thought has taken me far (in terms of being able to accomplish a lot in my career so far) and allowed me to move forward without (a) extensive, time-consuming, research into the specific tools I'm using into the specific situations which may never happen, (b) extensive costs hiring experts in these niche areas.

tfehring · on May 10, 2020

ACID guarantees feel like they can be distilled to "everything must be perfect always" - and you can ensure ACID and still ship, just by using an ACID database. It's a hard problem, but it's also a solved problem.

Sometimes you don't need ACID guarantees, which is part of the reason databases like Redis exist, and is why the compromises you're describing are important. But in this context it sounds like the framework you're describing would lead to rolling your own ACID for Redis instead of just using an ACID database to begin with.

unexaminedlife · on May 12, 2020

I guess it's almost like speaking two different languages when talking about what a DB guarantees and what your application REQUIRES in the form of guarantees. Typically, in my experience, they're not the same thing. Or not ALWAYS the same thing. Within the problem space you're trying to come up with a solution for, I'd say it's highly unlikely "ACID" guarantees are sufficient to say categorically (with 100% accuracy) that "my application is perfect because: ACID".

skyde · on May 11, 2020

agree with you (computerGuru) at large scale any rare or corner case happen multiple time per day :-)

Redis was never designed for strong Synchronous replication of transaction or Strong leader election.

So each time a master crash you have a high chance of some recent acknowledged write being lost.

masnao · on May 10, 2020

> sysadmin

we live in the age of devOps. there's no such thing as sysadmin, or god forbid, DBAs.

/sarcasm

ramraj07 · on May 10, 2020

For real! Where did all those DBAs go??

lykr0n · on May 10, 2020

They're charging $400 an hour to come in and unfuck years of devs thinking they can design databases because they read a MySQL How-to Article

oneplane · on May 10, 2020

They retired and since no millennial wants to work for an Oracle degree no new ones came in to the field /s

laurentdc · on May 10, 2020

We gladly paid one 800 € for three hours of work.. fixing a botched stored procedure on IBM DB2. Gray haired DBAs with the right blend of dev, ops and most importantly business domain knowledge (something many people in IT seem to overlook) are worth gold

jcims · on May 10, 2020

infrastructure as code == infrastructure by coders

jdc0589 · on May 10, 2020

I'm one of those people with moderate throughput production applications using Redis as a primary store (500-2000 requests per second); in a multi-region active-active setup without redis enterprise no less. Its a very real use case at a LOT of companies.

Favorite datastore to date. The only thing I wish for frequently is native global change-data-capture without doing something manually with streams (I've thought about writing a replication protocol consumer, but the way resync's are handled is a little problematic)

manigandham · on May 10, 2020

How are you doing active-active?

jdc0589 · on May 10, 2020

piggybacked off Dynamodb in AWS for now unfortunately; so its more "the entire system is active-active, not just the redis bit". Reads always go to Redis in every region (no read-through), writes go to dynamodb (in global table config). Then there's a worker in each region consuming the dynamo stream and keeping it's region's redis in sync

Worker is a DIY stream consumer implementation. The stream -> lambda integration works, but its super limited (and ~300ms slower to get new events on average)

jdc0589 · on May 11, 2020

* writes are multiplexed to redis and dynamo in the origin region.

manigandham · on May 11, 2020

Why is the origin region Redis not also a reader of the replication stream? Just for latency?

jdc0589 · on May 12, 2020

yep. in this specific use case there can be pretty fast reads after a write, and dynamo stream latency is not great even intra-region.

Via the lambda integration you should probably expect ~1000ms p99 latency intra-region, and ~2100ms p99 from, e.g, east-1 to west-2. Shave maybe 200-300ms off of those numbers if you do a DIY stream consumer without the aws provided lambda connector.

gigatexal · on May 10, 2020

Former DBA here ... you don't disable fsync.

antirez · on May 10, 2020

No but often you set it to fsync every second and not at every write, if you are really concerned with performances in certain kind of applications. Or in practical terms, check the synchronous_commit config option of PostgreSQL. Basically sometimes you have fsync enabled because you don't want to corrupt the log, but the writes will not wait for the WAL, so the last transactions can get lost on crash.

gigatexal · on May 10, 2020

I would just throw more money at the problem not weaken the strength of my confidence in the underlying storage subsystem or the DB by playing with the fsync behaivor willy-nilly nor even in a calculated way. I would instead just throw more money at the problem by investing in faster and faster disks -- you can do a lot with enterprise grade NVMe now.

raverbashing · on May 10, 2020

I think what the author means is:

> "Remember redis (in AWS) holds ephemeral data in memory"

(because EC2 machines don't persist their disk data between start/stops and a lot of people don't realize that until it is too late - hey remember that Bitcoin exchange that got bitten by this?)

bbgm · on May 10, 2020

I suspect that’s what you meant but just for clarity, Instance storage does persist for reboots (which is effectively an OS reboot), but not if you stop/start (or terminate) the instance.

watermelon0 · on May 10, 2020

This was mostly applicable to older generations of EC2 instances (C3, M3, etc.)

EBS is now the default when creating an instance, and instance types of newer generations generally don't have instance store at all (except disk optimized instance types, and those with "d" in the name.)

cookiecaper · on May 10, 2020

Yeah, I think Amazon gave up on trying to communicate the meaning of ephemeral storage, and I think that's for the best tbh. It's too easy to have important data vanish that way. Something as simple as someone issuing a shutdown from the console is translated into an "instance stop" in AWS, which would purge the storage -- unfortunately often a lesson learned the hard way.

raverbashing · on May 10, 2020

Correct, I've updated the comment.

Yes if you do an ordinary OS reboot the disk remains.

tobad357 · on May 12, 2020

Will this announcement be Day 1 or Day 2? Looking forward to it as may solve some decisions we will be doing going forward

antirez · on May 12, 2020

Today starting at 6:30 Pacific time or alike. Not sure about the time of this specific announcement.

himinlomax · on May 11, 2020

Use a redis cluster to store session data, and then wonder why so many users get double login forms at times.

Bad idea. Bad tool for the job.

Scarbutt · on May 10, 2020

The querying capabilities of SQL shouldn't be ignored though, another issue is that your data must fit in memory.

jdc0589 · on May 10, 2020

thats not really relevant, I don't think anyone assumed it was always a drop in replacement for sql

Scarbutt · on May 10, 2020

Anyway if you want to see a version of Redis where you can store also bank accounts, transactions or any other of the most critical stuff you can store in a DB make sure to check the news at Redis Conf 2020 (the conf is in streaming and free).

He sort of implies here, not for you, but it might create that impression on others, was clarifying for those.

lidHanteyk · on May 10, 2020

Regardless of whether Redis is meant to hold ephemeral data in memory, it is well-known to anybody who has had to maintain a Redis instance as part of their deployment that Redis is only effective when used for ephemeral in-memory work.

I understand your defensiveness, but please understand that because of your massive experience with Redis, either you have never lost data with Redis, which means that nobody besides you understands how to run your software in production, or you have lost data with Redis, which means that it's somewhat hypocritical of you to insist that Redis is durable.

rpdillon · on May 10, 2020

Your comment seems to suggest that everyone that's ever used Redis except antirez has lost data. I don't think that's true. I've used Redis since 2012 across dozens of production services and never lost data, and I don't have any particular skill that makes me special.

tomnipotent · on May 10, 2020

> that Redis is only effective when used for ephemeral in-memory work

Hogwash and there are plenty of examples out there that prove otherwise.

> somewhat hypocritical of you to insist that Redis is durable

If Jepsen has demonstrated anything, it's that no database is as durable as it claims to be.

staticassertion · on May 10, 2020

There are lots of dbs that do very well under Jepsen tests.

antirez · on May 10, 2020

Note that Redis only provides best effort consistency: it means that sometimes it can lose acknowledged data in special conditions (during failovers, or during restarts with a relaxed fsync policy). So it will never pass Jepsen tests in the default setup, but may pass it only when Redis is used with a linerizable algorithm on top of it that uses Redis as a state machine for a consensus protocol. But this does not mean that people haven't applied Redis for data storage with success. For instance many *SQL failover solutions, also can't pass Jepsen tests, yet people use them to store real world data. There are a lot of applications where to be able to really scale well and with a low cost (spinning sometimes 1/10 or 1/100 of nodes) makes perfectly viable to pick a system that is designed for that, and as a price, will lose a window of acknowledged writes during failures, while trying hard to avoid it in common failure scenarios.

tomnipotent · on May 11, 2020

> restarts with a relaxed fsync policy

I can't think of a single database that's solved the fsync/O_DIRECT issue on Linux completely. Postgres had to patch it _again_ last year.

staticassertion · on May 10, 2020

Sure, I was speaking more broadly. Jepsen isn't a meaningful tests for a lot of databases, or use cases/configurations of databases.

tomnipotent · on May 10, 2020

I can't think of a single Jepsen test that hasn't demonstrated issues with data loss.

staticassertion · on May 10, 2020

Just going from the last one I read for DGraph, it did extremely well. Pretty sure etcd did well.

They always have bugs somewhere, but there are huge differences between bugs that show up for very specific, niche cases, and normal "I wrote to the db and it dropped it".

tomnipotent · on May 11, 2020

From the latest test:

"We found five safety issues in version 1.1.1—some known to Dgraph already—including reads observing transient null values, logical state corruption, and the loss of large windows of acknowledged inserts."

Loss of large windows of acknowledged inserts. Durability is hard.

mrjn · on May 12, 2020

(author of Dgraph here)

As staticassertion is mentioning, some of the violations that were found were only around tablet moves, which happen only in certain cluster sizes and quite infrequently. Of course, Jepsen triggers those moves left-right-and-center to evoke some of those failure conditions; but that's not how tablet moves are supposed to work in real world conditions. This is different from other edge cases like process crashes, or machine failures, network partitions, clock skews, etc., which can and do happen. In those cases, Jepsen didn't find any violations.

We were planning to look into those tablet move issues and get them fixed up (shouldn't be that hard), but honestly, the chances of our users encountering them is so low that we de-prioritized that work over some of the other launches that we are doing.

But, we'll fix those up in the next few months, once we have more bandwidth.

staticassertion · on May 11, 2020

I don't really feel like playing the quotes game... but, sure.

"All of the issues we found had to do with tablet migrations"

"ndeed, the work Dgraph has undertaken in the last 18 months has dramatically improved safety. In 1.0.2, Jepsen tests routinely observed safety issues even in healthy clusters. In 1.1.1, tests with healthy clusters, clock skew, process kills, and network partitions all passed. Only tablet moves appeared susceptible to safety problems."

No one is here to claim that anyone is getting through any kind of rigorous testing without bugs found. But there is a huge difference between "My extremely common write path + a partition = dropped transactional writes" and "Under very specific circumstances, with worst case testing, multiple partitions, and the db in a specific state, we drop writes".

There is an ocean between, say, mongodb's test results, and Dgraph's.

Read Redis's evaluation, for example: https://aphyr.com/posts/283-call-me-maybe-redis

"If you use Redis as a queue, it can drop enqueued items. However, it can also re-enqueue items which were removed. "

"f you use Redis as a database, be prepared for clients to disagree about the state of the system. Batch operations will still be atomic (I think), but you’ll have no inter-write linearizability, which almost all applications implicitly rely on."

"Because Redis does not have a consensus protocol for writes, it can’t be CP. Because it relies on quorums to promote secondaries, it can’t be AP. What it can be is fast, and that’s an excellent property for a weakly consistent best-effort service, like a cache."

Again, Redis is a very different type of database, so expectations should be aligned. Further, this test is quite old.

But that's a huge difference from DGraph's results.

Basically, saying "Well no one does well on Jepsen" isn't really true. Lots of databases do well, but you have to adjust your definition of "do well".

devsquid · on May 10, 2020

As with all things it totally depends on the scale you're at.

Redis is an extremely reliable service. I've never "lost" data with Redis.

languagehacker · on May 10, 2020

This is definitely a post based in experience, but maybe not based on broad expertise in using these tools.

I don't want to pick apart the entire post, but I will say that the Lambda + API Gateway example is maybe the best example of jumping into the cloud-native world with blinders on. Just about every modern programming language has a toolset right now that will do the work of generating a CF template for you that creates an API gateway that forwards all HTTP routes to a single Lambda, and then that Lambda is responsible for handling the actual routing of that request. Examples include Zappa, ClaudiaJS, and Ruby on Jets just to name a few. I can't imagine providing a feature-rich web application in Lambda without this kind of abstraction.

Not knowing that such tooling exists, or explicitly choosing not to use such tooling, and experiencing pain as a result seems to be common theme in this article. If you dive into architecting a system using AWS's product offerings without understanding the tradeoffs you're making, you will experience greater cost and greater friction -- hands down.

mandarg · on May 10, 2020

> I can't imagine providing a feature-rich web application in Lambda without this kind of abstraction.

This reminded me of a caveat about using some of these abstractions – which is that they are still subject to the limits and restrictions of the underlying platform.

We discovered this the hard way once when an automatically generated function name or something was over the limit in prod (this issue [1] describes a similar problem). We did not catch this in dev because "dev" is one character under "prod" and our autogenerated name in dev hadn't put us over the limit. That was an interesting exercise in leaky abstractions.

[1] https://github.com/serverless/serverless/issues/2856

dayjah · on May 10, 2020

Yeah, missing the proxy pass option will definitely burn a prospective Lambda user.

I also felt a similar tinge when the post talks about KCL. Yes, you use it, and yes there are rules about output streams - but they all make sense when deployed in fairly complex environments. I can’t remember when I last used stdout for logging outside of greenfield development. Once it’s on prod everything goes to stderr and is highly structured.

Similar sentiments about Cognito. At some point of complexity you must have a server in between using the Admin* range of commands. If you want their “get going quickly” then yes, there are trade offs like WebViews. That said; I’ve never used Auth0 - so insert a Luddite warning here.

Finally, one hundred percent agree on CF.

redisman · on May 10, 2020

Another pretty basic Lambda thing the author didn’t mention is the alias feature. So rather than having users-api-dev, users-api-qa - we just have users-api and it has a alias for each environment. The environment variable management is not amazing but just using a key-value JSON {dev: {a: “bc”}} works fine.

There’s also truly no reason one lambda function can’t have a collection of APIs (using the proxy method from API gateway). It’s literally no different from any other micro service setup. You can pick any abstraction for a function to cover.

Aeolun · on May 10, 2020

If you start trying to work around the limits of your abstraction, maybe you just chose the wrong one in the first place.

This is what seems to happen when working with Lambda for me.

jacobra2 · on May 10, 2020

Could you elaborate on the proxy pass option piece?

davidgh · on May 10, 2020

https://docs.aws.amazon.com/apigateway/latest/developerguide...

From the article:

In Lambda proxy integration, when a client submits an API request, API Gateway passes to the integrated Lambda function the raw request as-is.

...and:

You can set up a Lambda proxy integration for any API method. But a Lambda proxy integration is more potent when it is configured for an API method involving a generic proxy resource. The generic proxy resource can be denoted by a special templated path variable of {proxy+}, the catch-all ANY method placeholder, or both.

FireBeyond · on May 10, 2020

> Not knowing that such tooling exists, or explicitly choosing not to use such tooling, and experiencing pain as a result seems to be common theme in this article. If you dive into architecting a system using AWS's product offerings without understanding the tradeoffs you're making, you will experience greater cost and greater friction -- hands down.

Which I think is one of the underlying points of the article.

> Many a startup has fallen prey to ElastiCache. It usually happens when the team is under-staffed and rushing for a deadline and they type in Redis into the AWS console:

I think this is a good article in the sense of "these are things that may or will bite you if you've not put a bit of thought into them".

Even the replies on this story upstream hint at that, with people debating Redis' durability, based on this:

> It’s best to design your system assuming that redis may or may not lose whatever is inside.

completely ignoring the follow up:

> Clustering or HA via Sentinel, can all come later when you know you need it!

It's like the first sentence was read and people leaped to the keyboard to reply passionately.

psanford · on May 10, 2020

The API Gateway v2 HTTP stuff makes this even easier[0]. I switched some old services over from the v1 stuff (using '{proxy}' routes) to the v2 API and I was a lot happier. I especially like that v2 has an option for Auto-deploy. I'm sure the manual deployment stuff is really useful to some people but for my little projects it was a real pain to have to remember to trigger a new deployment any time I mad a change.

[0]: https://aws.amazon.com/blogs/compute/announcing-http-apis-fo...

ufmace · on May 10, 2020

I have experience with a few of these. For those, the comments seem kind of "duh" quality.

So you say Kinesis is bad for message streams where each message should be handled by only one machine? Well duh, that's how it works. Kinesis is meant for every listener to be assured to get the entire stream. It was never designed for that, and I expect you'll have a bad time if you try to use it that way. SQS is obviously what you want if you're interested in guarantees that each message is handled once and only once.

One Lambda per route in a decently-sized restful webapp is ugly and unmanageable? Well duh. Don't do that. Do pretty much anything else instead. Seriously, anything else at all.

This sounds more like, if you want to deploy services on AWS, do either pay somebody who knows what they're doing, or spend some time checking out the services to make sure they are designed for what you want. Don't just pick a random AWS service and usage pattern from a Google search and start building around it without ever checking if it matches what you're trying to do.

ignoramous · on May 10, 2020

> SQS is obviously what you want if you're interested in guarantees that each message is handled once and only once.

You mean SQS FIFO? SQS classic cannot guarantee exactly-once delivery or ingestion, either due to failure at the client or server. SQS FIFO, however; can but requires both the client and the server to work in-tandem to ensure exactly-once processing.

awinder · on May 10, 2020

And your queue throughput also gets destroyed by flipping that FIFO option, it’s a mess.

How’s amazon’a kafka offering going?

isugimpy · on May 10, 2020

I'm not a fan of it, personally. The interface is lacking, configurability after launch is minimal at best, and the observability is EXTREMELY poor because you can't get per-topic metrics without paying extra.

Aeolun · on May 10, 2020

> One Lambda per route in a decently-sized restful webapp is ugly and unmanageable? Well duh. Don't do that.

It’s also what every lambda framework (serverless etc.) dumps you on by default. Often without any alternative being available.

philliphaydon · on May 11, 2020

I donno about other lambda frameworks.

But with ASP.NET Core it's effectively a Web API as a Lambda. It maps the event to a request and passes it through the asp.net framework as if it's a normal website.

The config for this is like 3 lines of code so if you don't want to host in a Lambda anymore, it's trivial to move to IIS, Windows Service, Linux.

8192kjshad09- · on May 10, 2020

This gives me an idea for a codegolf like challenge. Pick a random AWS service and try to implement X using it.

X could be:

List prime numbers, List fibbonaci numbers, Conway GOL, etc

ufmace · on May 11, 2020

Now this is a fun idea.

I'd just be afraid we'd have to warn the guy who wrote the article not to put any of them in production.

watermelon0 · on May 10, 2020

> With lack of drift detection comes great uncertainty.

But CloudFormation does have drift detection: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

I'd pick Terraform before CloudFormation, but there are things that are better with CF, such as rollbacks (they really do work as expected in almost all cases), and AutoScalingGroup's UpdatePolicy (rolling/replacing deploy of new launch configuration with healthchecks), which is not available outside of CF.

mike503 · on May 10, 2020

> AutoScalingGroup's UpdatePolicy (rolling/replacing deploy of new launch configuration with healthchecks), which is not available outside of CF.

This drives me crazy. I never want to touch CF, but if I want rolling ASG, I have to. Why?!?

rwiggins · on May 11, 2020

Drift detection is only implemented for a very small subset of resources: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

For a concrete example of one that bit us one time, database parameter groups don't support drift detection.

dastx · on May 10, 2020

There are a lot of issues with both cfn and tf. As always,it comes down use-case.

In my opinion, tf is terrible choice unless you're going to be using it for more than AWS

acdha · on May 10, 2020

Or if you want to use AWS features on a deterministic timescale: CloudFormation frequently lags on support for new features - I’ve worked on multiple projects which used Terraform for months before the equivalent feature could be used in CloudFormation. For example, when ECS launched secrets support it took most of a year before CF was updated.

This is a big deal with classic CloudFormation because there was no escape hatch. Terraform can run arbitrary code if needed but you had to split CF stacks apart so you could have something else run in between. That’s probably better with CDK, of course.

synthc · on May 11, 2020

CloudFormation has an escape hatch via custom resources, which can call Lambda functions do create/update/destroy resources. It's ugly, but it works.

dastx · on May 11, 2020

At my old place we had some automation around this. Essentially our custom resources was mostly just the python scripts and then any deviations from the basic custom resource definition. Made life a lot easier.

acdha · on May 11, 2020

Good point: they added that feature after I stopped using it but it’s a good option for some cases, especially if you want to do things which the caller shouldn’t be allowed to directly perform.

KaiserPro · on May 10, 2020

Cognito:

yes, its a complex poorly documented pile of shite. BUT. It does work as a reasonably secure OAUTH2 thingamebob. However I was told by my AWS account manager that auth0 was the way forward, and I agree.

Cloudformation:

Meh, I have about 35k lines of active CF at the moment. Its much of a muchness. Unless you are using parameters with selectors, you are going to have a bad time. Hard linking templates together (I assume thats what nested stacks are) is terrible. I've only briefly used terraform, so I have no idea if its much better.

CF _could be_ a lot better. Like compile time validation, not just in time. that would stop a lot of anger when you realise you've spelt a CF parameter wrong(or the value fails validation) but only after you've spent ten minutes for it to spin up. Thats frankly unforgivable.

Elasticache:

Yes. Its expensive.

KINESIS:

What a disappointment. Stupid naming conventions, Terrible throttling and throughput. Its just horrific. Whats worse is that they looked at SQS and thought: "this compares favouribly" NATS.io is a great fit for certain usecases (no, kafka is never the answer)

LAmbda:

I don't actually get this myself. I made a REST api exclusively in lambda. It meant that I could build a working prototype really quickly. Once proven we ported it to fastapi in an autoscale group.

The API gateway was heavily integrated into the lambda spinup (controlled in CF) so I really don't see what the issue it. Also it understands swagger, so I struggle to understand the criticism

PaywallBuster · on May 10, 2020

Author is making a new lambda for each route.

But you can point each route in API Gateway to a different function in the same lambda.

Additionally, using serverless framework facilitate local testing too.

That solves all the complaints from the author.

kdidbsbsbd · on May 10, 2020

That only works if your functions require same resources. If you need different memory/cpu power you're unevenly splitting functions where ever

user5994461 · on May 10, 2020

>>> However I was told by my AWS account manager that auth0 was the way forward, and I agree.

What do they mean by that?

auth0 is a company that sells a variety of authentication solutions (on premise and SaaS) and a variety of authentication libraries/plugins.

For example they run the site https://jwt.io/ that anybody who's had to work with JWT tokens would have used at some point.

KaiserPro · on May 10, 2020

Auth0 sell a bunch of authentication products that are similar/compete with cognito.

The difference between cognito and auth0 is that auth0 has documentation, code examples and a decent API guide.

Cognito has a poorly documented API, terrible integration guides and even worse debugging options

Once you have it going, cognito is grand. To get there, its a huge learning curve

dlhavema · on May 10, 2020

The fact that some specific attributes or options cannot change after creation is hard too. Other than that, it's not too bad. But like you said, setting it up takes effort, but a lot of programming is getting to a non trivial hello world.

debaserab2 · on May 10, 2020

> CF _could be_ a lot better. Like compile time validation, not just in time.

I've recently finished a project using AWS CDK, which seems to do a certain amount of this. Just using TypeScript and having AWS resource interfaces be fully typed goes a long ways for finding a template mistake quickly.

mxz3000 · on May 10, 2020

+1 for CDK. It's the way forward. CloudFormation sucks.

thanksforfish · on May 10, 2020

> compile time validation

Oh! I forgot about that mess. Yeah, it takes minutes of deploy time while real resources are spun up to catch some really basic mistakes.

Terraform plan catches a lot of issues, but I've seen cases where it misses something.

_t0du · on May 10, 2020

I haven't seen a scenario where TF plan AND apply miss something, but I have definitely been in the scenario where a CF stack fails, and then the rollback fails, and then you're stuck with an undeletable resource and can only submit a ticket to AWS.

acdha · on May 10, 2020

Ditto on both counts: we stopped using CF after hitting one of those irrecoverable bugs — usually deleting the resources manually and ignoring all the errors deleting the stack would recover after a cycle or two but we hit at least one case where that wasn’t true.

ciceryadam · on May 11, 2020

I had a lot of uncatched issues while migrating from tf11 to tf12, as far as I remember it was due to heavy module usage.

incognito_limey · on May 10, 2020

Why is (no, kafka is never the answer)?

KaiserPro · on May 10, 2020

Large, painful to configure, high latency and difficult to look after.

incognito_limey · on May 10, 2020

Not sure I understand the "large" comment. Painful to configure.. have you looked at CP-Ansible: https://github.com/confluentinc/cp-ansible

High Latency: Kafka's latency depends on your configuration and infrastructure. It has some of the lowest latency out there when configured correctly.

Curios to know why you find it difficult to look after? It's super stable, and is pretty easy to monitor via JMX.

isugimpy · on May 10, 2020

Curious to hear more details about your thoughts on this. I've done some pretty significant improvements around my team's use of it in the last few months and can't say I've had this experience. The difficulties with it really, to me, seem to be a case of batteries-not-included, speaking as someone who had never run it prior to last August.

dominotw · on May 10, 2020

> difficult to look after.

Agree with this 100%. But what about hosted kafka solutions?

> high latency

Not sure I understand this point.

incognito_limey · on May 10, 2020

What makes it difficult to look after? Genuinely curious. I've run it in production for years and its been pretty solid.

wdb · on May 10, 2020

What are good Kafka alternatives? There got to be some Go written one somewhere ;)

thayne · on May 11, 2020

NATs is one. Although it has some limitations.

runT1ME · on May 10, 2020

NATS.io and Kafka seem very different, the former not being able to do at least once guarantees, no?

zhdc1 · on May 10, 2020

I've used Cognito for personal/pet web app projects before. It was good for my use case - quick and cheap user authentication.

crummy · on May 10, 2020

The simpler the better. In my limited experience once we started fleshing out users to admins, managers, and users, in a multi-tenanted environment, we pretty quickly ran up against Cognito limitations which surprised me.

(Cognito groups seemed made for this, except they have a limit of 10k groups. We ended up storing a comma-separated list of ids in a custom cognito tag, which seemed awkward.)

KaiserPro · on May 10, 2020

Oh it can be!

but the documentation is utterly pitiful

slau · on May 10, 2020

I highly recommend cfn-lint [1]. As someone who has been writing CFN for a living for the past few months, it’s a godsend.

[1]: https://github.com/aws-cloudformation/cfn-python-lint

swyx · on May 10, 2020

seriously, an AWS account manager told you to use a competitor? is this common? in a normal company salespeople might lose their jobs for saying that.

mcherm · on May 10, 2020

> in a normal company salespeople might lose their jobs for saying that.

Quite a few times over my career I have had a salesperson (not from Amazon) recommend a competitor over their own product. In every single case my respect for the salesperson shot way up. In at least two cases I can recall this helped them close a sale.

A smart salesperson does not do everything possible to push their company's products... a smart salesperson solves their customers' problems.

zdragnar · on May 11, 2020

> A smart salesperson does not do everything possible to push their company's products

Bingo. A bad product fit means a bad customer experience, which means a bad review or reputation.

The smaller the company, the more important referrals are from your customers. Sending a potential customer to a competitor will (potentially) earn goodwill and future referrals. At worst, they might not refer anyone your way, but at least they won't be badmouthing you either.

Unfortunately, large companies typically mean large customers, and the people with the buying power aren't the people who will be using the product... so neither party really cares all that much about how well the product fits. This is the old "nobody gets fired for choosing IBM" mentality.

The worst is when medium companies think they are big companies, and try to do that to small customers. I once saw a salesperson push hard for something that was very obviously too small to be worth our time, and the project management overhead would have lead to blowing our potential customer's budget out of the water. In the end, they walked away without working with us, and a pretty sour taste in their mouths from the pushiness of the sales guy.

KaiserPro · on May 10, 2020

Yup.

If you look at it from their point of view:

We were making an API that take images does stuff on the GPU and pushes back an answer

It needed to be secure, fast and easy to look after. If they had forced cognito down my throat, and it stopped me from shipping on time, they would have missed out in $$$ of GPU time. I trusted that architect more, because they were honest, and actually helped. Making me want to stay inside the expensive walled garden that is AWS, more.

ausernameforme · on May 10, 2020

Also, consider that the key to being successful in enterprise sales is all about relations. When that account rep leaves Amazon, they want to be able to use the relationship they have with you with whatever product they end up selling later.

coredog64 · on May 10, 2020

Amazon doesn’t care what you use as long as you’re staying in their broader ecosystem.

I think it comes down to the “more flies with honey” thing. The customer will go on the defensive and shutdown if you tell them they’re stupid.

kdidbsbsbd · on May 10, 2020

Rather they're watching what others do on their platform and then make their own products based on it, same as on Amazon.com

bashinator · on May 11, 2020

I've also had AWS support go way outside the realm of what they officially support, to help us get the job done. Hell, I've had AWS support people help me debug problems in Terraform when it was pretty apparent that the issue was on the AWS side. "Pretend I'm doing this by hand."

tphan · on May 10, 2020

I've had an AWS account manager suggest to me to use Auth0 rather than Cognito for some scenarios as well.

aprao · on May 10, 2020

Doesn't CF with dry-run enabled help with some of the compile-time validation?

katzgrau · on May 10, 2020

I thought for sure I was going to find Elastic Beanstalk. It's great when it works, but when it doesn't, the list of ways it makes life hell is a long one.

Elasticache though? Pricey, but I recently got an alert at 1am that one of my core services was down and it was because Redis was out of memory. With Elasticache I doubled the size of the instance in place from my phone so I could go back to sleep. Fixed the key leak in the am and returned to the original instance size.

megavolcano · on May 10, 2020

I joined a company that had their application on about a dozen environments in Elastic Beanstalk that would fail for no reason during deployments. When everything went find it took about an hour to deploy, when stuff goes wrong - say goodbye to half your day (at a minimum). The general solution to most deployment issues was to just terminate every instance but 1, deploy to 1 instance and let the scaling policies kick in to replace the terminated instances. EB is absolute trash.

thanksforfish · on May 10, 2020

> With Elasticache I doubled the size of the instance in place from my phone so I could go back to sleep.

Did you so that using the AWS console in your mobile web browser? Pretty nifty way to quickly get back to sleep.

ramraj07 · on May 10, 2020

Do you have any details on EB failing?

senderista · on May 10, 2020

IME EB often gets into unrecoverable bad states that necessitate nuking the environment and starting over.

katzgrau · on May 11, 2020

This is the problem. The ways in which it can fail can vary (sometimes you can't even pull logs from the web interface), and if it fails during a production deploy you may be left at half or no capacity.

When something inextricably effs up, rebuild the environment and more often than not, the problem dissapears.

monkey34 · on May 13, 2020

I'm not sure I'd consider this a "failure," but related to GP, I have had a number of issues maintaining Elastic Beanstalk environments, including:

- The single container Docker platform (not sure if this is an issue with other platforms) can cause the CloudWatch agent on the environments' EC2 instances to stop streaming logs to CloudWatch. This seems to occur when a Docker container fails, for example if the process it's managing stops (e.g., if a Node.js application triggers an exception that is not caught and exits). A new Docker container will be started, but the new container's log file sometimes does not automatically get attached to/monitored by the CloudWatch agent.

- The default CloudWatch alarms created by the environment can create a "boy who cried wolf" situation. For example, when updating the application version for an environment, EB will transition the environment's state from "OK" to "Info" or even "Warning," depending on the deployment policy. This is a regular operation, but CloudWatch will still send an email to the designated notification email address about the state change. If you monitor those emails for environment issues, this normal operation could cause overload, which might lead to ignoring the emails outright. This could be problematic if the environment state transitions to an actual problem state. You can create email client rules for this, but the structure of the alarm email doesn't make this very easy, at least in Outlook 365.

An annoying example of this is when your EB environment auto-scales up due to, for example, an increase in traffic. When the auto-scaling policy scales down your instances (due to normal operation of the policy), you'll get an email that your environment has transitioned into a "Warning" state because one or more of your environment's EC2 instances are being terminated. This looks scary in the CloudWatch email that is delivered, but you have to learn that it's just the ASG doing its thing, terminating unused instances as it's been configured to do. The emails, however, do not provide good context into what has led to the "Warning" state.

- The way environments handle configuration files stored in your application's .ebextensions/ directory can cause inconsistent application state between version deployments on existing/new EC2 instances. For example, if your auto-scaling policy creates a new EC2 instance, but your recently deployed application version doesn't specify some of the commands/settings applied during a previous update to your .ebextensions/ files that might have been deployed to existing EC2 instances, you run the risk of having inconsistent state across your application's EC2 instances. This can be solved by using the "immutable" deployment type, but that's not the default deployment type. It's an edge case, but it's still something that requires you to SSH into your EC2 instances, and possibly manually terminate older instances when you eventually figure out what's going on.

Having said all of that, I think EB is still a reasonable choice for small/beginner workloads: It gives you a number of things (automated deployment, auto-scaling, load balancing, logs, etc.) that you can get by doing things on your own, but lets you get to production quickly. For mature applications, I think you could be better off managing these individual services yourself (EB is mostly just wiring together a number of AWS services with a a few deployment and monitoring agents running on each EC2 instance). If you're comfortable with the components EB is managing for you and if you have a stable CI/CD pipeline, you can get more flexibility than bending EB against its will.

mleonhard · on May 10, 2020

TLDR: I quit Elastic Beanstalk because they deploy on weekends and won't clean up their mistakes. And Beanstalk is far too buggy for a 9 year old service.

I struggled to get Elastic Beanstalk working well. The documentation is incomplete.

The Elastic Beanstalk console often shows servers as up when they're really down.

Once, my Elastic Beanstalk deployment stopped writing log files. After wasting many hours debugging, I went to AWS Loft and consulted with the support engineer. He had me log into the backing EC2 instance and debug. I was using Elastic Beanstalk so I would never have to log in to EC2 instances. He concluded that the application logs were not appearing due to a bug in the service. He promised to file a bug report.

The last straw was when Elastic Beanstalk team deployed a broken API on a Saturday. THEY DEPLOYED ON A SATURDAY!!! Their broken API added an invalid entry into the beanstalk config database for my account. All subsequent calls to the Beanstalk API failed with a 500 Server Error. I paid for an AWS Support subscription and filed a ticket. Their support engineer told me to install the AWS CLI and run some obscure commands to remove the invalid entry which their botched API deployment added. I asked them to do it and they refused. So I migrated off of Elastic Beanstalk to Docker. I have since migrated off AWS to Digital Ocean.

ashtonkem · on May 10, 2020

The section on Lamba missed the biggest problem: latency and cost.

For low regularity calls, Lambda suffers badly from the cold start problem. The first call to Lambda must actually create the lambda which can take hundreds of milliseconds. This problem also shows up when the level of requests exceeds the number of lambdas currently active, causing a cold start as the new lambda instance is created. It’s not uncommon to see some services inject fake requests with some regularity to insure that there are enough warm lambda instances to avoid the cold start problem, which is silly.

Secondly, all lambda calls are charged by the millisecond, but the all calls are rounded up to 100ms. So if your typical lambda call is 5ms, you might be paying for 20x more time than you’re actually using.

Both of these issues led my former team to use a regular ECS app rather than lambdas.

bastawhiz · on May 10, 2020

Your mileage may vary. I process tens of millions (if not hundreds of millions) of requests on Lambda each month. If you have meaningful volume, cold starts aren't a problem. And when with my volume of traffic, lambda use is a tiny fraction of my AWS bill ($30 max?). I'd be interested to know who is dissatisfied with the cost and paying more than $100/mo and what you're running (and how that would be expensive for you, considering the kind of operation you must be running).

But also, if you don't have highly variable traffic, why would you use Lambda in the first place? If you have negligible traffic (enough to sit on the free tier), why not just use a single cheap EC2 instance? Lambda trades start time for lack of a server—it's shared resource utilization taken to the extreme. You're lowering your cost by letting AWS use the bare minimum to keep your service available, and that means turning your code off when it's not running. If you want to keep your code available at a hundred millisecond's notice, just have a server running.

I assume most of the folks running into this just couldn't be bothered to pay the $7/mo for a hobby Heroku dyno or run a dirt cheap EC2 instance. Really interested to hear from folks that find Lambda impractical for serious use cases.

jeremyjh · on May 10, 2020

That doesn’t sound right. A hundred million requests would cost $350 a month just in API Gateway charges.

bastawhiz · on May 12, 2020

I'm not using API gateway. I invoke Lambda via the API.

empthought · on May 11, 2020

That's REST requests... HTTP requests are $1/million.

ashtonkem · on May 11, 2020

That’s still $100, which is much more than the $30 mentioned.

ashtonkem · on May 11, 2020

Cold starts are only a problem if you’re latency sensitive; I forgot to specify that.

I’m speaking from the context of a large company that’s well beyond the free tier, for whom AWS bills matter. If you’re down in the free tier, setup time dominates all other concerns.

philwelch · on May 11, 2020

This I think misses the point of Lambda a little. If you want to keep your Lambda instances warm all the time, you should use ECS because that's what ECS is. Lambda is for intermittent workloads where latency is less important and the cost of spending 100ms invoking an instance is less than the cost of keeping an idle container up and running all the time.

ashtonkem · on May 11, 2020

I agree that intermittent, latency insensitive operations is exactly what Lambda is good for. However this is fairly disjoint with what the hype around serverless has been about, which is why I mention it.

theshrike79 · on May 11, 2020

There has been an option for Provisioned Concurrency in Lambda functions since December 2019: https://docs.aws.amazon.com/lambda/latest/dg/configuration-c...

helsinkiandrew · on May 10, 2020

> With the Lambda server-less paradigm, you end up with 1 lambda function per route

I’m not sure why this is the case. You could host all modules in the same lambda endpoint /api/v1/* Using the same technology you would use with any other backend

popotamonga · on May 10, 2020

I use Api gateway to redirect to a regular Flask lambda with all the endpoints

damagednoob · on May 10, 2020

Yeah, we use https://github.com/awslabs/aws-serverless-express/ which solves the problem of multiple routes and running an express app locally.

philliphaydon · on May 11, 2020

Yeah I use ASP.NET Core Lambda.

Article's comments don't seem well researched. Just used tech. Ran into road block. Blogged. Complained.

blntechie · on May 10, 2020

I was going to say the same. For smaller apps, just build a monolith as a function and proxy all routes to it from API Gateway. If microservices based app, each resource or route mapped to a function like what he mentioned is common though.

kdidbsbsbd · on May 10, 2020

You will end up needing to provisiom lambda size to the worst case of any one function. You need to split them if you want to use resources efficiently according to what the functions do.

helsinkiandrew · on May 11, 2020

Agreed, this is the biggest issue with lambda for me - particularly Lambda Edge instances.

But I find it's the dependencies and common code that takes most of the space which often is common with all functions.

zegl · on May 10, 2020

Agreed, and even if do want to use a different lambda per route, it's not as troublesome to setup as the author claims.

dlhavema · on May 10, 2020

Yeah a swagger file broken up with multi-file-swagger npm module works great

coredog64 · on May 10, 2020

Thanks for the reference. I’ve been looking to solve a problem in this space and that’s perfect.

dlhavema · on May 11, 2020

This has been amazing for the ApiGateway swagger files. There is quite a bit of duplication and splitting it up let's us only write the common stuff one time.. error code handling, common request templates, etc...

sputknick · on May 10, 2020

I work in a security office managing AWS infrastructure resources across a large org and three of these services work great for us:

CloudFormation: this is an excellent resource when you need to tell teams across our org how you want them to set up resources. Rather than have numerous lengthy meetings where we tell them what to do, we just give them the CF template, super simple, and we guarantee everyone has the same setup.

Kinesis: works great for ingesting the data we had them set up resources for with the above mentioned CF script. I can't speak to the Java dependency the author mentioned. Not an issue for us. YMMV.

Lambda: Also works great with the CF template setup mentioned above. Super cheap to use. Maybe the difference between our implementation and the authors is the frequency and trigger used. Our lambda functions are all time based and run once a day, or maybe a few times a day. Super reliable, super easy, super cheap.

I think the overall thesis here is that the usefulness of AWS services depends on what you are using them for.

dlhavema · on May 10, 2020

That's a great use case for lambda. Small self contained scripts/jobs that don't charge you all day to be available once.

EdwardDiego · on May 10, 2020

> Furthermore, in order to have multiple workers, Kinesis Streams require you to use multiple shards. Each worker will make claim to a shard.

It helps if you think of Kinesis as AWS Kafkaesque rather than a message queue, because then shards = partitions and how you work with a Kinesis stream makes a lot of sense.

Multiple concurrent consumers? You're going to need shards/partitions.

Now, question is, as a replacement for homerolled Kafka in EC2, or AWS's Managed Kafka, is it cheaper?

So far, magic 8-ball says "benchmarking for your given use case needed". So far my experiments for our workload and patterns say - maybe, but more investigation is needed. I plan to role out a side-by-side Kafka and Kinesis experiment on a given topic and ascertain the costs.

Although ultimately, like any distributed messaging system, you end up engineering to the foibles of the system - in Kinesis' case, it's the fact that it rounds each sent message (which could be a batch of records, or a single record) up to the nearest 5KB for billing that you have to engineer for.

jpgvm · on May 10, 2020

Kinesis is just a very very poor Kafka.

They wanted to have a Kafka but they also had this silly idea that everything needed to be HTTP so they built a poor clone with a HTTP interface.

Unless forced into it by some other system there is never any reason to consider Kinesis over Kafka or Pulsar.

EdwardDiego · on May 10, 2020

Only reason I'm looking into it is cost engineering tbh.

As to the Kafka vs. Pulsar, I've been running a Kafka cluster for several years now, and was interested in Pulsar as a Kafka++, and have been evaluating it for a client who wants to choose between one of the two, and at the moment, I think Pulsar needs about another year before I'd recommend it.

It's exciting tech, but as is normal of recently open sourced projects, there's a lot of bugs being surfaced, and the documentation has a lot of unanswered questions, like, if I'm consuming a multi-partitioned topic in an exclusive subscription, what does that mean for ordering?

I think it has some great ideas (especially decoupling brokers from storage), but yeah, it feels a bit like Kafka pre 0.8, interesting, but you're taking on a lot of work adopting it at the moment.

jpgvm · on May 10, 2020

I think saying it's analogous to Kafka 0.8 is reasonably apt. It's very stable and performant in the verticals it has been used in at Yahoo but not as widely integrated, understood and optimised as Kafka across a broader set of use cases.

That said if what you are trying to do fits squarely in the box of things that do work well it's considerably better than Kafka in a few keys ways.

The first is the obvious separation of storage and compute/broker responsibilities, the benefits of which also led to the same design being used in LogDevice - a similar system designed and built completely independently and at roughly the same time at Facebook.

The second is selective acknowledgement, i.e the ability to acknowledge messages as processed by a consumer out of order rather than merely the offset of the latest message processed. This allows Pulsar to be more easily used as a workqueue without the multitude of hacks and layered infrastructure required to get the same out of Kafka.

Shared subscriptions/partitioning model. Compared to Kafka its more flexible, less punishing and less beholden to the architecture of consumers.

Finally I would say tiering, for some it means nothing but depending on your use case it can be defining. Pulsar can offload historical segements to long term stable storage but still present a unified offset like API to consume historical data.

So I think for some Pulsar provides enough benefit to make up for any shortcomings in integration as long as you have sufficient engineers able to debug/patch issues.

EdwardDiego · on May 10, 2020

I agree wholeheartedly, especially with the last paragraph :)

enitihas · on May 10, 2020

What do you think are the major advantages of Kafka over Kinesis?

jpgvm · on May 10, 2020

It's actually awfully simple. Take everything Kinesis does and Kafka does it better. Yes, everything.

Then ontop of that add Consumer Groups which basically deal with the issues in the OP w.r.t consuming a topic from multiple processes along with providing administrative APIs to reset application offsets, inspect lag etc.

Also a bunch of extra features like transactions etc but if you are comparing on the basis of Kinesis like features they aren't likely to matter as much as the core functionality - which is where Kinesis really gets destroyed anyway.

Normally I wouldn't take such an absolutist position but when it comes to Kinesis let me repeat in no uncertain terms.

Never use Kinesis.

saberience · on May 10, 2020

You're missing something major. I have experience using Kafka and Kinesis. My curreny company only uses Kinesis and zero Kafka. Why? Because Kafka takes significant knowledge, experience, time to setup, understand, and support. And if it fucks up and you don't have one or more "Kafka experts" on-call, well, you're screwed.

Kinesis on the other hand, just works. Yes, it just works. I don't have to have a Kinesis expert on hand, I don't have to configure clusters myself or write Ansible/Puppet, etc. I have a few basic lines of Terraform to create my Kinesis streams and I push data to them and we got it working in minutes and we've had no issues.

Contract this to my previous job where we literally had to hire multiple Kafka experts at high salaries to maintain our Kafka clusters.

This is why you only use Kafka if you ABSOLUTELY NEED it.

jpgvm · on May 10, 2020

I am about 90% the way ridding a larger company of Kinesis. You might not have needed a Kafka expert but instead you had to create (likely poorly implemented) frameworks for consuming from Kinesis (when you could have just used Consumer Groups). Also Kinesis is insanely expensive, to the point that if your throughput is anything but trivial (in which case it doesn't matter which option you use, both will -just work-) that is just becomes untennable.

I actually did miss something but it's not complexity, you can easily use hosted Kafka (AWS even provides it with MSK now). It's actually authentication. Kafka does have authentication but it doesn't easily tie into AWS or GCP authentication mechanisms without the help of a tool like Vault.

That said.. I don't think that minor benefit is enough to ever justify Kinesis over Kafka.

crescentfresh · on May 11, 2020

> Kinesis is insanely expensive

No, it is not. Also, an equivalent cluster of EC2 instances running zookeeper/Kafka/etc will outprice Kinesis.

jpgvm · on May 11, 2020

At tiny throughput maybe? Pretty sure you could run Kafka on small ec2 instances and still get great throughput, especially if you don't need much retention. (reducing retention reduces need for disk I/O as consumers can't fall far behind causing disks to seek).

I think for any reasonable workload, i.e 10k/s+ and/or throughput over 100mb/s Kinesis gets dumpstered for price every time.

EdwardDiego · on May 10, 2020

As I'm evaluating the other way, I'd be curious if you see any major advantages of Kinesis over Kafka.

I'm not wedded to one or the other, just interested, as I said in another comment, I'm looking at Kinesis from a cost engineering basis alone, but if there's something I've overlooked, would be keen to know :)

noirbot · on May 10, 2020

I'm confused that the author says about Cloudformation:

> Lack of Drift detection or reconciliation. With lack of drift detection comes great uncertainty.

Cloudformation has definitely had Drift Detection since 2018: https://aws.amazon.com/blogs/aws/new-cloudformation-drift-de... It's not everything you might want it to be, but it's not like Terraform will reconcile your drift automatically either, that I know of.

rwiggins · on May 11, 2020

Drift detection is only implemented for a very small subset of resources: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

Terraform does indeed reconcile drift 'automatically' across all resources, by which I mean the plan will include changing everything that's drifted back to the specified configuration. That may not always be desirable, which is why building a good plan/apply process with approval is important. (Same goes for CloudFormation, though.)

karterk · on May 10, 2020

Can't believe Cloudwatch wasn't mentioned here. Could be such a useful product that is so messed up that there is probably multiple billion dollar companies in this space.

zwily · on May 10, 2020

I’ve heard rumblings from inside AWS that there is internal pressure not to make CloudWatch too powerful, lest it harm their partners (like Datadog, etc).

I’ve heard this a couple times after complaining to AWS engineers about CloudWatch shortcomings.

That said, CloudWatch has gotten tons better than a few years ago.

pentlander · on May 10, 2020

I don't think that's true, internally there was a long term plan to bring cloudwatch up to par with internal metrics and alarming systems. At the time when I left Amazon, there were some shims to use internal dashboards and alarms with cloudwatch data. The eventual goal was for the whole company to move to cloudwatch instead of maintaining two systems.

somebodythere · on May 10, 2020

What I don't get was the broad internal push to building on AWS directly, which was great most of the time, but came with big downgrades in metrics and alarms if you were to use the AWS equivalent.

mleonhard · on May 10, 2020

CloudWatch Logs has no ingestion rate limits. Using it is a real financial risk. Misbehaving servers can quickly generate huge AWS charges, up to $74,000 per day per account. Even 12 servers could produce $4320 in charges just over a weekend.

Also the CloudWatch Logs console is unusable for even simple tasks.

Datadog and its competitors have ingestion rate limits, but they are overall quite expensive. Self-hosted log analysis tools are exceedingly complex to set up and maintain: ElasticSearch + Kibana, Grafana + Loki.

nopzor · on May 12, 2020

just curious (and i'll admit i'm biased, am co-founder @ grafana labs), what do you find exceedingly complex about setting up and maintaining grafana + loki? both are single binary, and can be run without any dependencies.

mleonhard · on May 23, 2020

I'm using Terraform to maintain identical staging and prod deployments. Grafana is difficult to deploy statelessly, so this adds yet another Terraform deployment stage:

1. Deploy host running dockerd

2. Deploy Grafana server

3. Configure Grafana server admin password, organizations, users, and passwords.

If Grafana would just support file-based configuration then a whole stage could be eliminated.

"Loki does not come with any included authentication layer. Operators are expected to run an authenticating reverse proxy in front of your services, such as NGINX using basic auth or an OAuth2 proxy."

https://github.com/grafana/loki/blob/master/docs/operations/...

Unfortunately, this means that any process that can write logs can read all logs. This violates the principle of least privilege, a core part of system security. Prometheus suffers from this, too. To put this into concrete terms: When someone uploads a malicious image through our app which exploits our image resizing server, they will obtain log-writing credentials. Those credentials should not allow them to read all the logs and steal the user data in them. That would be catastrophic for the users and the company.

ElasticSearch supposedly has ACL support, but it is mostly undocumented and full of foot-guns. For example, their security doc omits a necessary flag which enables password enforcement. After following the guide, I discovered that the passwords I had set up were not being checked. I immediately lost all confidence in ElasticSearch as a tool to safely store user data. I deleted it.

I used InfluxDB since it lets me create write-only user accounts. Unfortunately, Grafana's integration with InfluxDB is problematic.

The lack of useful debug logging in Grafana makes troubleshooting especially difficult.

dirtydroog · on May 10, 2020

Really. I found it extremely useful when applying custom scaling policies. It's miles better then GCP's Monitoring.

RandomTisk · on May 10, 2020

They need to do away with CloudWatch/Logs API Call limits. They actually push users into using Kinesis if they need frequent access to their logs, which adds an unbelievable layer of complexity just for the privilege of avoiding even more latency from using S3 and writing their own scripts/code to collect and pull what they want.

eraad · on May 10, 2020

I believe the title of the article should be "AWS Services You Should Avoid if you don't need them or if they are not the right fit for your specific application, for example, if you need social login you are better off with other alternatives more fit to the social login use-case; or for example, if Lambda is not the best fit for containing and deploying my code given my architectural decisions at the moment of this writing".

kdidbsbsbd · on May 10, 2020

That is a very long sentence

Twirrim · on May 10, 2020

I'm honestly not familiar with most of the services mentioned from a user perspective, the AWS service I worked on didn't consume them at the time, and I work for another cloud platform now.

That said, this point gave me particular pause, because it brought in to question the other assertions:

> The application collected small json records, and stuffed them into Kinesis with the python boto3 api . On the other side, worker process running inside EC2/ECS were pulling these records with boto3 and processing them. We then discovered that retrieving records out of Kinesis Streams when you have multiple worker is non-trivial.

Yeah.. because that's really not what Kinesis is designed for. They even, rightly, point out that SQS is a better fit for that purpose. That hardly makes it a service to avoid.

senderista · on May 10, 2020

That was the most annoying part of the article. A streaming data platform is not a task queue. Why do you think they built Kinesis in the first place when SQS had been around for years?

CSMastermind · on May 10, 2020

I wholeheartedly agree with the Cloudformation take.

Terraform is better in almost every way. If you use Cloudformation you'll end up writing a bunch of bash script wrappers or similar around it to make it actually do what you want.

Everyone who's tried both at scale has said the same thing.

boredgamer2 · on May 10, 2020

Terraform is better in almost every way.

I disagree. I spent time in Terraform a few years ago working with a client and Terraform had the ability to create but not tear down resources for some services. I was shocked -- check out the Github issues history. I ended up writing a "bunch of bash script wrappers or similar around it".

noirbot · on May 10, 2020

Yea, this is exactly what I was talking about. It's not to say that Cloudformtation can't run into the same thing - deleting S3 buckets is difficult in any situation - but there's a lot of things Terraform doesn't/won't do, and often you're left with just making your own second layer of automation to work around.

noirbot · on May 10, 2020

To be fair, that also happens a lot in Terraform. I'm taking over work at my org where people have wrapped Terraform in 2-3 different tools and templating engines. And then you have to work out how to robustly store and back up your TF states because piecing it back together if you lose your state, or someone else corrupts it, can be just as bad as the worst CF issues.

Overall, I think Terraform is better if you're deploying a lot of inter-connected resources, but Cloudformation makes a lot of things "just work" in ways that Terraform doesn't. I think of it like ECS vs. EKS/Your Own Kube Whatever. ECS is full of gotchas and limitations, but if you play along with it you get a lot of things "for free".

oneplane · on May 10, 2020

If you store your TF state on S3 and use versioning you'll never have that lost state problem.

kolanos · on May 10, 2020

For anyone writing bash scripts to manage CloudFormation templates (eek), there are tools out there like Troposphere to make CF manageable [0].

[0]: https://github.com/cloudtools/troposphere

tie_ · on May 10, 2020

I'll chime in with https://tropostack.readthedocs.io (based on Troposphere)

Sytten · on May 10, 2020

Used Kinesis in an ETL pipeline and we ended up writing our own custom client for it because we didn't want to use the Java one. The product itself is not too horrible, but the fact that they refuse to create native clients for other languages makes it a no go for any company that is not JVM based. A managed Kafka on GCP or directly a kafka on Kubernetes (there is a controller for that if you pay confluence that makes the whole thing painless) is much better.

calcifer · on May 10, 2020

There are excellent community KCLs for other languages, like Twitch's kinsumer [1] for Go. We've been using it in production with a high througput stream for almost a year now without any issues.

[1] https://github.com/twitchscience/kinsumer

manigandham · on May 10, 2020

You can also use Strimzi for Kafka on Kubernetes: https://strimzi.io/

2ion · on May 10, 2020

While CloudFormation is annoying to use without extra tooling (like Sceptre -- makes management of dependencies between outputs and inputs go away easily, and it's extensible with plain Python), Terraform is horrible in its own ways. The AWS provider has many consistency issues, even glaring ones like not detecting that IAM policy documents have not changed when using lists in actions=[] and resources=[], and after having tried it, it makes me uncomfortable managing large landscapes with it.

Terraform makes mostly sense -- because there is good tooling around CloudFormation, and CFN more often than not does the well enough -- when you find out that CloudFormation is like Terraform just a driver for AWS APIs, likely developed by a dedicated team, and as such features of the underlying APIs are often not exposed at release time in CloudFormation. I noticed this especially when experimenting with EKS: There's eksctl, which integrates with CloudFormation (generates some stacks) but is utterly useless for integration because you can't import outputs or exports, so you have to hardcode all your SG IDs, VPC IDs, Subnet IDs if you want to integrate with existing infrastructure (https://github.com/weaveworks/eksctl/issues/1141). Dealbreaker, waste of time, disappointing for an "official" CLI. Next, there's pure CloudFormation -- but no luck for you, to this day AWS doesn't support EKS endpoint access control settings through CloudFormation (https://github.com/aws/containers-roadmap/issues/242 - dealbraker, if you need that and are allergic against public control endpoints in your infrastructure. Looking at Terraform -- it supports integration with existing CloudFormation infrastructure and can access CFN exports, it supports all the EKS settings you'd want, it offers a consistent interface to these features, so you got something to use, short of rolling your own.

Mind you, CloudFormation is extensible using Custom resources, but hacking around CloudFormation is likely not worth your time and something Amazon should do. Anyway, CloudFormation is likely one of the most tested and well-working parts of AWS, so I'd prefer it over third-party state management unless I find a very good reason to do so.

scarface74 · on May 10, 2020

so you have to hardcode all your SG IDs, VPC IDs, Subnet IDs if you want to integrate with existing infrastructure (https://github.com/weaveworks/eksctl/issues/1141). Dealbreaker, waste of time, disappointing for an "official" CLI

You can reference Parameter Store and Secret Manager entries in CF so you don’t have to hardcore values.

raverbashing · on May 10, 2020

"What’s the problem (with Elasticache)? It is expensive, and no one notices for months."

So you just click on stuff without knowing how much it costs? Really?

> Secondly that cache.* prefix means this instance costs $0.216/hr instead of $0.126/hr, a 71.4% premium. Then you might think you need one for dev, qa, and prod

Ok and you pick a cache of the same size for dev/qa/prod? Again, really?

> But, not all data/records/events should go into Kinesis. It is not a general purpose enterprise event bus or queue.

Yes, there's SQS for that. But it seems another case of clicky-go-lucky. "Hey it says queue here so we'll just use this right?!"

> Lambdas are great for the following tasks:

Yes, agreed. I would generalize it as: it is good for specific small tasks, especially when plugging into the rest of the AWS infrastructure (reacting to/from SQS, S3 events, etc)

> Lambda is horrible for: > A replacement for REST API endpoints.

Sigh

General advice: if something feels weirdly hard, then you're probably doing it wrong.

acdha · on May 10, 2020

> the new YAML CF is better than the old JSON CF, but it’s still tough to read long complex stacks.

These days I would recommend using CDK if you can’t switch to Terraform but I would never under any circumstances recommend YAML since the odds approach certainty that the magic in the parser will cause a problem. Beyond the usual confusion around typing I’ve seen things like significant whitespace breaking functionality[1] and the shorthand types in a few cases make diffs messier. If you find an example in YAML, it takes a second to run it through cfn—flip and you’ll never have to deal with any of that.

For anyone using CF seriously, I highly recommend the cfn-python-lint tool and associated editor support. It will catch many of these cases before you burn a lengthy update cycle.

1. Fun fact: AWS’ own security alert CF stack fails the Security Hub CIS scanner because it adds extra white space the metric filters.

cddotdotslash · on May 10, 2020

> With the Lambda server-less paradigm, you end up with 1 lambda function per route.

Only if you use a framework that splits it up that way (which I agree is horrible). But nothing about Lambda _requires_ this architecture. In fact, API Gateway has a proxy mode that allows you to serve all requests from a single Lambda.

philwelch · on May 10, 2020

To my surprise this is mostly solid advice despite the clickbaity title.

I've never used Cognito or ElasticCache, but for the other three (CloudFormation, Kinesis, and Lambda), I would agree with what he's actually saying, which is to use these services judiciously.

CloudFormation has some advantages over Terraform, but having used both, Terraform is much more usable for long-lived static resources. I have yet to try Pulumi, and I don't see any reason why I should to be honest. However, CloudFormation is required for automated build and deployment of Lambda functions--both Serverless Framework and AWS SAM use CloudFormation. I've also used CloudFormation StackSets to build out infrastructure for multi-account governance. In normal use though, Terraform is typically more usable, though both solutions can get pretty hairy.

By "Kinesis" the author seems to be referring to Kinesis Streams. In fairness I haven't built anything where Kinesis Streams would be a better use case than SNS/SQS, but that's just as much a statement about the projects I have happened to work on than it is about Kinesis Streams. I have used Kinesis Firehose, which provides a very scalable mechanism for, "multiple clients are intermittently spitting out a massive volume of data points, all of which need to be faithfully logged into S3 somewhere".

And finally, we come to Lambda. Lambda is a good fit for use cases where either you need to run some code on an intermittent basis or when you want to prototype faster and don't want to mess around with provisioning and deploying to servers. Lambda is a great place for little pieces of glue code that get triggered on a x-minute cron or based on a CloudWatch event or from SNS or SQS. It's good for serving an HTTP API that gets less than one request per second. These use cases are extremely common and I've encountered them a lot more than the author seems to, though again maybe that's just me. But for anything high traffic, where the Lambda container is just going to stay hot, just deploy it to EC2 (perhaps with some container magic in the middle), it's cheaper in the long run.

x3n0ph3n3 · on May 10, 2020

My team has invested a sizable amount of code to manage CloudFormation. At this point, we have a rather mature interface for dealing with CloudFormation, including:

- defining dependencies between stacks

- taking outputs from stacks and feeding them in as parameters to other stacks (not using that awful Import/Export Value crap they implemented)

- deploying as many stacks in parallel as possible and waiting for them to complete before deploying dependent stacks

- dealing with common failures and rollbacks, including handling known "continue update rollback" steps with predefined resources to skip

- pre/post stack create/update/delete actions to make API calls and perform other actions outside of CloudFormation

We basically built the missing pieces of CloudFormation ourselves and have managed to keep CloudFormation holding the definitive state rather than managing it ourselves (or paying Terraform to do it). We have about 500 stacks for a single deployment of our product.

We also implemented a large number of Transform Macro lambdas to composable templates _much_ easier.

There's a lot you can do with CloudFormation, but it takes some investment.

A few things about it that really drive me nuts, though:

- Lagging support for new features/resources

- Parameter count and parameter size limits

- Certain bugs with some resource types that are slow to get fixed. (Redshift cluster password management and issues with elasticsearch resizing triggering blue/green deployment come to mind.

- No ability to modify timeouts for some actions. For example, the timeout for a CustomResource is fixed and cannot be tuned -- if your Lambda never responds, CloudFormation will hang up to 2 hours. We wrote our own Lambda wrapper just to guarantee a response if an unexpected failure occurs

x3n0ph3n3 · on May 10, 2020

My team has a case with Kinesis streams where multiple lambdas (and a firehose) processes the same data and performs different actions on them. This can't be done with SQS, and there's no "pausing" with SNS. If for some reason, a lambda has issues processing data, we have until the TRIM_HORIZON to fix the issue and resume processing.

I will say, though, the the KCL library is a PoS and the few containerize services that are reading from Kinesis exhibit this annoying behavior of causing the iterator age to spike to the TRIM_HORIZON when performing a blue/green deployment for the containers.

smitty1e · on May 10, 2020

> With the Lambda server-less paradigm, you end up with 1 lambda function per route.

You do? Why can't a function be written to handle multiple routes?

floydnoel · on May 10, 2020

It certainly can be done. The author seems to be uninformed on the topic he is writing about.

smitty1e · on May 10, 2020

To the author's point, fragmentation is a threat if there is no architectural guide.

karmakaze · on May 10, 2020

This would be great as a Markdown repo of "When to avoid each AWS Service."

Many more than the listed services have serious usage limitations or gotchas and the listed ones have more limitations. Workarounds would also be useful to know.

dehrmann · on May 10, 2020

There's also ones that aren't in active development like SimpleDb. It's not that it doesn't work, but you probably don't want to build anything new with it.

karmakaze · on May 10, 2020

I don't get this attitude when applied to small things. Can't it just be considered complete and not in need of enhancements? If it's not supported to the point of not fixing security issues that's completely valid.

duskwuff · on May 10, 2020

SimpleDB has some significant limitations which don't exist in its successor, DynamoDB. For example, SimpleDB tables are limited to a size of 10 GB; no such limitation exists in DynamoDB.

karmakaze · on May 11, 2020

Well stated. It's very clear to say it has a 10 GB limit rather than a vague "don't use this". SimpleDB and DynamoDB are at opposite ends of the scaling spectrum.

dehrmann · on May 10, 2020

I actually agree with this in general. One of the open questions in my mind is what's a single-product business supposed to do when its product is feature complete.

philliphaydon · on May 11, 2020

None of the comments of the article are valid tho. They are more misunderstandings of the services.

karmakaze · on May 11, 2020

Which lends itself to opening issues, PRs & comments.

zxcvbn4038 · on May 10, 2020

Someone’s shotgun of frustration post, sounds like the author is mad at everything in AWS. Lambda has its uses - I definitively wouldn’t follow AWS’s SAM model and I wouldn’t try to do serve my critical web site from it - but lambda does have a convenience factor.

Cloudformation isn’t perfect but it is well integrated within AWS and you can get support for it around the clock. Hashicorp quoted me $13,000 just for support for Terraform. AWS support isn’t cheap either but covers everything in AWS and their support is fantastic - like IBM in their heyday.

scoot_718 · on May 10, 2020

Athena is actively worse than hosting your own Hive on EC2. It'll timeout queries and give you no results beyond charging you. Given the variable performance of S3, that's terrible.

aeyes · on May 10, 2020

Wait? Why are you comparing Athena to Hive? Athena is basically managed Presto.

0xbadcafebee · on May 10, 2020

You should use Cognito as a quick way to provide SAML/OAuth to a lot of apps, but it's not going to solve all your authentication woes. Nothing does (not even Auth0).

You should use CloudFormation if it's there and you're just trying to stand something up quickly. Shell scripts with AWS CLI works too. For more robust long term stuff, use Terraform, Pulumi, or custom Boto3 code. The point is to just start using the Infrastructure as Code pattern early, not to make it perfect. Terraform is surprisingly painful after a while, but it's easier to standardize your organization on large-scale. There's no great solutions in this space.

You should absolutely throw money at ElastiCache to quickly scale up and down a cache. If you're not a very experienced sysadmin (and fuck the entire tech industry for making that term a dirty word - if you are not experienced at administrating systems) you'll be spending unnecessary time and energy to stand up and maintain it. Your caching should ideally just be in your service and scaling your services themselves should be sufficient, but whatever, Redis wasn't invented for people who write good code.

> But, not all data/records/events should go into Kinesis

Not all of anything should go into anything. Use it if it's convenient.

Lambda is very useful for batch jobs and CloudWatch-triggered maintenance tasks. I wouldn't rely on it for anything more than that; I would rather just deploy the same code to a Fargate Spot Instance and not deal with all of Lambda's bullshit.

bshanks · on May 12, 2020

Summary of the article's recommendations. Don't use:

- Cognito (authentication). Because: on mobile social login it isn't native. Instead: use Auth0, OneLogin, Okta or roll your own

- Cloudformation (programmatically configure AWS). Because: various complexities. Instead: use Terraform.

- ElastiCache (managed Redis). Because: expensive. Instead: run Docker Redis in EC2.

- KINESIS (queue), as a general-purpose data queue. Good for: streaming data such as video processing/uploading. Bad for: generic data queue, because its's difficult to route each event in the queue to one of multiple workers. Kinesis is meant for every listener to be assured to get the entire stream. Instead: SNS/SQS (SQS FIFO) or a queuing framework that sits on top of Redis or traditional databases.

- Lambda (server-less), to implement a REST API. Good for: Serving/Redirecting requests to CloudFront, also reacting to events from SNS or SQS by running small asynchronous tasks. Bad for: replacement for REST API endpoints, because it's too hard to work with a zillion lambdas. Instead: use a regular web framework (or, other commentators say, route all requests to one lambda and then do more routing within than lambda)

I do not agree or disagree, I'm just summarizing; although I did add in some information from other comments here, too.

_qcti · on May 10, 2020

With regards to Cognito, I'd recommend reading through this GitHub issue before committing to it:

https://github.com/aws-amplify/amplify-js/issues/3495

While I haven't personally had the opportunity to run into these issues, the feedback there shows a serious lack of ownership that I've never encountered elsewhere with AWS.