"Remember redis is meant to hold ephemeral data in memory, so please don’t treat it as a DB. It’s best to design your system assuming that redis may or may not lose whatever is inside"
Of course I disagree with that statement. Many serious Redis use cases at big companies assume Redis is there and will hold your data. What they usually assume is that under certain cases during failovers or other serious failures you many lose some acknowledged writes that were sent immediately before. And this most of the times does not happen too. Note that many high performance applications using SQL DBs with a relaxed fsync configuration will have the same assumption. The suggestion should be more: if Redis for you is just a volatile aid, use a given setup, if it's were you store data, use another setup.
Anyway if you want to see a version of Redis where you can store also bank accounts, transactions or any other of the most critical stuff you can store in a DB make sure to check the news at Redis Conf 2020 (the conf is in streaming and free).
EDIT: But even after that announcement the biggest value for Redis is that it can be fast and provide a "best effort" consistency level that is adequate for a lot of use cases in practice. Many systems were sacrified in the temple of wanting to provide linearizability for all the use cases.
This rubs me very very wrong. There is no maybe in data integrity. Redis wasn’t designed to be ACID and shouldn’t be treated as such just because “usually” it doesn’t lose data “under the right circumstances.”
If a sysadmin uses an RDBMS with relaxed fsync, they know what they are getting (and durable writes is not one of them, no usually about it). Same for Redis.
What I meant is that beat effort consistency can be super naive or can try hard, even without guaranteeing it, to avoid losing data in certain failure modes.
Building computer systems / architectures is always about compromises. If your requirements are that "everything must be perfect always", you will never deliver anything.
I came up with a "mental framework" that allowed me to get beyond these details early when I started architecting systems for companies.
1) Identify what are the possible malfunctions
2) What mechanisms will I use to detect these malfunctions as early as possible
3) Do I have a clear and concise recovery path for said malfunctions.
This frame of thought has taken me far (in terms of being able to accomplish a lot in my career so far) and allowed me to move forward without (a) extensive, time-consuming, research into the specific tools I'm using into the specific situations which may never happen, (b) extensive costs hiring experts in these niche areas.
ACID guarantees feel like they can be distilled to "everything must be perfect always" - and you can ensure ACID and still ship, just by using an ACID database. It's a hard problem, but it's also a solved problem.
Sometimes you don't need ACID guarantees, which is part of the reason databases like Redis exist, and is why the compromises you're describing are important. But in this context it sounds like the framework you're describing would lead to rolling your own ACID for Redis instead of just using an ACID database to begin with.
I guess it's almost like speaking two different languages when talking about what a DB guarantees and what your application REQUIRES in the form of guarantees. Typically, in my experience, they're not the same thing. Or not ALWAYS the same thing. Within the problem space you're trying to come up with a solution for, I'd say it's highly unlikely "ACID" guarantees are sufficient to say categorically (with 100% accuracy) that "my application is perfect because: ACID".
We gladly paid one 800 € for three hours of work.. fixing a botched stored procedure on IBM DB2. Gray haired DBAs with the right blend of dev, ops and most importantly business domain knowledge (something many people in IT seem to overlook) are worth gold
I'm one of those people with moderate throughput production applications using Redis as a primary store (500-2000 requests per second); in a multi-region active-active setup without redis enterprise no less. Its a very real use case at a LOT of companies.
Favorite datastore to date. The only thing I wish for frequently is native global change-data-capture without doing something manually with streams (I've thought about writing a replication protocol consumer, but the way resync's are handled is a little problematic)
piggybacked off Dynamodb in AWS for now unfortunately; so its more "the entire system is active-active, not just the redis bit". Reads always go to Redis in every region (no read-through), writes go to dynamodb (in global table config). Then there's a worker in each region consuming the dynamo stream and keeping it's region's redis in sync
Worker is a DIY stream consumer implementation. The stream -> lambda integration works, but its super limited (and ~300ms slower to get new events on average)
yep. in this specific use case there can be pretty fast reads after a write, and dynamo stream latency is not great even intra-region.
Via the lambda integration you should probably expect ~1000ms p99 latency intra-region, and ~2100ms p99 from, e.g, east-1 to west-2. Shave maybe 200-300ms off of those numbers if you do a DIY stream consumer without the aws provided lambda connector.
No but often you set it to fsync every second and not at every write, if you are really concerned with performances in certain kind of applications. Or in practical terms, check the synchronous_commit config option of PostgreSQL. Basically sometimes you have fsync enabled because you don't want to corrupt the log, but the writes will not wait for the WAL, so the last transactions can get lost on crash.
I would just throw more money at the problem not weaken the strength of my confidence in the underlying storage subsystem or the DB by playing with the fsync behaivor willy-nilly nor even in a calculated way. I would instead just throw more money at the problem by investing in faster and faster disks -- you can do a lot with enterprise grade NVMe now.
> "Remember redis (in AWS) holds ephemeral data in memory"
(because EC2 machines don't persist their disk data between start/stops and a lot of people don't realize that until it is too late - hey remember that Bitcoin exchange that got bitten by this?)
I suspect that’s what you meant but just for clarity, Instance storage does persist for reboots (which is effectively an OS reboot), but not if you stop/start (or terminate) the instance.
This was mostly applicable to older generations of EC2 instances (C3, M3, etc.)
EBS is now the default when creating an instance, and instance types of newer generations generally don't have instance store at all (except disk optimized instance types, and those with "d" in the name.)
Yeah, I think Amazon gave up on trying to communicate the meaning of ephemeral storage, and I think that's for the best tbh. It's too easy to have important data vanish that way. Something as simple as someone issuing a shutdown from the console is translated into an "instance stop" in AWS, which would purge the storage -- unfortunately often a lesson learned the hard way.
Anyway if you want to see a version of Redis where you can store also bank accounts, transactions or any other of the most critical stuff you can store in a DB make sure to check the news at Redis Conf 2020 (the conf is in streaming and free).
He sort of implies here, not for you, but it might create that impression on others, was clarifying for those.
Regardless of whether Redis is meant to hold ephemeral data in memory, it is well-known to anybody who has had to maintain a Redis instance as part of their deployment that Redis is only effective when used for ephemeral in-memory work.
I understand your defensiveness, but please understand that because of your massive experience with Redis, either you have never lost data with Redis, which means that nobody besides you understands how to run your software in production, or you have lost data with Redis, which means that it's somewhat hypocritical of you to insist that Redis is durable.
Your comment seems to suggest that everyone that's ever used Redis except antirez has lost data. I don't think that's true. I've used Redis since 2012 across dozens of production services and never lost data, and I don't have any particular skill that makes me special.
Note that Redis only provides best effort consistency: it means that sometimes it can lose acknowledged data in special conditions (during failovers, or during restarts with a relaxed fsync policy). So it will never pass Jepsen tests in the default setup, but may pass it only when Redis is used with a linerizable algorithm on top of it that uses Redis as a state machine for a consensus protocol. But this does not mean that people haven't applied Redis for data storage with success. For instance many *SQL failover solutions, also can't pass Jepsen tests, yet people use them to store real world data. There are a lot of applications where to be able to really scale well and with a low cost (spinning sometimes 1/10 or 1/100 of nodes) makes perfectly viable to pick a system that is designed for that, and as a price, will lose a window of acknowledged writes during failures, while trying hard to avoid it in common failure scenarios.
Just going from the last one I read for DGraph, it did extremely well. Pretty sure etcd did well.
They always have bugs somewhere, but there are huge differences between bugs that show up for very specific, niche cases, and normal "I wrote to the db and it dropped it".
"We found five safety issues in version 1.1.1—some known to Dgraph already—including reads observing transient null values, logical state corruption, and the loss of large windows of acknowledged inserts."
Loss of large windows of acknowledged inserts. Durability is hard.
As staticassertion is mentioning, some of the violations that were found were only around tablet moves, which happen only in certain cluster sizes and quite infrequently. Of course, Jepsen triggers those moves left-right-and-center to evoke some of those failure conditions; but that's not how tablet moves are supposed to work in real world conditions. This is different from other edge cases like process crashes, or machine failures, network partitions, clock skews, etc., which can and do happen. In those cases, Jepsen didn't find any violations.
We were planning to look into those tablet move issues and get them fixed up (shouldn't be that hard), but honestly, the chances of our users encountering them is so low that we de-prioritized that work over some of the other launches that we are doing.
But, we'll fix those up in the next few months, once we have more bandwidth.
I don't really feel like playing the quotes game... but, sure.
"All of the issues we found had to do with tablet migrations"
"ndeed, the work Dgraph has undertaken in the last 18 months has dramatically improved safety. In 1.0.2, Jepsen tests routinely observed safety issues even in healthy clusters. In 1.1.1, tests with healthy clusters, clock skew, process kills, and network partitions all passed. Only tablet moves appeared susceptible to safety problems."
No one is here to claim that anyone is getting through any kind of rigorous testing without bugs found. But there is a huge difference between "My extremely common write path + a partition = dropped transactional writes" and "Under very specific circumstances, with worst case testing, multiple partitions, and the db in a specific state, we drop writes".
There is an ocean between, say, mongodb's test results, and Dgraph's.
"If you use Redis as a queue, it can drop enqueued items. However, it can also re-enqueue items which were removed. "
"f you use Redis as a database, be prepared for clients to disagree about the state of the system. Batch operations will still be atomic (I think), but you’ll have no inter-write linearizability, which almost all applications implicitly rely on."
"Because Redis does not have a consensus protocol for writes, it can’t be CP. Because it relies on quorums to promote secondaries, it can’t be AP. What it can be is fast, and that’s an excellent property for a weakly consistent best-effort service, like a cache."
Again, Redis is a very different type of database, so expectations should be aligned. Further, this test is quite old.
But that's a huge difference from DGraph's results.
Basically, saying "Well no one does well on Jepsen" isn't really true. Lots of databases do well, but you have to adjust your definition of "do well".
This is definitely a post based in experience, but maybe not based on broad expertise in using these tools.
I don't want to pick apart the entire post, but I will say that the Lambda + API Gateway example is maybe the best example of jumping into the cloud-native world with blinders on. Just about every modern programming language has a toolset right now that will do the work of generating a CF template for you that creates an API gateway that forwards all HTTP routes to a single Lambda, and then that Lambda is responsible for handling the actual routing of that request. Examples include Zappa, ClaudiaJS, and Ruby on Jets just to name a few. I can't imagine providing a feature-rich web application in Lambda without this kind of abstraction.
Not knowing that such tooling exists, or explicitly choosing not to use such tooling, and experiencing pain as a result seems to be common theme in this article. If you dive into architecting a system using AWS's product offerings without understanding the tradeoffs you're making, you will experience greater cost and greater friction -- hands down.
> I can't imagine providing a feature-rich web application in Lambda without this kind of abstraction.
This reminded me of a caveat about using some of these abstractions – which is that they are still subject to the limits and restrictions of the underlying platform.
We discovered this the hard way once when an automatically generated function name or something was over the limit in prod (this issue [1] describes a similar problem). We did not catch this in dev because "dev" is one character under "prod" and our autogenerated name in dev hadn't put us over the limit. That was an interesting exercise in leaky abstractions.
Yeah, missing the proxy pass option will definitely burn a prospective Lambda user.
I also felt a similar tinge when the post talks about KCL. Yes, you use it, and yes there are rules about output streams - but they all make sense when deployed in fairly complex environments. I can’t remember when I last used stdout for logging outside of greenfield development. Once it’s on prod everything goes to stderr and is highly structured.
Similar sentiments about Cognito. At some point of complexity you must have a server in between using the Admin* range of commands. If you want their “get going quickly” then yes, there are trade offs like WebViews. That said; I’ve never used Auth0 - so insert a Luddite warning here.
Another pretty basic Lambda thing the author didn’t mention is the alias feature. So rather than having users-api-dev, users-api-qa - we just have users-api and it has a alias for each environment. The environment variable management is not amazing but just using a key-value JSON {dev: {a: “bc”}} works fine.
There’s also truly no reason one lambda function can’t have a collection of APIs (using the proxy method from API gateway). It’s literally no different from any other micro service setup. You can pick any abstraction for a function to cover.
In Lambda proxy integration, when a client submits an API request, API Gateway passes to the integrated Lambda function the raw request as-is.
...and:
You can set up a Lambda proxy integration for any API method. But a Lambda proxy integration is more potent when it is configured for an API method involving a generic proxy resource. The generic proxy resource can be denoted by a special templated path variable of {proxy+}, the catch-all ANY method placeholder, or both.
> Not knowing that such tooling exists, or explicitly choosing not to use such tooling, and experiencing pain as a result seems to be common theme in this article. If you dive into architecting a system using AWS's product offerings without understanding the tradeoffs you're making, you will experience greater cost and greater friction -- hands down.
Which I think is one of the underlying points of the article.
> Many a startup has fallen prey to ElastiCache. It usually happens when the team is under-staffed and rushing for a deadline and they type in Redis into the AWS console:
I think this is a good article in the sense of "these are things that may or will bite you if you've not put a bit of thought into them".
Even the replies on this story upstream hint at that, with people debating Redis' durability, based on this:
> It’s best to design your system assuming that redis may or may not lose whatever is inside.
completely ignoring the follow up:
> Clustering or HA via Sentinel, can all come later when you know you need it!
It's like the first sentence was read and people leaped to the keyboard to reply passionately.
The API Gateway v2 HTTP stuff makes this even easier[0]. I switched some old services over from the v1 stuff (using '{proxy}' routes) to the v2 API and I was a lot happier. I especially like that v2 has an option for Auto-deploy. I'm sure the manual deployment stuff is really useful to some people but for my little projects it was a real pain to have to remember to trigger a new deployment any time I mad a change.
I have experience with a few of these. For those, the comments seem kind of "duh" quality.
So you say Kinesis is bad for message streams where each message should be handled by only one machine? Well duh, that's how it works. Kinesis is meant for every listener to be assured to get the entire stream. It was never designed for that, and I expect you'll have a bad time if you try to use it that way. SQS is obviously what you want if you're interested in guarantees that each message is handled once and only once.
One Lambda per route in a decently-sized restful webapp is ugly and unmanageable? Well duh. Don't do that. Do pretty much anything else instead. Seriously, anything else at all.
This sounds more like, if you want to deploy services on AWS, do either pay somebody who knows what they're doing, or spend some time checking out the services to make sure they are designed for what you want. Don't just pick a random AWS service and usage pattern from a Google search and start building around it without ever checking if it matches what you're trying to do.
> SQS is obviously what you want if you're interested in guarantees that each message is handled once and only once.
You mean SQS FIFO? SQS classic cannot guarantee exactly-once delivery or ingestion, either due to failure at the client or server. SQS FIFO, however; can but requires both the client and the server to work in-tandem to ensure exactly-once processing.
I'm not a fan of it, personally. The interface is lacking, configurability after launch is minimal at best, and the observability is EXTREMELY poor because you can't get per-topic metrics without paying extra.
But with ASP.NET Core it's effectively a Web API as a Lambda. It maps the event to a request and passes it through the asp.net framework as if it's a normal website.
The config for this is like 3 lines of code so if you don't want to host in a Lambda anymore, it's trivial to move to IIS, Windows Service, Linux.
I'd pick Terraform before CloudFormation, but there are things that are better with CF, such as rollbacks (they really do work as expected in almost all cases), and AutoScalingGroup's UpdatePolicy (rolling/replacing deploy of new launch configuration with healthchecks), which is not available outside of CF.
Or if you want to use AWS features on a deterministic timescale: CloudFormation frequently lags on support for new features - I’ve worked on multiple projects which used Terraform for months before the equivalent feature could be used in CloudFormation. For example, when ECS launched secrets support it took most of a year before CF was updated.
This is a big deal with classic CloudFormation because there was no escape hatch. Terraform can run arbitrary code if needed but you had to split CF stacks apart so you could have something else run in between. That’s probably better with CDK, of course.
At my old place we had some automation around this. Essentially our custom resources was mostly just the python scripts and then any deviations from the basic custom resource definition. Made life a lot easier.
Good point: they added that feature after I stopped using it but it’s a good option for some cases, especially if you want to do things which the caller shouldn’t be allowed to directly perform.
yes, its a complex poorly documented pile of shite. BUT. It does work as a reasonably secure OAUTH2 thingamebob. However I was told by my AWS account manager that auth0 was the way forward, and I agree.
Cloudformation:
Meh, I have about 35k lines of active CF at the moment. Its much of a muchness. Unless you are using parameters with selectors, you are going to have a bad time. Hard linking templates together (I assume thats what nested stacks are) is terrible. I've only briefly used terraform, so I have no idea if its much better.
CF _could be_ a lot better. Like compile time validation, not just in time. that would stop a lot of anger when you realise you've spelt a CF parameter wrong(or the value fails validation) but only after you've spent ten minutes for it to spin up. Thats frankly unforgivable.
Elasticache:
Yes. Its expensive.
KINESIS:
What a disappointment. Stupid naming conventions, Terrible throttling and throughput. Its just horrific. Whats worse is that they looked at SQS and thought: "this compares favouribly" NATS.io is a great fit for certain usecases (no, kafka is never the answer)
LAmbda:
I don't actually get this myself. I made a REST api exclusively in lambda. It meant that I could build a working prototype really quickly. Once proven we ported it to fastapi in an autoscale group.
The API gateway was heavily integrated into the lambda spinup (controlled in CF) so I really don't see what the issue it. Also it understands swagger, so I struggle to understand the criticism
The fact that some specific attributes or options cannot change after creation is hard too. Other than that, it's not too bad. But like you said, setting it up takes effort, but a lot of programming is getting to a non trivial hello world.
> CF _could be_ a lot better. Like compile time validation, not just in time.
I've recently finished a project using AWS CDK, which seems to do a certain amount of this. Just using TypeScript and having AWS resource interfaces be fully typed goes a long ways for finding a template mistake quickly.
I haven't seen a scenario where TF plan AND apply miss something, but I have definitely been in the scenario where a CF stack fails, and then the rollback fails, and then you're stuck with an undeletable resource and can only submit a ticket to AWS.
Ditto on both counts: we stopped using CF after hitting one of those irrecoverable bugs — usually deleting the resources manually and ignoring all the errors deleting the stack would recover after a cycle or two but we hit at least one case where that wasn’t true.
Curious to hear more details about your thoughts on this. I've done some pretty significant improvements around my team's use of it in the last few months and can't say I've had this experience. The difficulties with it really, to me, seem to be a case of batteries-not-included, speaking as someone who had never run it prior to last August.
The simpler the better. In my limited experience once we started fleshing out users to admins, managers, and users, in a multi-tenanted environment, we pretty quickly ran up against Cognito limitations which surprised me.
(Cognito groups seemed made for this, except they have a limit of 10k groups. We ended up storing a comma-separated list of ids in a custom cognito tag, which seemed awkward.)
> in a normal company salespeople might lose their jobs for saying that.
Quite a few times over my career I have had a salesperson (not from Amazon) recommend a competitor over their own product. In every single case my respect for the salesperson shot way up. In at least two cases I can recall this helped them close a sale.
A smart salesperson does not do everything possible to push their company's products... a smart salesperson solves their customers' problems.
> A smart salesperson does not do everything possible to push their company's products
Bingo. A bad product fit means a bad customer experience, which means a bad review or reputation.
The smaller the company, the more important referrals are from your customers. Sending a potential customer to a competitor will (potentially) earn goodwill and future referrals. At worst, they might not refer anyone your way, but at least they won't be badmouthing you either.
Unfortunately, large companies typically mean large customers, and the people with the buying power aren't the people who will be using the product... so neither party really cares all that much about how well the product fits. This is the old "nobody gets fired for choosing IBM" mentality.
The worst is when medium companies think they are big companies, and try to do that to small customers. I once saw a salesperson push hard for something that was very obviously too small to be worth our time, and the project management overhead would have lead to blowing our potential customer's budget out of the water. In the end, they walked away without working with us, and a pretty sour taste in their mouths from the pushiness of the sales guy.
We were making an API that take images does stuff on the GPU and pushes back an answer
It needed to be secure, fast and easy to look after. If they had forced cognito down my throat, and it stopped me from shipping on time, they would have missed out in $$$ of GPU time. I trusted that architect more, because they were honest, and actually helped. Making me want to stay inside the expensive walled garden that is AWS, more.
Also, consider that the key to being successful in enterprise sales is all about relations. When that account rep leaves Amazon, they want to be able to use the relationship they have with you with whatever product they end up selling later.
I've also had AWS support go way outside the realm of what they officially support, to help us get the job done. Hell, I've had AWS support people help me debug problems in Terraform when it was pretty apparent that the issue was on the AWS side. "Pretend I'm doing this by hand."
I thought for sure I was going to find Elastic Beanstalk. It's great when it works, but when it doesn't, the list of ways it makes life hell is a long one.
Elasticache though? Pricey, but I recently got an alert at 1am that one of my core services was down and it was because Redis was out of memory. With Elasticache I doubled the size of the instance in place from my phone so I could go back to sleep. Fixed the key leak in the am and returned to the original instance size.
I joined a company that had their application on about a dozen environments in Elastic Beanstalk that would fail for no reason during deployments. When everything went find it took about an hour to deploy, when stuff goes wrong - say goodbye to half your day (at a minimum). The general solution to most deployment issues was to just terminate every instance but 1, deploy to 1 instance and let the scaling policies kick in to replace the terminated instances. EB is absolute trash.
This is the problem. The ways in which it can fail can vary (sometimes you can't even pull logs from the web interface), and if it fails during a production deploy you may be left at half or no capacity.
When something inextricably effs up, rebuild the environment and more often than not, the problem dissapears.
I'm not sure I'd consider this a "failure," but related to GP, I have had a number of issues maintaining Elastic Beanstalk environments, including:
- The single container Docker platform (not sure if this is an issue with other platforms) can cause the CloudWatch agent on the environments' EC2 instances to stop streaming logs to CloudWatch. This seems to occur when a Docker container fails, for example if the process it's managing stops (e.g., if a Node.js application triggers an exception that is not caught and exits). A new Docker container will be started, but the new container's log file sometimes does not automatically get attached to/monitored by the CloudWatch agent.
- The default CloudWatch alarms created by the environment can create a "boy who cried wolf" situation. For example, when updating the application version for an environment, EB will transition the environment's state from "OK" to "Info" or even "Warning," depending on the deployment policy. This is a regular operation, but CloudWatch will still send an email to the designated notification email address about the state change. If you monitor those emails for environment issues, this normal operation could cause overload, which might lead to ignoring the emails outright. This could be problematic if the environment state transitions to an actual problem state. You can create email client rules for this, but the structure of the alarm email doesn't make this very easy, at least in Outlook 365.
An annoying example of this is when your EB environment auto-scales up due to, for example, an increase in traffic. When the auto-scaling policy scales down your instances (due to normal operation of the policy), you'll get an email that your environment has transitioned into a "Warning" state because one or more of your environment's EC2 instances are being terminated. This looks scary in the CloudWatch email that is delivered, but you have to learn that it's just the ASG doing its thing, terminating unused instances as it's been configured to do. The emails, however, do not provide good context into what has led to the "Warning" state.
- The way environments handle configuration files stored in your application's .ebextensions/ directory can cause inconsistent application state between version deployments on existing/new EC2 instances. For example, if your auto-scaling policy creates a new EC2 instance, but your recently deployed application version doesn't specify some of the commands/settings applied during a previous update to your .ebextensions/ files that might have been deployed to existing EC2 instances, you run the risk of having inconsistent state across your application's EC2 instances. This can be solved by using the "immutable" deployment type, but that's not the default deployment type. It's an edge case, but it's still something that requires you to SSH into your EC2 instances, and possibly manually terminate older instances when you eventually figure out what's going on.
Having said all of that, I think EB is still a reasonable choice for small/beginner workloads: It gives you a number of things (automated deployment, auto-scaling, load balancing, logs, etc.) that you can get by doing things on your own, but lets you get to production quickly. For mature applications, I think you could be better off managing these individual services yourself (EB is mostly just wiring together a number of AWS services with a a few deployment and monitoring agents running on each EC2 instance). If you're comfortable with the components EB is managing for you and if you have a stable CI/CD pipeline, you can get more flexibility than bending EB against its will.
TLDR: I quit Elastic Beanstalk because they deploy on weekends and won't clean up their mistakes. And Beanstalk is far too buggy for a 9 year old service.
I struggled to get Elastic Beanstalk working well. The documentation is incomplete.
The Elastic Beanstalk console often shows servers as up when they're really down.
Once, my Elastic Beanstalk deployment stopped writing log files. After wasting many hours debugging, I went to AWS Loft and consulted with the support engineer. He had me log into the backing EC2 instance and debug. I was using Elastic Beanstalk so I would never have to log in to EC2 instances. He concluded that the application logs were not appearing due to a bug in the service. He promised to file a bug report.
The last straw was when Elastic Beanstalk team deployed a broken API on a Saturday. THEY DEPLOYED ON A SATURDAY!!! Their broken API added an invalid entry into the beanstalk config database for my account. All subsequent calls to the Beanstalk API failed with a 500 Server Error. I paid for an AWS Support subscription and filed a ticket. Their support engineer told me to install the AWS CLI and run some obscure commands to remove the invalid entry which their botched API deployment added. I asked them to do it and they refused. So I migrated off of Elastic Beanstalk to Docker. I have since migrated off AWS to Digital Ocean.
The section on Lamba missed the biggest problem: latency and cost.
For low regularity calls, Lambda suffers badly from the cold start problem. The first call to Lambda must actually create the lambda which can take hundreds of milliseconds. This problem also shows up when the level of requests exceeds the number of lambdas currently active, causing a cold start as the new lambda instance is created. It’s not uncommon to see some services inject fake requests with some regularity to insure that there are enough warm lambda instances to avoid the cold start problem, which is silly.
Secondly, all lambda calls are charged by the millisecond, but the all calls are rounded up to 100ms. So if your typical lambda call is 5ms, you might be paying for 20x more time than you’re actually using.
Both of these issues led my former team to use a regular ECS app rather than lambdas.
Your mileage may vary. I process tens of millions (if not hundreds of millions) of requests on Lambda each month. If you have meaningful volume, cold starts aren't a problem. And when with my volume of traffic, lambda use is a tiny fraction of my AWS bill ($30 max?). I'd be interested to know who is dissatisfied with the cost and paying more than $100/mo and what you're running (and how that would be expensive for you, considering the kind of operation you must be running).
But also, if you don't have highly variable traffic, why would you use Lambda in the first place? If you have negligible traffic (enough to sit on the free tier), why not just use a single cheap EC2 instance? Lambda trades start time for lack of a server—it's shared resource utilization taken to the extreme. You're lowering your cost by letting AWS use the bare minimum to keep your service available, and that means turning your code off when it's not running. If you want to keep your code available at a hundred millisecond's notice, just have a server running.
I assume most of the folks running into this just couldn't be bothered to pay the $7/mo for a hobby Heroku dyno or run a dirt cheap EC2 instance. Really interested to hear from folks that find Lambda impractical for serious use cases.
Cold starts are only a problem if you’re latency sensitive; I forgot to specify that.
I’m speaking from the context of a large company that’s well beyond the free tier, for whom AWS bills matter. If you’re down in the free tier, setup time dominates all other concerns.
This I think misses the point of Lambda a little. If you want to keep your Lambda instances warm all the time, you should use ECS because that's what ECS is. Lambda is for intermittent workloads where latency is less important and the cost of spending 100ms invoking an instance is less than the cost of keeping an idle container up and running all the time.
I agree that intermittent, latency insensitive operations is exactly what Lambda is good for. However this is fairly disjoint with what the hype around serverless has been about, which is why I mention it.
> With the Lambda server-less paradigm, you end up with 1 lambda function per route
I’m not sure why this is the case. You could host all modules in the same lambda endpoint /api/v1/*
Using the same technology you would use with any other backend
I was going to say the same. For smaller apps, just build a monolith as a function and proxy all routes to it from API Gateway. If microservices based app, each resource or route mapped to a function like what he mentioned is common though.
You will end up needing to provisiom lambda size to the worst case of any one function. You need to split them if you want to use resources efficiently according to what the functions do.
This has been amazing for the ApiGateway swagger files. There is quite a bit of duplication and splitting it up let's us only write the common stuff one time.. error code handling, common request templates, etc...
I work in a security office managing AWS infrastructure resources across a large org and three of these services work great for us:
CloudFormation: this is an excellent resource when you need to tell teams across our org how you want them to set up resources. Rather than have numerous lengthy meetings where we tell them what to do, we just give them the CF template, super simple, and we guarantee everyone has the same setup.
Kinesis: works great for ingesting the data we had them set up resources for with the above mentioned CF script. I can't speak to the Java dependency the author mentioned. Not an issue for us. YMMV.
Lambda: Also works great with the CF template setup mentioned above. Super cheap to use. Maybe the difference between our implementation and the authors is the frequency and trigger used. Our lambda functions are all time based and run once a day, or maybe a few times a day. Super reliable, super easy, super cheap.
I think the overall thesis here is that the usefulness of AWS services depends on what you are using them for.
> Furthermore, in order to have multiple workers, Kinesis Streams require you to use multiple shards. Each worker will make claim to a shard.
It helps if you think of Kinesis as AWS Kafkaesque rather than a message queue, because then shards = partitions and how you work with a Kinesis stream makes a lot of sense.
Multiple concurrent consumers? You're going to need shards/partitions.
Now, question is, as a replacement for homerolled Kafka in EC2, or AWS's Managed Kafka, is it cheaper?
So far, magic 8-ball says "benchmarking for your given use case needed". So far my experiments for our workload and patterns say - maybe, but more investigation is needed. I plan to role out a side-by-side Kafka and Kinesis experiment on a given topic and ascertain the costs.
Although ultimately, like any distributed messaging system, you end up engineering to the foibles of the system - in Kinesis' case, it's the fact that it rounds each sent message (which could be a batch of records, or a single record) up to the nearest 5KB for billing that you have to engineer for.
Only reason I'm looking into it is cost engineering tbh.
As to the Kafka vs. Pulsar, I've been running a Kafka cluster for several years now, and was interested in Pulsar as a Kafka++, and have been evaluating it for a client who wants to choose between one of the two, and at the moment, I think Pulsar needs about another year before I'd recommend it.
It's exciting tech, but as is normal of recently open sourced projects, there's a lot of bugs being surfaced, and the documentation has a lot of unanswered questions, like, if I'm consuming a multi-partitioned topic in an exclusive subscription, what does that mean for ordering?
I think it has some great ideas (especially decoupling brokers from storage), but yeah, it feels a bit like Kafka pre 0.8, interesting, but you're taking on a lot of work adopting it at the moment.
I think saying it's analogous to Kafka 0.8 is reasonably apt. It's very stable and performant in the verticals it has been used in at Yahoo but not as widely integrated, understood and optimised as Kafka across a broader set of use cases.
That said if what you are trying to do fits squarely in the box of things that do work well it's considerably better than Kafka in a few keys ways.
The first is the obvious separation of storage and compute/broker responsibilities, the benefits of which also led to the same design being used in LogDevice - a similar system designed and built completely independently and at roughly the same time at Facebook.
The second is selective acknowledgement, i.e the ability to acknowledge messages as processed by a consumer out of order rather than merely the offset of the latest message processed. This allows Pulsar to be more easily used as a workqueue without the multitude of hacks and layered infrastructure required to get the same out of Kafka.
Shared subscriptions/partitioning model. Compared to Kafka its more flexible, less punishing and less beholden to the architecture of consumers.
Finally I would say tiering, for some it means nothing but depending on your use case it can be defining. Pulsar can offload historical segements to long term stable storage but still present a unified offset like API to consume historical data.
So I think for some Pulsar provides enough benefit to make up for any shortcomings in integration as long as you have sufficient engineers able to debug/patch issues.
It's actually awfully simple. Take everything Kinesis does and Kafka does it better. Yes, everything.
Then ontop of that add Consumer Groups which basically deal with the issues in the OP w.r.t consuming a topic from multiple processes along with providing administrative APIs to reset application offsets, inspect lag etc.
Also a bunch of extra features like transactions etc but if you are comparing on the basis of Kinesis like features they aren't likely to matter as much as the core functionality - which is where Kinesis really gets destroyed anyway.
Normally I wouldn't take such an absolutist position but when it comes to Kinesis let me repeat in no uncertain terms.
You're missing something major. I have experience using Kafka and Kinesis. My curreny company only uses Kinesis and zero Kafka. Why? Because Kafka takes significant knowledge, experience, time to setup, understand, and support. And if it fucks up and you don't have one or more "Kafka experts" on-call, well, you're screwed.
Kinesis on the other hand, just works. Yes, it just works. I don't have to have a Kinesis expert on hand, I don't have to configure clusters myself or write Ansible/Puppet, etc. I have a few basic lines of Terraform to create my Kinesis streams and I push data to them and we got it working in minutes and we've had no issues.
Contract this to my previous job where we literally had to hire multiple Kafka experts at high salaries to maintain our Kafka clusters.
This is why you only use Kafka if you ABSOLUTELY NEED it.
I am about 90% the way ridding a larger company of Kinesis. You might not have needed a Kafka expert but instead you had to create (likely poorly implemented) frameworks for consuming from Kinesis (when you could have just used Consumer Groups).
Also Kinesis is insanely expensive, to the point that if your throughput is anything but trivial (in which case it doesn't matter which option you use, both will -just work-) that is just becomes untennable.
I actually did miss something but it's not complexity, you can easily use hosted Kafka (AWS even provides it with MSK now). It's actually authentication. Kafka does have authentication but it doesn't easily tie into AWS or GCP authentication mechanisms without the help of a tool like Vault.
That said.. I don't think that minor benefit is enough to ever justify Kinesis over Kafka.
At tiny throughput maybe? Pretty sure you could run Kafka on small ec2 instances and still get great throughput, especially if you don't need much retention. (reducing retention reduces need for disk I/O as consumers can't fall far behind causing disks to seek).
I think for any reasonable workload, i.e 10k/s+ and/or throughput over 100mb/s Kinesis gets dumpstered for price every time.
As I'm evaluating the other way, I'd be curious if you see any major advantages of Kinesis over Kafka.
I'm not wedded to one or the other, just interested, as I said in another comment, I'm looking at Kinesis from a cost engineering basis alone, but if there's something I've overlooked, would be keen to know :)
I'm confused that the author says about Cloudformation:
> Lack of Drift detection or reconciliation. With lack of drift detection comes great uncertainty.
Cloudformation has definitely had Drift Detection since 2018: https://aws.amazon.com/blogs/aws/new-cloudformation-drift-de... It's not everything you might want it to be, but it's not like Terraform will reconcile your drift automatically either, that I know of.
Terraform does indeed reconcile drift 'automatically' across all resources, by which I mean the plan will include changing everything that's drifted back to the specified configuration. That may not always be desirable, which is why building a good plan/apply process with approval is important. (Same goes for CloudFormation, though.)
Can't believe Cloudwatch wasn't mentioned here. Could be such a useful product that is so messed up that there is probably multiple billion dollar companies in this space.
I’ve heard rumblings from inside AWS that there is internal pressure not to make CloudWatch too powerful, lest it harm their partners (like Datadog, etc).
I’ve heard this a couple times after complaining to AWS engineers about CloudWatch shortcomings.
That said, CloudWatch has gotten tons better than a few years ago.
I don't think that's true, internally there was a long term plan to bring cloudwatch up to par with internal metrics and alarming systems. At the time when I left Amazon, there were some shims to use internal dashboards and alarms with cloudwatch data. The eventual goal was for the whole company to move to cloudwatch instead of maintaining two systems.
What I don't get was the broad internal push to building on AWS directly, which was great most of the time, but came with big downgrades in metrics and alarms if you were to use the AWS equivalent.
CloudWatch Logs has no ingestion rate limits. Using it is a real financial risk. Misbehaving servers can quickly generate huge AWS charges, up to $74,000 per day per account. Even 12 servers could produce $4320 in charges just over a weekend.
Also the CloudWatch Logs console is unusable for even simple tasks.
Datadog and its competitors have ingestion rate limits, but they are overall quite expensive. Self-hosted log analysis tools are exceedingly complex to set up and maintain: ElasticSearch + Kibana, Grafana + Loki.
just curious (and i'll admit i'm biased, am co-founder @ grafana labs), what do you find exceedingly complex about setting up and maintaining grafana + loki? both are single binary, and can be run without any dependencies.
I'm using Terraform to maintain identical staging and prod deployments. Grafana is difficult to deploy statelessly, so this adds yet another Terraform deployment stage:
1. Deploy host running dockerd
2. Deploy Grafana server
3. Configure Grafana server admin password, organizations, users, and passwords.
If Grafana would just support file-based configuration then a whole stage could be eliminated.
"Loki does not come with any included authentication layer. Operators are expected to run an authenticating reverse proxy in front of your services, such as NGINX using basic auth or an OAuth2 proxy."
Unfortunately, this means that any process that can write logs can read all logs. This violates the principle of least privilege, a core part of system security. Prometheus suffers from this, too. To put this into concrete terms: When someone uploads a malicious image through our app which exploits our image resizing server, they will obtain log-writing credentials. Those credentials should not allow them to read all the logs and steal the user data in them. That would be catastrophic for the users and the company.
ElasticSearch supposedly has ACL support, but it is mostly undocumented and full of foot-guns. For example, their security doc omits a necessary flag which enables password enforcement. After following the guide, I discovered that the passwords I had set up were not being checked. I immediately lost all confidence in ElasticSearch as a tool to safely store user data. I deleted it.
I used InfluxDB since it lets me create write-only user accounts. Unfortunately, Grafana's integration with InfluxDB is problematic.
The lack of useful debug logging in Grafana makes troubleshooting especially difficult.
They need to do away with CloudWatch/Logs API Call limits. They actually push users into using Kinesis if they need frequent access to their logs, which adds an unbelievable layer of complexity just for the privilege of avoiding even more latency from using S3 and writing their own scripts/code to collect and pull what they want.
I believe the title of the article should be "AWS Services You Should Avoid if you don't need them or if they are not the right fit for your specific application, for example, if you need social login you are better off with other alternatives more fit to the social login use-case; or for example, if Lambda is not the best fit for containing and deploying my code given my architectural decisions at the moment of this writing".
I'm honestly not familiar with most of the services mentioned from a user perspective, the AWS service I worked on didn't consume them at the time, and I work for another cloud platform now.
That said, this point gave me particular pause, because it brought in to question the other assertions:
> The application collected small json records, and stuffed them into Kinesis with the python boto3 api . On the other side, worker process running inside EC2/ECS were pulling these records with boto3 and processing them. We then discovered that retrieving records out of Kinesis Streams when you have multiple worker is non-trivial.
Yeah.. because that's really not what Kinesis is designed for. They even, rightly, point out that SQS is a better fit for that purpose. That hardly makes it a service to avoid.
That was the most annoying part of the article. A streaming data platform is not a task queue. Why do you think they built Kinesis in the first place when SQS had been around for years?
I wholeheartedly agree with the Cloudformation take.
Terraform is better in almost every way. If you use Cloudformation you'll end up writing a bunch of bash script wrappers or similar around it to make it actually do what you want.
Everyone who's tried both at scale has said the same thing.
I disagree. I spent time in Terraform a few years ago working with a client and Terraform had the ability to create but not tear down resources for some services. I was shocked -- check out the Github issues history. I ended up writing a "bunch of bash script wrappers or similar around it".
Yea, this is exactly what I was talking about. It's not to say that Cloudformtation can't run into the same thing - deleting S3 buckets is difficult in any situation - but there's a lot of things Terraform doesn't/won't do, and often you're left with just making your own second layer of automation to work around.
To be fair, that also happens a lot in Terraform. I'm taking over work at my org where people have wrapped Terraform in 2-3 different tools and templating engines. And then you have to work out how to robustly store and back up your TF states because piecing it back together if you lose your state, or someone else corrupts it, can be just as bad as the worst CF issues.
Overall, I think Terraform is better if you're deploying a lot of inter-connected resources, but Cloudformation makes a lot of things "just work" in ways that Terraform doesn't. I think of it like ECS vs. EKS/Your Own Kube Whatever. ECS is full of gotchas and limitations, but if you play along with it you get a lot of things "for free".
Used Kinesis in an ETL pipeline and we ended up writing our own custom client for it because we didn't want to use the Java one. The product itself is not too horrible, but the fact that they refuse to create native clients for other languages makes it a no go for any company that is not JVM based. A managed Kafka on GCP or directly a kafka on Kubernetes (there is a controller for that if you pay confluence that makes the whole thing painless) is much better.
There are excellent community KCLs for other languages, like Twitch's kinsumer [1] for Go. We've been using it in production with a high througput stream for almost a year now without any issues.
While CloudFormation is annoying to use without extra tooling (like Sceptre -- makes management of dependencies between outputs and inputs go away easily, and it's extensible with plain Python), Terraform is horrible in its own ways. The AWS provider has many consistency issues, even glaring ones like not detecting that IAM policy documents have not changed when using lists in actions=[] and resources=[], and after having tried it, it makes me uncomfortable managing large landscapes with it.
Terraform makes mostly sense -- because there is good tooling around CloudFormation, and CFN more often than not does the well enough -- when you find out that CloudFormation is like Terraform just a driver for AWS APIs, likely developed by a dedicated team, and as such features of the underlying APIs are often not exposed at release time in CloudFormation. I noticed this especially when experimenting with EKS: There's eksctl, which integrates with CloudFormation (generates some stacks) but is utterly useless for integration because you can't import outputs or exports, so you have to hardcode all your SG IDs, VPC IDs, Subnet IDs if you want to integrate with existing infrastructure (https://github.com/weaveworks/eksctl/issues/1141). Dealbreaker, waste of time, disappointing for an "official" CLI. Next, there's pure CloudFormation -- but no luck for you, to this day AWS doesn't support EKS endpoint access control settings through CloudFormation (https://github.com/aws/containers-roadmap/issues/242 - dealbraker, if you need that and are allergic against public control endpoints in your infrastructure. Looking at Terraform -- it supports integration with existing CloudFormation infrastructure and can access CFN exports, it supports all the EKS settings you'd want, it offers a consistent interface to these features, so you got something to use, short of rolling your own.
Mind you, CloudFormation is extensible using Custom resources, but hacking around CloudFormation is likely not worth your time and something Amazon should do. Anyway, CloudFormation is likely one of the most tested and well-working parts of AWS, so I'd prefer it over third-party state management unless I find a very good reason to do so.
so you have to hardcode all your SG IDs, VPC IDs, Subnet IDs if you want to integrate with existing infrastructure (https://github.com/weaveworks/eksctl/issues/1141). Dealbreaker, waste of time, disappointing for an "official" CLI
You can reference Parameter Store and Secret Manager entries in CF so you don’t have to hardcore values.
"What’s the problem (with Elasticache)? It is expensive, and no one notices for months."
So you just click on stuff without knowing how much it costs? Really?
> Secondly that cache.* prefix means this instance costs $0.216/hr instead of $0.126/hr, a 71.4% premium. Then you might think you need one for dev, qa, and prod
Ok and you pick a cache of the same size for dev/qa/prod? Again, really?
> But, not all data/records/events should go into Kinesis. It is not a general purpose enterprise event bus or queue.
Yes, there's SQS for that. But it seems another case of clicky-go-lucky. "Hey it says queue here so we'll just use this right?!"
> Lambdas are great for the following tasks:
Yes, agreed. I would generalize it as: it is good for specific small tasks, especially when plugging into the rest of the AWS infrastructure (reacting to/from SQS, S3 events, etc)
> Lambda is horrible for:
> A replacement for REST API endpoints.
Sigh
General advice: if something feels weirdly hard, then you're probably doing it wrong.
> the new YAML CF is better than the old JSON CF, but it’s still tough to read long complex stacks.
These days I would recommend using CDK if you can’t switch to Terraform but I would never under any circumstances recommend YAML since the odds approach certainty that the magic in the parser will cause a problem. Beyond the usual confusion around typing I’ve seen things like significant whitespace breaking functionality[1] and the shorthand types in a few cases make diffs messier. If you find an example in YAML, it takes a second to run it through cfn—flip and you’ll never have to deal with any of that.
For anyone using CF seriously, I highly recommend the cfn-python-lint tool and associated editor support. It will catch many of these cases before you burn a lengthy update cycle.
1. Fun fact: AWS’ own security alert CF stack fails the Security Hub CIS scanner because it adds extra white space the metric filters.
> With the Lambda server-less paradigm, you end up with 1 lambda function per route.
Only if you use a framework that splits it up that way (which I agree is horrible). But nothing about Lambda _requires_ this architecture. In fact, API Gateway has a proxy mode that allows you to serve all requests from a single Lambda.
To my surprise this is mostly solid advice despite the clickbaity title.
I've never used Cognito or ElasticCache, but for the other three (CloudFormation, Kinesis, and Lambda), I would agree with what he's actually saying, which is to use these services judiciously.
CloudFormation has some advantages over Terraform, but having used both, Terraform is much more usable for long-lived static resources. I have yet to try Pulumi, and I don't see any reason why I should to be honest. However, CloudFormation is required for automated build and deployment of Lambda functions--both Serverless Framework and AWS SAM use CloudFormation. I've also used CloudFormation StackSets to build out infrastructure for multi-account governance. In normal use though, Terraform is typically more usable, though both solutions can get pretty hairy.
By "Kinesis" the author seems to be referring to Kinesis Streams. In fairness I haven't built anything where Kinesis Streams would be a better use case than SNS/SQS, but that's just as much a statement about the projects I have happened to work on than it is about Kinesis Streams. I have used Kinesis Firehose, which provides a very scalable mechanism for, "multiple clients are intermittently spitting out a massive volume of data points, all of which need to be faithfully logged into S3 somewhere".
And finally, we come to Lambda. Lambda is a good fit for use cases where either you need to run some code on an intermittent basis or when you want to prototype faster and don't want to mess around with provisioning and deploying to servers. Lambda is a great place for little pieces of glue code that get triggered on a x-minute cron or based on a CloudWatch event or from SNS or SQS. It's good for serving an HTTP API that gets less than one request per second. These use cases are extremely common and I've encountered them a lot more than the author seems to, though again maybe that's just me. But for anything high traffic, where the Lambda container is just going to stay hot, just deploy it to EC2 (perhaps with some container magic in the middle), it's cheaper in the long run.
My team has invested a sizable amount of code to manage CloudFormation. At this point, we have a rather mature interface for dealing with CloudFormation, including:
- defining dependencies between stacks
- taking outputs from stacks and feeding them in as parameters to other stacks (not using that awful Import/Export Value crap they implemented)
- deploying as many stacks in parallel as possible and waiting for them to complete before deploying dependent stacks
- dealing with common failures and rollbacks, including handling known "continue update rollback" steps with predefined resources to skip
- pre/post stack create/update/delete actions to make API calls and perform other actions outside of CloudFormation
We basically built the missing pieces of CloudFormation ourselves and have managed to keep CloudFormation holding the definitive state rather than managing it ourselves (or paying Terraform to do it). We have about 500 stacks for a single deployment of our product.
We also implemented a large number of Transform Macro lambdas to composable templates _much_ easier.
There's a lot you can do with CloudFormation, but it takes
some investment.
A few things about it that really drive me nuts, though:
- Lagging support for new features/resources
- Parameter count and parameter size limits
- Certain bugs with some resource types that are slow to get fixed. (Redshift cluster password management and issues with elasticsearch resizing triggering blue/green deployment come to mind.
- No ability to modify timeouts for some actions. For example, the timeout for a CustomResource is fixed and cannot be tuned -- if your Lambda never responds, CloudFormation will hang up to 2 hours. We wrote our own Lambda wrapper just to guarantee a response if an unexpected failure occurs
My team has a case with Kinesis streams where multiple lambdas (and a firehose) processes the same data and performs different actions on them. This can't be done with SQS, and there's no "pausing" with SNS. If for some reason, a lambda has issues processing data, we have until the TRIM_HORIZON to fix the issue and resume processing.
I will say, though, the the KCL library is a PoS and the few containerize services that are reading from Kinesis exhibit this annoying behavior of causing the iterator age to spike to the TRIM_HORIZON when performing a blue/green deployment for the containers.
This would be great as a Markdown repo of "When to avoid each AWS Service."
Many more than the listed services have serious usage limitations or gotchas and the listed ones have more limitations. Workarounds would also be useful to know.
There's also ones that aren't in active development like SimpleDb. It's not that it doesn't work, but you probably don't want to build anything new with it.
I don't get this attitude when applied to small things. Can't it just be considered complete and not in need of enhancements? If it's not supported to the point of not fixing security issues that's completely valid.
SimpleDB has some significant limitations which don't exist in its successor, DynamoDB. For example, SimpleDB tables are limited to a size of 10 GB; no such limitation exists in DynamoDB.
Well stated. It's very clear to say it has a 10 GB limit rather than a vague "don't use this". SimpleDB and DynamoDB are at opposite ends of the scaling spectrum.
I actually agree with this in general. One of the open questions in my mind is what's a single-product business supposed to do when its product is feature complete.
Someone’s shotgun of frustration post, sounds like the author is mad at everything in AWS. Lambda has its uses - I definitively wouldn’t follow AWS’s SAM model and I wouldn’t try to do serve my critical web site from it - but lambda does have a convenience factor.
Cloudformation isn’t perfect but it is well integrated within AWS and you can get support for it around the clock. Hashicorp quoted me $13,000 just for support for Terraform. AWS support isn’t cheap either but covers everything in AWS and their support is fantastic - like IBM in their heyday.
Athena is actively worse than hosting your own Hive on EC2. It'll timeout queries and give you no results beyond charging you. Given the variable performance of S3, that's terrible.
You should use Cognito as a quick way to provide SAML/OAuth to a lot of apps, but it's not going to solve all your authentication woes. Nothing does (not even Auth0).
You should use CloudFormation if it's there and you're just trying to stand something up quickly. Shell scripts with AWS CLI works too. For more robust long term stuff, use Terraform, Pulumi, or custom Boto3 code. The point is to just start using the Infrastructure as Code pattern early, not to make it perfect. Terraform is surprisingly painful after a while, but it's easier to standardize your organization on large-scale. There's no great solutions in this space.
You should absolutely throw money at ElastiCache to quickly scale up and down a cache. If you're not a very experienced sysadmin (and fuck the entire tech industry for making that term a dirty word - if you are not experienced at administrating systems) you'll be spending unnecessary time and energy to stand up and maintain it. Your caching should ideally just be in your service and scaling your services themselves should be sufficient, but whatever, Redis wasn't invented for people who write good code.
> But, not all data/records/events should go into Kinesis
Not all of anything should go into anything. Use it if it's convenient.
Lambda is very useful for batch jobs and CloudWatch-triggered maintenance tasks. I wouldn't rely on it for anything more than that; I would rather just deploy the same code to a Fargate Spot Instance and not deal with all of Lambda's bullshit.
Summary of the article's recommendations. Don't use:
- Cognito (authentication). Because: on mobile social login it isn't native. Instead: use Auth0, OneLogin, Okta or roll your own
- Cloudformation (programmatically configure AWS). Because: various complexities. Instead: use Terraform.
- ElastiCache (managed Redis). Because: expensive. Instead: run Docker Redis in EC2.
- KINESIS (queue), as a general-purpose data queue. Good for: streaming data such as video processing/uploading. Bad for: generic data queue, because its's difficult to route each event in the queue to one of multiple workers. Kinesis is meant for every listener to be assured to get the entire stream. Instead: SNS/SQS (SQS FIFO) or a queuing framework that sits on top of Redis or traditional databases.
- Lambda (server-less), to implement a REST API. Good for: Serving/Redirecting requests to CloudFront, also reacting to events from SNS or SQS by running small asynchronous tasks. Bad for: replacement for REST API endpoints, because it's too hard to work with a zillion lambdas. Instead: use a regular web framework (or, other commentators say, route all requests to one lambda and then do more routing within than lambda)
I do not agree or disagree, I'm just summarizing; although I did add in some information from other comments here, too.
While I haven't personally had the opportunity to run into these issues, the feedback there shows a serious lack of ownership that I've never encountered elsewhere with AWS.
While I agree that medium should die in a fire, it has been my experience that someone has usually submitted the URL to archive.is and that is true in this case, too: https://archive.is/hz1RG
You may also enjoy trying an incognito window in the future
Regarding using lambda for a REST API, just set up API gateway endpoint in proxy mode, and run all your endpoints from one lambda, will be much simpler, and if you use a decent framework, it will be really simple to convert your app into a regular daemon process if you ever decide to move off lambda.
Regarding cloudformation, I view manually writing cloudformation json/yaml as an anti-pattern.
I would recommend looking at using CDK, which is a framework for writing your cloudformation stacks as actual real code, not markup.
It's a great tool, and I find it enormously more productive and expressive than terraform.
CDK allows you to write CDK apps in a variety of languages, but I would recommend just using typescript, since that's what its written in.
I tried the java bindings, but it was pretty clunky.
For a non-AWS specific CDK like experience, you could also look at using pulumi, but I haven't really used it much myself.
Trying to define infrastructure purely by declarative markup based languages is such a waste of time.
Even if you dont want to use the proxy route, you can point multiple routes to the same lambda function, and every ok-ish routing module/framework will know what code path to follow.
This makes some good points about misuses of these AWS services, but the title is misleading. The article is actually more like "tempting but inadvisable use cases for AWS services".
My employer uses three of these heavily (ElastiCache, Kinesis and Lambda) and we get quite a bit of leverage out of them.
ElastiCache in particular surprised me. At first glance I mistook it for a transparent (and expensive) wrapper around sticking Redis on an EC2 instance, but if your usage is heavy enough to need multi-node clusters (e.g. read replicas or full Redis Cluster), its orchestration features are pretty useful. We can resize instances, fail over to a replica, and reshard clusters, with zero downtime, by clicking a button (or a one-line Terraform change). And never having to install security patches is nice too.
It certainly is expensive, though. (But if you're not willing to pay a premium for managed infra, what are you doing on AWS in the first place?)
Don't fret! You still have to look forward to all the weird service limits you'll hit both directly and indirectly using CFN to maintain and extend a prod infrastructure YoY
Just like the Redis replacement, AWS's MongoDB replacement of DocumentDB comes us when you search for Mongo, but it is not an actual replacement even though the name indicates "compatibility."
As a rule of thumb, we always try to avoid such proprietary services for potential vendor lock-in, whether AWS or not.
The approach CFN used made sense when there was no alternative, but Terraform's approach is so much better. I wonder how AWS approaches this space in the future. CFN has great vendor lock-in, and needs to remain supported for any customers who are locked in, but its harmful for any new development to use CFN.
I get that AWS likes having infrastructure framework level lock-in, but they need to do better.
Does anyone have a feel for how hard AWS support, account reps, or training push CFN versus Terraform? I've been out of the AWS cloud for a while.
Support doesn't officially support third-party products, and support engineers may or may not know Terraform. Engineers definitely aren't trained on Terraform, and your mileage will vary a lot depending on the specific engineer and their background. It really depends on whether they've run across it in other jobs, personal projects, etc.
Customer obsession is a big deal for support engineers. So if they can, they'll try to help solve a Terraform issue. But it'll be on a personal, best-effort basis.
For such a mud-slinging article, I'm surprised the author didn't complain about the most pressing issue (IMO) with CloudFormation: There have been a few instances where thing in AWS changed and Terraform was updated more quickly than CloudFormation was. Happens less and less now though.
> Support doesn't officially support third-party products
This is not entirely correct. "AWS Support Business and Enterprise levels include limited support for common operating systems and common application stack components" as documented here: https://aws.amazon.com/premiumsupport/faqs/ under "Third-party software"
Having been in some calls with reps/solutions architect folks recently, the push hasn't been super hard. I don't think the CFN vs. Terraform discussion has ever come up.
Cloudformation is spot on. There is simply no excuse in 2020 to use it when cloud agnostic Terraform exists.
Elasticache is expensive and the analysis is generally correct on that, however if you do actually need a high capacity rock solid Redis cluster and you can afford it, it does the job. Paying someone to setup and scale a cluster of that magnitude would probably cost a year's worth of usage. Of course if you don't actually need to scale it yet then just use the suggestions in the article to run it yourself.
Perhaps bad terminology on my part but I would argue that Terraform IS cloud agnostic, it is NOT cloud portable.
You can use Terraform with any cloud. You cannot seamlessly move your AWS configuration to Google Cloud but you can still write Google Cloud configuration with the same paradigm as your AWS configuration.
This is exactly the idea of Terraform - the workflow is portable and agnostic, but the full fidelity of each individual resource provider API is exposed.
it’s not just the provisoner. it’s. it like different clouds have the same abstraction. for all intents and purposes you need to understand the underlying cloud and how to use api/tooling around it when terraform fucks up
People ask that question all the time, and I don't get why. I hate HCL, I hate the verboseness, the limited capacity for abstraction, that its yet another proprietary format, everything reminds me of ANT.
I want infrastructure as code, not infra-as-yaml. I want types checked by the compiler, as is good and proper. Apparently many people don't feel that way though?
There's another one that I think qualifies for this list: Elastic Container Service. It's a half baked Docker service. It seems like it's abandoned now for Kubernetes.
I don't really agree that it's been abandoned, nothing suggest that this is true. They're not known for just abandoning established services, there's a public roadmap[0] that you can follow. Like all services there are pros and cons. Up until very recently EKS lacked more or less everything ECS has in tight integration with other AWS services, so there's that.
I love the 3rd party integration that comes with it.
I love the fact that I don't have to manage user credentials and everything else by myself.
However,
I hate that the UI is not polished, and there's very little customization allowed. As an example, from the sign in/up page, you can't add a link to go back to the home page.
I hate that if user signs up via Google, and later on tries to reset password via the Google email, Cognito silently drops the request.
I hate the js library. Hard to use and documentation is not great.
You can definitely use one Lambda function to serve all of the endpoints that the author mentions. You need a wildcard route in API gateway, and can easily use koa-serverless or express-serverless to get the route information when any request is received. That way you only need one function for your APIs. Even with a regular api not on lambda, you’ll still have to be aware of db connections.
I have no experience with Cognito and the CloudFormation point I completely agree with. The rest of them don't make sense though. ElasticCache is expensive but offers a ton of features that a single Redis process on an EC2 instance will not. And while Kinesis and Lambda are bad for the use cases called out in the article, they each fit their niche pretty well.
Very Flask like and includes CLI components to automate deployment. Even includes simple decorator based event mapping for S3, SQS, scheduled tasks, etc.
I agree. My personal list of stuff I like using at Amazon
1. S3 +cloudfront (but I still use cloudflare for ddos)
2. Ec2
3. Pg aurora
4. Sqs is a few cases, but generally i use redis for queueing
5. Lambda for tiny notifications usually to sqs if s3 gets dinged
Unlike the author I'm not afraid of lambda but generally prefer containers for apps
Author misses one important point with Redis: you don't pay for traffic. If you host it yourself, you pay for traffic both way if traffic crosses AZs. Dynamic scaling, no downtime upgrades,æ. It works really well.
I do hope they will get tiered RIs for ElasticCache like they have for RDS.
> You dig into CLOUD WATCH logs. We have three sets of logs all intermixed together. This is unreadable.
I'm not following this. You don't read logs one by one. You filter by source, error code, time range, etc. Its been a while, but I'm also pretty sure CW has log groups.
Hmm.
1) their VPN solution is overly expensive
2) Mongodb document database similar
3) Lambda I want to use but other than the sns/sqs connected tool I agree as an api endpoint google's offering is better
4) Cognito horrible docs...
One time a hosted an ElasticCache for testing purposes for Bull library for a couple of days and then I forgot about it for a month but I wasn't even using it. Got the bill for 1500$ :)
We use the above mentioned services and I don't agree with the assessment as we are cloud native and use lambda heavily and we think its an amazing service to use to move faster!
I lack experience in AWS or clouds but I wonder what is the alternative to CF if I should avoid it? I can imagine alternatives to the rest but I cannot come up with one to CF.
use CloudFormation (or Deployment Manager if on gcp). Terraform is a disaster and you’ll learn the hard way you cannot trust it and you’ll have to understand the underlying cloud no matter what you do.
cloudformation is awesome and it works as advertised. I have literally not seen it fail catastrophically ever since it came out. it just works.
terraform? or terrafail as it’s known. it’s a disaster. it cannot keep track of the resources it creates. when in trouble, it throws its hands up in the air and you’re on your own. do yourself a favor and don’t learn the hard way - in production - that terraform cannot take your infra from point A to point B (and rollback in case something went wrong).
Lol, cloudformartion historically did't support really huge and mandatory parts of aws resources and resource parameters and as there was no Roadmap of support until previous year https://github.com/aws-cloudformation/aws-cloudformation-cov... it was not fan because basically you couldn't even try for example to enable RDS storage autoscaller for RDS via CF template. That's why terraform was created because of AWS CF didn't support own AWS resources.
hahahah. terraform has bugs that were opened 8 years ago and were swept under the rug countless times.
the fundamental problem is that terraform is not even a half baked tool (what version was it again? 0.12?) and people are betting the farm on it. guess i’ll go build more software while others are twirling their hands with the super HCL language (what’s even more aggravating is that Hashicorp got this part - the language - right with Vagrant but I guess reinventing the wheel and using go was sexier).
If you needed to setup an S3 bucket that triggers a Lambda function when an object is added, how would you do that? Last I looked, the S3 resource would get created lazily, the Lambda wouldn't get an S3 ARN, as the S3 bucket didn't exist yet and the stack would fail.
It's a super obvious way to use S3 and Lambda, but the docs recommend something insane.
it’s not cool to use it. why use a service that’s written, ran and maintained by a dedicated team that also builds all the other services you use? (if in AWS). they don’t know what they’re doing, right?
better use some randotool and talk big about Iaac when in reality the only people that I have ran across and loved Terraform either 1) used it on a really small scale or 2) don’t really have a lot of experience with the cloud.
the only reason to not use Cloudformation is that you’re not using AWs. Use whatever your cloud provides.
For everyone that has objections, an appeal to authority: I was building stuff on the cloud before you knew the cloud was a thing. I worked with AWS, GCP, Azure and [sad face] OpenStack extensively.
Terraform resources aren't meant to be externally modified. The fonciguration itself shouldn't have a life of it's own. It should represent the whole behavior of the stack
this is not about terraform “resources”. it’s about having a reliable way to work with your infrastructure. terraform is not that way (have you ever tried seeing what happens if the terraform process crashes/is killed or you lose network connectivity while terraforming?)
I’ve been using Okta for authentication in my SaaS app, the doc and support experience had been great for me! Free 1000 MAU. Haven’t tried Cognito or Auth0.
Am I the only one that finds AWS' pricing structure as a whole too risky to use? Why use a service that can charge me a lot of money without being up front about it?
DynamoDB is fine as long as it fits your use case.
The trouble is that without a time machine, it's very tough to realize that your use case isn't the one DynamoDB is good at. The docs and Best Practices certainly suggest that anything is possible.
Do you need God Tier Scale (and understand enough about DDB to achieve it (and have the $$$))? DynamoDB is awesome! Otherwise... maybe consider a few other options...
You can use one lambda to handle multiple API gateway requests. Route to ANY. Then have a bridge that connects to your favorite language router. We do this all the time with NestJS. The down side is you burn some cash when someone calls a 404 resource since your NestJS has to say that it’s missing and that requires at least 100ms of lambda time. You could probably get around this by generating a route maker off the NestJS controller files.
Well now you are comparing apples to oranges, terraform main focus is managing and provisioning infrastructure while ansible is more commonly used to do the actual instance provisioning and configuration.
In fact, Terraform + Ansible is a fairly common and powerful stack.
While it's possible to provision actual instances with terraform, it's not what you should be doing.
Let terraform handle the infrastructure, and have ansible or puppet or saltstack or cfengine or whatever provisioning tool to do the instance setup.
It's all about the right tool for the job, and managing your linux install with terraform is not that.
Yes terraform and ansible have different sweet spots. Terraform is best when managing stateless resources. It can handle stateful resources, but as always managing state gets tricky fast.
If what you are CRUD-ing doesn't map well to the terraform concept of a resource, then don't use it. SO i don't like to use to create anaything inside a VM.
Using terraform to create an auto-scaling group is good, though, as it can be represented as a stateless resouce. Or rather any state in it is transient.
Of course I disagree with that statement. Many serious Redis use cases at big companies assume Redis is there and will hold your data. What they usually assume is that under certain cases during failovers or other serious failures you many lose some acknowledged writes that were sent immediately before. And this most of the times does not happen too. Note that many high performance applications using SQL DBs with a relaxed fsync configuration will have the same assumption. The suggestion should be more: if Redis for you is just a volatile aid, use a given setup, if it's were you store data, use another setup.
Anyway if you want to see a version of Redis where you can store also bank accounts, transactions or any other of the most critical stuff you can store in a DB make sure to check the news at Redis Conf 2020 (the conf is in streaming and free).
EDIT: But even after that announcement the biggest value for Redis is that it can be fast and provide a "best effort" consistency level that is adequate for a lot of use cases in practice. Many systems were sacrified in the temple of wanting to provide linearizability for all the use cases.