Exploring performance differences between Amazon Aurora and vanilla MySQL

encoderer · on June 17, 2021

This is a wonderful article. I recently discovered an aurora gem that saved me from some flakey mysqldump: you can save the results directly to s3 with a “select into outfile s3” query. This is, according to the docs, an optimized operation when run from a read replica.

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

roopawl · on June 17, 2021

Every once in a while there is a well written blog post about database internal. Uber's Postgres-MySql switch saga produced a few of them. This one is pretty good too

jeandenis · on June 17, 2021

We worked closely with AWS on this (problem and blog) and they were great and quite transparent. Glad it's interesting/useful to you.

slownews45 · on June 17, 2021

The simplest is probably read committed especially if like many ETL jobs you are just going to grab stuff using one read for further processing. Another option, do a read committed and omit last 15 minutes of data if you are doing long running jobs to avoid churn at end of tables / logs.

I see folks doing serializable reads for historic ETL jobs with one read in the transaction - why? Is there some history / tool issue I'm not familiar with?

bjacokes · on June 17, 2021

For Aurora MySQL, the default for read-only replicas is repeatable read. As we mentioned towards the end of the post, read committed support appears to have been introduced to Aurora MySQL just last year. But you're right – now that it's supported, switching to read committed is by far the easiest fix.

No idea why people would be using serializable reads for ETL jobs though! :O

slownews45 · on June 17, 2021

My own guess was that some ETL jobs were really data integrity jobs - in which case folks got used to higher levels of isolation being necessary across many reads to avoid false positives on their cross check stuff maybe.

whs · on June 17, 2021

We had similar problem where a running ETL job caused a production outage due to binlog pressure.

One thing that surprised us that our TAM says that on a 1 AZ write-heavy workload normal MySQL would have higher performance as Aurora synchronously write to storage servers in other AZs. On immediate read-after-write workload that would mean it would take longer time to acquire lock.

frakkingcylons · on June 17, 2021

> One thing that surprised us that our TAM says that on a 1 AZ write-heavy workload normal MySQL would have higher performance as Aurora synchronously write to storage servers in other AZs

What is surprising about a multi-AZ database having higher latency than one that runs in only one AZ?

bjacokes · on June 17, 2021

From what I can tell, they provisioned their DB instance(s) in a single AZ, but weren't aware that Aurora automatically provisions its own storage and always uses multiple AZs. We touch on the separation of compute and storage in the post.

I think the surprise is that it's not possible to have a truly "single AZ" Aurora database, even though you might have thought you provisioned your DB instances that way.

frakkingcylons · on June 17, 2021

I see. I haven’t used Aurora, but have had experience running write heavy workloads on RDS. EBS failures would regularly (like monthly) cause our write latency to spike up 3-5x. If Aurora’s storage layer architecture is more resilient to those types of problems, that seems like a huge win.

goeiedaggoeie · on June 17, 2021

Should not be a surprise if you are using Aurora hopefully. Papers on the topic are very clear on how they scale the storage.

bjacokes · on June 17, 2021

This seems plausible given our understanding of the database internals. In general we found our AWS contacts to be knowledgeable and forthcoming about complex tradeoffs between Aurora and vanilla MySQL, even if some of that information is hard or impossible to find in the docs.

georgewfraser · on June 18, 2021

I wonder why Aurora shares undo logs between replicas? It’s perfectly possible for the read replicas to each re-create their own copy of the undo logs, and retain those undo logs for different durations based on the different long-running queries on each replica.

user3939382 · on June 18, 2021

It seems like the benefits of cloud infrastructure have normalized vendor lock-in.

I’ve never used Aurora because I don’t want to code anything to the idiosyncrasies of AWS (or any other cloud provider).

sofixa · on June 19, 2021

Aurora has compatibility layers and you interact with it as with normal MySQL, MariaDB or PgSQL. Of course there are some underlying differences, but the code and most of the tooling stay the same.

exabrial · on June 18, 2021

One huge difference is in locking we discovered. Do not expect Aurora to do you any kind of favors in you actually use these features.

shepardrtc · on June 18, 2021

Really great article! I have a question: in it you say to keep an eye on RollbackSegmentHistoryListLength, and I want to do that, but I don't know at what number does it become something to worry about. There doesn't seem to be any guidance on AWS' site. I'm seeing ranges of 1,000 to 5,000 and sometimes 100,000.

bjacokes · on June 19, 2021

Great question, although I'm not sure there's a concrete answer to it other than "it depends". You can think of that metric as representing the number of logs that haven't been garbage collected, so as it goes up, performance will get worse.

If you're seeing spikes in RollbackSegmentHistoryListLength that coincide with dips in DB performance, you've probably identified the culprit. In the scenario described in our post, that metric would have grown monotonically for the duration of the long-lived ETL query – probably a more overt problem than what you're describing with short spikes to 100,000.

shepardrtc · on June 19, 2021

A number of our 100k spikes spanned about a day, and a cluster of them seem to coincide with serious performance issues we have encountered. We "solved" the problem by increasing the instance size, but I'm starting to see spikes that get larger and larger, so I suspect we will run into this issue again. But now I have something to report on and watch out for. Thank you!

wooly_bully · on June 17, 2021

H3 tags on this could really use a bump in size and contrast from the regular text.