Migrating Large Heroku Postgres Instances to AWS Aurora Without Downtime

0xCMP · on Nov 12, 2020

Very thorough and nice article! It's really nice to have this all in one place for reference later.

In skimming this page I wish the following had been more prominent:

> In the Real World for this process to work you need to request for the Heroku Data Support team to establish continuous WAL log shipping to an AWS S3 bucket along with a base physical backup using the WAL-E Python based library.

I kept searching in vain for where exactly you were able to get a WAL-E backup into the bucket and finally found that line.

ekump · on Nov 12, 2020

This is a great write-up! I wrote a far less thorough blog post on the same topic a few months ago: https://vericred.github.io/2020/08/10/migrating-to-aurora.ht....

Overall, we're happy with our decision to migrate to Aurora. We feel it offers better performance, observability and scalability compared to heroku postgres. To echo some other comments:

1. Jesse at Heroku is great. I had never setup my own postgres instance before, and his guidance helped me a lot.

2. We used DMS, and it was ok, not great. Dealing with LOBs was a bit of a pain. In order to keep the migration time to a minimum we setup multiple DMS tasks and grouped tables into those tasks based on their max LOB size. Most of our tables didn't have any LOB columns and they migrated quickly.

3. The ability to fast clone a database is nice. We can create a copy of our production database in about five minutes for use in non-prod environments. This would take a couple of hours in heroku.

4. The Aurora IO costs are ridiculous. Are there any good tools to see what is exactly causing so much I/O?

craigkerstiens · on Nov 13, 2020

DMS under the covers is another tool. The Postgres support for it lags pretty well and is mostly and after thought. It's much better to try to leverage something more native, either the WAL or if on a more modern version take advantage of logical replication.

For coming off Heroku DMS isn't going to work cause you don't have a replication slot, so your only approach is wal-e, or maybe wal-g?

Fast cloning is one nice feature for sure of Aurora, but when you go into Aurora you're starting to deviate more and more from native Postgres. Fast cloning can definitely be done one without Aurora. We're working on bringing our ability on Crunchy Bridge down to a few minutes similar to Aurora, while still being 100% Postgres. Not there yet, but suspect we're a few months away.

mijoharas · on Nov 12, 2020

Hey, regarding 4. we came across this but didn't find anything from the article particularly helpful for us:

https://aws.amazon.com/blogs/database/reducing-aurora-postgr...

(we bounced to normal RDS postgres and reduced our costs a lot while keeping performance similar, check out my other comments on this thread for details).

nijave · on Nov 13, 2020

& with RDS postgres, in a lot of cases, you're probably o.k. with gp2 vs the much more expensive io1 provisioned IOPs storage unless you have a fairly business database (100s-1000s qps)

The main difference is 95% performance SLA on gp2 vs >99 on io1

(You get 3x IOP per GB on gp2)

base · on Nov 13, 2020

io1 also has a slight better performance than gp2

aeyes · on Nov 12, 2020

> 4. The Aurora IO costs are ridiculous. Are there any good tools to see what is exactly causing so much I/O?

I don't use Aurora so I don't know what is available but pg_stat_statements does have I/O statistics so you can review it manually or use a tool like pgAnalyze (not sure if it supports Aurora) to see what queries are I/O intensive.

nijave · on Nov 13, 2020

Not sure about cost and availability but Performance Insights for RDS postgres will show waits on various resources (disk, CPU) and is fairly easy to use. If it's like RDS postgres you can setup logging and use something like pgbadger (you would need to adjust the parameter group logging options to capture the right details)

davidverhasselt · on Nov 12, 2020

In a similar vein, I recently purchased this ebook https://pawelurbanek.com/heroku-migrate-postgres-rds - I haven't finished it yet but I enjoy the small, edited format of it.

shangxiao · on Nov 12, 2020

Nice!

Slight nitpick: Django should automatically create the Hstore extension for you [1] during migrate whenever you use a HStoreField

[1] https://github.com/django/django/blob/f5e07601b233a50e6bcca4...

Edit: I might've been mistaken here in thinking the above migration operation was automatically applied as the docs seem to suggest you need to add it yourself [2] but at least it's as easy as adding an operation ;)

[2] https://docs.djangoproject.com/en/3.1/ref/contrib/postgres/f...

pedrokost · on Nov 12, 2020

The author has gained a significant amount of Postgres-related knowledge when performing this migration.

Taking this into consideration, is it still worth using the managed Aurora, instead of the EC2 self-managed instance?

sciguymcq · on Nov 12, 2020

I currently manage, or am "lead dev", for around a half dozen to 10 apps using Postgres as the database. For two of those I use a managed database service (AWS RDS / AWS Aurora Postgres) because they drive mission critical high value products / services and I wouldn't consider using a standalone install of Postgres. For other apps that are not so mission critical I am perfectly happy saving considerable money and running them on a VPS on AWS EC2 or Digital Ocean or Linode. So ultimately it depends on the app and the risk tolerance for Recovery Time Objective and Recovery Point Objective.

rohansahai · on Nov 12, 2020

That's super interesting! Have heard people do the opposite for performance reasons (run on bare EC2) and storage cost reason. Do you notice a big performance difference between the EC2 apps and RDS apps? Also have you had any EBS volume related issues on RDS or unexplained latency/downtime ?

fastball · on Nov 12, 2020

Worth noting that DO also has a hosted Postgres service.

takeda · on Nov 13, 2020

If you have a CM infrastructure set up, you're not gaining much. Tools like repmgr/barman (in public cloud you probably would use WAL-E/WAL-G) provide nearly all of the benefits.

PostgreSQL is generally low maintenance, the only cases when this might not be true is when you're hitting performance bottlenecks and need to tune your database, optimize your tables, or use bigger instance etc, but RDS doesn't save you from that.

jrochkind1 · on Nov 12, 2020

RDS Postgres (actual postgres, not the Aurora product which can be "mostly postgres compatible") is another option, not just "EC2 self-managed instance".

I guess we'd have to look at price comparisons for the size you actually need. I think depending on load, the price differences may be nominal.

mijoharas · on Nov 12, 2020

Thought I'd weigh in here, since we moved from heroku to ec2 to aurora to RDS postgres, so I can probably speak to this a little more.

* ec2 self managed is easily the cheapest, we had a solid setup, with continuous backups and a read replica, if cost is a factor, it's easily a winner. However, there is a _lot_ of knowledge that goes with it. When it comes down to it, you can pay someone else to handle that. This isn't just the setup cost, you need to factor in ongoing maintenance (the number of people at my company that could have done complicated things with the instance was probably me and another engineer, and we didn't want to be permanently on call for this) and general risk.

* EC2 did probably work out to between 1/2 to 3/4 (probably 2/3) the price of the equivalent RDS (tough to say exactly, as I'd need to factor in all the ancillary costs that are more "bundled together" in RDS)

* RDS was much cheaper than aurora for our workload. (almost 1/2 the price).

I think the main thing to remember, is that ec2 is much cheaper than RDS, but RDS is cheaper than engineers. With that, while we didn't need to do anything complicated other than the migration, the risk and possible engineering time and bus factor didn't feel worth it to stay on ec2.

craigkerstiens · on Nov 13, 2020

We've had a few people migrate from Heroku and RDS over to Crunchy Bridge and see 2.5x performance improvement on warm cache and up to 5-6x performance improvement on cold cache for the exact same $ spend. I can't say easily how that would compare to raw EC2 cause it really depends on how you tune and configure. But there is definitely some opportunity to optimize vs. any stock install and setup.

The real thing about EC2 is it'll be cheaper. But when you do have a bad day, say page corruption you're gonna have more than a bad day and a bad week/month trying to untangle that. I'm not sure how helpful RDS is in those cases. I can say that Heroku historically was solid (I was there way back in the day having helped build the service). And there are some other options that can still deliver good performance, good support, at a balanced price.

rohansahai · on Nov 12, 2020

Been on RDS for awhile now but have recently had some issues that are pushing us to heavily consider EC2 - especially since we've heard about considerable performance gains running on some of the bare metal EC2 options out there vs RDS.

mijoharas · on Nov 12, 2020

We had equivalent performance (and we'd benchmarked both) for equivalent RDS and EC2 instances. (EDIT: to note, we didn't dig in heavily, just looked and saw it seemed the same, maybe there's something smart you can do to make EC2 a lot faster)

First thing to think about for RDS (or any postgres instance really) is to figure out what the bottlenecks are, if your cache hit ratio is slightly low, or queries are generally feeling a little sluggish but cpu and ram aren't too high, I'd recommend looking into increasing the provisioned IOPS of the machine (requires no downtime, basically you can just make the storage faster).

If your issues are ram/cpu being high you might want to consider moving to a larger instance.

It can also be worth looking into the number of temp files being created by your queries and tweak your work_mem setting (you can use log_temp_files to see if this is likely to help a bit).

Other than that, remember that every connection to the db is a separate process, so you don't want too many. It could be worth looking at RDS proxy or pg_bouncer if you think you have a lot of connections to your db.

mijoharas · on Nov 12, 2020

Might have been nice to see this post a year ago :).

My company migrated our ~2TB postgres db from heroku to AWS (originally aurora, we switched back to RDS postgres). Haven't had time for a blog post, but figure I might post our learnings here, and the differences in our approach.

* We did the second step (ec2 -> aurora) via AWS DMS (database migration service). This service may be a little more of a pain than it's worth, but it's what we settled on.

* we used wal-g rather than wal-e for the initial phase, it has slightly better performance, and we were happy with the tool.

* We wrote ansible playbooks to perform the heroku -> ec2 migration. it's definitely nice and makes you feel a little more secure having run the db migration multiple times over, and it would be a giant PITA to do all that manually. (hadn't used ansible much before, I have to say, I'm a fan! had used chef/puppet in bits and pieces before and was never that keen on them).

* Aurora wasn't the right fit for us. Be careful about reading the pricing. On the AWS pricing calculator it asks you for how many million requests you have per month. This is _not_ how it's charged. If you read the docs, the IO charges apply to AuroraReadIO and AuroraWriteIO, which is not per db request, it's per db storage request, which is number of requests * number of db blocks that each request fetches.

* Our Aurora costs were through the roof, (and variable and spiky, never something you want!). We migrated again to RDS postgres, and bumping the provisioned IOPs up we had equivalent performance (possibly better?) for half the price (and the price was stable). We benchmarked everything else but the cost with the db migration, instead relying on the AWS simple pricing calculator. The difference between predicted and actual storage IO costs was over 10x out if my memory serves (making the entire DB double the price per month). BEWARE!

* Aws DMS is kinda buggy, last I checked the table selection editor UI didn't work when you added/modified multiple fields. You can switch to a json editor, so generate the table selection once, and check it in the json editor in case the UI got borked.

* Speed wasn't great with AWS DMS as well, we had to flip BatchApplyEnabled on, and take a lot of care with it.

* There were definitely some shenanigans to do with LOBs, but I can't fully remember them.

* DMS does not update id sequences, so if you have any tables with incrementing ids, make sure that you set the sequences manually before switching dbs.

* We used debian as our intermediate ec2 instance. We've historically used ubuntu for many ec2 instances, but the default ubuntu AMI image uses MBR for it's partition (!!! in 2020!!!) and while we had some code to swap it to GPT while running, that felt a little too spicy for a db migration, so we used base debian which has a sane GPT partitioned disk for it's AMI. (Note: we needed to create partitions larger than what MBR can support so this was a hard requirement).

* The last learning I should add, is that heroku support were actually really great during the whole process, very knowledgeable and gave some setup example files for the ec2 instance. After that they basically leave it to you, but I can't speak highly enough about them (thanks jessie!)

* One more for you, be careful about the version of libc that you have. even if the postgres versions are identical, subsequent libc versions can have slightly different string sorting, which can corrupt your indexes that are migrated after copying the basebackup. (we needed to rebuild an index or two due to this, not a huge deal, but an interesting gotcha. I swear I checked our locale settings 5 times when it went wrong until I clocked on to it being the libc version.)

There were probably more learnings than that, but that's what comes to mind.

(EDIT: added final bullet point). (EDIT 2: added one more bullet)

sciguymcq · on Nov 12, 2020

I originally wanted to use DMS but our DB failed the pre-migration assessment so that forced us to Logical Replication.

We also worked with Jesse. You are right he is amazingly helpful. Unfortunately we got some pretty questionable help from others before he was assigned to our support ticket but once that happened it was infinitely improved.

mijoharas · on Nov 12, 2020

Ahhhh... well you might've dodged a bullet to be honest! If I were to do it all again, there's a good chance I might stick with the standard tools rather than DMS.

Thanks for the in depth article by the way, very well written!

Keep an eye on your AuroraStorageIO costs going forward, as we said, we halved our costs and increased performance by switching back to RDS and increasing the PIOPs available. I'm sure that'll be workload sensitive though, so YMMV. We did a bunch of benchmarking, and aurora seemed nice apart from this.

We must've lucked out, as we got Jesse at the start.

yowlingcat · on Nov 12, 2020

Fantastic comment, thank you so much. I've been following Aurora for a while but a little curious (and skeptical) about sharp edges vs vanilla PG. If I may ask you a follow-on question, who do you think Aurora would be the right fit for?

mijoharas · on Nov 12, 2020

Hmmmm... That's a good question.

The performance was great, I can't complain about it, but the unpredictable (and high) price was the issue.

In terms of aurora I think it probably can scale better to much larger workloads. If we put the time in and dug in and optimised any queries that could touch a large number of blocks (we do well, but I'm sure there are queries that have some seq scans somewhere in our codebase) then the price wouldn't have been an issue, and we'd happily have stayed with it.

In addition, we have a single large master and one replica. with my understanding of aurora's architecture, and how the storage is separated from the "postgres" instance, it could possibly do a lot better with drastically different architectures with a lot of different smaller "postgres" instances. We didn't have time to investigate many other architectures and in the end moving to RDS felt safest (didn't want to spend too much time when we could be building core features)

Hope that answers your question! Some of the above is more an educated guess with a smattering of experience so please treat it as such.

mijoharas · on Nov 12, 2020

On the other hand, vanilla postgres is very powerful, and you shouldn't underestimate how much you can win by cranking up the PIOPs for an RDS postgres instance.

Remember, always benchmark representative workloads.

danielvaughn · on Nov 12, 2020

I'm considering moving from a self-hosted Mongo setup to Aurora, either with PG or MySQL. Anyone here have opinions on Aurora?

bastawhiz · on Nov 12, 2020

It does the job, but aurora postgres lags behind mysql when it comes to features. Failovers are surprisingly slow, too. You're pretty much stuck with cloudwatch for monitoring, which has significant lag.

mayank · on Nov 12, 2020

> aurora postgres lags behind mysql when it comes to features

And performance, on some workloads we tested. I would benchmark write performance very carefully before using Aurora Postgres (Aurora MySQL does not seem to suffer from the same issue on the same workload).

> Failovers are surprisingly slow, too.

The recommendation now is to use RDS Proxy for faster failovers.

mnutt · on Nov 12, 2020

Aurora (Postgres) looked like it would be great, and the performance for our use case was much better than RDS. But there was a showstopper for us with the replication: If the cluster leader fails, _every read replica will force a restart_ to elect a new leader, resulting in minimum ~30s complete cluster downtime. In our situation we have no problem with a short downtime in write availability, but the fact that we can't maintain read availability via replicas is a deal breaker.

hyperdimension · on Nov 12, 2020

I'm no DBA, but is that not the entire point of a cluster/replicas?

Sure, maybe writes won't be accepted or replicated until a new master is elected, but you can't even read from the cluster?

I'm curious at this point: what is it exactly that you gain from a cluster? Is it only having to wait 30s vs. however long it takes for the master to come back online?

mnutt · on Nov 12, 2020

That was the mind-boggling part to me. After testing failover and noticing the behavior we had a number of discussions with AWS reps and support engineers that confirmed it. And Aurora's failover documentation hints at it but downplays the severity.

aetherson · on Nov 13, 2020

The thing you gain from the replicas is removing read load from the primary during normal operations.

dageshi · on Nov 12, 2020

If you're thinking of going for the version that auto sleeps when not in use be aware there's a 30-40 second startup time from first incoming request. If it's a database you don't need to be accessible except in bursts this can be really handy for cutting costs as you're not paying for any read/write resources you're not using. Of course if you do need it generally accessible all the time then you'd want to provision it accordingly.

mbesto · on Nov 12, 2020

I've heard mixed reviews on Aurora MySQL. Some people move to it and see huge performance gains, others simply don't or have some limitations.

oblio · on Nov 13, 2020

Do you have some examples of limitations they faced?

mbesto · on Nov 13, 2020

Honestly I don't sorry.

atanasovskib · on Nov 12, 2020

I would suggest checking out other smaller cloud providers for managed databases over Aurora. Check out Aiven.io

patman81 · on Nov 12, 2020

That's impressive. We struggle to find a way to "migrate" from AWS Aurora 1 to 2 (that is from MySQL 5.6 to 5.7). There seems to be no (easy) way to upgrade without downtime for large databases.