Our Journey to PostgreSQL 12

mooreds · on Jan 25, 2021

Amazing that the process went so smoothly and that there were so many resources for them to draw from. Jumping from 9 to 12 is quite a few major versions!

Also liked the couple of gotchas which go to show no matter how smooth a data migration is, there'll be some bumps.

hans_castorp · on Jan 26, 2021

> Jumping from 9 to 12 is quite a few major versions!

Just a little side note.

They were jumping from 9.6 to 12, not from 9.0 to 12.

Before Postgres 10 was released, the first two digits defined a "major" version). So from 9.6 to 12 it's three major releases (9.6 -> 10, 10 -> 11, 11 -> 12)

foxhill · on Jan 26, 2021

still, changes between 9.6 and 12 are _numerous_, both in features and performance: llvm based query compilation, CTE de-materialisation, proper procedures, and that's just off the top of my head.

i wish the process for upgrading postgres were easier/more dynamic. i'm sure plenty of people are still using versions 9.6 or earlier.

m_st · on Jan 26, 2021

The changes you mention like llvm, CTE de-mat and such: Were these for "free" or did you have to adapt your code?

And if you didn't adapt: Would your code still running after upgrading (backwards compatible)?

I'm asking as a dev mostly working with MS-SQL and seriously considering moving to PostgreSQL one day.

wiredfool · on Jan 26, 2021

I've done PG upgrades on a code base from ~7.x now through 11. For the most part it's pretty seamless, though there are occasionally backwards incompatible changes that pop up, but they're called out in the release notes.

The biggest bites have been changing the database driver from pypgsql to psycopg2 and there was a binary column default wire format change at the 9/9.1 upgrade. In another job, I've got essentially the same code running against 9.5 and 12, and I don't notice the difference.

Performance improvements for a feature are almost always 'free', but obviously your code isn't just going to start using CTEs if you haven't been using them.

dijit · on Jan 26, 2021

> The changes you mention like llvm, CTE de-mat and such: Were these for "free" or did you have to adapt your code?

The changes were made "for-free" as in no code needed to be changed, but when changing query planners caution should be heeded..

I remember a time where a mysql 5.6->5.7 upgrade crippled our entire monitoring stack:

https://bugs.mysql.com/bug.php?id=87164

https://dba.stackexchange.com/questions/121841/mysql-5-7-inn...

foxhill · on Jan 28, 2021

whilst "free" in principle, there is obvious cost/risk in adopting features like those (CTE demat and llvm), if nothing else, in even trying them out.

testing the llvm-based optimiser (in postgres 11) showed little benefit, and in a few circumstances slowed things down. had i had more time to poke around, i'd have probably got it to a place where it represented an improvement, but, up until i left that role, it was a feature that remained switched off - postgres doesn't generally introduce features that represents "breaks" in code.

having not had _that_ much experience with MS-SQL, i'd still recommend the switch. looking through the things in postgres that i take for granted vs. what's in MS-SQL, i don't think i could cope with not having postgres :)

mbyio · on Jan 26, 2021

This is a very difficult thing to do. Very impressive. I have so many questions but my number one is: were you able to evaluate alternatives to your existing vertical scaling based setup? For example, cockroachdb, multi-master postgres, using sharding instead of a single DB, etc. At that database size, you are well past the point in which a more advanced DB technology would theoretically help you scale and simplify your architecture, so I'm curious why you didn't go that route.

paulryanrogers · on Jan 26, 2021

Distributed systems are hard. Multi master is particularly sticky, especially if the data doesn't have natural boundaries.

Once solved though horizontal is nice, if more involved to maintain.

namibj · on Jan 26, 2021

CockroachDB is pretty good at encapsulating the complexity of multi-master.

You'll have to accept that transactions can fail due to conflicts, so if they are interactive, you'll have to retry manually.

Edit: I'd like hear criticism, instead of just seeing disapproval.

ddorian43 · on Jan 26, 2021

(as a downvoter) Distributed transactions don't scale, they are NOT efficient. You can't co-partition data in Cockroachdb, so the only way is the slow way. Interleaved-data don't count.

namibj · on Jan 26, 2021

Are you suggesting it hits scaling boundaries earlier than that single-node postgres? The TPC-C benchmark suggests fairly strong scaling up to medium~high double digit node counts, and my understanding is that it's decently conflict prone and very much relying on distributed transactions.

Of course a distributed architecture costs efficiency, but as long as it still scales further (and after some point, cheaper) than single-node alternatives, the efficiency loss is tolerable.

You can affect partitioning, but forcing it requires the payed version.

ddorian43 · on Jan 26, 2021

> Are you suggesting it hits scaling boundaries earlier than that single-node postgres?

This is nosql scenario all over again. You have to think in cost($)/query and your data layout.

> Of course a distributed architecture costs efficiency, but as long as it still scales further (and after some point, cheaper) than single-node alternatives, the efficiency loss is tolerable.

The efficiency loss is tolerable. The question is, when ? For primary-key get, efficiency-loss is, let's say, none. So you can start distributed from the start.

For another query, especially multi-tenant when you can put a tenant in 1 box, it may start making sense AFTER scaling to 2TB memory node.

Imagine a select query doing a join. It has to read 1GB from nvme-array or from network.

Imagine a write, ending up as a distributed-write, it needs to wait for 2x+ more servers in the network.

tommyzli · on Jan 26, 2021

basically what paulryanrogers said.

We thought about migrating to Citus, but I don't have a good idea of how to shard our dataset efficiently.

If we were to shard by user id, then creating a match between two people would require cross-shard transactions and joins. Sharding by geography is also tough because people move around pretty frequently.

skinkestek · on Jan 26, 2021

What kind of data is this?

My best guesses are

- either it is SAAS in which case shard it should make sense to shard by customer

- or it is something-to-consumer (social networking?) on which case I guess you'll have to take a step back and see if you can sacrifice one of your current assumptions

... but I feel I'm missing something since what I am saying feels a bit trivial.

alexhutcheson · on Jan 26, 2021

It’s a dating app.

mslot · on Jan 27, 2021

Sharding a matching engine is indeed pretty hard, and requires redundancy and very deliberate data modelling choices.

That does seem like a fun exercise :).

The general rules of the game are: You can only scale up throughput of queries/transactions that only access 1 shard (some percentage going to 2 shards can be ok). You can only scale down response time of large operations that span across shard since they are parallelized. You should only join distributed tables on their distribution column. You can join reference tables on any column.

The thing that comes to mind is to use a reference table for any user data that is used to find/score matches. Reference tables are replicated to every node and can be joined with distributed tables and each other using arbitrary join clauses, so joining by score or distance is not a problem, but you need to store the data multiple times.

One of the immediate benefits of reference tables is that reads can be load-balanced across the nodes, either by using a setting (citus.task_assignment_policy = 'round-robin') or using a distributed table as a routing/parallelization scheme.

    CREATE TABLE profiles (
        user_id bigint primary key,
        location geometry,
        profile jsonb
    );
    SELECT create_reference_table('profiles');
    
    CREATE TABLE users (
        user_id bigint primary key references profiles (user_id),
        name text,
        email text
    );
    SELECT create_distributed_table('users', 'user_id');
    
    -- replicate match_score function to all the nodes
    SELECT create_distributed_function('match_score(jsonb,jsonb)');
    
    -- look up profile of user 350, goes to 1 shard
    SELECT * FROM users u, profiles p WHERE u.user_id = p.user_id AND u.user_id = 350;
    
    -- find matches for user #240 within 5km, goes to 1 shard
    SELECT b.user_id, match_score(a.profile, b.profile) AS score
    FROM users u, profiles a, profiles b
    WHERE u.user_id = 240 AND u.user_id = a.user_id 
    AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000 
    ORDER BY score DESC LIMIT 10;

The advantage of having the distributed users table in the join is mainly that you divide the work in a way that keeps each worker node's cache relatively hot for a specific subset of users, though you'll still be scanning most of the data to find matches.

Where it gets a bit more interesting is if your dating site is opinionated / does not let you search, since you can then generate matches upfront in batches in parallel.

    CREATE TABLE match_candidates (
        user_id_a bigint references profiles (user_id),
        user_id_b bigint references profiles (user_id),
        score float,
        primary key (user_id_a, user_id_b)
    );
    SELECT create_distributed_table('match_candidates', 'user_id_a', colocate_with :='users');
    
    -- generate match candidates for all users in a distributed, parallel fashion
    -- will generate a match candidate in both directions, assuming score is commutative
    INSERT INTO match_candidates
    SELECT a.user_id, b.user_id, match_score(a.profile,b.profile) AS score
    FROM users u, profiles a, profiles b
    WHERE u.user_id = a.user_id 
    AND match_score(a.profile,b.profile) > 0.9 AND st_distance(a.location,b.location) < 5000 
    ORDER BY score DESC LIMIT 10;

For interests/matches, it might make sense to have some redundancy in order to achieve reads that go to 1 shard as much possible.

    CREATE TABLE interests (
        user_id_a bigint references profiles (user_id),
        user_id_b bigint references profiles (user_id),
        initiated_by_a bool,
        mutual bool,
        primary key (user_id_a, user_id_b)
    );
    SELECT create_distributed_table('interests', 'user_id_a', colocate_with :='users');
    
    -- 240 is interested in 350, insert into 2 shards (uses 2PC)
    BEGIN;
    INSERT INTO interests VALUES (240, 350, true, false);
    INSERT INTO interests VALUES (350, 240, false, false);
    END;
    
    -- people interested in #350, goes to 1 shard
    SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 350 AND NOT initiated_by_a;
    
    -- it's a match! update 2 shards (uses 2PC)
    BEGIN;
    UPDATE interests SET mutual = true WHERE user_id_a = 240 AND user_id_b = 350;
    UPDATE interests SET mutual = true WHERE user_id_a = 350 AND user_id_b = 240;
    END;
    
    -- people #240 is matched with, goes to 1 shard
    SELECT * FROM interests JOIN profiles ON (user_id_b = user_id) WHERE user_id_a = 240 AND mutual;

For data related to a specific match, you can perhaps use the smallest user ID as the distribution column to avoid the redundancy.

    CREATE TABLE messages (
        user_id_a bigint,
        user_id_b bigint,
        from_a bool,
        message_text text,
        message_time timestamptz default now(),
        message_id bigserial,
        primary key (user_id_a, user_id_b, message_id),
        foreign key (user_id_a, user_id_b) references interests (user_id_a, user_id_b) on delete cascade
    );
    SELECT create_distributed_table('messages', 'user_id_a', colocate_with :='interests');

    -- user 350 sends a message to 240, goes to 1 shard
    INSERT INTO messages VALUES (240, 350, false, 'hi #240!');
    
    -- user 240 sends a message to 350, goes to 1 shard
    INSERT INTO messages VALUES (240, 350, true, 'hi!');
    
    -- user 240 looks at chat with user 350, goes to 1 shard
    SELECT from_a, message_text, message_time
    FROM messages 
    WHERE user_id_a = 240 AND user_id_b = 350
    ORDER BY message_time DESC LIMIT 100;

This exercise goes on for a while. You still get the benefits of PostgreSQL and ability to scale up throughput of common operations or scale down response time of batch operations, but it does require careful data model choices.

(Citus engineer who enjoys distributed systems puzzles)

tommyzli · on Jan 27, 2021

Wow, I wasn't expecting such an in-depth response!

It sounds like making user profiles a reference table would solve the cross-shard join problem, but what would the performance implications look like?

The docs just say that it does a 2PC, which I'm assuming won't perform very well in a high-write workload

outworlder · on Jan 26, 2021

> As I mentioned earlier we run Postgres on i3.8xlarge instances in EC2, which come with about 7.6TB of NVMe storage.

Wait a second. You run your production database on ephemeral storage? Wow.

I see the replication setup and the S3 WAL archiving and whatnot but still... that's brave.

craigkerstiens · on Jan 26, 2021

It's really a question of how many replicas you have, if you're running with sync rep or not, and what your DR story is like. We've tested it before a few times at previous employers and are exploring it rolling it out for Crunchy Bridge currently. The NVMe storage is really nice it's great performance and the price balance of it is good as well. But it does come with nuances... I wouldn't let a user provision without HA for example. In cases for a standard app without crazy uptime requirements having a standby or 2 is wasted cash. So it isn't for everyone, but can be for some people.

snissn · on Jan 26, 2021

hi! Can you share how you do HA on postgres? Master/slave with monitoring and manual fall over or is that automatic? If so reliable? What tooling do you use? Thanks!

tommyzli · on Jan 26, 2021

We are living life on the edge to an extent, but we have 5 hot standbys across AZs and regular backups + WAL archives to S3.

May not be as durable as EBS, but it's enough for me to sleep soundly at night. And with a highly concurrent WAL-G download, it takes like an hour to catch up a new replica from scratch.

throwdbaaway · on Jan 26, 2021

Fine, with enough replicas, you can sleep well at night. But how about the 3 years uptime without reboot? Can you really enjoy your morning coffee without thinking about it? :)

Netflix went full ephemeral storage for their Cassandra clusters since the beginning, at the time when they were just spinning disks. Years later, they still insist on doing this, and had to come up with creative solution to fix the uptime issue: https://netflixtechblog.medium.com/datastore-flash-upgrades-...

matharmin · on Jan 26, 2021

From the parent comment:

> it takes like an hour to catch up a new replica from scratch

That means it should be pretty easy to replace an instance - just create a new replica from scratch, then fail over (if you're replacing the current master/primary instance), and remove the old one.

phonon · on Jan 26, 2021

Why can't you reboot? NVMe storage is local, not ephemeral.

throwdbaaway · on Jan 26, 2021

Yes, you can reboot, but you have to update in-place, instead of rolling out a new OS image. Or you adopt the Netflix approach. There are also some additional restrictions, e.g. you can't change the machine type.

Anyway, from the other comment here, I think tommyzli might not have realized that a reboot is still possible, which would partially explain the 3 years uptime.

bcrosby95 · on Jan 26, 2021

This was pretty common in AWS back in the late 00s. Performance usually sucked too much otherwise.

paulryanrogers · on Jan 26, 2021

Even with prioritized IOPS I once had to resort to RAID0 and replicas to get needed performance under budget on EBS. Probably should have just bumped instance size and used local storage.

eropple · on Jan 26, 2021

It's funny--we used to run Vertica on ephemeral nodes and actually found a performance improvement going to EBS, but that was pre-NVMe in AWS.

I wonder how big the delta was for CMB between EBS and ephemeral?

lmarcos · on Jan 26, 2021

Silly question: they have 5.7TB in their database... How come? It's a dating app founded in 2012, I can understand that one can accumulate such much data in 11 years, but sure you can periodically archive "unused" data and move it out of your primary database, right? I mean, are the 5.7TB of data actually needed in a daily basis by their app?

(I assume data for analytical purposes is not stored in their primary DB, which is fair to assume I believe)

YuriNiyazov · on Jan 26, 2021

As someone who manages an order of magnitude greater database than that for an app founded in 2008 (and that's after multiple archivings of unused data, and no severe crimes like storing image blobs in the DB), I can tell you that, uhm, Coffee Meets Bagel is either not that successful, or is doing a very good job at managing their DB size.

jamesmishra · on Jan 26, 2021

5.7 TB is small by database standards. I work at a much smaller company and deal with "proportionally" much more data.

I don't know if it would be worth the engineering effort to try and archive old data in a way that still makes it transparently accessible to users that go looking for it--especially when modern databases ought to be able to scale up and out without manual archiving.

arcticfox · on Jan 26, 2021

5.7 TB for an OLTP database is small?! I must be living in a different world. Obviously I know you can go that big, but I thought the number of use-cases would be limited.

lumost · on Jan 26, 2021

Why does my browser routinely eat 8GB while it used to only require 32MB 25 years ago? because it can. Web services likewise come up with features and data to fill databases.

For $8/hr you can rent a DB with 500 GB of memory and 64 cores, complete with redundancy, automated backups, and failover. For the hourly rate of an oracle consultant you can rent a DB with 2TB of memory for the day.

Bear in mind that many of these workloads are trivially shardable (e.g. any table keyed off customer ID) and can be scaled across hundreds of DBs as required.

nopurpose · on Jan 26, 2021

If data is sharable, it doesn't mean that it is trivial. In your example with shards by user, simple message sent between users in app becomes are very non trivial dance to be done reliably.

lumost · on Jan 26, 2021

If you recognize that there isn't a good automated solution for the DB to smartly join messages than it becomes a fairly straightforward problem once again.

e.g. the simple solution is to denormalize the table and have each message keyed by recipient. In a dating app you'll roughly double your message count this way, and even in most messenger apps the proportion of messages sent person to person is likely the most significant.

A smarter solution is to key each conversation by a unique key in a sharded table and then store the set of "conversations" that a user is engaged in in the sharded users table. Fetching the messages for a user then becomes a simple 2 query process - fetch/filter the conversations, then fetch the messages. No duplication of messages, and likely just a few extra bytes per message for the key.

It would be great if the DB could manage the above application side sharded join internally, but we're unfortunately a few steps away from that today.

nopurpose · on Jan 27, 2021

It doesn't matter how you arrange data, the moment you need to commit to 2 shards transactionally you are either having consistency trouble or performance trouble.

Both your schemas require writes to at least 2 shards transactionally.

KptMarchewa · on Jan 26, 2021

Then you don't do chats on your primary RDBMS.

systems · on Jan 26, 2021

its huge by database standard, i worked in large multinationals and dealt with some of the their largest databases

5.7 is enormous by database standard , there is no way you can get good query performance on a 5.7 tb database without solid physical partitioning and heavily optimized queries, and most normal companies even with 200-500 GB database use datamarts to have good performance without a super complex architectures and geniuses working fro you in db admin department

the more i think about it, the more i think that 5.7 TB would be unusably huge, and if you have this much data, most wont even bother to partition, the db will be broken into several (hundreds) smaller databases

derekperkins · on Jan 26, 2021

5TB is not that large and it's not that difficult to get good performance. We operated a 50TB single instance of MySQL for 5 years with a tiny team before migrating to Vitess, and it was basically zero maintenance. We did partition our largest table, which just requires a little extra SQL, but is otherwise transparent.

throwdbaaway · on Jan 26, 2021

You guys are talking past each other, as your workloads appear to be different. With a traditional RDBMS, size usually wouldn't be a bottleneck, as long as you can add enough disks to a single server.

Write operations per second, that's the metric I would care about. A 50TB instance with low amount of write operations can be zero maintenance, while a 500GB instance with high amount of write operations can be a real pain.

lopatin · on Jan 26, 2021

Thank you for validating my troubles with a write-heavy ~500GB db in the midst of a TB-measuring contest :)

diziet · on Jan 26, 2021

Imagine there are 10m users. That's 600kb per user.

mbyio · on Jan 26, 2021

And you have to account for indexes, temporary tables used for data analysis, etc. And most of it is probably not compressed. So with that perspective it isn't that much data at all.

stickfigure · on Jan 26, 2021

That's an incredibly large amount per user? I have worked on a couple online dating sites, including one that was fairly popular (Let's Date - which stiffed me for my last invoice before they went belly up grrr). Unless you're storing images in the database, it's really hard to generate 600k for a dating workload - even with indexes.

The only thing I can imagine generating 600k per user is putting something like "hit tracking" in the database. Which I've done - yes it adds up - but it's also relatively easy to move to some other kind of store.

curryst · on Jan 26, 2021

If messages between users go into their main database, then that would be a pretty reasonable amount.

stickfigure · on Jan 26, 2021

600k is a sizable book. The Adventures of Huckleberry Finn is 600k. And that's the average per-user; most users will never send or receive messages.

The only thing I can imagine is that they do an incredible amount of activity tracking.

mrweasel · on Jan 26, 2021

That's really a good way of looking at it. I though it sounded like a lot of data, well, 600kb is a lot of textual data, but who knows what they have stuffed into the database.

I worked for an e-commerce site, with a few million customers, even more orders, data-duplication all over the place, and still we where using a perhaps a 200GB of database storage.

sharadov · on Jan 26, 2021

Not a lot, but there may be options to partition, but again, you can't comment unless you know the design.

hermanradtke · on Jan 26, 2021

Agreed. I ran a 1 billion dollar GMV e-commerce company and our primary OLTP database was around 60 GB. Everything else was moved to an OLAP database.

philipwhiuk2 · on Jan 26, 2021

Purely at a guess, people's images are stored in the app as blobs because it's "easier"

kevas · on Jan 26, 2021

What was the strategy for moving things over? By age? Monitoring queries and determining what data isn’t being queried? Something else?

tommyzli · on Jan 26, 2021

Nope.

jbverschoor · on Jan 26, 2021

Probably some excessive data hoarding of location information, and perhaps too much meta data for each request.

Anyways, seems more like a liability than an achievement

wejick · on Jan 26, 2021

I think it's because they use Django and naturally it's a monolithic architecture. We will hardly find database this big in microservice world, instead there will be several smaller databases.

Now whether several smaller DBs is easier to manage compared to one big one, it would be debatable. However with that huge DB, I would prefer having several smaller one.

philipwhiuk2 · on Jan 26, 2021

The fact that the answer isn't "move to RDS where Amazon solves the problem for us, which isn't our core business as a relationships app" seems to me to be a massive failing of the RDS offering and cloud services in general.

ramraj07 · on Jan 26, 2021

RDS can be.. expensive? Like by a lot?

aeyes · on Jan 26, 2021

If you want to do upgrades like this on RDS with minimal downtime you will end up doing the same process: Set up new servers, do logical replication, switch over.

The RDS update process is a single button and you have no way of knowing how long it will take. There are some tricks like turning off Multi AZ and taking a snapshot manually before starting the process but still - for large instances you could be waiting anywhere from 30 minutes to 3 hours for RDS to finish. With large instance types I have seen RDS take a full hour just to provision an instance, in the meantime you'll sit there hitting F5 not knowing if it will ever finish.

cwyers · on Jan 26, 2021

Only if they evaluated RDS and found it wanting. They don't even mention testing it.

tommyzli · on Jan 26, 2021

It's not in the post, but I answered this in a separate thread. RDS doesn't let us provision as many IOPS as we need.

Apparently Aurora behaves differently, but I wasn't aware of that when we specced out the project.

phonon · on Jan 26, 2021

Aurora charges $0.20 per 1 million requests...your IO would have gotten expensive. It's also still stuck on PostgreSQL 11.9.

darkr · on Jan 26, 2021

RDS can do up to 64K PIOPS these days. If you need more than that, I would definitely advocate for re-architecture to split things out and/or shard. We're up to 40k PIOPS in one of our larger databases and fast approaching the time to make that jump ourselves..

tyingq · on Jan 26, 2021

I suspect the RDS limitations are left there to push you to Aurora. They control that and would be better equipped to make the most of their infrastructure and margins with it.

haltingproblem · on Jan 25, 2021

Love your app, easily the best experience of all dating apps. However, stability and notifications are atrocious. App notifications lead the data showing up in the app. Sometimes notifications fail all together. You guys can dominate this space if you can fix these issues.

u678u · on Jan 25, 2021

I love RDBMS over NoSql but the whole upgrade and schema change always is stressful. I miss the days when we could ask our DBA to deal with it. :)

jabo · on Jan 26, 2021

At least with RDBMS, the database engine takes care of the actual data movement for you, after you issue a SQL command. With NoSQL, when you need to update your document format, you now need to handle the data migration yourself.

sharadov · on Jan 26, 2021

Given the limitations that you had ( move to larger instance, downtime restrictions), you took the most optimal path. Fantastic work! I am in the process of what you did, but across couple hundred instances ( am using pg_upgrade for most but will be using an approach similar to yours where we can't afford downtime).

ants_a · on Jan 26, 2021

pg_upgrade would have worked fine given your requirements. Just using normal streaming replication to move database over to new systems and then performing an in place pg_upgrade there would be doable with most likely a couple of minutes of downtime and a much quicker and more robust process.

tommyzli · on Jan 26, 2021

How would that have worked with multiple replicas cascading from the new primary? Streaming replication doesn't work across versions, so would we have had to build out a tree of new instances, then pg_upgrade them all at the same time?

craigkerstiens · on Jan 26, 2021

I believe it is possible to repoint the replicas with re-wind and then repoint to a new timeline. This is something we looked at at Heroku a long time back. It wasn't trivial to fix, but Heroku eventually improved some of this. This post drills into some of that - https://blog.keikooda.net/2017/10/18/battle-with-a-phantom-w...

cconstantine · on Jan 26, 2021

I've done a few postgres upgrades, the first using pg_upgrade, and the last doing effectively what you did (it was even 9.x -> 12).

My experience was that it needs to rewrite all the data for some tables/indexes under some circumstances, and the db won't be available while that happens. So, unless your db can be down for the time it takes to rewrite all that data it isn't really an option. After having done the upgrade through streaming logical replication I'm not sure I would try pg_upgrade again.

I did the pg_upgrade style update a long time ago, so most details are fuzzy, but I remember setting up a string of replicas something like:

primary -> [read_replica, backups_replica]

read_replica -> [upgrade_replica]

upgrade_replica -> [read_replica_upgraded, backups_replica_upgraded]

This allowed us to do multiple practice rounds without putting any unnecessary load on the primary. I think we needed to re-initiate replication off 'upgrade_replica' after the upgrade, but we did the live update during low-load so the extra read load wasn't an issue.

ants_a · on Jan 26, 2021

In pg_upgrade documentation there is documented a way to use rsync to quickly replicate the upgraded contents of the new primary to replicas. So you would first move to upgraded base VMs running the old version streaming from old primary, which can be done one host at a time if need be.

The new cluster can then be pg_upgraded and rsynced all at once.

paulryanrogers · on Jan 26, 2021

Good questions. As logical replication matures it may someday be possible to replicate among versions.

skunkworker · on Jan 26, 2021

Logical replication would be nice, but at the moment it's far too slow to replicate for heavy day-to-day use, other than doing a pg upgrade between primaries and replicas.

sharadov · on Jan 26, 2021

I've replicated across different PG versions, there is an extension called mimeo, which is fantastic for logical replication https://pgxn.org/dist/mimeo/1.5.1/doc/howto_mimeo.html

sa46 · on Jan 26, 2021

We tried pg_upgrade going from Postgres 10 to Postgres 12 and it didn't work. An individual instance was about 8 TiB in size. We left it running for over a day to see if it would complete. Instead, we used an approach similar to logical replication described in the article.

To be fair, our use case was probably close to pathological for pg_upgrade. We had lots of TOAST data and dozens to hundreds of indexes per table.

rubiquity · on Jan 26, 2021

Uisng async replication and read replicas in a relational DB is a great way to play reverse Wheel of Fortune and go from ACID to just C. You must get some fun bug reports. A poster below mentions doing actions in the app and their side effects vanishing. At the end of the day it's a business decision but that would not be fun to program against, though maybe some of it can be handled with app/client-side with caching and causal consistency.

edit: For more on the nuances of Postgres tradeoffs for replication and transaction isolation: https://www.postgresql.org/docs/9.1/high-availability.html

justinclift · on Jan 26, 2021

Reading over this, it seems like there isn't an offsite backup done of the database? eg to have a copy of the data in a "safe place" off AWS infrastructure

If something goes wrong with their relationship with AWS, that could be business ending. :(

tyingq · on Jan 26, 2021

The termination clauses in their T&C's say they will give you access, post "for cause" termination, so long as you've paid your bill. Though I'm mindful that pulling a lot of data could take a long time.

Dylan16807 · on Jan 26, 2021

That's still a lot of trust that nothing else wipes the account.

But this post wasn't about backups, so there might be a whole lot excluded from the diagram.

temp667 · on Jan 25, 2021

Very nice.

Did they migrate into Amazon RDS while doing this? For smaller projects I've stopped doing the self managed postgresql thing. The pricing is higher (75%?) for RDS for some use cases but can be worth it.

Going to try RDS Proxy next.

tommyzli · on Jan 25, 2021

Thanks! We stuck with plain EC2. RDS has a limit of 80,000 provisioned IOPS and our read replicas on Postgres 9.6 would regularly hit near double that during peak

sk5t · on Jan 26, 2021

Did you consider lowering those IOPS with application-level and/or distributed in-memory cache and/or pub-sub notifications to let your app nodes not pester the database so much? Reasonably performant hand-written SQL (no ORM!), review of query plans, maybe shift the hot path into functions/procs?

tommyzli · on Jan 26, 2021

Believe it or not these numbers are actually from _after_ me and some others spent a few weeks cleaning up our heavier queries

throwdbaaway · on Jan 26, 2021

That's an impressive amount of IOPS. I have always been a EBS / pd-ssd guy, and mostly rely on memory to reduce the IOPS requirement. But as cloud providers typically charge a ridiculous amount of money for memory, a setup like yours with instance storage / local-ssd is an intriguing option.

orf · on Jan 26, 2021

That limit doesn’t apply to Aurora - did you consider that?

tommyzli · on Jan 26, 2021

Apparently I'm living in the twilight zone because I have a vivid memory of reading the Aurora docs and seeing the same limit. Oh well, it's something to consider for the next upgrade.

0xbadcafebee · on Jan 26, 2021

So.......... caching?

shoo · on Jan 26, 2021

> We then made the following changes to the subscriber database in order to speed up the synchronization: [...] Set fsync to off

I'm curious how much risk of data loss this added.

I guess the baseline is "we need to migrate before we run out of disk" I.e. you're either going to have data loss or a long period of unavailability if the migration cannot be carried out fast enough.

ants_a · on Jan 26, 2021

If fsync is turned back on and a manual sync call is issued before considering the replica valid there will be no risk from this.

tommyzli · on Jan 26, 2021

ants_a is correct. Also, our NVMe storage is ephemeral so you aren't recovering from a power loss anyways :)

rubiquity · on Jan 26, 2021

Disclaimer: I work at AWS, not on EC2.

Locally attached disks are not ephemeral to instance reboots/power failures. However, the disks are wiped after instance terminations. On the official EC2 product pages this is called "instance storage" not "ephemeral storage."

tommyzli · on Jan 26, 2021

TIL, this is really good to know! Do you know offhand if this is a new feature, or have I just always been wrong

sk5t · on Jan 26, 2021

It's been that way for a long time - years at least. You can reboot but not 'stop' and of course not 'terminate.'

esseti · on Jan 26, 2021

if pglogical better than a min downtime with pgdump/psql? it seems a lot of work to setup pglogical to migrate versions (or am i missing anything?)

darkr · on Jan 26, 2021

With 5.7 TB data, you're probably looking at something like 24 hours for a dump/restore including index rebuilds

latchkey · on Jan 26, 2021

I read this and all I can think about is all that private information on some unmaintained database server.

0xbadcafebee · on Jan 26, 2021

Am I the only one who thinks it's bizarre that a structured query language defines so much of how we choose to architect and operate our systems?

Think about it for a sec: SQL is literally just a language to query and manipulate data. There's no reason that schema changes and data changes have to happen only through the one language, and only through one interface on one piece of software.

For whatever reason, this has just been how the most popular products have done it, and they largely just never changed their designs in 40 years. I like the language, and the general organization of the data is handy. But everything else about it is archaic.

Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".

Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?

Other protocols and languages have ways of dealing with these kinds of things. LDAP can refer you to a different endpoint to process what you need. Web servers can store, process, and retrieve the same content across many different endpoints in a variety of ways. Lots of technology exists that can easily replicate, snapshot, version-control, etc arbitrary pieces of data and expose them to any application using standard interfaces.

Why haven't we created a database yet which works more like the Unix operating system?

sk5t · on Jan 26, 2021

Are you kidding, LDAP referrals as a model of how to do it? I mean, I did a lot of LDAP work back in the day, and that's not a feature that saw a lot of action outside academia. Just write your own thing on top of HTTP, that's got referrals too!

There are practically many ways of talking to database systems, if it isn't too troubling that some SQL is often happening somewhere. Like, there's Hasura, postgrest, etc.; or Mongo has a variety of drivers that support different inputs.

One might consider the most unix'y database to be Berkeley DB/Sleepycat, but that is probably not what you wanted. ;)

outworlder · on Jan 26, 2021

> Why haven't we created a database yet which works more like the Unix operating system?

Not to be overly snarky, but have you tried? Database design is full of trade-offs.

0xbadcafebee · on Jan 26, 2021

When I was a kid and learning to program, I wrote some shitty databases for fun. I learned about the trade-offs and that it was easy to write a database that out-performed RDBMSes in specific criteria. But I hadn't thought of making them extensible.

I have a pet project I'm working on, which is a generi distributed system where each component is a microservice. It turns out there's lots of these things built already, mainly by systems engineers for obscure things (Airflow, Rundeck, Stackstorm being some examples). I'll probably think about how I can redesign my project with composeable databases in mind. I don't expect I'll ever have a working product, but it'll be useful to think about this problem.

Dylan16807 · on Jan 26, 2021

You're basically telling them to put in months of work to find out. Even if it's not too snarky, it's a ridiculous way to learn something that could be conveyed pretty well in a blog post or a chapter of a book.

tyingq · on Jan 26, 2021

I think that's a fair observation, but in this case, a dating app, a relational database and SQL seem like a great fit. The ends users are fairly literally SELECTING and JOINING with LIMITS and newly FOREIGN relationships and so forth :)

paulryanrogers · on Jan 26, 2021

Why are we still using ASCII or Unicode character interfaces in shells? Because like SQL they work and are moderately well understood.

There are many query languages and having one common one as a base is useful to transfer skills. Think of it as an on ramp to more specific dialects or technologies.

rubiquity · on Jan 26, 2021

You just described distributed databases -- which are overwhelmingly now deciding to use SQL as their interface of choice. You're completely hand waving over the fact that data is just a bunch of bits on disk grouped into pages. Everything above that fact is a tradeoff.

0xbadcafebee · on Jan 26, 2021

Actually not really. A Unix operating system can do everything I described with regular-old data, and it's not a distributed operating system. It simply has extensible standard interfaces.

Do you need a distributed database to read a .txt file with cat, Emacs, and Firefox? No. Why not? Because there's an I/O standard they all use. Does that .txt file have to live on a single filesystem, or disk? No. Why not? Because the storage mediums all have a standard virtual filesystem interface.

There is no reason databases cannot do exactly the same thing. It's just that nobody has made them do it yet. They've stuck with the exact same paradigm, and that then drives how all of us build our systems, with this archaic 40 year old model that requires heavy-lifting maintenance windows and a lot of praying.

curryst · on Jan 26, 2021

You're interweaving several different issues here.

> Why fumble around with synchronization? 99% of the data in big datasets doesn't change. This doesn't even have to be "log-based", we just need to be able to ship the old, stable data and treat it almost like "cold storage".

This is not a feature of SQL, this is a feature of the database. Also, this sounds exactly like doing full-table replication to get the "old" data and then turning on log-based replication. You can do key-based replication if you really want to avoid log-based, but it's generally just a less efficient version of log-based replication.

> Why is there a single point of entry into the data? You have to use the one database cluster to access the one database and the one set of tables. Why can't we expose that same data in multiple ways, using multiple pieces of software, on multiple endpoints?

You can. Postgres supports both Perl and Python extensions that run in the RDBMS process, iirc. Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways, and it is very difficult to gain any benefits over just running a separate process that communicates over SQL.

So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

There's also stuff like Presto[1] that allows you to run queries distributed over multiple databases, multiple types of databases, etc, etc, etc. In that case, conceptually, Presto is "the database" and all the records you refer to are remote.

1: https://prestodb.io/

0xbadcafebee · on Jan 26, 2021

> This is not a feature of SQL, this is a feature of the database

Yet they always seem tied together eh? Somehow the conventions are stuck together, and that then affects how our systems work.

> Postgres supports both Perl and Python extensions that run in the RDBMS process

But I'm talking about not having to use the RDBMS process. If I have a text file on the disk, I can use a million different programs to access it. I don't have to run one program through another program just to open, read, write, and close the file with any program. Why don't we design our databases to work this way?

> Very few people use them because running in the RDBMS process means that you can break the RDBMS process in really bad ways

Yes, it does sound bad. That's why I'd prefer an indirect method rather than having to wedge access through the RDBMS

> So if you consider other processes that communicate with the database and then show views of that over other protocols, that describes most of the backend apps in the world.

Yep! We architect entire systems-of-systems just because the model for our data management in an RDBMS is too rigid. We're building spaceships to get to the grocery store because we haven't yet figured out roads and cars.

lmm · on Jan 26, 2021

You're not the only one. There are lots of better alternatives to SQL databases for most use cases (I'm lucky enough to have worked in some places where SQL datastores were the exception rather than the rule). But it takes a long time for cultural change to happen.

strokirk · on Jan 26, 2021

Would you mind mentioning some good options? I've always been interested in databases, but find it hard to know which ones to learn more about and when they'd actually be worth investing in (especially since it's hard to build knowledge from toy projects).

lmm · on Jan 27, 2021

Honestly all the popular datastores are fine, though they all have their own foibles. Redis is fine if you just want a basic key-value store that's not HA (and not particularly durable). Cassandra is fine if you want HA (Riak had a better design, but is so much less popular that I probably wouldn't use it these days). Kafka is really good but a much bigger conceptual leap. Even MongoDB is pretty much fine - there are plenty of horror stories about data loss in earlier versions, but it pretty much works most of the time, which is the best you can ever really say about any datastore IME (even PostgreSQL).

kall · on Jan 26, 2021

Putting data into cold storage, spinning up multiple flexible access points with different datasets... Sounds like what Snowflake is doing right? I don‘t really use it but looks neat from the outside. May be nice to bring some of that to OLTP.

marcinzm · on Jan 26, 2021

Life is about tradeoffs. Complexity, latency, cost and so on. Things in general are much harder to implement correctly (see Jepsen tests) than to talk about in broad terms.

valenterry · on Jan 26, 2021

Not sure why you are downvoted, you made a lot of very valid points and I agree.

People get very comfortable very quickly, even tech savvy folks. Having to learn another language will scare many away, even though the effort might be the same - it's perceived harder.