Hacker News new | past | comments | ask | show | jobs | submit login
Postgres 15 Merge Command with Examples (crunchydata.com)
145 points by winslett on Oct 17, 2022 | hide | past | favorite | 27 comments



This will make complex upserts so much more simple, fantastic addition.

I really hope `RETURNING` support gets added to `MERGE` asap though (I believe it's been noted as a fairly trivial addition to come in future), then it'll be super powerful for doing bulk upserts that require post-processing.


returning is the best.


> prior to Postgres 15, you could use the "upsert" method with the ON CONFLICTclause.

> Now, MERGE can be used instead!

No mention of deadlocks in the article has me worried about thoroughness of the analysis.


Agreed - I think MERGE is great, but this post is missing the "Why not" part of the analysis.

In the Postgres community, MERGE has been talked about for a long time, but in my understanding, part of the reason why the Postgres team initially shipped INSERT ... ON CONFLICT (instead of straight up MERGE) is that it lets you have guarantees about the outcome of the statement (i.e. either INSERT or UPDATE, by use of speculative insertion handling), vs MERGE can cause unique constraint violations and other issues.

AFAIK, the generic syntax of MERGE does not allow for stricter guarantees, and therefore there will always be cases where one is better than the other.


This sounds really cool; maybe slightly cleaner than an INSERT ... ON CONFLICT statement. However, is that the only advantage?


Great to see!

My biggest gripe with ON CONFLICT upserts are the IDs (sequences) having gaps in them. Any good ways to prevent that?


This is my gripe as well. We used `ON CONFLICT ... DO NOTHING` for a table with ~10k entries that are frequently upserted.

By the time we reached a few thousand entries, we had primary keys in the millions. I personally don't care that there are gaps in the sequence, but gaps of hundreds of thousands definitely leaves a lot to he desired. I think you can circumvent this behaviour by changing from `DO NOTHING` to `DO UPDATE` and doing a dummy write, but that too leaves much to be desired.

I also discovered that `ON CONFLICT` doesn't really work with a high number of concurrent writes. We had to implement our own upsert logic using advisory locks.


If you're inserting a lot of data with a high likelihood of conflicts you're probably better off using `INSERT ... SELECT .. WHERE NOT EXISTS` instead, that way you can have gapless sequences.

    -- Will leave gaps
    INSERT INTO t (val) VALUES ('abc'), ('def') ON CONFLICT DO NOTHING;

    -- Won't leave gaps
    INSERT INTO t (val)
    SELECT * FROM (VALUES ('abc'), ('def')) AS tmp (val)
    WHERE NOT EXISTS (SELECT 1 FROM t WHERE t.val = tmp.val);
https://www.db-fiddle.com/f/6s4uYA987owJxu3CEvfu8t/0


It's been a while but I'm pretty sure that's exactly what we ended up doing. The advisory lock was because downstream logic required returning the primary key — basically https://en.m.wikipedia.org/wiki/Double-checked_locking.

Thanks for sharing.


One of the weirdest outages I've ever had to debug was when one of our tables with millions of rows (but in the low millions, at least so we thought) overran the maximum value for a 32 bit signed integer, due to it being primarily updated with INSERT ON CONFLICT on a nightly basis. At the time we had no idea why the primary keys were climbing so high! That was a tough one to figure out.


re concurrent writes: The issue we ran into was deadlocks when we upserted batches of records within a transaction. E.g. transaction A updates row 1 then row 2 and transaction B updates row 2 then row 1. If you sort so that you update rows in the same order across all transactions, then it won't deadlock. E.g. sort records by primary key and then upsert. If you can't do that, then advisory locks seems like a good idea.


Don’t rely on sequences having no gaps?

Why do you even need that?


If they're not accounting for it, integer overflow could be a real problem. (And it's not clear they would account for it in a table that normally has 10s of thousands of entries.) Someone else mentioned it here too: https://news.ycombinator.com/item?id=33242669


bigint is the default data type for sequences and it's highly unlikely that somebody would reach an overflow with it.

I've worked on various database solutions, both rdbms and analytical and I find sequences to be one of the most misunderstood features in the industry. The only guarantee they make is that they generate unique values. Some of the newer distributed rdbms don't even guarantee they'll be monotonic.

Relying on them generating consecutive values is a sure way to get vendor lock-in to whatever database has made that guarantee.


The problem isn't that values aren't consecutive, it's that the sequence is exhausted quicker than you'd expect. Someone could define a PK as `int4 GENERATED ALWAYS AS IDENTITY` because their expected number of entries is below the cap, only for that cap to be exhausted due to the behavior (in fact, that's been mentioned a few times).

It 'violates' the principle of least surprise. Intuitively you'd expect `ON CONFLICT .... DO NOTHING` to do nothing, but it will increment the PK every time.


I wouldn’t be worried about the sequence itself running out, but the column it’s stored in in the table may not be large enough to handle those numbers.


I don’t rely on them having no gaps. If the sequence is an integer, it’s much quicker to wrap if the table gets a lot of upserts.


for i smell jank code


Based on some quick testing MERGE does not appear to cause id gaps, though that does make me wonder if it's able to do so by providing less strict guarantees than ON CONFLICT.


This only seems to be an issue from an ideological point of view and not a technical one. On small projects it bothered me that I can't immediately see if something is an old entry or a new one just by glancing at the ID.

In any bigger project it's just noise and IDs are not helpful information anyway, not even at a glance. It's better to just ignore them, use bigint and move on.


In SqlServer you have to use MERGE with HOLDLOCK to avoid issues, is it possible in Postgres? Is it so by default?


This is one of my favorite features in snowflake (I’m assuming they didn’t invent it) and am happy to see it in Postgres now.


Awesome. How the hell did it take 20 years to finally land?


> How the hell did it take 20 years to finally land?

Perhaps because people working for free can decide what they want to work on?

I think merge is cool, but it also easily replicated with what most of us do now for upserts in postgres using ON CONFLICT.


> people working for free

I don’t think that is an accurate characterisation of the Postgres core team


Lots of people are paid to work on Postgres.


Because it's very easy to get things wrong or cause foot guns for the user. The Postgres core team tends to err on the side of delay (or outright rejection) rather than release a feature that could lead to data loss.

https://www.mssqltips.com/sqlservertip/3074/use-caution-with...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: