This is great! Given that SQLite is generally used for small projects and not in any sort of cluster or HA environment, pushing this kind of work (the if/then/else block that you don't have to write in your code anymore) makes a ton of sense.
I actually like this feature more in SQLite than in Postgres. In Postgres I worry about pushing that kind of work into the database in clustered situations, as there are places where it can go wrong and it would be more likely that the code wouldn't be able to handle that failure properly (as opposed to having to write the if/then/else yourself, which if you're a good coder would also involve some try/catch and proper fallback behavior in case of an error).
Before upsert, how would you handle the examples in the article? Like, how would you "insert Alice's phone number unless it's already there, in which case update it".
Edit: Downthread it was pointed out that you could just run an update and insert on every request and let one fail, which seems like a fair tradeoff for the extra network round trips.
Except that that's not concurrency safe in a transaction, because you might not yet see the row inside the NOT EXISTS() subquery if it hasn't yet committed.
> The update isn't inserting anything. Either the row exists already, or it doesn't. Maybe I'm missing something?
Rows can exists without you being able to see them.
Consider:
S1: BEGIN;
S2: BEGIN;
S1: INSERT name = 'Alice'
S2: UPDATE WHERE name = 'Alice' -> no row matches
S2: INSERT name = 'Alice' WHERE NOT EXISTS (name = 'Alice')
S2: <blocks, due to potential uniqueness violation>
S1: COMMIT;
S2: <resumes, raises uniqueness violation>
> If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row.
My understanding is that the update from S2 would be rerun applied.
Specifically postgres calls out:
> The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition.
> I haven't tested but I don't think that will happen on the default settings for postgres. I know for a fact that it won't on higher isolation levels.
Unfortunately you're wrong. I've now tested it, but I was also pretty confident before - I'm a postgres developer, and I worked on/comitted the PG upsert implementation ;)
postgres[10287][1]=# CREATE TABLE data(key text unique);
CREATE TABLE
postgres[10287][1]=# BEGIN ISOLATION LEVEL SERIALIZABLE ;
BEGIN
postgres[10284][1]=# BEGIN ISOLATION LEVEL SERIALIZABLE ;
BEGIN
postgres[10287][1]*=# INSERT INTO data VALUES('alice');
INSERT 0 1
postgres[10284][1]*=# INSERT INTO data SELECT 'alice' WHERE NOT EXISTS(SELECT * FROM data WHERE key = 'alice');
postgres[10287][1]*=# COMMIT;
COMMIT
postgres[10284][1]*=#
ERROR: 23505: duplicate key value violates unique constraint "data_key_key"
DETAIL: Key (key)=(alice) already exists.
SCHEMA NAME: public
TABLE NAME: data
CONSTRAINT NAME: data_key_key
LOCATION: _bt_check_unique, nbtinsert.c:535
(the number in brackets in the prompt is the backend pid, allowing to differentiate the two sessions).
> > If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row.
> My understanding is that the update from S2 would be rerun applied.
> Specifically postgres calls out:
> > The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition.
Those comments are about row-level locks - they're not the problem here. What you get is a constraint violation due to the unique constraint.
Unique constraints have nothing to do with transactions. There's no promise that all columns that you run a transaction on will have a unique constraint.
Try the same thing with no constraint. You should see the transaction fail but not because of a unique constraint. Otherwise there's something incorrect.
No. ACID doesn't guarantee that you can't have constraint violations triggered by a concurrent session. That'd make any sort of efficient constraint pretty much impossible. It guarantees that that error is caught despite the concurrency however.
Am I missing something? Wikipedia[1] says: "The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially." Oh, could the exception here be caused by the raw insert (without the where clause)? If you retried your example where both inserts were: INSERT INTO data SELECT 'alice' WHERE NOT EXISTS(SELECT * FROM data WHERE key = 'alice'); would that still cause the constraint error? If so, isnt that not equivalent to running the two queries sequentially hence violating the isolation property? Have I misunderstood something?
In the master branch there is no file "Alice". Two branches get created. One of them runs sed via find on all files with the name "Alice" and then it creates a file called "Alice" if find doesn't find any files called Alice.
The second branch creates a file called "Alice".
Both want to merge now but unfortunately there is a merge conflict. In this case automatic conflict resolution is impossible so one of the branches has to be discarded (transaction rollback). You can now "rerun" the branch and everything should work on the next merge.
Conclusion: "update + insert where" is not equivalent to "upsert". The former can cause a transaction to fail and must be retried on failure. With "upsert" the database engine has enough information to handle the unique constraint.
Awesome! Thanks for the explanation! Sorry for the additional questions, but if we "rerun" the branch, shouldnt everything work as expected? (And in parallel, the answer to my above question is: yes, the raw insert is being rerun (after the insert where) which is causing the constraint violation; is that right?
What you're not supposed to allow is letting both succeed, because then the clients would assume everything went fine. If one errored before the commit, both transactions didn't execute successfully and you're fine. Pretty much every approach to isolation has a good chunk of errors that aren't strictly necessary (deadlocks, spurious serialization errors, ...).
I believe the issue is a concurrent transaction (i.e. a separate one from the one you're currently in) may perform the INSERT and COMMIT that insert after your transaction's SELECT subquery returns nothing (thus allowing INSERT to proceed, by the WHERE clause), but before the actual INSERT itself is committed.
This means there is a window between the sub-SELECT and the INSERT itself where a secondary transaction can insert something, causing both your original update to do nothing, and the insert to fail. Thus you've lost the write entirely.
In that case the other effort (I'm reluctant to call it a transaction) wins. Maybe it was later and should win, maybe it was earlier and shouldn't. It takes a lot more care than seen here to preserve a meaning of "earlier" and "later" that isn't vulnerable to this issue. The way the schema is designed, there can be only one "Alice". I'm trying and failing to imagine a way in which both this effort and the other effort can fail.
I think the particular point made by Andres (and elsewhere in this thread) is really not about what the schema for a phone book is like, and more that "Accurately replicating what UPSERT does without it is exceedingly hard and you will probably get it wrong or suboptimal in a number of scenarios you can't possibly imagine until they happen".
Isolation levels will for example have impact on how transactions proceed and what results you see (REPEATABLE READ vs READ COMMITTED); the error handling (did your transaction actually fail from a concurrent INSERT succeeding, or did the update fail, or did something else happen entirely? UPSERT is often not a standalone operation in a transaction) must be carefully decided for each use-case or else you're likely to redo needless transactions or throw away good ones, and you have to work around incidental problems like latency using stored procedures to keep everything server-side, etc etc.
It's just not the kind of thing you want to replicate a number of places at each use site in a number of tools. It's something only the database knows how to manage properly by having "global knowledge".
Exactly this. It‘s so hard that it took PostgreSQL almost two decades to finally implement that feature and they had to support a lot of internal features first, like speculative insertion. If you want to do UPSERT 100% correctly and ACID, the list of edge cases just grows and grows.
I'm not 100% familiar with the system, but depending on the transaction levels available you could do wrap it in like... a serializable transaction, then select the rows (just to lock them), then do the update, then do the insert, then commit the transaction
But we are talking about sqlite, right? By default the database is locked once a write occurs, so there could not be concurrent transactions writing to the database.
Yes, but did you just do an update and an insert every time, or did you have some code that checked the return values and not run the extra unnecessary command?
They are run together every time. You could run a check but it would probably cost as much as the update, which does nothing when a record is not found.
In PostgreSQL (for example) you could use a CTE and the RETURNING clause to prevent two statements from having to run. I don't think sqlite supports that. Current version of PostgreSQL makes that unnecessary as it has "upsert" as well.
> You could run a check but it would probably cost as much as the update, which does nothing when a record is not found.
True if the generation of the insert doesn't have some heavier burdens on the caller side. Otherwise, start a txn, do the first, check if 0 updated, then do the second.
You've chosen to trade off network round trips for extra work in the database and slightly longer row locks while both commands run, which seems like a fair tradeoff.
well even if you have an if after your update it's not what he basically meant. What he meant is that some people would've written the if/else with SQL and not with the language they written ?
You could also COUNT on a query/filter to see if it exists by row count, then INSERT or UPDATE based on if there is an existing item for that filter/query. That is still two calls and can be safer, it can also be done in a transaction.
Lots of app level code in ORMs that use ADD/EDIT or CREATE/UPDATE actions can encapsulate this by just a SET, which counts the rows and then if zero INSERTS, if > 0 UPDATE.
I believe you need to repeat that update one more time, in case the row wasn’t present during the first update, but was created by another session immediately before the insert. Unless you’re in a transaction—then you’d be fine as posted above.
I would say in this day of high scalability and microservices, a good engineer should be skilled at dealing with these types of errors, as they would be one of the most common types.
Edit: I find it interesting that this comment has fluctuated between -2 and +2 points. Apparently saying that engineers should be able to handle concurrency errors is controversial?
I bet that at least 9/10 experienced engineers in the space will come up with a solution that has at least one of a) incorrect behaviour under concurrency b) unnecessary deadlocks c) extreme slowness d) break w/ triggers, row level security etc.
> Apparently saying that engineers should be able to handle concurrency errors is controversial?
I didn't vote either way. But perhaps it's less the fact that they should be able to handle concurrency, and more that replicating the necessary non-trivial logic in several applications isn't a good plan.
I greatly enjoyed the addition of upsert to Postgres, but not in any high availability situation. Could elaborate on the way the failure modes of UPSERT aren't sufficient for HA?
I actually like this feature more in SQLite than in Postgres. In Postgres I worry about pushing that kind of work into the database in clustered situations, as there are places where it can go wrong and it would be more likely that the code wouldn't be able to handle that failure properly (as opposed to having to write the if/then/else yourself, which if you're a good coder would also involve some try/catch and proper fallback behavior in case of an error).