This is great! Given that SQLite is generally used for small projects and not in...

ams6110 · on May 10, 2018

I would trust postgres to manage concurrency better than I could write an if/then/else that did it properly in all cases.

If you're writing if/then/else when dealing with a database, in my mind you're likely doing it wrong.

jedberg · on May 10, 2018

Before upsert, how would you handle the examples in the article? Like, how would you "insert Alice's phone number unless it's already there, in which case update it".

Edit: Downthread it was pointed out that you could just run an update and insert on every request and let one fail, which seems like a fair tradeoff for the extra network round trips.

cobythedog · on May 10, 2018

UPDATE phonebook SET phonenumber = 704-555-1212' WHERE name = 'Alice';

INSERT INTO phonebook(name, phonenumber) VALUES ('Alice','704-555-1212') WHERE NOT EXISTS (SELECT 1 FROM phonebook WHERE name = 'Alice');

anarazel · on May 10, 2018

Except that that's not concurrency safe in a transaction, because you might not yet see the row inside the NOT EXISTS() subquery if it hasn't yet committed.

jessaustin · on May 10, 2018

The update isn't inserting anything. Either the row exists already, or it doesn't. Maybe I'm missing something?

anarazel · on May 10, 2018

> The update isn't inserting anything. Either the row exists already, or it doesn't. Maybe I'm missing something?

Rows can exists without you being able to see them. Consider:

  S1: BEGIN;
  S2: BEGIN;
  S1: INSERT name = 'Alice'
  S2: UPDATE WHERE name = 'Alice' -> no row matches
  S2: INSERT name = 'Alice' WHERE NOT EXISTS (name = 'Alice')
  S2: <blocks, due to potential uniqueness violation>
  S1: COMMIT;
  S2: <resumes, raises uniqueness violation>

Edit: formatting

eclark · on May 11, 2018

I haven't tested but I don't think that will happen on the default settings for postgres. I know for a fact that it won't on higher isolation levels.

https://www.postgresql.org/docs/9.1/static/transaction-iso.h...

> If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row.

My understanding is that the update from S2 would be rerun applied.

Specifically postgres calls out:

> The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition.

anarazel · on May 11, 2018

> I haven't tested but I don't think that will happen on the default settings for postgres. I know for a fact that it won't on higher isolation levels.

Unfortunately you're wrong. I've now tested it, but I was also pretty confident before - I'm a postgres developer, and I worked on/comitted the PG upsert implementation ;)

  postgres[10287][1]=# CREATE TABLE data(key text unique);
  CREATE TABLE
  
  postgres[10287][1]=# BEGIN ISOLATION LEVEL SERIALIZABLE ;
  BEGIN
  
  postgres[10284][1]=# BEGIN ISOLATION LEVEL SERIALIZABLE ;
  BEGIN
  
  postgres[10287][1]*=# INSERT INTO data VALUES('alice');
  INSERT 0 1
  
  postgres[10284][1]*=# INSERT INTO data SELECT 'alice' WHERE NOT EXISTS(SELECT * FROM data WHERE key = 'alice');
  
  postgres[10287][1]*=# COMMIT;
  COMMIT
  
  postgres[10284][1]*=# 
  ERROR:  23505: duplicate key value violates unique constraint "data_key_key"
  DETAIL:  Key (key)=(alice) already exists.
  SCHEMA NAME:  public
  TABLE NAME:  data
  CONSTRAINT NAME:  data_key_key
  LOCATION:  _bt_check_unique, nbtinsert.c:535

(the number in brackets in the prompt is the backend pid, allowing to differentiate the two sessions).

> > If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row.

> My understanding is that the update from S2 would be rerun applied.

> Specifically postgres calls out:

> > The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition.

Those comments are about row-level locks - they're not the problem here. What you get is a constraint violation due to the unique constraint.

eclark · on May 13, 2018

Unique constraints have nothing to do with transactions. There's no promise that all columns that you run a transaction on will have a unique constraint.

Try the same thing with no constraint. You should see the transaction fail but not because of a unique constraint. Otherwise there's something incorrect.

chacham15 · on May 11, 2018

I know very little about Postgres, but doesnt this appear to violate the isolation property of ACID?

anarazel · on May 11, 2018

No. ACID doesn't guarantee that you can't have constraint violations triggered by a concurrent session. That'd make any sort of efficient constraint pretty much impossible. It guarantees that that error is caught despite the concurrency however.

chacham15 · on May 11, 2018

Am I missing something? Wikipedia[1] says: "The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially." Oh, could the exception here be caused by the raw insert (without the where clause)? If you retried your example where both inserts were: INSERT INTO data SELECT 'alice' WHERE NOT EXISTS(SELECT * FROM data WHERE key = 'alice'); would that still cause the constraint error? If so, isnt that not equivalent to running the two queries sequentially hence violating the isolation property? Have I misunderstood something?

[1]https://en.wikipedia.org/wiki/ACID

imtringued · on May 11, 2018

Think of it like git branching.

In the master branch there is no file "Alice". Two branches get created. One of them runs sed via find on all files with the name "Alice" and then it creates a file called "Alice" if find doesn't find any files called Alice.

The second branch creates a file called "Alice".

Both want to merge now but unfortunately there is a merge conflict. In this case automatic conflict resolution is impossible so one of the branches has to be discarded (transaction rollback). You can now "rerun" the branch and everything should work on the next merge.

Conclusion: "update + insert where" is not equivalent to "upsert". The former can cause a transaction to fail and must be retried on failure. With "upsert" the database engine has enough information to handle the unique constraint.

chacham15 · on May 11, 2018

Awesome! Thanks for the explanation! Sorry for the additional questions, but if we "rerun" the branch, shouldnt everything work as expected? (And in parallel, the answer to my above question is: yes, the raw insert is being rerun (after the insert where) which is causing the constraint violation; is that right?

anarazel · on May 11, 2018

What you're not supposed to allow is letting both succeed, because then the clients would assume everything went fine. If one errored before the commit, both transactions didn't execute successfully and you're fine. Pretty much every approach to isolation has a good chunk of errors that aren't strictly necessary (deadlocks, spurious serialization errors, ...).

tscs37 · on May 11, 2018

Usually for the serializable isolation level you need to manually rerun the transaction.

There are some DBs that do that for you like CockroachDB.

aseipp · on May 10, 2018

I believe the issue is a concurrent transaction (i.e. a separate one from the one you're currently in) may perform the INSERT and COMMIT that insert after your transaction's SELECT subquery returns nothing (thus allowing INSERT to proceed, by the WHERE clause), but before the actual INSERT itself is committed.

This means there is a window between the sub-SELECT and the INSERT itself where a secondary transaction can insert something, causing both your original update to do nothing, and the insert to fail. Thus you've lost the write entirely.

chacham15 · on May 11, 2018

Im sorry, maybe im just inexperienced here, but doesnt what you're describing violate the isolation part of ACID? https://en.wikipedia.org/wiki/Isolation_(database_systems)

jessaustin · on May 10, 2018

In that case the other effort (I'm reluctant to call it a transaction) wins. Maybe it was later and should win, maybe it was earlier and shouldn't. It takes a lot more care than seen here to preserve a meaning of "earlier" and "later" that isn't vulnerable to this issue. The way the schema is designed, there can be only one "Alice". I'm trying and failing to imagine a way in which both this effort and the other effort can fail.

aseipp · on May 10, 2018

I think the particular point made by Andres (and elsewhere in this thread) is really not about what the schema for a phone book is like, and more that "Accurately replicating what UPSERT does without it is exceedingly hard and you will probably get it wrong or suboptimal in a number of scenarios you can't possibly imagine until they happen".

Isolation levels will for example have impact on how transactions proceed and what results you see (REPEATABLE READ vs READ COMMITTED); the error handling (did your transaction actually fail from a concurrent INSERT succeeding, or did the update fail, or did something else happen entirely? UPSERT is often not a standalone operation in a transaction) must be carefully decided for each use-case or else you're likely to redo needless transactions or throw away good ones, and you have to work around incidental problems like latency using stored procedures to keep everything server-side, etc etc.

It's just not the kind of thing you want to replicate a number of places at each use site in a number of tools. It's something only the database knows how to manage properly by having "global knowledge".

endymi0n · on May 11, 2018

Exactly this. It‘s so hard that it took PostgreSQL almost two decades to finally implement that feature and they had to support a lot of internal features first, like speculative insertion. If you want to do UPSERT 100% correctly and ACID, the list of edge cases just grows and grows.

Here‘s a great article describing these problems in detail if you‘re interested: https://www.depesz.com/2012/06/10/why-is-upsert-so-complicat...

thr0awaz · on May 11, 2018

I'm not 100% familiar with the system, but depending on the transaction levels available you could do wrap it in like... a serializable transaction, then select the rows (just to lock them), then do the update, then do the insert, then commit the transaction

Maybe... I'm used to systems with upsert/merge.

cobythedog · on May 11, 2018

But we are talking about sqlite, right? By default the database is locked once a write occurs, so there could not be concurrent transactions writing to the database.

jedberg · on May 10, 2018

Yes, but did you just do an update and an insert every time, or did you have some code that checked the return values and not run the extra unnecessary command?

cobythedog · on May 10, 2018

They are run together every time. You could run a check but it would probably cost as much as the update, which does nothing when a record is not found.

In PostgreSQL (for example) you could use a CTE and the RETURNING clause to prevent two statements from having to run. I don't think sqlite supports that. Current version of PostgreSQL makes that unnecessary as it has "upsert" as well.

kodablah · on May 10, 2018

> You could run a check but it would probably cost as much as the update, which does nothing when a record is not found.

True if the generation of the insert doesn't have some heavier burdens on the caller side. Otherwise, start a txn, do the first, check if 0 updated, then do the second.

jedberg · on May 10, 2018

You are right.

You've chosen to trade off network round trips for extra work in the database and slightly longer row locks while both commands run, which seems like a fair tradeoff.

merb · on May 10, 2018

well even if you have an if after your update it's not what he basically meant. What he meant is that some people would've written the if/else with SQL and not with the language they written ?

drawkbox · on May 11, 2018

You could also COUNT on a query/filter to see if it exists by row count, then INSERT or UPDATE based on if there is an existing item for that filter/query. That is still two calls and can be safer, it can also be done in a transaction.

Lots of app level code in ORMs that use ADD/EDIT or CREATE/UPDATE actions can encapsulate this by just a SET, which counts the rows and then if zero INSERTS, if > 0 UPDATE.

schrodinger · on May 11, 2018

I believe you need to repeat that update one more time, in case the row wasn’t present during the first update, but was created by another session immediately before the insert. Unless you’re in a transaction—then you’d be fine as posted above.

forgotmypw · on May 10, 2018

Thank you for posting this.

anarazel · on May 10, 2018

> which if you're a good coder would also involve some try/catch and proper fallback behavior in case of an error

It's actually really hard to do so correctly in the face of concurrency. Do not attempt to do so unless you've a really good reason.

jedberg · on May 10, 2018

I would say in this day of high scalability and microservices, a good engineer should be skilled at dealing with these types of errors, as they would be one of the most common types.

Edit: I find it interesting that this comment has fluctuated between -2 and +2 points. Apparently saying that engineers should be able to handle concurrency errors is controversial?

anarazel · on May 10, 2018

I bet that at least 9/10 experienced engineers in the space will come up with a solution that has at least one of a) incorrect behaviour under concurrency b) unnecessary deadlocks c) extreme slowness d) break w/ triggers, row level security etc.

anarazel · on May 10, 2018

> Apparently saying that engineers should be able to handle concurrency errors is controversial?

I didn't vote either way. But perhaps it's less the fact that they should be able to handle concurrency, and more that replicating the necessary non-trivial logic in several applications isn't a good plan.

Normal_gaussian · on May 10, 2018

I greatly enjoyed the addition of upsert to Postgres, but not in any high availability situation. Could elaborate on the way the failure modes of UPSERT aren't sufficient for HA?

quinthar · on May 11, 2018

Fyi, Expensify uses sqlite as it's primary database in a HA clustered way, see bedrockdb.com for more details.