This is very cool. Simple enough to reason the invariants easily. I guess one of...

pbailis · on May 29, 2013

(edit: post author, [not OP]) here. Thanks for the feedback!

> I guess one of the key insights is that each data has a canonical server owner which enforces the consistency of the writes of the data at a single place.

Well, this is the way I presented it, because it's easiest to understand. But, if you want a replicated system that provides HA, there are alternatives (http://www.bailis.org/blog/non-blocking-transactional-atomic...).

> When a third client3 tries to read the latest value of x or y, what is the latest value of its peer data? It looks like depending which data client3 starts with, it would get a different version of the peer data? ... Am I missing something or this is the semantic in determining the latest values of peer data?

Good question! This ultimately comes down to how you want to handle concurrent writes. Many distributed databases use a "last writer wins" strategy when reconciling concurrent updates (that is, the correct behavior is specified to be that the database serves the highest-timestamped version of a given data item). Now, in your example, the clients both started their writes at the same (real-world) time and used this time as a basis for their timestamp, so the "last" write is undefined. We need a way to break the "last writer wins" tie. In practice, this can be something like the client ID appended to the last few bits of the timestamp or even a hash of the value written--as long as the tie-breaker is deterministic (that is, different replicas don't decide different "winners"), it doesn't really matter which is chosen.

In practice, to avoid storing every value ever written, you'd want to provide some kind of "merge" function for multiple writes, and, in our implementation and in often practice, this is last writer wins (plus some deterministic tie-breaker for identically timestamped but distinct writes).

chmike · on May 29, 2013

It took me time but I understood why the suggested race condition with two or more clients can't happen. But this require that pending information is not "overwritten" and preserved until becoming good. The x=10 is then never overwritten by the x=20 and the meta data allows to distinguish the two possible value of x. If a third client gets a good x=10, it will get the y value with the same meta data and which is 10, not 20 or 0.

The only problem that bogs me a bit in this algorithm is that transaction time sequence is not guarantied and preserved. Some clients may see x=y=10 and others x=y=20 during the transaction period. This happen if one client starts writing x=y=10 with s1 and the other y=x=20 with s2. Clients requesting x and y starting with s1 will get x=y=10 and those requesting x and y starting with s2 will get x=y=20. When the transactions completes, the values finally stored in x and y as good may be 10 or 20.

So the algorithm ensures consistency, which is the most important and useful property, but with many concurrent write transactions, the ending DB content may be a bit somehow unpredictable.

This can be a problem when one needs to do operations like x-=1 and y-=1 on the database as for seats reservation in a plane for a travel with two flights for instance. How would this be done ?

pbailis · on May 29, 2013

Good questions.

In the example configuration in the post (i.e., two servers with no replication), during the period between the write start and end of phase two of writes, reads can return either value written (x=y=0 or x=y=1). If we want to enforce the property that once one read returns the second write, all subsequent reads (that begin after this read) will return the second write (or a later write), which is called linearizability (http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf), then whenever servers serve from 'pending', they should move the writes to 'good'. This is safe because, if a client reads from 'pending', it means that the write must be in 'good' elsewhere and therefore is stable.

As for how to order the writes, real-time clocks can provide a fairly useful timestamp mechanism. Databases like Cassandra use this real-time ordering for timestamping, which appears to work well in practice. Alternatively, you could use a distributed sequence number generator to totally order transactions. But real-time should be fine.

As you point out, this atomicity property doesn't address (all) application-level integrity constraints; it doesn't handle isolation between transactions. As I discussed in the post and in another comment (https://news.ycombinator.com/item?id=5784030), if your application-level integrity constraints are such that your updates are not commutative (that is, concurrent updates should not be allowed), then you'll need to block in order to guarantee database integrity is not violated. This is separate from atomicity, but it is important to remember. In your example above, two writers might both simultaneously reserve the last seat on a plane unless they synchronize.

Effectively, with non-commutative updates, you need greater isolation than is provided by the algorithm in the post (which effectively provides Read Committed isolation). Achieving greater isolation is possible but you'll lose the non-blocking property (again, due to the requirement to avoid concurrent updates via higher isolation like serializability rather than due to atomicity). But, for many applications like 2i and the multi-puts I mentioned at Facebook and Twitter, updates are commutative.

cpleppert · on May 29, 2013

That was the question I was going to ask. I don't understand any possible means that the system can resolve this issue. Traditional database systems always have one canonical value of row X at any point in time. The value can change for earlier transactions but there can never but multiple values of X for the same time. What happens as soon as client 3 comes along and wants to read row X? Making a decision is akin to choosing which transaction you will accept as valid and which you will reject!

The paper describes this scheme as READ Committed which doesn't make generally make sense except in the context of a database with secondary indexes.

pbailis · on May 29, 2013

Post author here. Interesting take, but I'm not sure I agree, or perhaps I misunderstand.

In the initial example, I represented 'good' as a set for ease of understanding, but, in practice, unless a client specifically requests an older version of a data item, the system serves the latest value written to 'good'. That is, the system does not expose a read() that returns multiple values. Rather, clients can read_good(key) or read_by_version(key, timestamp), both of which return a single version/write.

This is different from deciding which "transaction you will accept as valid and which you will reject." Many database systems perform in-place updates, but they must either either 1.) choose a winner across multiple writes (as I described below, distributed databases often employ what's called "last writer wins") or 2.) abort multiple writes. However, a large class of database systems (e.g., Oracle, Postgres) employ what's called multi-version concurrency control, whereby the database stores multiple versions of each data item. The system has a total commit order on transactions which determines what version a transaction should read() from the database. But, say, in Oracle, if:

1.) I start a transaction

2.) You start a transaction

3.) You modify variable X

4.) You commit

5.) I read X

Under what's known as Snapshot Isolation, I will read X as of the start of my transaction (i.e., I will not read your write to X even though it's "present" in the database). This is often accomplished via MVCC techniques.

> The paper describes this scheme as READ Committed which doesn't make generally make sense except in the context of a database with secondary indexes.

I tend to disagree. This is probably another conversation, but databases rarely guarantee serializable isolation (see http://www.bailis.org/blog/when-is-acid-acid-rarely/#acidtab...), and Read Committed is a fairly commonly deployed model. It's true that serializability is often required for correct operation. But, perhaps interestingly, many databases like Oracle 11g and SAP HANA do not provide it as an option (largely due to poor performance and deadlock avoidance), and, anecdotally, models like Read Committed are 2-3x faster than serializability.

I'm not entirely sure what you mean by applicability to secondary indexing (rather, I think there are other use cases, though I'm excited about 2i applications). However, I'm genuinely curious if I'm missing something.