More

rystsov · on July 2, 2020

Author is here AMA :)

rystsov · on May 24, 2020

Hi Kyle, thanks for the Elle :) I want to use Elle to check long histories of transactions over small set of keys with read dominant workload, the paper recommends to use lists over registers but when the history becomes long on the one hand it becomes too wasteful to read the register's history on each request on the other hand the Elle's input becomes very large. E.g. when each read should return the whole register's history the size of history grows O(n^2) compared to the case when the reads return just the head.

So I'm curios how would you have described the ability of finding violations with Elle using read-write registers with unique values vs the append-only lists?

aphyr · on May 24, 2020

E.g. when each read should return the whole register's history the size of history grows O(n^2) compared to the case when the reads return just the head.

If you look at Elle's transaction generators, you can cap the size of any individual key, and use an uneven (e.g. exponential) distribution of key choices to get various frequencies. That way keys stay reasonably small (I use 1-10K writes/key), some keys are updated frequently to catch race conditions, and others last hundreds of seconds to catch long-lasting errors.

So I'm curios how would you have described the ability of finding violations with Elle using read-write registers with unique values vs the append-only lists?

RW registers are significantly weaker, though I don't know how to quantify the difference. I've still caught errors with registers, but the grounds for inferring anomalies are a.) less powerful and b.) can only be applied in certain circumstances--we talk about some of these details in the paper.

rystsov · on Oct 23, 2018

> Thus far, causal consistency has generally been limited to research projects ... MongoDB is one of the first commercial databases we know of which provides an implementation.

Cosmos DB provides session consistency (looks like an another name for causal consistency) at least since 2014 [1].

Cosmos DB's session guarantees [2]: consistent prefix, monotonic reads, monotonic writes, read-your-writes, write-follows-reads.

Mongo DB's causal consistency guarantees [3]: monotonic reads, monotonic writes, read-your-writes, write-follows-reads.

Doubt that four years later still qualities as one of the first.

[1] https://www.infoq.com/news/2014/08/microsoft-azure-documentd...

[2] https://docs.microsoft.com/en-us/azure/cosmos-db/consistency...

[3] https://docs.mongodb.com/manual/core/read-isolation-consiste...

aphyr · on Oct 23, 2018

Causal and session are definitely similar, but I'm not entirely sure if causal implies consistent prefix, and conversely, I think causal miiight have stronger implications than just the intersection of MR, MW, RYW, and WFR. Because we weren't entirely certain whether we could make that claim regarding Cosmos, we opted to be conservative.

rystsov · on Oct 23, 2018

I agree it's hard for me too to be precise about naming in academic sense. But this published paper "Writes: the dirty secret of causal consistency" says that both Cosmos DB and MongoDB have causal consistency so I don't know.. At least Cosmos DB and MongoDB provide the same guarantees for session/causal.

Strongerpass · on Oct 23, 2018

.. but the post didn't say it was first. Not even the part you quoted.

Boxxed · on Oct 23, 2018

Well, the quote you reference says, "one of the first," doesn't it?

Volt · on Oct 23, 2018

"we know of"

polskibus · on Oct 23, 2018

Was Cosmos DB tested by Jepsen?

rystsov · on Oct 23, 2018

It's off-topic. But yes, Cosmos DB has rigorous tests[1] including Jepsen (a tool).

[1] https://twitter.com/dharmashukla/status/869104163510034432

aphyr · on Oct 23, 2018

To clarify: no, Jepsen, as an organization, has not worked with CosmosDB.

I'm delighted they have rigorous tests, and I'm glad our tool has been helpful for them! We just can't say anything about those tests, because we haven't looked yet. Maybe someday!

namibj · on Oct 24, 2018

I understand them having developed TLA+ to guide their design and test consistency primitives.

rystsov · on Oct 20, 2018

CRDT may help: if you read from quorum, merge(+modify) and write back to quorum for read (write) then you'll always get right answers.

hardwaresofton · on Oct 21, 2018

I'm not sure if I'm understanding what you're suggesting right, but it sounds like you're just quorum-ing for everything. You shouldn't need improved merging if you quorum both reads and writes, since it implies the majority agreeing on the contents of the log to begin with.

rystsov · on Oct 23, 2018

I agree that CRDT don't help with data-scalability (eventually all data converge on a single node), just wanted to comment on the following phrase

> CRDTs don't help in the case that you don't want to deliver wrong answers

Usually when people talk about CRDT they assume that a client talks to a single node and then a background process replicate data to others node. Before this replication is done, if a client contact the other node they get "wrong answers". But nothing prevent a client always talk with quorum of nodes and do replication on its own, in this case we'll get always only right answers :)

However we still need to merge if we do quorum-ing. Paxos/Raft-based systems basically take optimistic lock then they quorum (see prepare phase) so they suffer from contention and need a leader to serialize requests (leader in paxos/raft is for liveness not safety). But with CRDT we always can merge so concurrent requests are not a problem and we can achieve better level of parallelism (request-scalability).

evanweaver · on Oct 21, 2018

Quorum/quorum does not guarantee agreement or serializability even for a single key.

Imagine a write that commits to one replica but is delayed committing to two others. Some quorum reads will see it (but also see a tie with the old value...what breaks the tie? Certainly not a timestamp) and others will not see it at all, indefinitely.

It’s easy to end up in a state where every replica has a different value because of torn writes.

hardwaresofton · on Oct 22, 2018

Sorry, I think either my fundamental understanding is off or I wasn't clear enough in how I was imagining this happening...

If you and a majority of nodes agree on a value like an CRDT OpSet (more simplistically, just agreeing on the state of the log), how does that not guarantee agreement and serializability? It is impossible from that point on to have a another majority of nodes have some other view of what happened. Consensus algorithms are

One copy serializability is exactly what would be achieved by having a read and write quorum[0][1]. It intuitively makes sense to me (and maybe my intuition is wrong), but if you talk to a majority of nodes and ask "this is what we all have, right?" before every write and every read, you've got a guaranteed consistent (of course, progress isn't guaranteed if partitions/nodes die, etc) state.

AFAIK Quorums are the state of the art (and arguably the only relatively efficient option) as far as achieving serializability in a distributed system...

[0]: https://en.wikipedia.org/wiki/Quorum_(distributed_computing)...

[1]: https://arxiv.org/pdf/1406.7423.pdf

evanweaver · on Oct 22, 2018

> If you talk to a majority of nodes and ask "this is what we all have, right?"

You cannot know this, because the transaction replication is racy and not atomic--it may have applied to only one node while you are doing your read. Whether you see it or don't is luck. So you can have the following scenario (and in practice you will):

- TX commit begins

- TX replicated to node A

- Read from coordinator A' begins

- Read sees replica A (has tx) and replica B (does not)

- Read assumes A wins because of some kind of vector clock in the data value (choosing the older value doesn't make things better, just in case you are wondering)

- Read from coordinator B' begins

- Read sees B (no TX) and C (no TX)

- Read completes with stale value--serializability violation has occurred

- TX finishes replicating to B and C

This leaves aside write skew, torn transactions due to partitions, and all kinds of other problems.

hardwaresofton · on Oct 23, 2018

I'm super confused -- what you're describing isn't a quorum read/write scenario -- what do you mean by "a wins"? Also where is the prepare phase for 2 phrase commit/any consensus algo? Replica A shouldn't be reporting an unacknowledged transaction to coordinators -- the write hasn't been quorum acknowledged. TX is not considered committed until it reaches a majority of nodes. You are right that if you have a network partition you're in trouble, but that ends in lack of progress, not loss of serializability.

We must be talking about different things because I can't find any literature that has reached the conclusion that serializability is impossible in distributed transactions? Can you point me to that?

Also, do you have any thoughts on the literature[0] that contradicts what you're saying? I'm not an expert but unless I'm not misreading english serializability is possible with quorum reads and writes.

> In a new algorithm for maintaining replicated data, every copy of a replicated file is assigned some number of votes. Every transaction collects a read quorum of rvotes to read a file, and a write quorum of wvotes to write a file, such that r+w is greater than the total number of votes assigned to the file. This ensures that there is a non-null intersection between every read quorum and every write quorum. Version numbers make it possible to determine which copies are current. The reliability and performance characteristics of a replicated file can be controlled by appropriately choosing r, w, and the file's voting configuration. The algorithm guarantees serial consistency, admits temporary copies in a natural way by the introduction of copies with no votes, and has been implemented in the context of an application system called Violet.

Has this paper been refuted? In addition to this there's literally the whole section on distributed serializability[1].

[0]: https://dl.acm.org/citation.cfm?doid=800215.806583

[1]: https://en.wikipedia.org/wiki/Serializability#Distributed_se...

evanweaver · on Oct 29, 2018

> what you're describing isn't a quorum read/write scenario -- what do you mean by "a wins"? Also where is the prepare phase for 2 phrase commit/any consensus algo?

There is no consensus; that requires a leader system. The paper you link appears to require a multi-phase lock; the quorum itself does not guarantee serializability. Explicit preparation via quorum can guarantee serializability (but not strict serializability, I don't think), but cleanup of locks is a big performance problem in practice.

> Replica A shouldn't be reporting an unacknowledged transaction to coordinators -- the write hasn't been quorum acknowledged.

Acknowledged by who? The replicas can't block on the other replicas; they just tell the coordinator when they applied the write.

This is worth a blog post.

rystsov · on Oct 20, 2018

"unlike Google Percolator, FoundationDB, or similar systems, FaunaDB places no constraints on replica distance and is practical to deploy at global internet latencies"

"For each batch of parallel transactions, they are inserted into a distributed, write-ahead transaction log"

"Replicas must achieve consensus for how to insert new transactions into the log. FaunaDB uses an optimized Raft implementation to achieve consensus."

There are constrains on running consensus across the world (Raft), it adds at least 200 ms to serialize txs. Also higher latency means longer interval between hearbeats and hence longer downtime if leader is isolated - known issue of leader based consensus protocols (see "There Is More Consensus in Egalitarian Parliaments" paper[1] or "In search of a simple consensus algorithm" post[2])

Google's Percolator doesn't depend on global consensus but just on global TSO (timestamp oracle) which is possible to implement in a way:

- it doesn't suffer from leader isolation (no leader)

- doesn't have bottleneck (each node handles requests)

- doesn't touch disk on each request

details in the "Quorum clock: leaderless distributed clock" post[3].

[1] https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf

[2] http://rystsov.info/2017/02/15/simple-consensus.html

[3] http://rystsov.info/2018/10/01/tso.html

rystsov · on Feb 26, 2018

Roaring 20s were pretty liberal in USSR too - https://en.wikipedia.org/wiki/LGBT_history_in_Russia#LGBT_Hi...

rystsov · on Feb 4, 2018

It would be interesting to repeat the research in St.Petersburg, the siege during the WW2 lasted more than 2 year and there were 642,000 casualties among civilians.

https://en.wikipedia.org/wiki/Siege_of_Leningrad

kbart · on Feb 5, 2018

Holodomor even more, 2.4-12 million people have have died of starvation deliberately caused by Soviets.

0. https://en.wikipedia.org/wiki/Holodomor

rystsov · on Jan 29, 2018

Some update on Gryadka - Tobias Schottdorf and Greg Rogers independently explored it with TLA+ and didn't find any issues: - https://tschottdorf.github.io/single-decree-paxos-tla-compar... - https://medium.com/@grogepodge/tla-specification-for-gryadka...

rystsov · on March 27, 2017

I responded in the comments: http://tschottdorf.github.io/if-its-not-paxos-its-probably-w... the analysis is based on a false assumption about the read operation

rystsov · on March 25, 2017

I don't know TLA+ yet but I'll assist anybody with an explanation on how Gryadka works.