120K distributed consistent writes per second with Calvin

itp · on Feb 24, 2017

This seems cool, and I sincerely wish them nothing but success. That said, I had a major sense of déjà vu while reading this post -- I worked at FoundationDB prior to the Apple acquisition, when we published a blog post with a very similar feel:

http://web.archive.org/web/20150325003241/http://blog.founda...

I'm not trying to make a comparison between a system I used to work on and one that I frankly know little to nothing about; rather, I'd suggest that building a system like this just isn't enough to be compelling on its own.

halestock · on Feb 24, 2017

Bah! I was so dissappointed when I heard about the Apple acquisition of FoundationDB. Will any of the technology behind it ever see the light of day?

itp · on Feb 24, 2017

Unfortunately I'm the last person to ask. While I did start at FoundationDB pretty early (second employee), I ceased to be involved at the point of the acquisition, and beyond that I've only heard a few rumors from former coworkers.

As a business it was always an ambitious effort, and I'm not sure what could or should have been done differently. But since then I've used a number of other systems and thought to myself "boy, I wish I had FDB right now."

wwilson · on Feb 24, 2017

Another former-FoundationDB guy here (hi Ian!), and I actually think the business case for Apple open sourcing is very strong. I'm a fan of the layered architecture we chose, but building efficient and powerful layers on top of the core key-value store is a serious engineering effort in its own right. By encouraging an open-source layer ecosystem (and operational and deployment tools), Apple could leverage its investment in the core technology more effectively.

Whether Apple's leadership agrees with me is another question. :)

evanweaver · on Feb 24, 2017

We specifically chose a monolithic architecture for FaunaDB, since performance improvements invariably come from breaking interface boundaries and sharing additional information. It's been working out well.

wwilson · on Feb 24, 2017

Yes, this is the argument that VoltDB made as well:

https://www.voltdb.com/blog/foundationdbs-lesson-fast-key-va...

My feelings on this topic are mixed. On the one hand, I think many of the specific examples chosen in that post are false (and have told John as much in person). On the other hand, the general point that you can squeeze out constant factor performance improvements by violating abstraction boundaries is obviously usually true.

Nevertheless, I still think this is a bad argument. While it's true that abstractions are rarely costless, they can often be made so cheap that the low-hanging performance fruit is elsewhere. And in particular, cheap enough that they're worth it when you consider all the other benefits that they bring.

When I built a query language and optimizer on top of FoundationDB, my inability to push type information down into the storage engine was about the last thing on my mind. Perhaps someday when I'd made everything else perfect it would've become a big pain (and perhaps someday we would've provided more mechanisms for piercing the various abstractions and providing workload hints), but in the meantime partitioning off all state in the system into a dedicated component that handled it extremely well made the combined distributed systems problem massively more tractable. The increased developer velocity and reduced bugginess in turn meant that I (as a user of the key-value store) could spend scarce engineering resources on other sorts of performance improvements that more than compensated for the theoretical overhead imposed by the abstraction.

I won't claim that a transactional ordered key-value store is the perfect database abstraction for every situation, but it's one that I've found myself missing a great deal since leaving Apple.

But I'm glad to hear that things are going well for you guys. Best of luck, this is a brutal business!

jhugg · on Feb 25, 2017

Hi Will. Thanks for the shout-out.

I still think many of the arguments in that blog post hold up for non-embedded KV stores. I think you can mitigate a lot by aggressively caching metadata, but eventually you end up moving the SQL engine closer and closer to the storage layer to get performance. And yeah, you end up more monolithic and testing gets harder. Sigh.

Some of this is workload dependent. If you're not touching many rows in your queries and transactions, then you can get away with a lot more. But if you give someone SQL, they're going to want to scan.

I wouldn't mind being proven wrong. Maybe Apple made FDB run SQL at legit speeds. I haven't seen much from public projects that work this way to change my mind yet.

> I won't claim that a transactional ordered key-value store is the perfect database abstraction for every situation, but it's one that I've found myself missing a great deal since leaving Apple.

How does Spanner not satisfy that itch? Not ordered matters?

wwilson · on Feb 25, 2017

> How does Spanner not satisfy that itch? Not ordered matters?

I was probably unclear in my previous comment. Spanner is great! (And Spanner is ordered). The particular aspect of FDB that I miss is what some of our old customers called "the bottom half of a database" or "a database construction kit". In fact FDB was an awesome modular building block for all kinds of distributed systems, not just databases. We hacked up prototypes for a whole bunch of these but sadly never got around to releasing them.

Spanner is a full-fledged enterprise grade database with opinions about your data model, query language, types, etc. For the vast majority of customers, that's much more useful than what FDB provided. But for me as somebody who enjoys kicking around silly new ideas for distributed systems, it's a bit less fun.

abdullin · on Feb 25, 2017

Monolithic databases (CockroachDB, Spanner etc) don't eliminate abstraction boundaries on the larger scale. They are simply aligned with the database boundary, pushing the burden of crossing them to the application logic. Software will still have to pay the price of crossing the gap with code complexity and performance (object-relational impedance mismatch comes to my mind).

It feels like the building block approach lets you achieve better design and performance for your entire application in the long run. Especially, if you can treat building blocks as blueprints and modify them to fit the task at hand.

evanweaver · on Feb 25, 2017

"When I built a query language and optimizer on top of FoundationDB, my inability to push type information down into the storage engine was about the last thing on my mind."

What was on your mind? What performance problems did you encounter?

abdullin · on Feb 25, 2017

Layer modeling techniques shared by FoundationDB (and still available in the internet archive) are still immensely helpful even without the database.

We are happily using them to implement and optimize our local storage on top of LMDB (another awesome database). However, these approaches could be applied to any other key-value database with transactions and lexicographically stored keys.

atombender · on Feb 25, 2017

What do you guys think about TiDB and CockroachDB, both of which are SQL layers on top of a distributed K/V store?

wwilson · on Feb 25, 2017

Full disclosure: I now work at Google on Cloud Spanner which competes with both products you mentioned. These are just my personal (and probably highly biased) opinions.

I have some concerns about CockroachDB on both the performance and the reliability fronts. But I hugely admire what they're trying to do and I've heard that they're rapidly improving in both areas. TiDB is an exciting project that I've heard great things about but have never tried myself. I think it's also relatively immature.

Honestly if I were starting a project right now and had neither FDB nor Spanner available to me, I'd probably try to push Postgres as far as I possibly could before considering anything else.

atombender · on Feb 25, 2017

Agreed, Postgres does scale very well for non-Google-sized apps, though a lingering issue is handling failover.

But if one does need a bit more horizontal scalability, there don't seem to be a lot of options if you also want atomic, transactional updates (though not necessarily strict transaction isolation). I have an app that is conceptually a versioned document store, where each document is the sum of all its "patches"; when you submit a batch of patches, the rule is that these are applied atomically, and that the latest version of document thereafter reflects your patch (optimistic locking and retries take care of serialization and concurrent conflicts). I'm using PostgreSQL right now, which does this beautifully, but with limited scalability. I've looked for a better option, but not come up with anything.

Redis would handle this, but it would work purely thanks to single-threaded; and I don't feel like Redis is safe as a primary data store for anything except caches and such. Cassandra might do it, using atomic batches, although its lack of isolation could be awkward to work around.

macdice · on Feb 25, 2017

What do you think Postgres should do in this area? It seems there are a bunch of approaches being explored by different teams. I'd be very interest to hear a Spanner person's take.

hvmonk · on Feb 24, 2017

Do you know what Apple is using it for, and at what scale?

evanweaver · on Feb 24, 2017

That's one of my favorite benchmarks!

You're right that distributed consistency is a beginning, not an end. We are painfully aware of all the startups that have died or are dying on this beach.

It's great to be scalable and consistent, but you have to be more than an operationally-better replacement for legacy SQL. That's one reason we built our own query language that plays to modern application development patterns (serverless, functional, change feeds, etc.) instead of the typical slow, never-quite-there, distributed SQL planner.

TranceMan · on Feb 24, 2017

Totally off topic - why does O2 UK block this without adult verification?

On mobile so cannot read article

g0del_was_wr0ng · on Feb 25, 2017

Including your 9x write amplification in the number of "consistent writes" doesn't count -- like at all. I'm amazed nobody called you out on this yet.

You're doing 3k batches per second with 4 logical writes each, right? So that is at most 3-12k writes per second using the way that every other distributed database benchmark and paper counts.

Or otherwise - if you continue counting writes in this special/misleading way - you'd have to multiply every other distributed db benchmark's performance numbers with a factor of 3-15x to get an apples-to-apples comparison.

The 12k batched writes/sek through what I assume is a paxos variant is still pretty impressive though! Good to get more competition/alternatives for zookeeper & friends!

evanweaver · on Feb 25, 2017

No, that's not write amplification. Replication and storage engine fanout are not included. Instead, that number is the number of logical partition updates per row, per transaction. This makes the test comparable to tests of key-value stores that can only update one key per transaction. The FoundationDB test mentioned elsewhere here was reported the same way.

If you want to include write amplification, then multiply by 6x again to account for the replicated log and the tables themselves.

g0del_was_wr0ng · on Feb 25, 2017

It's doing 12k rows in 3k user-issued write operations/transaction per second.

Counting any kind of "internal write effects" that result from a user write (i.e. write amplification) is obviously done to mislead in the benchmark and does not make it comparable to key-value stores.

12k writes/s is the number of rows that are written from a user perspective. So 12k/s is also the number you have to use when comparing it to key value stores. But of course, comparing Fauna with eventually consistent systems is not a really fair comparison. You don't make it fairer by misleading in your benchmark though.

Also, just because some other vendor posted a misleading benchmark on hn (I don't know if they did) that doesn't make it right or means you should do it. Just call them out on it too.

evanweaver · on Feb 25, 2017

Indexes aren't internal write effects, they are user-defined. But we will have additional benchmarks later on that focus on row commits only.

We tried to replicate a realistic workload rather than just target the best case or worst case performance profile.

g0del_was_wr0ng · on Feb 25, 2017

By that logic a write to a postgres table would also count as N+1 writes, where N is the number of secondary indexes defined on that table?

And also, by the same logic, replicating a write to three machines counts as three writes when the replication factor is user-defined, right?

evanweaver · on Feb 25, 2017

For purposes of this benchmark, the first yes, the second no. But neither of them are free.

g0del_was_wr0ng · on Feb 25, 2017

Well fair enough but that method of counting is not what everyone else does or assumes, so somebody just reading the title "120k writes per second" gets the wrong impression of whats going on.

(An uninitiated reader would assume you're comitting 120k rows per second from the title, whereas it's "only" 12k rows and "only" 3k actual operations over the wire. Still, 3-12k is pretty impressive)

rystsov · on Feb 26, 2017

Even if we focus on 120k writes and ignore the low transaction number then it's still an average result. They use c3.4xlarge to run 5 shards each with 3 replication factor. It's 1500 writes per core (120k / (5*16)). Etcd easily does 5227 writes (I believe it can do even more) on "Standard DS12 v2" so it's 1306 per core which is very similar.

eternalban · on Feb 25, 2017

> nobody called you out on this yet.

https://news.ycombinator.com/item?id=13728738

web007 · on Feb 24, 2017

This description is very misleading.

120,000 writes per second is accurate, talking about actual durable storage (disk) writes. But it's only 3,330 transactions, which should be the number that a user cares about.

I don't have proper data and I'm a bit rusty, but I feel like Cassandra could blow that away if you set similar consistency requirements on the client side (QUORUM on read, same for write?). Am I understanding this correctly, or does Fauna/Calvin give you something functionally better than what C* can do?

freels · on Feb 25, 2017

A more apples-to-apples comparison with Cassandra would be FaunaDB transactions and Cassandra's atomic batch mutations, or its PAXOS-based lightweight transactions as opposed to single-cell writes tested in most Cassandra benchmarks.

YMMV, but we've found the performance of Cassandra writing out similar-sized multi-row atomic batches at QUORUM to be similar in this hardware configuration.

FaunaDB transactions are quite a bit more powerful, as they can span multiple keys, use conditionals and read-modify-write logic, and still resolve with serializable semantics.

web007 · on Feb 25, 2017

That makes a lot more sense then. It's still a misleading statement to say "writes" vs "transactions" since you could (potentially) make fewer writes and support more transactions. The ratio between the two is a measure of efficiency, but only transactions matter to end-users.

evanweaver · on Feb 25, 2017

You're right that one number trades off another. I'm not sure that only transactions matters, though.

Tracking logical writes makes the test comparable to tests of key-value stores that can only update one key at time, which is pretty much every other distributed database.

qaq · on Feb 25, 2017

Maybe I am missing some special point but a decent PG box will do 1,000,000+ TPS vs 3,000+ TPS here. When pgXact lands it will do close to 2,000,000 TPS. So reading all the posts about the amazing new db "X" that can do about N times less than PG on a multi-node cluster I get confused why the numbers are being presented as some sort of achievement.

otterley · on Feb 25, 2017

That doesn't sound right. Even the newest NVMe devices can't do 1M writes per second; they're maxing out at around 330k IOPS.

The 1M "TPS" you're referring to is a read-only benchmark (e.g. http://akorotkov.github.io/blog/2016/05/09/scalability-towar...). Those are reads (most likely from the buffer cache), not writes or transactions in any real sense.

qaq · on Feb 25, 2017

330K iops for a single device you are very unlikely to be running a single device. There are Fusion IO models that can do 1M IOPS but they are on the exotic side. If you are optimising for throughput you can configure commit_delay so you will fsync multiple commits.

evanweaver · on Feb 25, 2017

FaunaDB is faster in an unclustered, unreplicated configuration too, but that's not what FaunaDB is really for.

We should track disk IOPS, though, so you can do an apples-to-apples comparison of how much low-level throughput the database is driving. I believe the instance store disks in the EC2 C3 hardware class can support about ~20K write IOPS each.

chillydawg · on Feb 25, 2017

Relaxing disk commits is a quick route to data loss. Might as well use mongo at that point.

qaq · on Feb 25, 2017

commit_delay does not relax anything it's just increases latency to group multiple commits

pg314 · on Feb 25, 2017

Like others have commented, those numbers seem to be too high for writes.

On the other hand, the Fauna numbers don't seem that impressive to me. On a mid-2011 Macbook Air, I get 2600 transactions per second (read-committed) in PostgreSQL 9.6. Setup is as follows:

  CREATE TABLE IF NOT EXISTS foo(a TEXT, b TEXT, c TEXT, d TEXT);
  CREATE INDEX IF NOT EXISTS idx_foo_a ON foo(a);
  CREATE INDEX IF NOT EXISTS idx_foo_b ON foo(b);
  CREATE INDEX IF NOT EXISTS idx_foo_c ON foo(c);
  CREATE INDEX IF NOT EXISTS idx_foo_d ON foo(d);

  -- prepared statement, the inserted strings are 4 chars wide
  INSERT INTO foo(a, b, c, d)
  VALUES
    ($1, $2, $3, $4),
    ($5, $6, $7, $8),
    ($9, $10, $11, $12),
    ($13, $14, $15, $16);

These numbers are for one thread doing the writing.

Am I missing something?

g0del_was_wr0ng · on Feb 25, 2017

Well, you're missing that in Faunas case the writes are durably stored on N machines. I.e. their system provides fault tolerance in case a machine fails. You can't really do the same thing with postgres (without trading off full ACID compliance).

pg314 · on Feb 25, 2017

True. Thanks for pointing that out. I think they probably should be putting the emphasis in their marketing on the fault tolerance i.s.o. the performance.

qaq · on Feb 25, 2017

you sure can get reasonable close with 2 phase commit, and starting with next release and quorum commit you will be able to do exactly that.

evanweaver · on Feb 25, 2017

Quorum commit is not the same as distributed strict serialization because replicas can desynchronize.

Additionally you have the bottleneck of a single master. Will it be possible to do a quorum read as well, and will every transaction on the master be doing it? Then you are starting to get closer, although many anomalies are still possible.

g0del_was_wr0ng · on Feb 25, 2017

I agree that your PG numbers don't sound likely/factual -- the reason for your confusion is probably that somebody gave you untrue performance numbers for postgres or you're not comparing the same things. Is the 1m+ TPS something you measured yourself or "heard from a friend"?

If you ran the benchmark yourself, how did you achieve 1m durable writes/sec on a postgres machine/instance? [It's quite an achievement] On what kind of crazy hardware? How large was each write/row? Did you use the postgres network protocol to perform the writes?

qaq · on Feb 25, 2017

By extrapolation writes are IO bound you don't need crazy expensive things to get to the needed number of IOPS Intel 750 is 230,000 random writes @ $320 per PCI-E SSD. 9 drive config is over 2,000,000 IOPS for less than 3K.

g0del_was_wr0ng · on Feb 25, 2017

Yeah it's a bit more complex than that... The disk "IOPS" number on the box doesn't translate 1:1 or even linearly to number of committed durable transactions per second. You should try this with postgres and see how it goes.

qaq · on Feb 25, 2017

If you land me 30K for the hardware I would be glad to do it :). For writing WAL it scales reasonably close to linear

zenithm · on Feb 24, 2017

This is a new one to me...the referenced paper is here: http://cs.yale.edu/homes/thomson/publications/calvin-sigmod1...

How does this algorithm compare to whatever Google Spanner does?

evanweaver · on Feb 24, 2017

That's a good and complicated question. They both are fully ACID-compliant systems. The biggest difference as a developer is that Calvin never blocks reads, contested or not. You get causally consistent single-replica reads with no coordination.

This makes the read performance equivalent to something like Cassandra at CONSISTENCY.ONE, without giving up the cross-partition write linearization of something like Spanner.

imglorp · on Feb 24, 2017

Can Calvin scale beyond the OP claim?

I've personally seen a Cassandra ring go to more than 2M ops/sec.

jchrisa · on Feb 24, 2017

Yes. This isn't near the top end, more like what happens when we run benchmarks on a reasonable web infrastructure class cluster. This is still only 5 machines in each datacenter.

imownbey · on Feb 24, 2017

"Calvin's primary trade-off is that it doesn't support session transactions, so it's not well suited for SQL. Instead, transactions must be submitted atomically. Session transactions in SQL were designed for analytics, specifically human beings sitting at a workstation. They are pure overhead in a high-throughput operational context."

Is this specifically for distributed SQL only? I think there are some scalable SQL systems that don't support sessions either.

jchrisa · on Feb 24, 2017

Calvin is a generalized consistency protocol, that we use in FaunaDB to support relational semantics (but not SQL) in our database.

Multi-query transactions can be useful, but the FaunaDB query language is functional, rather than declarative like SQL, so composing queries that can do everything you want is usually easier than SQL.

cakoose · on Feb 24, 2017

How would you perform the classic "transfer money from one account to another" operation?

Would you create a single operation that reads one record, checks that it's enough, then adds the amount to another record?

Or maybe you'd first read both accounts, then issue a conditional write operation that makes sure the data hasn't changed before doing the write?

freels · on Feb 24, 2017

FaunaDB's query language makes it straightforward to do it the first way. All queries are serializable, so any preconditions checked would gate transaction commit as you would expect, and read-modify-write style transactions work.

edit: here's an example in our Scala DSL:

  Let {
    val amount = 50
    val balanceA = Select("data" / "balance", Get(Ref("accountA"))
    val balanceB = Select("data" / "balance", Get(Ref("accountB"))
    If(Gteq(balanceA, amount),
      Do(
        Update(Ref("accountA"), Obj("data" -> Obj("balance" -> Subtract(balanceA, amount)))),
        Update(Ref("accountB"), Obj("data" -> Obj("balance" -> Add(balanceB, amount)))),
        "Transfer Success"
      ),
      "Insufficient Funds"
    )
  }

cakoose · on Feb 25, 2017

Is the wire format roughly isomorphic to the structure above? Or does the Scala library convert this code-like structure into something simpler/flatter?

(BTW, would be nice if I could read your API docs without signing up for an account.)

evanweaver · on Feb 25, 2017

It is isomorphic; right now it's layered onto JSON, but eventually we will support CBOR on the wire as well. Internally everything is CBOR with LZ4 block compression.

The docs will eventually be available without an account.

ddorian43 · on Feb 24, 2017

voltdb has no client-sessions though you have udf ones

olegkikin · on Feb 25, 2017

2011: Benchmarking Cassandra Scalability on AWS - Over a million writes per second

http://techblog.netflix.com/2011/11/benchmarking-cassandra-s...

Also a single SSD from 2015 is rated at 120K writes per second:

PM1725: http://www.samsung.com/semiconductor/global/file/insight/201...

lngnmn · on Feb 24, 2017

Consistent writes to a permanent storage or didn't happen.

evanweaver · on Feb 24, 2017

They are durable; will clarify.

lngnmn · on Feb 24, 2017

Otherwise it would be like Mongo did - we have put your data in the OS buffers - what could possibly go wrong?

evanweaver · on Feb 24, 2017

Haha no...it's the opposite of that.

rystsov · on Feb 24, 2017

Is it possible to download fauna to play with it on my own?

jchrisa · on Feb 24, 2017

You can sign up for the cloud version in a few seconds here: https://fauna.com/serverless-cloud-sign-up

rystsov · on Feb 25, 2017

With the cloud version, it's impossible to run jepsen-like tests to validate consistency and to observe cluster's behavior when the network is unstable and nodes tend to crush.

jchrisa · on Feb 25, 2017

We have some correctness results coming that should address your concerns. Watch this space for updates.

rystsov · on Feb 26, 2017

Will you release binaries or it's gonna be a pure cloud solution like DocumentDB or DynamoDB?

evanweaver · on Feb 26, 2017

We already have enterprise customers in production on-premises; there will also be a downloadable developer edition. There are no service or library dependencies other than the JVM.