So I assume this means cockroachDB currently (probably) meets its promised consistency levels. How does it compare to the usual default settings of PostgreSQL for example? - I think you get SERIALIZABLE, so the behaviour should be very reasonable.
I understand the txn/s numbers are in a pathological scenario, but will these transactions block the whole db from moving faster or have unrelated transactions still normal throughput?
I'm just glad we get linearizable single keys, I had to switch to HBase because it was the only choice at the time. Cassandra did not even give me read-your-own-writes. HBase also had issues while testing on the same host, delete+write lead to deletes being reordered after the writes because they had the same timestamp.
I'm a bit disappointed from HN in this post, everyone only talks about the name. But 90% probably did not even read the article.
Aaanyway, heres my 2cents: https://cr.yp.to/cdb.html is CDB for me :P - so better choose another acronym!
> How does it compare to the usual default settings of PostgreSQL for example? - I think you get SERIALIZABLE, so the behaviour should be very reasonable.
Not clear if you're referring to PostgreSQL or CockroachDB, but for the record, the PostgreSQL default transaction isolation level is READ COMMITTED, not SERIALIZABLE.
(Cockroach Labs CTO) CockroachDB provides two transaction isolation levels: SERIALIZABLE, which is our default and the highest of the four standard SQL isolation levels, and SNAPSHOT, which is a slightly weaker mode similar to (but not quite the same as) REPEATABLE READ. Unlike most databases which default to lower isolation modes like REPEATABLE READ or READ COMMITTED, we default to our highest setting because we don't think you should have to think about which anomalies are allowed by different modes, and we don't want to trade consistency for performance unless the application opts in.
All transaction interactions are localized to particular keys, so other transactions can proceed normally while there is contention in an unrelated part of the database.
(as for the acronym, we prefer CRDB instead of CDB)
I don't believe CockroachDB provides SERIALIZABLE consistency for the entire keyspace, rather for subsets of the keyspace that reside in the same Raft cluster.
In practice this means no cross key serializability without including a serialization token in transactions.
I may be mistaken however but that is my understanding from reading CockroachDB docs and Aphyrs post.
Cockroach Labs employee here. We provide serializable consistency for the whole keyspace. It's just that it's easier when you stay in one Raft ensemble as less can go wrong.
"For instance, on a cluster of five m3.large nodes, an even mixture of processes performing single-row inserts and selects over a few hundred rows pushed ~40 inserts and ~20 reads per second (steady-state)."
Jepsen is essentially a worst-case stress test for a consistent database: all the transactions conflict with each other. CockroachDB's optimistic concurrency control performs worse with this kind of workload than the pessimistic (lock-based) concurrency control seen in most non-distributed databases. But this high-contention scenario is not the norm for most databases. Most transactions don't conflict with each other and can proceed without waiting. Without contention, CockroachDB can serve thousands of reads or writes per second per node (and depending on access patterns, can scale linearly with the number of nodes).
And of course, we're continually working on performance (it's our main focus as we work towards 1.0), and things will get better on both the high- and low-contention scenarios. Just today we're about to land a major improvement for many high-contention tests, although it didn't speed up the jepsen tests as much as we had hoped.
I've been testing with node.js's pg client (not even using the 'native' option) and on 1 node I get > 1,000 single inserts /sec. I'm seeing this scale linearly too. NB I'm testing on a pretty overloaded macbook with only 8gb RAM. On a decent-sized cluster with production-grade hardware and network, performance is really unlikely to be a bottleneck. Also, it'll significantly improve during this year, one imagines.
All databases tend to struggle to under contention. If you are an application author building a high-scale app, you need to avoid contention by design. That single counter row you were planning to increment on every visit? Bad idea; maybe insert a new visit record instead.
The contention in this example is an aspect of the test design, not database design. You wouldn't try to build an app this way.
a.) this is a beta product where the team has been focusing on completeness and correctness before performance
b.) I was testing Cockroach from August through October--Cockroach Labs has put a good deal of work into performance in the last three months, but that work hasn't been merged into mainline yet. I expect things should be a good deal faster by 1.0.
c.) Jepsen does some intentionally pathological things to databases. For instance, we choose extremely small Cockroach shard sizes to force constant rebalancing of the keyspace under inserts. Don't treat this as representative of a production workload.
I agree that it can and should be faster, but the targets will depend on what you're going for. 20k linearizable reads/sec on EBS-backed instances is, uh, maybe a tad optimistic.
I'm a little confused about how read speed is 2x slower than write speed. With respect to 'correctness', you're drifting into pyrrhic victory or 'not even wrong' territory at that point.
When there are basic expectations of behavior that aren't being met, many of us would reject the idea that this code is 'correct'.
[Disclaimer: CockroachDB engineer here, working on performance and benchmarking]
IIRC, in the case that aphyr refers to for these specific numbers, the reads are scans that span multiple shards[1], while the writes are writes to single shards.
[1] even though aphyr says it's just a hundred rows, the tables are split into multiple shards because aphyr in this case was specifically testing our correctness in multi-shard contention scenarios. In production you wouldn't have multi-shard reads crop up until you were doing scans for tables that were hundreds of thousands of rows in size[2]. It's easier to picture if you think of this performance speed difference in the scenario where you are doing full table scans spanning multiple shards on multiple computers while the underlying rows are being rapidly mutated by contending write transactions. The transactions get constantly aborted to avoid giving a non-serializable result, and performance is suffering. We agree that the numbers in this contention scenario are too low, and we are actively working on high-contention performance (and performance in general) leading up to our 1.0 release[3].
[2] Specifically, we break up into multiple shards when a single shard exceeds 64mb in size.
The benchmark isn't quite apples to apples. I'm used to people doing put/get benchmarks, not put/search workflows. Merging results from multiple machines is nothing to sneeze at, especially if you have replication between nodes.
I worked for a search engine dotbomb, and I had a constant worry that the core engineering team had not solved a very similar problem. But since the funding round failed I'll never know for sure what they had planned.
Are read-only transactions handled without contention? I understand that a read-write transaction needs to be aborted if a later conflicting write is committed. But read-only transactions can view a snapshot of what was committed before they begin - you trade a little bit of latency for being conflict free. It doesn't even have to be the default as long as you offer it as an option, you might want this if you know you are reading hot keys, or as a fallback if you've been aborted X times.
This was IMO the big innovation of true time in spanner - contention free reads at either now, or at some time in the past.
Yes, if your timestamp is far enough in the past (this is determined by the maximum clock offset configured for the cluster), and if you don't collide with writes of a long-running transaction (which may run at a timestamp that could influence what you read), you will not have to worry about contention. What you're suggesting are definitely options for reading consistent, if slightly out of date, information from a hot keyspace.
There's a blog post on time travel queries, which conveniently expose the required functionality as specified by the SQL standard.
As I mentioned in the article's introductory paragraphs, where Spanner forces a minimum latency on writes to ensure consistency, CockroachDB pushes that latency to reads which contend with a write.
Latency is one thing throughput is a different animal 2 reads/second per m3.large node that really forces one to think there are some severe architectural issues.
You're not wrong. Whoever downvoted you is pushing an agenda and I'm not happy about it.
Reads per second is ultimately a measure of how many operations can be retired per second, not how long the operation takes. Anyone who has spent fifteen minutes learning about capacity planning should know that. 5 reads per second doesn't mean each read takes 200ms. It might mean 1 second per read spread across 5 threads of execution, or 3 seconds per read across 15 workers.
The parent poster (and many others in this thread) are assuming that performance under this deliberately pathological test is reflective of performance in the real world.
Optimistic locking systems inherently perform poorly under contention. But they also perform better than pessimistic concurrency systems overall because in the real world we design applications to avoid contention.
As an example, the Google App Engine datastore runs zillions of QPS across petabytes of data in a massive distributed cluster. But if you build an app that does nothing but mutate a single piece of state over and over, you'll top out at a couple transactions per second. This is painful if you're trying to build a simple counter, but with minimal care you can build a system that scales to any sized dataset and traffic volume.
So the benchmark is only informative from the standpoint of determining which records should be split to avoid concurrent writes.
For instance you wouldn't expect a single user to make 5 comments or upvotes per second so storing data about recent activity with the user isn't a bottleneck that you need to design for. Storing data about responses with an item might also be okay as long as you don't plan to be HN or Reddit (Github, for example, would be just fine). But if you want to track activity globally (eg, managing the watch list notifications in github), you will need to design around that number.
Aphyr's test is not a benchmark. It is a test of database correctness under a carefully constructed set of pathological circumstances. It cannot be used to infer real-world performance behavior.
Yes, the general advice for users of the GAE datastore is to build Entity Groups around the data for a single user. That isn't absolute though, and it doesn't cause problems for watch lists; the Watch can be part of the User's EG rather than the Issue's EG. Or it can be its own EG. In practice this doesn't require as much consideration as you probably imagine.
The clock skew limit controls the maximum latency on reads--so yeah, if you're doing reads that contend with writes, improving clock resolution improves latency. Not entiiirely sure about the throughput impact though.
You wouldn't reduce the clock drift limit unless you had more accurate clocks however. If all of your servers are on bare metal and synced to a high resolution atomic or GPS clock then you can probably reduce it from 250ms to 10ms or less similar to Googles Spanner setup.
Even if they manage to multiply this by 100 on the final release, it's still way weaker than a regular sql db. I hope they have another selling point than performance.
The test is testing the worst-case scenario of everything needing locking. CockroachDB uses an optimistic "locking" approach which makes this bad. But if you're use-case is strictly linearizable high-throughput reads good luck finding anything that is not a single-machine database.
Your one machine's disk fails and you'll lose data if you have anything other than synchronous replication. And once you have synchronous replication let's hope it is on at least a different rack if in the same data-center. Preferably in a different geographical region actually. And now you no longer have "really fast" QPS.
All this assumes you have only so much data that can fit in one machine. If that's not the case you're in need of a database that spans more than one machine.
My though exactly. It's already the case with faster DB. E.G: most projects my customers make me work on will never need scaling on multiple instances. They run really fast on a single postgres instance.
One had some perf issues.
First he just moved the DB to avoid having it on the same machine as the web server. Saved him a year.
Then as the user base grew, he just bought a bigger server for the db.
Right now it works really fast for 400k u/d, and there are still 2 offers of servers more powerful that the current one he has. By the time he needs to buy them, 2 new more incredible offers will probably exist. Which is a joke because more traffic means he has more money anyway.
Having a cluster of nodes to scale is a lot of work. So unless you have 100 millions users doing crazy things, what's the point ?
You do understand that scalability is more than just concurrent users right ? Maybe not ?
There are many very small startups doing work in IoT, Analytics, Social, Finance, Health etc who have ridiculously challenging storage needs. Trying to vertically scale is simply not an option when you have a 50 node Spark cluster or 1M IoT devices all streaming data into your database. Vertically scaling your DB doesn't work as it is largely an I/O challenge not a CPU/Storage one.
But hey it's cute and lucky that you work on small projects. But many of us aren't and have to deal with these more exotic database solutions.
This is a very niche use case. The vast majority of projects have neither the user base to arrive to such numbers, no are they I/O challenged. Most companies are not facebook or google.
Plus if you manage to do 50 write sec on this DB, I don't see how you plan to scale anything with it. You are back to use good old sharding with load balancers.
You're right, most people probably won't need cockroach db, but in my opinion, one of the nicest things about CDB is that it uses the postgres wire protocol.
So by all means, scale up your single DB server. But if you ever get the happy issue of even more growth, and needing to scale beyond a single server, then you "should" be able to migrate your data to a cockroachDB cluster, and not have to change much, if anything, in your application.
If your dataset fits comfortably on one postgres instance, and will continue to do so for your current architectural planning time horizon, then you have little need to use CockroachDB / Spanner.
These databases are designed for use-cases which require the consistency of a relational database, but cannot fit on a single instance.
You are misunderstanding this article. This is not a benchmark, this is a test of how correct the database is with distributed transactions and data in the worst conditions possible. These are not real-world performance numbers in any sense.
Why do you keep repeating 50 w/s when that's not an actual performance number? CDB will likely run with thousands of ops/sec per node.
Can you really not see why distributed databases are needed? High availability, (geo) replication, active/active, oversized data, concurrent users, and parallel queries are just a few of the reasons.
Distribution will always come with a cost, but the tradeoffs are for every application to make. We use a distributed SQL database and while it's faster than any single-node standard relational DB would be, speed isn't the reason we use it.
Can you elaborate on this point please? My reading of the article and docs is that perf is expected to scale linearly with the number of nodes (and therefore with dataset size).
Let's say you get 50 writes / sec per node. You need what, 1000 nodes to get to the playing fields of Postgres, with a simpler and cheaper setup ? Right now it's really not competitive, unless they improve the performance 1000 times. It makes no sense to buy and administrate many more machines to get the power you can have with one machine on another, proven tech.
If the DB can only handle 50 ops/sec then the point you are making here is valid.
But see https://news.ycombinator.com/item?id=13661349, that's a pathological worst-case number. You should find more accurate performance numbers for real-world workloads before making sweeping conclusions.
Your original comment was:
> Even if they manage to multiply this by 100 on the final release, it's still way weaker than a regular sql db
This is what my comment, and the sibling comments, are objecting to, and I don't think you've substantiated this claim. 100x perf is probably well in excess of 10k writes/sec/node, which is solid single-node performance (though you'd not run a one-node CockroachDB deployment). Even a 10x improvement would get the system to above 1k writes/sec/node, which would allow large clusters (O(100) nodes) to serve more data than a SQL instance could handle.
Obviously I'd prefer to be able to outperform a SQL instance on dataset size with 10 nodes, but for a large company, throwing 100 (or 1000) nodes at a business-critical dataset is not the end of the world.
At a given point in time yes, but to get this data set, you need to write it into the DB. Big datasets implies either you have been receiving data for very long of you did it quickly on a shorter period of time. It's usually the later. Which mean you need fast writes. 50 writes / sec is terrible, even more if to improve that you need to buy more servers while you 30 euros / months postgres instance can deal much more than that.
What I found out the hard way is that there's a qualitative difference between 'can' and 'must' here that causes a lot of problems with the development cycle.
When the project can no longer fit onto a developer's box it changes a bunch of dynamics and often not for the better. Lots of regressions slip in, because developers start to believe that the glitches they see are caused by other people touching things they shouldn't.
With Mongo we had to start using clustering before we even got out of alpha, and since this was an on-premises application now we had to explain to customers why they had to buy two more machines. Awkward.
Plus I found that people using MongoDB tend to not formalize their data schema because the tool doesn't enforce it. But they do have a data schema.
Only:
- it's implicit, and you have to inspect the db and the code to understand it.
- there is no single source of truth, so any changes better be backed up by unit tests. And training is awkward. If some fields/values are rarely use, you can easily end up not knowing about them while "peeking at the data".
- the constraints are delegated to dev, so code making checks is scattered all around the place. Which makes consistency suffers. I almost ALWAYS find duplicate or invalid entries in any mongo storage I have to work on. And of course again, changes and training are a pain. "Is this field mandatory ?" is a question I asked way to often in those mongo gigs.
Things like redis suffer less from this because they are so simple it's hard to get wrong.
But mongo is a great software. It's powerful. You can use it for virtually anything, whatever the size, quantity, or shape. And so people do it.
The funny thing is, the best MongoDb users I met are coming from an SQL background. They no the pros and the cons, and use the flexibility of Mongo without making a mess.
But people often choose mongo as a way to avoid learning how to deal with DB.
Oh, I'll never use Mongo again. I think they believe their own PR, and I feel duped by the people who pushed it into our project against the reservations of more than half of the team. In fact I'd rather not work with anyone involved in that whole thing.
But I didn't want to turn the Cockroach scalability discussion into another roast of Mongo so I very purposefully only talked about the horizontal scaling vis a vis 'immediately have to scale' versus 'very capable of scaling' horizontally.
If you gave me a DB that always had to be clustered but I could run it as two or three processes on a single machine without exhausting memory, I would be fine with that as long as the tooling was pretty good. As a dev I need to simulate production environments, not model them exactly.
But if the whole thing only performs when it's effectively an in-memory DHT, there are other tools that already do that for me and don't bullshit me about their purpose.
That last statement is a bit off. Almost every developer knows how to use an SQL database. For 95% of things you want to do it's a very simple piece of technology.
The issue is that they are annoying to use. You have to create a schema, you have to manage the evolution of that schema (great another tool), you have to build out these complicated relational models and finally you have to duplicate that schema in code.
And so IMHO that's why people move towards MongoDB. It's amazing for prototyping and then people just well stick with it as there isn't a compelling enough reason to switch.
See, I would agree with you but I'm aware that we may be dating ourselves.
It's been so long since I touched SQL that I'm forgetting how. Which means that my coworkers haven't touched it either, and some of them that were junior are probably calling themselves 'senior' now, having hardly ever touched SQL. It's no wonder noSQL can waltz into that landscape.
But what's the old rule? If your code doesn't internally reflect the business model it's designed for, then you'll constantly wrestle with impedence mismatches, inefficiencies, and eventually the project will grind to a halt.
So what I'd rather see is a database that is designed to work the way modern, 'good' persistence frameworks advertise them to work. That would represent progress to me.
Like Spanner, Cockroach’s correctness depends on the strength of its clocks... Unlike Spanner, CockroachDB users are likely deploying on commodity hardware or the cloud, without GPS and atomic clocks for reference.
I'm curious, do any of the cloud providers offer instances with high-precision clocks? Maybe this is something one of the smaller cloud providers might want to offer to differentiate themselves - instances with GSP clocks.
The spanner announcement mentioned offering direct access to the TrueTime api sometime in the not too distant future.
Edit: Quote from Brewer's post: "Next steps
For a significantly deeper dive into the details, see the white paper also released today. It covers Spanner, consistency and availability in depth (including new data). It also looks at the role played by Google’s TrueTime system, which provides a globally synchronized clock. We intend to release TrueTime for direct use by Cloud customers in the future."
Bingo. I've been hoping Google (or, heck, any colo provider) would offer this for a few years. Semi-synchronous clocks enable significant performance optimizations!
> I'm curious, do any of the cloud providers offer instances with high-precision clocks?
Essentially that's what Google's doing by selling Spanner, but you don't get access to the clocks directly. I'd guess that as they get more experience selling Spanner, and they bring the price of the hardware down, they'll rent those directly too.
>I'm curious, do any of the cloud providers offer instances with high-precision clocks?
That's a great question.
Could be a little tricky with AWS for example. It's typical to run a database on a VPC (virtual private cloud), since that layer of the stack doesn't need to be exposed to the Internet. Unfortunately, that means the servers in the VPC can't get to the ⁎.amazon.pool.ntp.org servers.
Skimming around, I don't see any of the cloud providers, VPS providers, or dedicated hosting companies offering anything special, like a local stratum 1 time source for each region/location.
In addition to being extremely precise, TrueTime is explicit about error in its measurements.
In other words, a call to TrueTime returns a lower and upper bound for the current time, which is important for systems like Spanner or CockroachDB when operations must be linearized.
In effect, Spanner sleeps on each operation for a duration of [latest_upper_bound - current_lower_bound] to ensure that operations are strictly ordered.
Your suggestion might improve typical clock synchronization, but still leaves defining the critical offset threshold to the operator, and importantly leaves little choice for software like CockroachDB other than to crash when it detects that the invariant is violated.
Wow. Probably fundraising and getting out front of googles new consumer spanner service. I just started experimenting w/ cockroachdb and it sounds/seems great. They def are campaigning hard, landing a Jepsen test and a lot of posts/articles last couple.
I am pulling for them over G on principle. I infer they know people at Stripe where the Rethink team & Aphyr are so hopefully they can learn from them and build an awesome product. G has lost a lot of battles on UX, Docs, attitude and just general product. They have won a lot of battles on sheer size & resources. Be nice to have this stick around.
Hey man, didn't realize you left Stripe but did sort of notice Jepsen getting more serious. Jepsen has phenomenal reach:
Interesting and hilarious coverage of technical complexity which assumes just enough knowledge to engage a keen novice while seemingly engaging seasoned experts in the field.
Thanks for that. Your entertainment firm is a bit out if my wheelhouse, but it's such a crowded space and you end up getting really tied down. Maybe get acquihired for an Iditarod team, but I reckon this is more scaleable.
CDB is a really cool db for the time I've been playing with it (just a few months I think).
That said, I'd really appreciate if they'd launch a hosted DBaaS. By that I mean, more in line with dynamodb or documentdb instead of say MySQL on AWS for example.
I know it's on their timeline, but dbs generally get selected during the start of the projects and unless absolutely necessary, companies don't want to change dbs.
CockroachDB employee here. DBaaS is definitely on our roadmap. There are a lot of moving pieces to building out DBaaS, so our focus first is on honing our core product. In the meantime, we are working on making it as easy as possible to deploy CockroachDB across different cloud vendors and with orchestration tools like Kubernetes.
> Should the clock offset between two nodes grow too large, transactions will no longer be consistent and all bets are off.
A concern I have is that there could be datacorruption in this context, no? It sounds like on a per cluster basis, i.e. Same Raft cluster, there is a strict serializability guarantee. But without linearalizability system wide, you could end up with data corruption on multi-keyspace writes.
You're correct--large clock anomalies (or VM pauses, message delays, etc) can cause serializability anomalies. I don't exactly know what will happen, because our tests explicitly focused on staying within clock skew bounds.
To speculate: if clocks do go out of bounds, I don't... think... you can violate single-key serializability, because that goes through a single Raft cluster, but you can definitely see multi-key consistency anomalies. I think you could also get stale single-key reads, because the read path IIRC relies on a time-based lease. Might be other interesting issues if Cockroach is shuffling data from one Raft ensemble to another--I think the range info itself might have a time-based lease that could go stale as well. Perhaps a Cockroach Labs engineer could comment here?
Note that the time window for these anomalies is limited to a few seconds-nodes will kill themselves when they detect clock skew.
(Cockroach Labs CTO) The major issue with clock offsets is stale reads due to the time-based read lease. Pretty much everything else is independent of the clocks. Stale reads are still enough to get you in trouble (some of the jepsen tests will show serializability violations if you disable the clock-offset suicide and mess with the clocks enough), but it's tricky to actually get bad data written into the database, as opposed to returning a stale-but-consistent view of the data.
Of course anything good may eventually rise to success on its merits, but in case I'm not the only one with a subconscious aversion maybe the info below will help put some people one less click away. How about SurvivalDB? :)
CDB is a distributed SQL database built on a transactional and strongly-consistent key-value store. Scales horizontally; survives disk, machine, rack, and datacenter failures with low latency disruption and no intervention; strongly-consistent ACID transactions; SQL API for querying data. Inspired by Google’s Spanner and F1. Completely open source.
Absolutely the name is holding them back. In my case it prevents me from championing the product within my company. Going to the executives and saying "we're going to put our most precious data into Cockroach" simply won't fly. Any discussion of features will be for naught; I know the CEO will make a "no" decision as soon as he hears the name.
To any Cockroach Labs folks who might read this: please give me (and all of us) a name that I can evangelize without my audience cringing.
Everybody makes decisions using both emotion and reason. This is why successful products speak to both the heart and the mind of the customer. Said another way, this is why the marketing department is just as essential to a company's success as the engineering department.
I've worked with my company's executive team for quite a while, and they're not obtuse. They are human, however. When I go to them and suggest we trust our data to a new, up-and-coming database, I'm asking quite a lot. The cost of failure is extremely high, so I need to convince them at both an emotional and rational level that the risk is worth it. But when the conversation starts, "let me tell you about Cockroach..." I'm already starting from behind.
The reason so many people are fussing about CockroachDB's name is because there's such a huge dissonance between a product that looks fantastic on its technical merits but has a name that triggers terrible images. Have you ever lived in a house with roaches? That's the kind of thing that wedges in your mind in the "never again" category.
That people are both emotional and logical is a non sequitur. The responsibility of picking a database includes trying explicitly to discount how ones feels about petty things like the name and rush to the technical discussion. That's part of what it means to do a good job picking a database.
All the best technical people I know would feel a visceral disgust at the thought of a product being discarded due to its name. Someone who was truly suited to attacking the problem of database selection would proactively make sure he wasn't letting nontechnical features into the discussion.
Whoever creates the successor to Linux should name it Holocaust OS or something just to mess with people, and laugh at those who due to its name thereby do their computing on an inferior system.
"But when the conversation starts, "let me tell you about Cockroach..." I'm already starting from behind."
Attempting to explain how brand recognition works: people browse the web, they see and hear things, they remember unusual things, things they see often, things they notice better because of how they shown in all the noise, etc. After awhile when presented with the choice where they need to make a judgement either under pressure, in a hurry or simply because they don't have enough information, they favor a familiar thing, thing they can recall, rather than the thing they can hardly recall or not at all, no matter how neutral the name is. This is called a familiarity bias and is also the reason behind the phrase "no such thing as bad publicity". So, you are definitely not starting from behind.
Except this is going to be the common response of management at basically any large, or older company. Not everyone here works at startups, and this isn't something that's easy to just immediately change.
This is an issue with a lot of weird or jokey project names, honestly.
"It must be a product from just another startup" is a very valid reason. At the end of the day, we only talk about mssql, mysql, postgres when it comes to rdbms.
Right, the implication being that nothing especially impressive about CockroachDB stands out, and so we're going to stick with proven options. That's a technical discussion!
Why are the people in charge of deciding what database to use the sort of people who would allow their disgust toward an animal whose name is shared by a database to block discussion of its technical features?
I agree completely. I love reading about databases, but apparently have been skipping reading about CockroachDB for that same reason subconsciously. I took a deep dive the other day and was amazed that I've seen the name dozens of times and have never bothered to ask what was special about it. It really is a fascinating database.
No, you have not been skipping it subconsciously or seeing it more often, than others, you just feel like you did. People actually have an availability bias [1] that clouds their judgement about unusual things.
It's a terrible name, it also has the property that stupidly configured web filters could filter it due to string matching the first part of the word.
It also makes your employees look stupid when they tell their landlords, bank accounts, dates that they work at Cockroach Labs. Good luck explaining that you aren't an exterminator.
There's nothing wrong at all with being an exterminator – but in general you probably don't want people to assume you're something you're not. The name doesn't help with that.
I submit "Bunker" as a more approachable name. Not only does it have the connotation of "survives nuclear bombardment", but also suggests "keep your stuff safe here" and "something you could entrust with your life".
Agree 100%, ditto for suggested alternate name. I'd been doing the same for Rust. It's not a rational reason, sure, but I have to sell adoption to a broader userbase that cares about more that the technical merits.
I don't want to have to make the case for why someone's system needs rust and cockroaches.
I have to say, I think that if you have a problem with "Rust" then you lose credibility in your criticism of "Cockroach".
Contrary to most above, I think "Cockroach" is great. It's suitable. Any suit who has a problem with it should get back on their side of the office, and stop sticking their nose in things they know nothing about, especially when so evidently closed minded. There are too many people in that meeting. Hell, even "Google" is a crap name, sounding like a baby word, and "Mongo" can be outright offensive. I expect anyone who has an issue with the name "Cockroach" is probably not going to be too open to the cutting edge anyway, so wasted concern.
I'm wondering if all the people who can't get over the name CockroachDB are also people who insist that professionals should dress the part. Can't take someone seriously if they don't have their shirt tucked in?
I'm pretty sure I'm not alone in having a bias against the name "CockroachDB" and being okay with people who don't have their shirt tucked in. I get the comparison you're making, but "untucked shirt" connotes "casual" to me. Roaches connote "health hazard." I'm okay with eating at casual restaurants, but not casual restaurants with roaches.
If you have to explain "we want you to think about survivability, not health violations," then maybe the problem isn't your untucked shirt--it's the big grease stain covering its front. You can insist that's just part of your style all you want, but it's going to be a hard sell.
Yeah, I've definitely heard people tell me "It seems cool, but not cool enough for me to get over the subconscious revulsion to the name" when asking about CDB.
I appreciate the name, but I think I'd have an easier time convincing people to use a product of a different name....
It might be intentional as reverse psychology. They don't want employees or users who wouldn't use a project because it's named after an unpopular insect. Those users/employees could be fickle, unreasonable and low quality at other aspects of business/development and be a net loss for the company/project in the long term.
Completely agree. I specialize in distributed databases and I have zero interest in even learning about CockroachDB. Could be the greatest technology ever and I will never know. Is that irrational? Sure. But I'm a busy man with plenty of tech to learn about and limited time. I detest roaches and do not want to think about them at all when I am working. Might as well have named it with an ethnic slur.
I can only go off what you wrote, stating you are deliberately ignorant. I was just reiterating what you proudly said.
Do you really think saying you are a specialist, yet won't read about the topic you specialize in because you are scared of a benign name is reasonable?
I forgot to mention the ridiculous hyperbole comparing 'cockroach' to racial slurs.
My point is that the name isn't benign. And my post was deliberately somewhat hyperbolic. I actually have read a bit about Cockroach, and if I saw a growing ecosystem that seemed relevant to my work, I would pursue it more.
But as others have pointed out, the name creates unnecessary conflict and discomfort. In a world of perfectly rational humans, it wouldn't matter. In the real world, it makes it harder to get stakeholders on board. That's why I made the comparison to ethnic slurs. "F_U_DB" might be the greatest storage system imaginable. It doesn't change the fact that the name alone is going to create extra unpleasant work. (EDITED to fix an error)
"Somewhat hyperbolic" and "the complete opposite of truth" are not the same thing.
There are no such things as "perfectly rational humans" and if there were they would be rational and would not experience discomfort because of a name because that's irrational. A rational human would see the name as clefer if anything since cockroaches are quite enduring.
Perhaps to a large fraction of relevant stakeholders, the name is a signal of project immaturity and perhaps poor judgement.
Additionally, even if you aren't personally bothered it, most would be aware that others feel this way, which certainly creates an uphill battle for the tool and diminishes the prospect of a vibrant ecosystem building up around it.
This is a red flag for anyone who might be considering including CockroachDB as an important part of their stack on any long-running project
It was not an ad hominem attack. He criticized you willfully not learning about the area you say you specialize in. It is completely valid criticism as your reasons for doing it are ridiculous and what you said looks very unprofessional.
to me it's not a deeply-seeded reason, it's just when I hear cutesy-sounding names I just don't take them seriously. They sound like the latest, hipster thing that's going to be in-bloom for a (figurative) week or two until the next cutesy thing steals everyone's attention. Even the language Pony has a mildly off-putting name (to me) despite hearing people rave about it. Couch DB ? What the hell does that mean ?
I admit it's a bias of mine and I could be losing out on learning amazing tools. But even non-profit, standardized languages and tools still have a brand name and image.
I agree that it's more neutral, if not random-sounding without knowing it's an acronym. JBOD is another neutral acronym.
I'm not marketing guru, but maybe cockroach could have been more palatable by calling it "RCH, some call it roach, because it was named after its persistence" ?
their name is the stupidest startup blunder I've ever seen. Enterprise won't go near it, seriously. I cannot tell a client I'm using a "cockroach" database, no fucking way.
If we could go back and rename Windows Hitler would anyone be using it???
Cloning a reply I made earlier talking about that exact situation:
It's not my company or me, it's my clients. Here's a sort of long winded story explaining what I mean that the corporate world can't use anything with the name "cockroach". If you disagree after reading, I'm open for a friendly debate.
Average project, some 60 year old VP signing off on 100k of custom software with 8 pages of lawyer speak contracts is NOT going to allow "his" project to run off of "cockroach database". VP's/CEO's like to take personal ownership of projects they feel will go well, and ones done with us usually do. None of these people want ANY chance of their project getting criticized internally, so nobody will use CockroachDb. It's just not going to happen. Having successful projects is an easy way for these people to gather promotions and respect. Having any part of it do with "cockroach" could result in the whole project being an embarrassment.
Imagine the database goes down one day. One of their low level staff calls us(we have ongoing support contract). Problem gets rapidly escalated to development. One of our lead developers replies to the email thread, that by now has a third of the clients' company on it. "Yeah see the problem was the cockroach database that wasn't supposed to go down needed to be restarted" . Most of the staff think we're joking. Guarantee this becomes a big pun around their office, every time anything breaks in the app they say "maybe the cockroaches on the back end died again".
It's made worse by the fact that issues in software systems are commonly called "bugs". Well, our system doesn't have general bugs anymore, it has "cockroaches". Do you think this is an unlikely scenario? Because it's not. Besides the word "cockroach" this is exactly how urgent issues tend to be reported and fixed. Replace "cockroach" with "Microsoft" and I've probably violated my NDA. After that incident, whoever championed the project, probably our strongest ally in the company, resents us for making such a stupid mistake that resulted in his project becoming a joke. Maybe we aren't as great of a dev shop as he once thought? Maybe it's better to do open bidding next time so whoever wins he can't be blamed directly for what happens.
You could say their company culture sucks, maybe it does? But that doesn't change that I can't use it on any client projects because of the chance we'll lose hundreds of thousands over a stupid name.
There's plenty of other names that convey meaning but don't make lawyers nervous. Your attitude of "oh, well if you can't use it your company culture sucks" is EXACTLY why nobody will rename it and why corporate american won't touch it. Nobody cares about how cool it is to be edgy when millions of dollars are on the line.
If you plan on forcing clients to use it you might as well hold a going away party for any of your clients with more than 5M revenue.
>If we could go back and rename Windows Hitler would anyone be using it???
Intensely hyperbolic but... you're not wrong. "Cockroach" is an awful name for a database and I have a hard time envisioning "seen-it-all" enterprise IT managers taking it seriously.
I wish everyone would stop bikeshedding the damn name.
From my perspective I'm glad something that's trying to do this is coming into existence. a couple years ago I was given the arduous task of ensuring that we never lost data. One of the requirements was that: At any time, any server can (and will) fail.. if you acknowledge that you have data then you MUST never lose it.
You cannot imagine how many database solutions that ruled out. Including basically _all_ noSQL data stores.
We settled on postgresql, because, despite going into toast tables and having to implement sharding on top- it was not only the most sane it could be wrangled into basically doing the right thing in regards to fsyncing and ensuring data wasn't in VFS.
So, Kudos to these guys, who, in a world of people who don't give a shit about consistency (except eventual consistency) are actually trying to further it.
For any database product, there are plenty of mature and robust technical tools for improving and evaluation. I do not believe the company is less aware of any HN commenter. Therefore any technical feedback only going to have limited impact.
But the naming feedback seriously helps them get an idea how their naming hurts their brand-acceptance.
"We settled on postgresql, because, despite going into toast tables and having to implement sharding on top"
This is a common mistake. The choice doesn't actually guarantee consistency at all. PostgreSQL only guarantees consistency as long as you don't try to communicate with it over a network, which you obviously do if you have shards. To address this problem we have things like 2pc, but they aren't very useful for an RDBMS on their own without the whole infrastructure of a distributed database.
The networking isn't a problem if you only acknowledge a write when the underlying database says so. You could have a write succeed before you crash. But you won't lose an acknowledged write.
No, but when your requirement is to "never loose a write regardless of which server fails" you have to make sure the mutation was acknowledged by more than one server.
So the least you have to do is to wait for a second server to confirm the write. But now what do you do when the network link between the two servers goes down in-between a write? The write might have already been applied on the first server or it might not have. The same goes for the second server. But you can't be sure either way.
For our thought experiment, let's assume that while the servers can't talk to each other anymore, some clients can still talk to each of the servers.
At this point, you either have to give up availability (i.e. don't accept any further writes until the fault is resolved) or consistency (i.e. potentially return unacknowledged/stale data from one of the nodes until the fault is resolved).
While returning unacknowledged/stale data is acceptable for some usecases, it does break ACID semantics. You can't get both ACID and "always-on" availability with a simple replication scheme like this.
If you want linearizability and HA, you need at least three servers and a more complex, quorum-based scheme. However, Postgres doesn't support that (AFAIK) - you have to use something like Zookeeper or CockroachDB. I think this is what zzzcpan meant.
(I realize that you only spoke about never loosing a write and didn't say anything about updates/queries. If you don't care about ACID, postgres with synchronous replication is a good highly available solution)
We don't have auto-healing HA, you have 32 master databases which have replica databases underneath them with synchronous replication. Meaning things must be synced to both before the COMMIT OK is received by the client.
Then you do the sharding logic in the application.
No write can be sent back as being "OK" unless it's on disk on 2 servers which represent a vertical slice of our entire database structure.
We assume power-loss scenarios mostly, which means if it's on disk and not in vfs then we're fine- as power-loss is more likely than complete raid degradation or server disappearance, although the replicas help with that too.
You don't need quorum at all in this scenario, and no matter which database or client fails you will not lose data that you've acknowledged, even on immediate power loss to 50% of your entire infra.
So what do you do when the master for one of your shards fails? Do you drop all incoming writes for the shard on the floor? Or do you fail over to the shard's slave and promote the slave to the new master?
Since you said you don't have "auto healing HA" I assume you don't fail over, but discard/deny incoming writes until the master comes back up.
This is a valid approach, but, I don't see how it contradicts what I said at all:
- I said you can't get full ACID and HA failover at the same time with postgres
- Your scheme does not provide HA failover
I explicitly said that if you can forgo either full ACID or HA in case of a failure, postgres is fine.
It's not bikeshedding at all. It looks like a good solution, but many people have concerns about an inability to use it because of management/clients and the connotation of the name. That's a legitimate concern. Don't just hand wave it away.
It's not bikeshedding. Who the fuck wants to work for a company called Cockroach Labs? Seriously, sure your tech friends and recruiters might appreciate it on your resume but anyone outside of tech? Applying for an apartment, mortgage, car, lol good luck. And also not too sure that anyone outside the immediate tech/startup nerd scene would appreciate it other.
That aside why the hell do I want to thing of cockroaches day in and day out when I am using their product? Seriously, is this someone's idea of a bad joke?
Nobody is going to give a flying fuck what your company is named on a mortgage application.
Honestly there are companies with _much_ worse names in active use like "Dick & Daughters"[0] or "Ismail ANUS"[1] who even got an appointment to the queen.
Seriously, it's bikeshedding, and the point of the DB is that you don't think about it all day.
I also don't think of women called Cassandra when I use the database of the same name. Bikeshedding. Seriously.
Dick & Daughters or Ismael ANUS are definitely not more vile then Cockroach. The former are names rooted in tradition, not some idea of a an arrogant hipster joke. To call a company Cockroach is fucken arrogance. Plain and Simple.
Cockroach is just a creature. How is it arrogance. It reeks of arrogance to call yourself "Lionhead" or "Jaguar" or "Oracle". Some form of apex predator or all-knowing entity.
Also "Is Male Anus" is not rooted in tradition at all.
It would be bikeshedding if we were all, for no good reason and without prompting, suggesting names that we preferred, with no consensus emerging.
Instead what I see in this thread is a clear consensus that the current name is atrocious, even to the point of being detrimental to the long-term success of the project, and should therefore be changed.
Personally, I refuse to use or recommend this product as long as its current name remains. And to those who think I'm being petty, 1) I can afford to be petty given how many database options (NoSQL or otherwise) are out there, and 2) market forces should punish products with names like this, or else we'll eventually find ourselves debating the merits of HerpesDB vs. ManureDB.
I'm disappointed with the commentary here, I for one would think the discussions around building a distributed system with the consistency guarantees claimed here would be far more interesting than the name of said system.
CockroachDB might be the worst product name I've come across. I get it; cockroaches are extremely hardy, but they're also much more strongly associated with uncleanliness and disease.
EDIT: Also, cockroaches are BUGS! Bugs cause problems in computer systems.
I always thought MongoDB was a pretty horrendous, semi-racist name and they ended up dominating, for better or worse, the NoSQL market. Of course, I quickly found out that Mongo was meant to be related to term 'Humongous' and not 'Mongoloid'.
Yes but I'm not sure the comparison stands. PSQL was created in 1996 and it happens to be one of the 2-3 most common/famous databases ever made as far as I can tell.
If you're postgres, it doesn't really matter because you're already there.
People will remember if the product is good so if you're confident in your technology you don't need any gimmicky tactics.
They're just hurting themselves for no good reason other than feeling good about their bad sense of humor most people don't sympathize with.
Imagine any IT department in a large company, when they're deciding which technology to use, this definitely matters. I mean, funny names like "mongodb" is alright, but the first image when people hear about "cockroachdb" is not a good picture. And first impression matters a lot. It's marketing 101.
--
Just think about real-life scenario of how you would recommend to a co-worker:
"Hey we should definitely use this new database. It's called couchdb!" vs "Hey we should definitely use this new database, it's called cockroachdb!"
Or,
"Hey we should check out this awesome way of representing API. It's called JSON-API!" vs "Hey we should check out this awesome way of representing API. It's called Swagger!"
Agreed, they can change it if they ever get big enough. But before that more memorable brand is going to make it feel more trustworthy than the competition.
To be a tad more concrete: GunDB can run in the browser as well as on servers; CockroachDB is a traditional, client/server database. GunDB is aiming for eventual consistency, but when I looked a year ago, it seemed like the algorithm they chose wasn't actually convergent--nodes could diverge because non-commutative updates could be applied in different orders on different nodes. CockroachDB, by contrast, is a serializable, single-key linearizable store. I don't think GunDB offers transparent sharding or SQL, both of which are important aspects of CockroachDB.
Aphyr, thanks! I appreciate seeing you here and replying. GUN Author here.
First, to answer @sroussey - Kyle is correct, GUN is an eventually consistent graph database that runs browser/server with extremely high availability, while CockroachDB is trying to be a strongly consistent, linearizable key/value SQL store. To read about what we guarantee/don't, check out these articles:
The last year we've focused on performance, and can now do 25M+ reads/sec (no cache miss, disk I/O performance isn't particularly interesting to us). More on that here: https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2... (we're not kidding about it being able to run in the browser)
With regards to commutative / non-commutative, @Aphyr, slight correction: Different machines will converge to the same value, it is just that you'd need to use a commutative CRDT on top of gun (we have one here: https://github.com/amark/gun/wiki/snippets-(v0.3.x)#counter ) for them to converge to their combined value.
Example: GUN treats primitives as atomic, so if you try to `gun.get('some').path('math').put(5 + 2)` it will converge to the atomic value of 7. So if two machines do `gun.get('some').path('math').put(currentValue + 2)` at the same time, both machines will converge to the fixed atomic value, not a commutative value - you need the CRDT for that.
I mention this in the "What could go wrong?" section of the talk (linked above), and it is easy to add the necessary CRDT when you need it - actually, this was inspired by you @aphyr, from our discussion a few years ago.
Note: @Aphyr, we built a distributed testing framework ( https://github.com/gundb/panic-server ), and now that our "performance improvement" stage is done, in the near future we're going to start building distributed correctness tests (we're rolling out to a government client soon). We'll probably be in touch with you in the next half year or so. :)
Great work, keep it up. You are always a huge inspiration. <3
Edit: With regards to sharding / SQL, we don't have SQL support yet but somebody is currently building a prototype for it. Sharding, not built in, but peers store the data they request, which acts as a natural shard, but nothing fancy - for more info check out this article: https://github.com/amark/gun/wiki/sharding .
I'm going to be the dissenting voice here and say: Use whatever name you like as long as it isn't something like cuntbagDB.
CockroachDB accurately and immediately conveys meaning. It'll survive when nuclear war happens just like a cockroach. And I might work in the games industry but I would never have a hard time selling this. I would basically force them to use it and explain the benefits.
If your company wont adopt technology because it uses a bug as it's name then your company has much greater problems than technology adoption.
It's not my company or me, it's my clients. Here's a sort of long winded story explaining what I mean that the corporate world can't use anything with the name "cockroach". If you disagree after reading, I'm open for a friendly debate.
Average project, some 60 year old VP signing off on 100k of custom software with 8 pages of lawyer speak contracts is NOT going to allow "his" project to run off of "cockroach database". VP's/CEO's like to take personal ownership of projects they feel will go well, and ones done with us usually do. None of these people want ANY chance of their project getting criticized internally, so nobody will use CockroachDb. It's just not going to happen. Having successful projects is an easy way for these people to gather promotions and respect. Having any part of it do with "cockroach" could result in the whole project being an embarrassment.
Imagine the database goes down one day. One of their low level staff calls us(we have ongoing support contract). Problem gets rapidly escalated to development. One of our lead developers replies to the email thread, that by now has a third of the clients' company on it. "Yeah see the problem was the cockroach database that wasn't supposed to go down needed to be restarted" . Most of the staff think we're joking. Guarantee this becomes a big pun around their office, every time anything breaks in the app they say "maybe the cockroaches on the back end died again".
It's made worse by the fact that issues in software systems are commonly called "bugs". Well, our system doesn't have general bugs anymore, it has "cockroaches". Do you think this is an unlikely scenario? Because it's not. Besides the word "cockroach" this is exactly how urgent issues tend to be reported and fixed. Replace "cockroach" with "Microsoft" and I've probably violated my NDA. After that incident, whoever championed the project, probably our strongest ally in the company, resents us for making such a stupid mistake that resulted in his project becoming a joke. Maybe we aren't as great of a dev shop as he once thought? Maybe it's better to do open bidding next time so whoever wins he can't be blamed directly for what happens.
You could say their company culture sucks, maybe it does? But that doesn't change that I can't use it on any client projects because of the chance we'll lose hundreds of thousands over a stupid name.
There's plenty of other names that convey meaning but don't make lawyers nervous. Your attitude of "oh, well if you can't use it your company culture sucks" is EXACTLY why nobody will rename it and why corporate american won't touch it. Nobody cares about how cool it is to be edgy when millions of dollars are on the line.
If you plan on forcing clients to use it you might as well hold a going away party for any of your clients with more than 5M revenue.
If your VP isn't trusting his technical directors to make appropriate choices when producing quality reliable software and infrastructure then I would say that is a problem with company culture.
I wouldn't have a problem selling the technical merits of a highly consistent database to a technical director. So everything else is moot to me.
I agree about the "bug" association, that is a negative stigma that I could see attached to it- but there are plenty of projects who have stupid names which pass by unnoticed.
take distro release names; Beefy Miracle.. Wheezy.. Natty Narwahl.
And these terms are used internally too. "Oh, that machine was running Squeeze and not Jessie which is why we can't install the latest gdb" etc;
You're giving this much more credence than it will have in real life.
I had the same issue with another tool I used called "agent ransack". The solution is simple, offer the same product with a different name to corporate clients. They have a different download called "file locator pro" that's literally exactly the same, without the bad name.
I understand the txn/s numbers are in a pathological scenario, but will these transactions block the whole db from moving faster or have unrelated transactions still normal throughput?
I'm just glad we get linearizable single keys, I had to switch to HBase because it was the only choice at the time. Cassandra did not even give me read-your-own-writes. HBase also had issues while testing on the same host, delete+write lead to deletes being reordered after the writes because they had the same timestamp.
I'm a bit disappointed from HN in this post, everyone only talks about the name. But 90% probably did not even read the article. Aaanyway, heres my 2cents: https://cr.yp.to/cdb.html is CDB for me :P - so better choose another acronym!