Hacker News new | past | comments | ask | show | jobs | submit login

Horizontal scaleability might be one.



But why scale something when you can just run one postgres instance and provide 10000x the performance?


Your one machine's disk fails and you'll lose data if you have anything other than synchronous replication. And once you have synchronous replication let's hope it is on at least a different rack if in the same data-center. Preferably in a different geographical region actually. And now you no longer have "really fast" QPS.

All this assumes you have only so much data that can fit in one machine. If that's not the case you're in need of a database that spans more than one machine.


My though exactly. It's already the case with faster DB. E.G: most projects my customers make me work on will never need scaling on multiple instances. They run really fast on a single postgres instance.

One had some perf issues.

First he just moved the DB to avoid having it on the same machine as the web server. Saved him a year.

Then as the user base grew, he just bought a bigger server for the db.

Right now it works really fast for 400k u/d, and there are still 2 offers of servers more powerful that the current one he has. By the time he needs to buy them, 2 new more incredible offers will probably exist. Which is a joke because more traffic means he has more money anyway.

Having a cluster of nodes to scale is a lot of work. So unless you have 100 millions users doing crazy things, what's the point ?


You do understand that scalability is more than just concurrent users right ? Maybe not ?

There are many very small startups doing work in IoT, Analytics, Social, Finance, Health etc who have ridiculously challenging storage needs. Trying to vertically scale is simply not an option when you have a 50 node Spark cluster or 1M IoT devices all streaming data into your database. Vertically scaling your DB doesn't work as it is largely an I/O challenge not a CPU/Storage one.

But hey it's cute and lucky that you work on small projects. But many of us aren't and have to deal with these more exotic database solutions.


This is a very niche use case. The vast majority of projects have neither the user base to arrive to such numbers, no are they I/O challenged. Most companies are not facebook or google.

Plus if you manage to do 50 write sec on this DB, I don't see how you plan to scale anything with it. You are back to use good old sharding with load balancers.


You're right, most people probably won't need cockroach db, but in my opinion, one of the nicest things about CDB is that it uses the postgres wire protocol.

So by all means, scale up your single DB server. But if you ever get the happy issue of even more growth, and needing to scale beyond a single server, then you "should" be able to migrate your data to a cockroachDB cluster, and not have to change much, if anything, in your application.


If your dataset fits comfortably on one postgres instance, and will continue to do so for your current architectural planning time horizon, then you have little need to use CockroachDB / Spanner.

These databases are designed for use-cases which require the consistency of a relational database, but cannot fit on a single instance.


Right now not really. Cockroach perf don't allow you do have a big dataset given the performances.


You are misunderstanding this article. This is not a benchmark, this is a test of how correct the database is with distributed transactions and data in the worst conditions possible. These are not real-world performance numbers in any sense.


You are misunderstanding these comments.

The problem is not just the performances, it's that distributing has a huge cost in term of servers and maintenance.

If you can write only 50 times a second, your data set won't get big enough to justify distributing it.

Put your millions of row in one server and be done with it. Cheaper, faster, easier.

There is a tendancy nowaway to make things distributed for the sake of it.

Distribution is a constraint, not a feature.


Why do you keep repeating 50 w/s when that's not an actual performance number? CDB will likely run with thousands of ops/sec per node.

Can you really not see why distributed databases are needed? High availability, (geo) replication, active/active, oversized data, concurrent users, and parallel queries are just a few of the reasons.

Distribution will always come with a cost, but the tradeoffs are for every application to make. We use a distributed SQL database and while it's faster than any single-node standard relational DB would be, speed isn't the reason we use it.


Can you elaborate on this point please? My reading of the article and docs is that perf is expected to scale linearly with the number of nodes (and therefore with dataset size).


Let's say you get 50 writes / sec per node. You need what, 1000 nodes to get to the playing fields of Postgres, with a simpler and cheaper setup ? Right now it's really not competitive, unless they improve the performance 1000 times. It makes no sense to buy and administrate many more machines to get the power you can have with one machine on another, proven tech.


> Let's say you get 50 writes / sec per node.

If the DB can only handle 50 ops/sec then the point you are making here is valid.

But see https://news.ycombinator.com/item?id=13661349, that's a pathological worst-case number. You should find more accurate performance numbers for real-world workloads before making sweeping conclusions.

Your original comment was:

> Even if they manage to multiply this by 100 on the final release, it's still way weaker than a regular sql db

This is what my comment, and the sibling comments, are objecting to, and I don't think you've substantiated this claim. 100x perf is probably well in excess of 10k writes/sec/node, which is solid single-node performance (though you'd not run a one-node CockroachDB deployment). Even a 10x improvement would get the system to above 1k writes/sec/node, which would allow large clusters (O(100) nodes) to serve more data than a SQL instance could handle.

Obviously I'd prefer to be able to outperform a SQL instance on dataset size with 10 nodes, but for a large company, throwing 100 (or 1000) nodes at a business-critical dataset is not the end of the world.


This statement makes absolutely no sense.

Performance is very loosely correlated with dataset size and less so in most distributed NoSQL databases like CockroachDB.


At a given point in time yes, but to get this data set, you need to write it into the DB. Big datasets implies either you have been receiving data for very long of you did it quickly on a shorter period of time. It's usually the later. Which mean you need fast writes. 50 writes / sec is terrible, even more if to improve that you need to buy more servers while you 30 euros / months postgres instance can deal much more than that.


What I found out the hard way is that there's a qualitative difference between 'can' and 'must' here that causes a lot of problems with the development cycle.

When the project can no longer fit onto a developer's box it changes a bunch of dynamics and often not for the better. Lots of regressions slip in, because developers start to believe that the glitches they see are caused by other people touching things they shouldn't.

With Mongo we had to start using clustering before we even got out of alpha, and since this was an on-premises application now we had to explain to customers why they had to buy two more machines. Awkward.


Plus I found that people using MongoDB tend to not formalize their data schema because the tool doesn't enforce it. But they do have a data schema.

Only:

- it's implicit, and you have to inspect the db and the code to understand it.

- there is no single source of truth, so any changes better be backed up by unit tests. And training is awkward. If some fields/values are rarely use, you can easily end up not knowing about them while "peeking at the data".

- the constraints are delegated to dev, so code making checks is scattered all around the place. Which makes consistency suffers. I almost ALWAYS find duplicate or invalid entries in any mongo storage I have to work on. And of course again, changes and training are a pain. "Is this field mandatory ?" is a question I asked way to often in those mongo gigs.

Things like redis suffer less from this because they are so simple it's hard to get wrong.

But mongo is a great software. It's powerful. You can use it for virtually anything, whatever the size, quantity, or shape. And so people do it.

The funny thing is, the best MongoDb users I met are coming from an SQL background. They no the pros and the cons, and use the flexibility of Mongo without making a mess.

But people often choose mongo as a way to avoid learning how to deal with DB.


Oh, I'll never use Mongo again. I think they believe their own PR, and I feel duped by the people who pushed it into our project against the reservations of more than half of the team. In fact I'd rather not work with anyone involved in that whole thing.

But I didn't want to turn the Cockroach scalability discussion into another roast of Mongo so I very purposefully only talked about the horizontal scaling vis a vis 'immediately have to scale' versus 'very capable of scaling' horizontally.

If you gave me a DB that always had to be clustered but I could run it as two or three processes on a single machine without exhausting memory, I would be fine with that as long as the tooling was pretty good. As a dev I need to simulate production environments, not model them exactly.

But if the whole thing only performs when it's effectively an in-memory DHT, there are other tools that already do that for me and don't bullshit me about their purpose.


That last statement is a bit off. Almost every developer knows how to use an SQL database. For 95% of things you want to do it's a very simple piece of technology.

The issue is that they are annoying to use. You have to create a schema, you have to manage the evolution of that schema (great another tool), you have to build out these complicated relational models and finally you have to duplicate that schema in code.

And so IMHO that's why people move towards MongoDB. It's amazing for prototyping and then people just well stick with it as there isn't a compelling enough reason to switch.


See, I would agree with you but I'm aware that we may be dating ourselves.

It's been so long since I touched SQL that I'm forgetting how. Which means that my coworkers haven't touched it either, and some of them that were junior are probably calling themselves 'senior' now, having hardly ever touched SQL. It's no wonder noSQL can waltz into that landscape.

But what's the old rule? If your code doesn't internally reflect the business model it's designed for, then you'll constantly wrestle with impedence mismatches, inefficiencies, and eventually the project will grind to a halt.

So what I'd rather see is a database that is designed to work the way modern, 'good' persistence frameworks advertise them to work. That would represent progress to me.


> - it's implicit, and you have to inspect the db and the code to understand it.

Or run ToroDB Stampede [1] and look at the generated tables manually or via other tools like Schema Spy [2].

[1] https://www.torodb.com/stampede/

[2] http://schemaspy.sourceforge.net/

(warning: I'm a ToroDB dev)


Scale != performance, in the narrow aspect that you seem to be considering.

Lots of data, parallel queries, high-availability, replication, multi-master, etc are all instances that require true distributed scalability.


Storage and availability. But I agree, it will not be an easy trade, until they improve the performance significantly.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: