NewSQL overview

xtrumanx · on Feb 13, 2015

For anyone else too lazy to read the slides to find out what NewSQL is, this is slide 13:

    NewSQL: definition

    "A DBMS that delivers the scalability and flexibility 
    promised by NoSQL while retaining the support for SQL 
    queries and/or ACID, or to improve performance for 
    appropriate workloads."

    451 Group

drapper · on Feb 13, 2015

Better one is on the next slide:

- SQL as a primary interface

- ACID support for transactions

- Non-locking architecture

- High per-node performance

- Scalable, shared nothing architecture

Michael Stonebraker

moondowner · on Feb 13, 2015

The first time I ran into the NewSQL terminology was when I checked out CockroachDB: https://github.com/cockroachdb/cockroach

'A Scalable, Geo-Replicated, Transactional Datastore' written in Go, influenced by Google's Spanner.

philippelh · on Feb 13, 2015

There's an interesting presentation on cockroachdb https://www.youtube.com/watch?v=MEAuFgsmND0

kolev · on Feb 13, 2015

The first time I ran into the NewSQL term was several years ago with VoltDB [0].

[0] http://voltdb.com/

gliush · on Feb 13, 2015

Thanks, I've missed it, I should definitely take a look.

atombender · on Feb 13, 2015

It's a shame that most of these aren't open source.

At one point I had great hopes for RethinkDB. It seemed like a great fit when I first heard about it: It's among the newer databases, it's open source, MVCC, distributed, sharded, multi-master, and has a neat query language with some advanced features.

Unfortunately, while RethinkDB does feel very pleasant and modern, performance is pretty terrible, and it's quite lacking in some other areas.

danielmewes · on Feb 17, 2015

Daniel @ RethinkDB here

We're still optimizing the performance of certain operations. There are definitely still a few rough edges, especially with analytical queries as you'd pointed out.

As far as basic operations (inserts, updates, retrieving documents etc.) are concerned, RethinkDB already has competitive performance and we hope to publish official benchmark results soon. As you know, good, comparable benchmarks aren't easy, and we're currently spending most of our time on improving RethinkDB rather than working on those. But they're definitely coming.

This is just a guess, but a query involving COUNT(*) might execute faster with PostgreSQL since it can use an index for making the count basically a constant-time operation, while RethinkDB doesn' currently support this. That is definitely a missing feature on our end.

If you get a chance, I'd be very happy to hear from you about the specific queries and the data you'd been testing, so we can look into those specific performance problems. My email is daniel at rethinkdb.com .

atombender · on Feb 17, 2015

Benchmarks are indeed hard, which is why I only posted a very specific couple of numbers.

Postgres isn't using an index for the count. When you're essentially tallying an entire table it's a lot less efficient to traverse a B-Tree than to just sequentially stream the table, which is something Postgres is pretty good at.

Here's a sample explain output [1]. Mind you, that server was under heavy load (load avg 12) when I ran the query, which is why the numbers are higher than usual; with no competing CPU or I/O, it normally takes about 500ms.

That box has only 4GB of RAM, out of which it's using about 512MB for caching, and the OS is using about 1.5GB for the file page cache. As you can see from the plan, it's hitting 38,108 buffers, or 297MB of data, meaning its plowing through approximately 148MB/sec to compute the aggregation.

For comparison, RethinkDB chugs about 6GB of RAM before it manages to eke out some results from that query. I'm honestly not sure what it's doing.

[1] https://gist.github.com/atombender/e97dde73ed90054c7626

kiwidrew · on Feb 14, 2015

Are you able to go into more detail about the performance of RethinkDB? Considering to use it for a new project but haven't been able to find anything solid about performance except vague and hand-waving "not very fast" anecdotes.

atombender · on Feb 16, 2015

It's been a few weeks since I did a benchmark comparison, and I don't have a setup anymore that I can re-run to get real numbers.

RethinkDB is decent but not great at simple queries, like "select * where foo = 1". Unfortunately, and it's a rather big minus, Rethink doesn't have a planner, and every index has to be specified explicitly. This puts a burden on the application, which needs to be completely index-aware.

Where it really does break down is on aggregations. On a 800k dataset of semi-complex documents, the query "select x, count(*) from y group by x" (x here is a string with a cardinality of 89) took 4 seconds with Postgres, 84 seconds with Rethink on a test box.

In this scenario, Postgres was actually running in a slow Virtualbox VM, whereas Rethink was running natively; on our production servers, the same query takes about 500ms. In this scenario, Rethink used about 6GB out of the 8GB of memory allocated to it, whereas Postgres just streams the data through the FS cache -- so I think Rethink is just using a really bad query strategy.

Rethink's benefit, of course, is horizontal scaling, and it's supposed to be able to parallelize reads. But it has to work for the types of queries you give it. I strongly advise looking at your application's query needs, importing some test data, and seing how they perform. If you don't need aggregations or complex joins, it might work for you.

kiwidrew · on Feb 18, 2015

Thank you for coming back to this thread to share your experience, much appreciated!

I did not realize RethinkDB lacks a query planner/optimizer. That is a huge downside. The docs didn't seem to make much effort to point out this limitation.

amaks · on Feb 13, 2015

It's strange he doesn't mention Google's Spanner, or references a paper (http://static.googleusercontent.com/media/research.google.co...).

gliush · on Feb 13, 2015

I'm going to. I just don't want it to be half-done, and this paper requires a lot of time :(

dfragnito · on Feb 13, 2015

We consider ourselves NewSQL http://schemafreedb.com/. We are looking into using Nuodb as one of our backend storage options when scalability is a concern, default is MySQL with TokuDB.

gliush · on Feb 13, 2015

Take a look also at FoundationDB, very promising DB with a good performance and scalability.

0942v8653 · on Feb 13, 2015

Is there a non-slideshare link?

spikej · on Feb 13, 2015

Deslide to the rescue:

http://deslide.clusterfake.net/?o=html_table&u=http%3A%2F%2F...

jgord · on Feb 13, 2015

share the sentiment, to the point I'll never visit an SL link

devty · on Feb 13, 2015

mind if i ask why?

ddorian43 · on Feb 13, 2015

Is there any open-source/free newsql db ?

alexandre_m · on Feb 13, 2015

CockroachDB is promising.

https://github.com/cockroachdb/cockroach

Inspired by Google Spanner.

mathnode · on Feb 13, 2015

Slide 27. VoltDB.

Or use posgres-xc, or mariadb 10 with galera and maxscale. Postgres has JSON storage. MariaDBB has dynamic and virtual columns, and the CONNECT engine which can now directly create new or edit existing JSON files using the SQL interface, instead of dedicated functions for JSON manipulation.

https://mariadb.com/kb/en/mariadb/connect-json-table-type/

ddorian43 · on Feb 13, 2015

voltdb free has no HA

postgres-xc/xl has no HA

Maxscale looks like it doesn't yet offer sharding

jeremysmyth · on Feb 13, 2015

MySQL Cluster.

http://www.mysql.com/products/cluster/features.html

jrullmann · on Feb 13, 2015

Take a look at the Overview slide of the presentation, where this information is listed for each database the author discussed: http://image.slidesharecdn.com/newsql-2015-150213024325-conv...

anonetal · on Feb 13, 2015

With the recent SQL-like query language, I think Cassandra comes pretty close to satisfying the requirements as well, as long as you don't require multi-key transactions.

codeordie · on Feb 13, 2015

It's not bad, and Apache Spark adds SQL on top of some other nosql DBs, but they are still hard to use.

Cassandra for example: you can't just put a secondary index on a table to enable sorting on it. There are secondary indexes, but they aren't useful for that (they don't perform well on high cardinality data).

Want a different sort? Make a new table. And duplicate the data, since there are no JOINs.

I find this really cumbersome.

pkolaczk · on Feb 14, 2015

This is because the top priority for Cassandra is being scalable and always-on in the first place. New features are added only if they meet these goals. Therefore you have guarantee it will always scale and it has no single point of failure, not even a temporary one, regardless of which feature you use.

Some other new database engines do it in different order - first adding as many RDBMS features as possible and then making some of them kinda scalable (in some circumstances, if the phase of the moon is right, etc.). Sure, that might feel less cumbersome at the beginning on a single laptop (feels like good old RDBMS), but once you scale it out, it is far from being all roses.

BTW, duplicating (denormalizing) data is the way to go if you really need scalability. In general case, joins or traditional indexes with references from the index leaves to the original data do not scale on distributed data sets.

ddorian43 · on Feb 17, 2015

You can do joins on the same partition-key + indexes(with references) only inside the partition-key.

on Feb 13, 2015

[deleted]

ddorian43 · on Feb 13, 2015

that's for olap, while I was talking for oltp

gliush · on Feb 13, 2015

As a short answer - not yet, but it looks like in 2-3 years there will be at least one.

frik · on Feb 13, 2015

MySQL/MariaDB/Percona with its different modern db engines and MySQL Cluster should be mentioned.

gliush · on Feb 13, 2015

I haven't found any notion about good scalability as a feature. Could you provide a link to a good article about it so that I could mention them

frik · on Feb 13, 2015

http://highscalability.com/blog/2010/11/4/facebook-at-13-mil...