Hacker News new | past | comments | ask | show | jobs | submit login
RethinkDB 2.1.5 performance and scalability (rethinkdb.com)
155 points by knes on June 30, 2016 | hide | past | favorite | 38 comments



We use RethinkDB for our home grown monitoring system; it's incredible. We have several hundred clients connected, inserting monitoring data and doing some pretty complex queries. In the ~2 years we've been using RethinkDB, we have never experienced a failure.

If I could go back in time, I would use RethinkDB for our actual product.


Has the live query stuff performed well? i.e. subscribing to some arbitrary db query for new records?


This is probably my favorite database to work with currently. Real time, relational document storage. Best of three worlds.


Not to mention the great push-based features!


Relational documents? That sounds like graphs but Rethink is not a graph database (compared to Neo4j, GUN, Orient, Arango, etc).

Do you have a link to explain what you mean?


It can do (performant) joins.

https://www.rethinkdb.com/docs/table-joins/


What would you say is the distinguishing feature of whether something "is a graph database"?

To put it another way, what do the databases you listed have in common with each other that they don't have in common with other NoSQL databases, besides marketing?


A graph database exposes an object model that fits graph terminology: nodes and edges. They (hopefully) are optimized for graph-traversal operations like hopping along a series of edges, which would be painful in a relational model.

In this situation, NoSQL is the marketing term, while a graph database is a clear concept with corresponding performance and capability expectations.


I'm not asking you to explain to me what a graph is.

I've found that it's easier and faster to import and traverse edges in SQLite or PostgreSQL than in some of the systems you listed. I haven't tried RethinkDB but it sounds promising. You use words like "hopefully" and "performance and capability expectations", but distinguishing databases based on their hopes and expectations is not at all a clear concept.

The main difference seems to be whether you call things "nodes and edges" or "documents and references" or "rows and joins" in the documentation.

(edit: removed some negativity, and I'm aware some remains)


The most common and relevant difference is that a graph database has a storage engine that is optimized for graph traversal. For example, in a RDBMS, another row is referenced via a foreign key which involves a lookup to an indexed column, while in a graph database, a node is often referenced by its storage location. This makes graph traversals much cheaper, and is also something that an RDBMS is unable to optimise for because of conflicts with the relational model.

Similarly, the query language is optimised for graph traversal types of queries, in a way that would not be possible in an RDBMS due to the relational model constraints, and also because some of the query operations would be extremely inefficient in an RDBMS storage engine.

> That sounds like graphs but Rethink is not a graph database (compared to Neo4j, GUN, Orient, Arango, etc).

With regards to the listed databases, some are, but the rest are not (and do not claim to be) graph databases.

Note: It's different people replying to you.


FWIW, I should mention that I have tried one DB that seems to qualify as a graph database, in that it actually does seem to import and traverse edges efficiently, which is Blazegraph. You tend to have to search past several other things calling themselves graph DBs to find that one.

(But I'm a bit afraid of using it. Any armchair lawyers want to tell me if code that uses a GPL database has to be GPLed itself?)


Don't worry about the GPL part, e.g. see MySQL as a GPL database in wide use.

As long as you access blazegraph via the sparql or tinkerpop interfaces you will clearly be fine.


The capabilities and performance of graph databases do differ, and many aren't cheap, nor well known, so if you need one, you'll need to shop around.

Btw, Neo4J is definitely a graph database, although it might not be what you need.


I can't speak for the others, but the upcoming version of GUN (where I work) has some incredible performance benchmarks: https://github.com/amark/gun/wiki/100000-ops-sec-in-IE6-on-2... .

Yes, "documents and references" and "nodes and edges" is where things become fairly synonymous which is why I asked about RethinkDB supporting relational documents. I honestly think they are doing themselves a disservice by calling it a "JOIN" since that sounds slow and scary. If they wrapped it in a nice API like ".navigate()" or ".traverse()" I think it would be reasonable for them to advertise as a graph database! Tip to anybody at RethinkDB in case they are listening.


I really like using rethinkdb. Currently, I am using thinky for node as an orm which works quite well for my purposes. If anyone else works with nodeJS I would be interested to know:

* how you use rethinkdb

* what you don't use it for, and why

* good examples resources like ORMs, repos, and data models.

Some things I like/use/reference:

- https://github.com/drhurdle/node-rethinkdb-auth-starter

- Yo Express, provides a nice structured MVC application with thinky and rethinkdb with gulp

- https://www.airpair.com/javascript/posts/using-rethinkdb-wit...


While thinky is great and does a ton of heavy lifting for me, I've found it a little bit confusing when needing to write more advanced queries. I would personally recommend to start learning reql + rethinkdb with the rethinkdbdash[0] driver first. It helps with managing connection pools and working with Promises but still lets you play nice with the (very good) rethinkdb docs while ramping up.

[0]: https://github.com/neumino/rethinkdbdash


We had this experience as well. We started using Thinky for a few projects, but found that as soon as we need more complex queries it became very awkward to work within Thinky to do those.

Eventually, we discovered ReQL by itself is just great to work with (especially with rethinkdbdash). Writing your own queries also has significant atomicity benefits over Thinky.


You can mix and match your own queries with Thinky helper functions. I do it all the time with .getJoin. Do some more advanced REQL query, getJoin it into your other Thinky defined models. Thinky is great for the modeling/relationship aspect, setting up default values based on returns from anonymous js functions, ensuring indexes, and dealing with referential integrities when you want to recursively delete related documents. Managing all of that manually, while it's doable, would be a giant pain in the ass. Point is, Thinky and raw REQL work together.


It's really not that bad to manage that manually! The models we replaced with Joi validators, the relationships we weren't using anyway anymore because Thinky has some limitations to them, default values are easy in Rethink, index management just became part of normal migrations, and referential integrities aren't actually guaranteed in Thinky since it's not atomic so we found explicit chained promises to be more intuitive :)

Thinky and raw ReQL definitely work together, don't get me wrong! But as our project scaled we found 95% of our code had become raw ReQL for atomicity, performance, and relational reasons, so it just didn't make sense to use Thinky in future projects. Our original RethinkDB project still contains Thinky, but it'd be cleaner if we just made a complete switch at this point too.

To each their own though, of course!


Mostly agree with ORMs in general, I find in JS they don't offer nearly as much as simple transforms against the direct objects, though they do provide schema integrity that nosql/document data stores tend to skip over.


I'm using thinky too, I think it works fairly well but is a bit more lightweight to e.g. mongoose for MongoDB.

I'm not using any real-time features yet which makes me a bit sad, but have been using it before real-time was their focus.

I'm the maintainer of the yeoman express generator, glad you like it!


What are some counterindicated use cases for rethinkdb?


Besides multi-document transactions, I've encountered these problematic use-cases (something I discovered after scope creep of projects):

2. When I needed "more exotic" data types (e.g. but not limited to Decimals) where you want query/analytics logic to happen on the database rather than the app.

3. OLAP types of use cases. I really wanted collation, sort order, query optimizers, ability to "explore" the data in relational ways, and so on.


What did you use to solve these two use-cases?


Couldn't figure out a way to solve them reasonably (and within schedule) using RethinkDB, so I did a "master-slave" with the RethinkDB as the master and PostgreSQL as a slave replica. Then did all the queries and analytics on the PostgreSQL database. I think this is far from ideal in terms of server costs, but scope creep is never ideal.

I suppose using computed property indexes in RethinkDB might work as well, or generating indexes, sort order, and doing query optimisation outside of the DB, but I couldn't figure it out in time, and it also seemed way more complicated than replicating the data to PostgreSQL.

EDIT: Above is for the OLAP use cases. OLAP of some types just don't go well with document databases. For the Decimal type, I stored it as a string, and had a computed property that transformed the string to a float. The float can then be indexed and queried like a number. It was good, until all the other OLAP use-cases creeped in.

Having said that, I found that custom replicas with RethinkDB is easy because it emits data change events. Never expected that feature to be an escape hatch.


Also, what are some pain points of using RethinkDB, ie development problems, production problems?

RethinkDB looks really nice, was really easy to play around with it when I last did. I'm considering using it for a small project, but wondering what everyone thinks of the downsides, compared to the upsides. (there are pros and cons to everything :)).


Because of the lack of cross document atomic operations, you might have to choose between at most once semantics and at least once semantics. This requires some fancy engineering to make it work for some use cases.

Minor failover is automatic, but major failover requires manual intervention to repair tables even when there's no data loss. That's a little confusing and not obvious.

The anonymous function query style can get confusing because it is partially executed client side and then executed server side. This can be a subtle source of bugs.

Source: I am the creator and maintainer of the Elixir driver for RethinkDB.


> Minor failover is automatic, but major failover requires manual intervention to repair tables even when there's no data loss. That's a little confusing and not obvious.

Wasn't this fixed in RethinkDB 2.1? Or are you referring to something different?


Table configuration is a bit different from data. So if you lose too many of the table configuration voting replicas, you have to manually elect new ones.

Edit: it's better documented now, but still not obvious. https://rethinkdb.com/api/javascript/reconfigure/

I've never encountered it organically, but did during some benchmarks when I was dropping and adding a bunch of servers.


There are no transactions - updates to an individual document are atomic but you are on your own if you need all-or-nothing updates again multiple documents.


Would it be an antipattern (design and performance wise) to have huge documents in rethinkdb? for example, using a single document for each user in a web app.


The database log is append only (with GC aterwards) so a one byte change on a large document re-writes the entire document.

From https://rethinkdb.com/docs/data-modeling/:

Disadvantages of using embedded arrays: Deleting, adding or updating a post requires loading the entire posts array, modifying it, and writing the entire document back to disk. Because of the previous limitation, it’s best to keep the size of the posts array to no more than a few hundred documents.


I've found this to be the case. While RethinkDB has many features to help you drill in to nested content, you'll end up missing document-level features fairly soon and want to split things up.


I'll let someone else field that because I'm not knowledgeable enough to say one way or the other. The inter-document JOINs were one of the features that first drew me in so all I can say is it doesn't match my usage.


If you need ACID.


I would limit that to atomic commits across multiple records... IIRC, you can do an atomic commit to a single record, and the consistency and durability are tunable.

Overall, it's pretty nifty, hoping to find a chance to actually use it. Early on the lack of geo-indexing and automagic failover were features I'd find important for a lot of setups (now has these), but lack of atomic commits across multiple tables/collections not so much.


I love how the rethink team keeps improving an otherwise wonderful document database. ReQL is also super nice. :)


Does anyone have real prod scale numbers - even order of magnitude concurrency would be helpful. We're obviously very wary of trying out new and young data stores.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: