I really encourage everyone to just give this DB a spin, with homebrew or their Ubuntu PPA it is a dead simple install process. I guarantee that the first touch experience will be among the best DB experiences you've had. Their built-in web UI is so incredibly polished and handy.
Also, a nice +1 for RethinkDB is when I was importing way too much data for the (accidentally) way too small EC2 instance, the OOM killer kept knocking off RethinkDB but not a single doc had been corrupted. I've done similar stupid things with other DB's and... well, data was lost.
We're still working through it for analytics, I hope to do a write up sometime soon.
I'm kind of flabbergasted that there is no Java driver listed as either a primarily supported driver or a community one. I personally do most of my work in Scala now but it would make the most sense to have a well supported driver written in Java so any other JVM based language can take advantage of it. Why is there no love for the JVM?
I would encourage everyone working on drivers for RethinkDB to publish them regardless of how complete they are, and advertise them as partial implementations. I know of a lot of people interested in working on a Java driver, and it would be great to see collaboration, even early on!
That's a great question. While we only have the bandwidth to officially support three client drivers (we've internally developed the Python, Ruby, and JavaScript drivers), an enormous number of projects have sprung up to build drivers for everything from Go to C# / .NET, PHP, Haskell, Erlang, and more.
As far as a JVM-based client driver, kclay in this thread mentioned that he's been building a Scala driver that is compatible with 1.5, and will update it for 1.6. There's also a thread on Google Groups (https://groups.google.com/forum/?fromgroups#!topic/rethinkdb...) detailing some efforts on a Java driver to date.
If you're interested in contributing, shoot me an email at mike [at] rethinkdb.com. I'd love to see a Java driver happen, and we'll do everything we can to support such a project.
slava @ rethink here. This is entirely our fault -- good docs are really hard, and we're working on it now. We should have much more comprehensive documentation up in a few weeks.
This has actually been one of my biggest frustrations -- there is an enormous amount of really cool stuff in Rethink that's relatively poorly documented, so people can't find out about it. I'm really looking forward to fixing this soon.
Hi Slava, this is a bit tangential to the topic at hand -- but I was wondering how RethinkDB plans to support itself...
It looks like an open source project with an enormous amount of amazing talent behind it, and YC backing; but considering that it's FOSS now; how do you plan on generating revenue?
Rethink is a VC funded company. Our goal is to be a defacto database choice for every new web app. This is tongue in cheek -- https://github.com/rethinkdb/rethinkdb/issues/1000 -- but should give you a sense of the direction for the project.
RethinkDB already has paying customers, and we'll soon be offering publicly available commercial support (not unlike JBoss, MySQL, 10Gen, etc.) There are also other revenue streams I can't talk about yet.
For anyone who's interested in the pilot program before commercial support becomes publicly available, shoot me an email -- slava@rethinkdb.com.
I have a design question that I didn't see answered in the docs. What's your consistency/atomicity model for secondary indexes?
It seems like it would be hard to guarantee that the document and its index entries are always in the same shard, and since RethinkDB doesn't seem to support general multi-document transactions, I'm curious how you go about updating them simultaneously.
Great question. In Rethink we guarantee that the document and its index entries are always in the same shard, so any change to the document that affects secondary indexes is consistent and atomic.
There are tradeoffs to this approach -- to read a secondary index the db has to go to every master/shard for the table, but there are lots of tricks we use to minimize the impact of this tradeoff.
Wow, the array operations are so good. I'm dealing currently with large sets of arrays in mongo and the whole $pull, $push whatever things are incredible frustrating. After seeing the array operations of RethinkDb it's so great.
For example the "insertAt" operation is really helpful when you need ordered arrays and you want to swap or change some elements. I'll give it a try definitely.
Thanks -- don't hesitate to let us know if there's something that's missing that you need. You can use the standard channels (http://rethinkdb.com/community/) or feel free to email me directly (slava@rethinkdb.com).
Since there is now basic authentication, I can finally finish up my Heroku add-on for RethinkDB this weekend. If anyone would like to help me test it, send me an email at cam@camrudnick.com.
I'm about to start a new project and had the usual nodejs + mongo skeleton set up and ready to go. After reading the comments here, I will definitely be giving rethink a try. It just feels like rethink is headed in the right direction, whereas mongo is slowing down (maybe focusing on enterprise / profitability stuff instead of features and fixes?). just my 2 cents...
I have been working on a Java/Scala Driver for Rethink and I am now convinced that everyone who ever wants to define a wire protocol should use protocol buffers.
Although I haven't used Rethink in large-scale production yet, I have to say I like it about 10x better than Mongo. The query language alone beats mongo, but the server maintenance/clustering side of things is so much more well thought out.
Rethink has this air of being rock-solid about it too. The team worked really hard on making the core great at storing data, then built features around that. Mongo, after using it for many years both in production and personal projects, seems like it's all slapped together. The early-on technical decisions they made to get it out the door are still biting them in the ass to this day.
That's not to say there aren't some things Mongo is probably better at (especially considering how early-stage Rethink is) but overall I prefer Rethink hands down.
Hey, we'd actually really appreciate it if you could take a few minutes and share your thoughts on what Mongo is better at. Beating Mongo in everything isn't our explicit goal (we're forging our own path), but if there are things we can fix, we'd love to hear about them.
Now that you ask, it's hard to say. Mongo seems to have been built for "fire and forget" quick writes, but due to the global write lock, it fails at this considerably once you start sending it any real data. The answer, which they spend every waking breath promoting, is to shard...but sharding/rebalancing/etc is far from automatic and the one time I really had to do it in a pinch, I ended up doing manual sharding on top of Mongo because it couldn't figure out how to correctly distribute the data. This was about two years ago though, so things may have changed with the clustering. On top of this, the only real way to get any performance is to have the entire dataset fit into memory. Once again, this may be an older restriction.
Reading over your guys' documentation and history, it seems that you have succeeded in every way Mongo fails: MVCC for high-write, homogeneous clustering (no config servers or route services), actual auto-sharding that doesn't make you want to gauge your eyes out, joins, a useful management interface baked in to the DB...those are the main ones I can think of.
I'm currently working with a lot of crypto data, so a binary storage format would be really nice in Rethink, I guess Mongo wins there (but I'm still using Rethink and just Base64-encoding everything).
There are one or two things I really do like about Mongo. For instance, the find-and-modify command on top of the built-in data structures makes it dead simple to build something like a queuing system. It can't actually handle the writes a real queuing system needs, but for lower-write stuff it's useful. For instance of you wanted to have a set of servers run shared cron jobs...they could all "check in" and see if another server is already running that job atomically (and if not grab the job in the same operation that checks it, ensuring that no two servers are running the same job at the same time). I'm not sure if Rethink supports anything like this at the moment (please correct me if I'm wrong).
Mongo also has GridFS (admittedly, I've never used it), but if I was going to do some sort of clustered filesystem, I'd use Riak, definitely not Mongo. Also, I believe Mongo now has full-text search, but once again, I've given up on primary databases having real/useful full-text search and would much rather use ElasticSearch to supplement the primary DB.
Once Rethink has a query optimizer (I know you are in the process of figuring this out and to what extent it will work), I don't think Mongo will have anything on Rethink. I'm certainly not ever going to choose Mongo over Rethink for another project. I may choose Postgres/Mysql for strictly relational data, but Mongo is pretty much dead to me at this point.
I'd rather suffer the consequences and growing pains of a newer DB that wins at just about everything it does than use something slightly more stable that has burned me a number of times.
Plus, I like your team better. I know none of the Mongo devs, but I've talked to many members of the Rethink team on many occasions and you've all been really helpful, supportive, and quick to fix any issues. I've said this before, but you guys aren't running around screaming about how great your DB is and how it will solve ALL OF YOUR PROBLEMS, get you a promotion, improve your sex life, etc etc. I feel like a lot of the reasons I've been burned by Mongo was because 10gen just wasn't straight with me or the teams I worked with when using Mongo. They'd much rather sell a support contract than actually see you succeed, it felt like.
Sorry for the book, TL;DR:
- Atomic "find and modify" is very useful in Mongo. Rethink may already support it via the QL.
- Binary storage type would be nice.
- Query optimizer (I know you're in the process of figuring this out).
- Packaged linux/mac Rethink binaries so people can download the latest version without a recompile (I do like this about Mongo)
I think that's about it. There may be many technical things I'm missing out of sheer ignorance of the internal workings of Rethink and Mongo. Maybe there are things Mongo is great at but Rethink isn't, but it would be news to me.
This is a great writeup and helps tremendously, thank you! Now, wrt specifics.
> the find-and-modify command
This has been baked into ReQL from day one and doesn't require a special command. Here's an example:
# 'jobs' table contains documents of the form
# { type: 'type', jobs: ['job1', 'job2'] }
r.table('jobs').filter({type: 'printer'}).update(function(row) {
return r.branch(
row('jobs').contains('job1'), // if there is job1 in the array
{jobs: row('jobs').difference(['job1'])}, // atomically remove job1
null // else don't modify the object
);
})
This isn't limited to arrays -- you can do all sorts of atomic find and modify this way within a single document. We clearly need to do a better job documenting this. I'll make sure that happens.
> Packaged linux/mac Rethink binaries
That's been available for a while -- http://rethinkdb.com/docs/install/, you can download a binary pkg for OS X if you don't want to use brew, and we support apt-get on ubuntu/debian via a private PPA. We'll be adding binary packages for more distros soon.
> binary storage format
Mongo wins on this one, but it's definitely on the horizon -- https://github.com/rethinkdb/rethinkdb/issues/137. The storage engine bits for this have been done for a long time, we just have to figure out a sane ReQL API (which is harder than it seems).
> Query optimizer
This is a somewhat longer-term project (i.e. we probably won't get to it by mid-fall), but we're definitely thinking about it. So far though, there hasn't been a real use case yet where it's a problem.
ReQL for binary suggestion: take binary input in the form of a no-args generator function (called repeatedly, returns a series of binary blobs and then null as EOF) or where the language permits, allow a language specific abstraction of a read-once stream such as an IO in Ruby as a higher level wrapper for that. Return binary data in the same form.
Just playing around a bit, even with soft durability, insertion speeds seem to be pretty slow (<100 inserts/s from ruby using bulk inserts on a i7 Macbook Pro with an SSD).
Also: All the ruby driver doc seems wrong. 1.6 changed a lot of the APIs?
< 100 inserts/s with soft durability is far too low. Shoot us an email to the google group (http://groups.google.com/group/rethinkdb) and we'll see if we can work this out!
Seems the speed it really declines with the size of the inserted documents. My production data is a hash with 14-15 keys and string/int values of maybe 10 characters each
Ah, I see. This is a known problem with the performance of the protobuf serialization library we're using in the Ruby driver (see https://github.com/rethinkdb/rethinkdb/issues/897). It bottlenecks the CPU and should be fixed in the next release.
In the meantime, you could try running a multithreaded/multiprocess script -- that would significantly increase throughput.
If you're working with small documents, it's possible you're experiencing a performance bottleneck other than protobuf serialization. Could you provide some more info on the queries you're running and an example document on our Google Group so the rest of the team can weigh in? (http://groups.google.com/group/rethinkdb)
Random sampling is such a simple, but nice touch. I've built up large tables/collections and only after realized I needed random samples I had to go through and generate random seeds for entries.
I've also been working on a Scala Driver in stealth mode, I've been able to successfully send a subset of queries. Though my primary focus has been on making sure queries are typed correctly in Scala. As well as concurrently engineering a good interface from typed objects to JSON data, so that you don't end up with Map[String, Any] or Array[Any] all over the place. My current approach is adopting some of the ideas from Bryan O'Sullivan's excellent Aeson library in Haskell.
I have a strong conviction that they will! See all the hard work made and focus on the right stuff (durability, ReQL, ui, auto sharding+clustering, etc.)
There are a number of customers using RethinkDB in production now. We'll be announcing them closer to the end of the summer and will write up case studies (not in the business-y sense but in a tech-y sense) that will give you an idea of how Rethink helps people solve problems in the wild.
We'll be sending out an email to driver developers to help them with the Protobuf changes in 1.6 shortly. Shoot me an email (mike [at] rethinkdb.com) and I'll make sure you get on the list.
I have nothing against the database, but I wish they'd come up with a better name. At work I'd feel silly trying to pitch a piece of technology that sounds like a motivational slogan. That's not an criticism of their work, but sometimes names matter.
Doesn't matter in the long run, Mongo had similar issues in the early days. "Mongo? What?" -- just pretend the name was chosen by Steve Jobs or whoever and it will add the "weight" to your voice next time you're pitching.
We generally try to adopt the philosophy of making the existing features really good before building new ones, and there is a lot of low-hanging fruit for us to pick for now.
I'd like to get geospacial support in, but it will take a bit of time before we can get to that. We'll likely get it in within a year, so it's a fairly long-term feature for now.
We ported to OS X in about three weeks, and we have an outstanding pull request for FreeBSD support (https://github.com/rethinkdb/rethinkdb/pull/688), so as long as it's a Unix-like operating system, there a reasonable amount of work involved.
Porting to Windows is another level of complexity altogether.
slava @ rethink here. I would love to get a windows port (if only because it will give us the option to use Visual Studio's debugger). Unfortunately it's a rather non-trivial undertaking. I'd like to get this done some time within a year, but I can't make any promises.
Also, a nice +1 for RethinkDB is when I was importing way too much data for the (accidentally) way too small EC2 instance, the OOM killer kept knocking off RethinkDB but not a single doc had been corrupted. I've done similar stupid things with other DB's and... well, data was lost.
We're still working through it for analytics, I hope to do a write up sometime soon.