RethinkDB 1.6 is out: regex matching, new array operations, random sampling

bryanh · on June 13, 2013

I really encourage everyone to just give this DB a spin, with homebrew or their Ubuntu PPA it is a dead simple install process. I guarantee that the first touch experience will be among the best DB experiences you've had. Their built-in web UI is so incredibly polished and handy.

Also, a nice +1 for RethinkDB is when I was importing way too much data for the (accidentally) way too small EC2 instance, the OOM killer kept knocking off RethinkDB but not a single doc had been corrupted. I've done similar stupid things with other DB's and... well, data was lost.

We're still working through it for analytics, I hope to do a write up sometime soon.

efuquen · on June 13, 2013

I'm kind of flabbergasted that there is no Java driver listed as either a primarily supported driver or a community one. I personally do most of my work in Scala now but it would make the most sense to have a well supported driver written in Java so any other JVM based language can take advantage of it. Why is there no love for the JVM?

dkhenry · on June 13, 2013

Give me one or two more weeks and mine will be functional enough to use. I haven't put it in the community wiki becuause its just not ready yet.

https://github.com/dkhenry/rethinkjava

mglukhovsky · on June 13, 2013

dkhenry, thank you for your work on this driver!

I would encourage everyone working on drivers for RethinkDB to publish them regardless of how complete they are, and advertise them as partial implementations. I know of a lot of people interested in working on a Java driver, and it would be great to see collaboration, even early on!

mglukhovsky · on June 13, 2013

That's a great question. While we only have the bandwidth to officially support three client drivers (we've internally developed the Python, Ruby, and JavaScript drivers), an enormous number of projects have sprung up to build drivers for everything from Go to C# / .NET, PHP, Haskell, Erlang, and more.

As far as a JVM-based client driver, kclay in this thread mentioned that he's been building a Scala driver that is compatible with 1.5, and will update it for 1.6. There's also a thread on Google Groups (https://groups.google.com/forum/?fromgroups#!topic/rethinkdb...) detailing some efforts on a Java driver to date.

If you're interested in contributing, shoot me an email at mike [at] rethinkdb.com. I'd love to see a Java driver happen, and we'll do everything we can to support such a project.

mrkurt · on June 13, 2013

This is the first time I realized that the secondary indexes in rethink are expression based. That's amazingly useful.

coffeemug · on June 13, 2013

slava @ rethink here. This is entirely our fault -- good docs are really hard, and we're working on it now. We should have much more comprehensive documentation up in a few weeks.

This has actually been one of my biggest frustrations -- there is an enormous amount of really cool stuff in Rethink that's relatively poorly documented, so people can't find out about it. I'm really looking forward to fixing this soon.

winter_blue · on June 13, 2013

Hi Slava, this is a bit tangential to the topic at hand -- but I was wondering how RethinkDB plans to support itself...

It looks like an open source project with an enormous amount of amazing talent behind it, and YC backing; but considering that it's FOSS now; how do you plan on generating revenue?

coffeemug · on June 13, 2013

Rethink is a VC funded company. Our goal is to be a defacto database choice for every new web app. This is tongue in cheek -- https://github.com/rethinkdb/rethinkdb/issues/1000 -- but should give you a sense of the direction for the project.

RethinkDB already has paying customers, and we'll soon be offering publicly available commercial support (not unlike JBoss, MySQL, 10Gen, etc.) There are also other revenue streams I can't talk about yet.

For anyone who's interested in the pilot program before commercial support becomes publicly available, shoot me an email -- slava@rethinkdb.com.

eaurouge · on June 13, 2013

I think you should work with the Meteor folks to bring RethinkDB support to Meteor. There's a huge opportunity there.

coffeemug · on June 13, 2013

We've been looking into it. I'd really like to get that done, but it's a non-trivial undertaking, and Meteor's DB APIs are still in flux.

I'll see what we can do to accelerate this -- possibly by enticing the community to pick this up.

JulianMorrison · on June 15, 2013

Please consider this encouragement to do so, because MongoDB is an instant interest killer for Meteor right now. I just don't trust MongoDB at all.

If you post what needs done in detail, someone will surely pick it up. I'd certainly take a look and see if it was within my capabilities.

lampe3 · on June 13, 2013

good to here!

teraflop · on June 13, 2013

I have a design question that I didn't see answered in the docs. What's your consistency/atomicity model for secondary indexes?

It seems like it would be hard to guarantee that the document and its index entries are always in the same shard, and since RethinkDB doesn't seem to support general multi-document transactions, I'm curious how you go about updating them simultaneously.

coffeemug · on June 13, 2013

Great question. In Rethink we guarantee that the document and its index entries are always in the same shard, so any change to the document that affects secondary indexes is consistent and atomic.

There are tradeoffs to this approach -- to read a secondary index the db has to go to every master/shard for the table, but there are lots of tricks we use to minimize the impact of this tradeoff.

robert-zaremba · on June 16, 2013

I will really love to see information about time complexity of DB operations

coffeemug · on June 18, 2013

I'm actually working on this right now. We should get the updated api docs in about two weeks.

andrewcooke · on June 14, 2013

[apropos nothing, but since you're official, the linked page uses "unilaterally" in an odd way (to mean "uniformly"). imho. afaict. etc.]

mglukhovsky · on June 14, 2013

Great catch, we'll update the blog post!

farslan · on June 13, 2013

Wow, the array operations are so good. I'm dealing currently with large sets of arrays in mongo and the whole $pull, $push whatever things are incredible frustrating. After seeing the array operations of RethinkDb it's so great.

For example the "insertAt" operation is really helpful when you need ordered arrays and you want to swap or change some elements. I'll give it a try definitely.

coffeemug · on June 13, 2013

Thanks -- don't hesitate to let us know if there's something that's missing that you need. You can use the standard channels (http://rethinkdb.com/community/) or feel free to email me directly (slava@rethinkdb.com).

ConceitedCode · on June 13, 2013

Since there is now basic authentication, I can finally finish up my Heroku add-on for RethinkDB this weekend. If anyone would like to help me test it, send me an email at cam@camrudnick.com.

elithrar · on June 13, 2013

I've dropped you an email. Been hanging out for a Heroku RethinkDB add-on for a while!

wc- · on June 13, 2013

I'm about to start a new project and had the usual nodejs + mongo skeleton set up and ready to go. After reading the comments here, I will definitely be giving rethink a try. It just feels like rethink is headed in the right direction, whereas mongo is slowing down (maybe focusing on enterprise / profitability stuff instead of features and fixes?). just my 2 cents...

dkhenry · on June 13, 2013

I have been working on a Java/Scala Driver for Rethink and I am now convinced that everyone who ever wants to define a wire protocol should use protocol buffers.

pjscott · on June 14, 2013

I have fixed bugs in an IMAP protocol parser, and therefore could not possibly agree with you more vehemently.

albiabia · on June 13, 2013

RethinkDB vs. MongoDB? They seem comparable. What is the big picture difference?

orthecreedence · on June 13, 2013

Although I haven't used Rethink in large-scale production yet, I have to say I like it about 10x better than Mongo. The query language alone beats mongo, but the server maintenance/clustering side of things is so much more well thought out.

Rethink has this air of being rock-solid about it too. The team worked really hard on making the core great at storing data, then built features around that. Mongo, after using it for many years both in production and personal projects, seems like it's all slapped together. The early-on technical decisions they made to get it out the door are still biting them in the ass to this day.

That's not to say there aren't some things Mongo is probably better at (especially considering how early-stage Rethink is) but overall I prefer Rethink hands down.

coffeemug · on June 13, 2013

Hey, we'd actually really appreciate it if you could take a few minutes and share your thoughts on what Mongo is better at. Beating Mongo in everything isn't our explicit goal (we're forging our own path), but if there are things we can fix, we'd love to hear about them.

orthecreedence · on June 13, 2013

Now that you ask, it's hard to say. Mongo seems to have been built for "fire and forget" quick writes, but due to the global write lock, it fails at this considerably once you start sending it any real data. The answer, which they spend every waking breath promoting, is to shard...but sharding/rebalancing/etc is far from automatic and the one time I really had to do it in a pinch, I ended up doing manual sharding on top of Mongo because it couldn't figure out how to correctly distribute the data. This was about two years ago though, so things may have changed with the clustering. On top of this, the only real way to get any performance is to have the entire dataset fit into memory. Once again, this may be an older restriction.

Reading over your guys' documentation and history, it seems that you have succeeded in every way Mongo fails: MVCC for high-write, homogeneous clustering (no config servers or route services), actual auto-sharding that doesn't make you want to gauge your eyes out, joins, a useful management interface baked in to the DB...those are the main ones I can think of.

I'm currently working with a lot of crypto data, so a binary storage format would be really nice in Rethink, I guess Mongo wins there (but I'm still using Rethink and just Base64-encoding everything).

There are one or two things I really do like about Mongo. For instance, the find-and-modify command on top of the built-in data structures makes it dead simple to build something like a queuing system. It can't actually handle the writes a real queuing system needs, but for lower-write stuff it's useful. For instance of you wanted to have a set of servers run shared cron jobs...they could all "check in" and see if another server is already running that job atomically (and if not grab the job in the same operation that checks it, ensuring that no two servers are running the same job at the same time). I'm not sure if Rethink supports anything like this at the moment (please correct me if I'm wrong).

Mongo also has GridFS (admittedly, I've never used it), but if I was going to do some sort of clustered filesystem, I'd use Riak, definitely not Mongo. Also, I believe Mongo now has full-text search, but once again, I've given up on primary databases having real/useful full-text search and would much rather use ElasticSearch to supplement the primary DB.

Once Rethink has a query optimizer (I know you are in the process of figuring this out and to what extent it will work), I don't think Mongo will have anything on Rethink. I'm certainly not ever going to choose Mongo over Rethink for another project. I may choose Postgres/Mysql for strictly relational data, but Mongo is pretty much dead to me at this point.

I'd rather suffer the consequences and growing pains of a newer DB that wins at just about everything it does than use something slightly more stable that has burned me a number of times.

Plus, I like your team better. I know none of the Mongo devs, but I've talked to many members of the Rethink team on many occasions and you've all been really helpful, supportive, and quick to fix any issues. I've said this before, but you guys aren't running around screaming about how great your DB is and how it will solve ALL OF YOUR PROBLEMS, get you a promotion, improve your sex life, etc etc. I feel like a lot of the reasons I've been burned by Mongo was because 10gen just wasn't straight with me or the teams I worked with when using Mongo. They'd much rather sell a support contract than actually see you succeed, it felt like.

Sorry for the book, TL;DR:

- Atomic "find and modify" is very useful in Mongo. Rethink may already support it via the QL.

- Binary storage type would be nice.

- Query optimizer (I know you're in the process of figuring this out).

- Packaged linux/mac Rethink binaries so people can download the latest version without a recompile (I do like this about Mongo)

I think that's about it. There may be many technical things I'm missing out of sheer ignorance of the internal workings of Rethink and Mongo. Maybe there are things Mongo is great at but Rethink isn't, but it would be news to me.

coffeemug · on June 13, 2013

This is a great writeup and helps tremendously, thank you! Now, wrt specifics.

> the find-and-modify command

This has been baked into ReQL from day one and doesn't require a special command. Here's an example:

  # 'jobs' table contains documents of the form
  # { type: 'type', jobs: ['job1', 'job2'] }
  r.table('jobs').filter({type: 'printer'}).update(function(row) {
    return r.branch(
      row('jobs').contains('job1'),             // if there is job1 in the array
      {jobs: row('jobs').difference(['job1'])}, // atomically remove job1
      null                                      // else don't modify the object
    );
  })

This isn't limited to arrays -- you can do all sorts of atomic find and modify this way within a single document. We clearly need to do a better job documenting this. I'll make sure that happens.

> Packaged linux/mac Rethink binaries

That's been available for a while -- http://rethinkdb.com/docs/install/, you can download a binary pkg for OS X if you don't want to use brew, and we support apt-get on ubuntu/debian via a private PPA. We'll be adding binary packages for more distros soon.

> binary storage format

Mongo wins on this one, but it's definitely on the horizon -- https://github.com/rethinkdb/rethinkdb/issues/137. The storage engine bits for this have been done for a long time, we just have to figure out a sane ReQL API (which is harder than it seems).

> Query optimizer

This is a somewhat longer-term project (i.e. we probably won't get to it by mid-fall), but we're definitely thinking about it. So far though, there hasn't been a real use case yet where it's a problem.

JulianMorrison · on June 15, 2013

ReQL for binary suggestion: take binary input in the form of a no-args generator function (called repeatedly, returns a series of binary blobs and then null as EOF) or where the language permits, allow a language specific abstraction of a read-once stream such as an IO in Ruby as a higher level wrapper for that. Return binary data in the same form.

coffeemug · on June 18, 2013

Thanks, this is cool. I posted this suggestion as a comment on the issue (https://github.com/rethinkdb/rethinkdb/issues/137).

coffeemug · on June 13, 2013

Take a look at these two writeups:

* A biased/big picture one -- http://rethinkdb.com/blog/mongodb-biased-comparison/

* An unbiased/technical one -- http://rethinkdb.com/docs/comparisons/mongodb/

andrewflnr · on June 14, 2013

The biased comparison still says secondary indexes are in development. Might want to update that.

mglukhovsky · on June 14, 2013

Thanks for noticing this, we'll update it shortly.

albiabia · on June 13, 2013

Perfect. Thank you.

rb2k_ · on June 13, 2013

Just playing around a bit, even with soft durability, insertion speeds seem to be pretty slow (<100 inserts/s from ruby using bulk inserts on a i7 Macbook Pro with an SSD).

Also: All the ruby driver doc seems wrong. 1.6 changed a lot of the APIs?

coffeemug · on June 13, 2013

< 100 inserts/s with soft durability is far too low. Shoot us an email to the google group (http://groups.google.com/group/rethinkdb) and we'll see if we can work this out!

rb2k_ · on June 13, 2013

Seems the speed it really declines with the size of the inserted documents. My production data is a hash with 14-15 keys and string/int values of maybe 10 characters each

When I loop and keep generating these documents:

    to_insert = {}
    10.times {|i| to_insert["key#{i}"] = rand(33333).to_s * rand(6)  }

I get around 140/s. Without the insert() call, I reach 50.000+, so it doesn't seem to be the overhead.

With simple documents (3 keys, int values) I reach 350-400.

p.s. script @ https://gist.github.com/rb2k/5777997

coffeemug · on June 13, 2013

Ah, I see. This is a known problem with the performance of the protobuf serialization library we're using in the Ruby driver (see https://github.com/rethinkdb/rethinkdb/issues/897). It bottlenecks the CPU and should be fixed in the next release.

In the meantime, you could try running a multithreaded/multiprocess script -- that would significantly increase throughput.

Sorry you ran into this -- it'll be fixed ASAP.

rb2k_ · on June 14, 2013

the driver doesn't seem to be threadsafe and just locks once I start using it in threads :(

coffeemug · on June 14, 2013

Sorry -- the drivers are meant to be used in a way where you create a new connection in each thread. If you do it this way, things will work.

mglukhovsky · on June 13, 2013

Which parts of the ruby driver docs seem inconsistent / wrong? If there's a documentation issue I'd love to fix this ASAP.

rb2k_ · on June 14, 2013

It's fixed by now, but everything on http://www.rethinkdb.com/docs/guides/drivers/ was wrong. The old driver e.g. took several arguments, the 1.6 one takes a hash.

The old version basically looked exactly like the python code.

molf · on June 13, 2013

Having the exact same problem, even with small documents. Using the NodeJS driver.

mglukhovsky · on June 13, 2013

If you're working with small documents, it's possible you're experiencing a performance bottleneck other than protobuf serialization. Could you provide some more info on the queries you're running and an example document on our Google Group so the rest of the team can weigh in? (http://groups.google.com/group/rethinkdb)

zmitri · on June 13, 2013

Random sampling is such a simple, but nice touch. I've built up large tables/collections and only after realized I needed random samples I had to go through and generate random seeds for entries.

jroesch · on June 13, 2013

I've also been working on a Scala Driver in stealth mode, I've been able to successfully send a subset of queries. Though my primary focus has been on making sure queries are typed correctly in Scala. As well as concurrently engineering a good interface from typed objects to JSON data, so that you don't end up with Map[String, Any] or Array[Any] all over the place. My current approach is adopting some of the ideas from Bryan O'Sullivan's excellent Aeson library in Haskell.

coffeemug · on June 13, 2013

Cool -- shoot us an e-mail to the google group (https://groups.google.com/forum/?fromgroups#!topic/rethinkdb...) we'll be happy to help you out any way we can!

imslavko · on June 13, 2013

I hope one day RethinkDB will be as wide-spread as MongoDB is now.

dexterbt1 · on June 14, 2013

I have a strong conviction that they will! See all the hard work made and focus on the right stuff (durability, ReQL, ui, auto sharding+clustering, etc.)

lampe3 · on June 13, 2013

hey

i could not find a good article about rethinkdb in production.

Can maybe someone share some thoughts about using rethinkdb on a database heavy web e commerce shop?

thx!

coffeemug · on June 13, 2013

There are a number of customers using RethinkDB in production now. We'll be announcing them closer to the end of the summer and will write up case studies (not in the business-y sense but in a tech-y sense) that will give you an idea of how Rethink helps people solve problems in the wild.

lampe3 · on June 13, 2013

Thank you for your response. I'm looking forward to see it.

kinleyd · on June 13, 2013

Great to hear Slava.

kclay · on June 13, 2013

Just when just finish my Scala libary... guess I need to find some time to add theses and release to the public.

mglukhovsky · on June 13, 2013

We'll be sending out an email to driver developers to help them with the Protobuf changes in 1.6 shortly. Shoot me an email (mike [at] rethinkdb.com) and I'll make sure you get on the list.

We also have a Google Group for driver developers (https://groups.google.com/forum/?fromgroups#!forum/rethinkdb...) that you can always ask questions on.

Thanks for working to build a Scala driver, I'm excited to see it!

noelwelsh · on June 13, 2013

Please add a link here:

https://github.com/rethinkdb/rethinkdb/wiki/Community-contri...

Every time I see an announcement from RethinkDB I get excited to try it, and then I see there is no driver for Java/Scala.

kclay · on June 13, 2013

I guess I'll polish it up and move it to github, been pushing to private bitbucket.

orthecreedence · on June 13, 2013

Damn, these guys are too quick for me, I haven't even updated my driver for version 1.5 yet!

This is great news and excellent progress. Keep up the great work, Rethink team.

mglukhovsky · on June 13, 2013

Andrew, thanks for your hard work on cl-rethinkdb (https://github.com/orthecreedence/cl-rethinkdb -- anyone using Common Lisp should check out his client driver!)

Expect an email shortly briefing you on the protobuf changes for 1.6!

overgard · on June 14, 2013

I have nothing against the database, but I wish they'd come up with a better name. At work I'd feel silly trying to pitch a piece of technology that sounds like a motivational slogan. That's not an criticism of their work, but sometimes names matter.

dualogy · on June 14, 2013

Doesn't matter in the long run, Mongo had similar issues in the early days. "Mongo? What?" -- just pretend the name was chosen by Steve Jobs or whoever and it will add the "weight" to your voice next time you're pitching.

rojabuck · on June 14, 2013

Is there anything in the roadmap for geospatial index support?

Our major usecase for mongo is returning entries in ordered distance from a location and / or within some radius of a location.

Support for such queries would be seriously useful.

coffeemug · on June 14, 2013

We generally try to adopt the philosophy of making the existing features really good before building new ones, and there is a lot of low-hanging fruit for us to pick for now.

I'd like to get geospacial support in, but it will take a bit of time before we can get to that. We'll likely get it in within a year, so it's a fairly long-term feature for now.

gizzlon · on June 14, 2013

I understand that it's not a priority right now, but how hard would it be to port Rethink to other unix-like operation systems?

mglukhovsky · on June 15, 2013

We ported to OS X in about three weeks, and we have an outstanding pull request for FreeBSD support (https://github.com/rethinkdb/rethinkdb/pull/688), so as long as it's a Unix-like operating system, there a reasonable amount of work involved.

Porting to Windows is another level of complexity altogether.

coffeemug · on June 14, 2013

slava @ rethink here. I would love to get a windows port (if only because it will give us the option to use Visual Studio's debugger). Unfortunately it's a rather non-trivial undertaking. I'd like to get this done some time within a year, but I can't make any promises.

robert-zaremba · on June 16, 2013

Right now there are much useful issues then MS Windows support ;)