Hacker News new | past | comments | ask | show | jobs | submit login
RethinkDB 1.6 is out: regex matching, new array operations, random sampling (rethinkdb.com)
164 points by coffeemug on June 13, 2013 | hide | past | favorite | 70 comments



I really encourage everyone to just give this DB a spin, with homebrew or their Ubuntu PPA it is a dead simple install process. I guarantee that the first touch experience will be among the best DB experiences you've had. Their built-in web UI is so incredibly polished and handy.

Also, a nice +1 for RethinkDB is when I was importing way too much data for the (accidentally) way too small EC2 instance, the OOM killer kept knocking off RethinkDB but not a single doc had been corrupted. I've done similar stupid things with other DB's and... well, data was lost.

We're still working through it for analytics, I hope to do a write up sometime soon.


I'm kind of flabbergasted that there is no Java driver listed as either a primarily supported driver or a community one. I personally do most of my work in Scala now but it would make the most sense to have a well supported driver written in Java so any other JVM based language can take advantage of it. Why is there no love for the JVM?


Give me one or two more weeks and mine will be functional enough to use. I haven't put it in the community wiki becuause its just not ready yet.

https://github.com/dkhenry/rethinkjava


dkhenry, thank you for your work on this driver!

I would encourage everyone working on drivers for RethinkDB to publish them regardless of how complete they are, and advertise them as partial implementations. I know of a lot of people interested in working on a Java driver, and it would be great to see collaboration, even early on!


That's a great question. While we only have the bandwidth to officially support three client drivers (we've internally developed the Python, Ruby, and JavaScript drivers), an enormous number of projects have sprung up to build drivers for everything from Go to C# / .NET, PHP, Haskell, Erlang, and more.

As far as a JVM-based client driver, kclay in this thread mentioned that he's been building a Scala driver that is compatible with 1.5, and will update it for 1.6. There's also a thread on Google Groups (https://groups.google.com/forum/?fromgroups#!topic/rethinkdb...) detailing some efforts on a Java driver to date.

If you're interested in contributing, shoot me an email at mike [at] rethinkdb.com. I'd love to see a Java driver happen, and we'll do everything we can to support such a project.


This is the first time I realized that the secondary indexes in rethink are expression based. That's amazingly useful.


slava @ rethink here. This is entirely our fault -- good docs are really hard, and we're working on it now. We should have much more comprehensive documentation up in a few weeks.

This has actually been one of my biggest frustrations -- there is an enormous amount of really cool stuff in Rethink that's relatively poorly documented, so people can't find out about it. I'm really looking forward to fixing this soon.


Hi Slava, this is a bit tangential to the topic at hand -- but I was wondering how RethinkDB plans to support itself...

It looks like an open source project with an enormous amount of amazing talent behind it, and YC backing; but considering that it's FOSS now; how do you plan on generating revenue?


Rethink is a VC funded company. Our goal is to be a defacto database choice for every new web app. This is tongue in cheek -- https://github.com/rethinkdb/rethinkdb/issues/1000 -- but should give you a sense of the direction for the project.

RethinkDB already has paying customers, and we'll soon be offering publicly available commercial support (not unlike JBoss, MySQL, 10Gen, etc.) There are also other revenue streams I can't talk about yet.

For anyone who's interested in the pilot program before commercial support becomes publicly available, shoot me an email -- slava@rethinkdb.com.


I think you should work with the Meteor folks to bring RethinkDB support to Meteor. There's a huge opportunity there.


We've been looking into it. I'd really like to get that done, but it's a non-trivial undertaking, and Meteor's DB APIs are still in flux.

I'll see what we can do to accelerate this -- possibly by enticing the community to pick this up.


Please consider this encouragement to do so, because MongoDB is an instant interest killer for Meteor right now. I just don't trust MongoDB at all.

If you post what needs done in detail, someone will surely pick it up. I'd certainly take a look and see if it was within my capabilities.


good to here!


I have a design question that I didn't see answered in the docs. What's your consistency/atomicity model for secondary indexes?

It seems like it would be hard to guarantee that the document and its index entries are always in the same shard, and since RethinkDB doesn't seem to support general multi-document transactions, I'm curious how you go about updating them simultaneously.


Great question. In Rethink we guarantee that the document and its index entries are always in the same shard, so any change to the document that affects secondary indexes is consistent and atomic.

There are tradeoffs to this approach -- to read a secondary index the db has to go to every master/shard for the table, but there are lots of tricks we use to minimize the impact of this tradeoff.


I will really love to see information about time complexity of DB operations


I'm actually working on this right now. We should get the updated api docs in about two weeks.


[apropos nothing, but since you're official, the linked page uses "unilaterally" in an odd way (to mean "uniformly"). imho. afaict. etc.]


Great catch, we'll update the blog post!


Wow, the array operations are so good. I'm dealing currently with large sets of arrays in mongo and the whole $pull, $push whatever things are incredible frustrating. After seeing the array operations of RethinkDb it's so great.

For example the "insertAt" operation is really helpful when you need ordered arrays and you want to swap or change some elements. I'll give it a try definitely.


Thanks -- don't hesitate to let us know if there's something that's missing that you need. You can use the standard channels (http://rethinkdb.com/community/) or feel free to email me directly (slava@rethinkdb.com).


Since there is now basic authentication, I can finally finish up my Heroku add-on for RethinkDB this weekend. If anyone would like to help me test it, send me an email at cam@camrudnick.com.


I've dropped you an email. Been hanging out for a Heroku RethinkDB add-on for a while!


I'm about to start a new project and had the usual nodejs + mongo skeleton set up and ready to go. After reading the comments here, I will definitely be giving rethink a try. It just feels like rethink is headed in the right direction, whereas mongo is slowing down (maybe focusing on enterprise / profitability stuff instead of features and fixes?). just my 2 cents...


I have been working on a Java/Scala Driver for Rethink and I am now convinced that everyone who ever wants to define a wire protocol should use protocol buffers.


I have fixed bugs in an IMAP protocol parser, and therefore could not possibly agree with you more vehemently.


RethinkDB vs. MongoDB? They seem comparable. What is the big picture difference?


Although I haven't used Rethink in large-scale production yet, I have to say I like it about 10x better than Mongo. The query language alone beats mongo, but the server maintenance/clustering side of things is so much more well thought out.

Rethink has this air of being rock-solid about it too. The team worked really hard on making the core great at storing data, then built features around that. Mongo, after using it for many years both in production and personal projects, seems like it's all slapped together. The early-on technical decisions they made to get it out the door are still biting them in the ass to this day.

That's not to say there aren't some things Mongo is probably better at (especially considering how early-stage Rethink is) but overall I prefer Rethink hands down.


Hey, we'd actually really appreciate it if you could take a few minutes and share your thoughts on what Mongo is better at. Beating Mongo in everything isn't our explicit goal (we're forging our own path), but if there are things we can fix, we'd love to hear about them.


Now that you ask, it's hard to say. Mongo seems to have been built for "fire and forget" quick writes, but due to the global write lock, it fails at this considerably once you start sending it any real data. The answer, which they spend every waking breath promoting, is to shard...but sharding/rebalancing/etc is far from automatic and the one time I really had to do it in a pinch, I ended up doing manual sharding on top of Mongo because it couldn't figure out how to correctly distribute the data. This was about two years ago though, so things may have changed with the clustering. On top of this, the only real way to get any performance is to have the entire dataset fit into memory. Once again, this may be an older restriction.

Reading over your guys' documentation and history, it seems that you have succeeded in every way Mongo fails: MVCC for high-write, homogeneous clustering (no config servers or route services), actual auto-sharding that doesn't make you want to gauge your eyes out, joins, a useful management interface baked in to the DB...those are the main ones I can think of.

I'm currently working with a lot of crypto data, so a binary storage format would be really nice in Rethink, I guess Mongo wins there (but I'm still using Rethink and just Base64-encoding everything).

There are one or two things I really do like about Mongo. For instance, the find-and-modify command on top of the built-in data structures makes it dead simple to build something like a queuing system. It can't actually handle the writes a real queuing system needs, but for lower-write stuff it's useful. For instance of you wanted to have a set of servers run shared cron jobs...they could all "check in" and see if another server is already running that job atomically (and if not grab the job in the same operation that checks it, ensuring that no two servers are running the same job at the same time). I'm not sure if Rethink supports anything like this at the moment (please correct me if I'm wrong).

Mongo also has GridFS (admittedly, I've never used it), but if I was going to do some sort of clustered filesystem, I'd use Riak, definitely not Mongo. Also, I believe Mongo now has full-text search, but once again, I've given up on primary databases having real/useful full-text search and would much rather use ElasticSearch to supplement the primary DB.

Once Rethink has a query optimizer (I know you are in the process of figuring this out and to what extent it will work), I don't think Mongo will have anything on Rethink. I'm certainly not ever going to choose Mongo over Rethink for another project. I may choose Postgres/Mysql for strictly relational data, but Mongo is pretty much dead to me at this point.

I'd rather suffer the consequences and growing pains of a newer DB that wins at just about everything it does than use something slightly more stable that has burned me a number of times.

Plus, I like your team better. I know none of the Mongo devs, but I've talked to many members of the Rethink team on many occasions and you've all been really helpful, supportive, and quick to fix any issues. I've said this before, but you guys aren't running around screaming about how great your DB is and how it will solve ALL OF YOUR PROBLEMS, get you a promotion, improve your sex life, etc etc. I feel like a lot of the reasons I've been burned by Mongo was because 10gen just wasn't straight with me or the teams I worked with when using Mongo. They'd much rather sell a support contract than actually see you succeed, it felt like.

Sorry for the book, TL;DR:

- Atomic "find and modify" is very useful in Mongo. Rethink may already support it via the QL.

- Binary storage type would be nice.

- Query optimizer (I know you're in the process of figuring this out).

- Packaged linux/mac Rethink binaries so people can download the latest version without a recompile (I do like this about Mongo)

I think that's about it. There may be many technical things I'm missing out of sheer ignorance of the internal workings of Rethink and Mongo. Maybe there are things Mongo is great at but Rethink isn't, but it would be news to me.


This is a great writeup and helps tremendously, thank you! Now, wrt specifics.

> the find-and-modify command

This has been baked into ReQL from day one and doesn't require a special command. Here's an example:

  # 'jobs' table contains documents of the form
  # { type: 'type', jobs: ['job1', 'job2'] }
  r.table('jobs').filter({type: 'printer'}).update(function(row) {
    return r.branch(
      row('jobs').contains('job1'),             // if there is job1 in the array
      {jobs: row('jobs').difference(['job1'])}, // atomically remove job1
      null                                      // else don't modify the object
    );
  })
This isn't limited to arrays -- you can do all sorts of atomic find and modify this way within a single document. We clearly need to do a better job documenting this. I'll make sure that happens.

> Packaged linux/mac Rethink binaries

That's been available for a while -- http://rethinkdb.com/docs/install/, you can download a binary pkg for OS X if you don't want to use brew, and we support apt-get on ubuntu/debian via a private PPA. We'll be adding binary packages for more distros soon.

> binary storage format

Mongo wins on this one, but it's definitely on the horizon -- https://github.com/rethinkdb/rethinkdb/issues/137. The storage engine bits for this have been done for a long time, we just have to figure out a sane ReQL API (which is harder than it seems).

> Query optimizer

This is a somewhat longer-term project (i.e. we probably won't get to it by mid-fall), but we're definitely thinking about it. So far though, there hasn't been a real use case yet where it's a problem.


ReQL for binary suggestion: take binary input in the form of a no-args generator function (called repeatedly, returns a series of binary blobs and then null as EOF) or where the language permits, allow a language specific abstraction of a read-once stream such as an IO in Ruby as a higher level wrapper for that. Return binary data in the same form.


Thanks, this is cool. I posted this suggestion as a comment on the issue (https://github.com/rethinkdb/rethinkdb/issues/137).


Take a look at these two writeups:

* A biased/big picture one -- http://rethinkdb.com/blog/mongodb-biased-comparison/

* An unbiased/technical one -- http://rethinkdb.com/docs/comparisons/mongodb/


The biased comparison still says secondary indexes are in development. Might want to update that.


Thanks for noticing this, we'll update it shortly.


Perfect. Thank you.


Just playing around a bit, even with soft durability, insertion speeds seem to be pretty slow (<100 inserts/s from ruby using bulk inserts on a i7 Macbook Pro with an SSD).

Also: All the ruby driver doc seems wrong. 1.6 changed a lot of the APIs?


< 100 inserts/s with soft durability is far too low. Shoot us an email to the google group (http://groups.google.com/group/rethinkdb) and we'll see if we can work this out!


Seems the speed it really declines with the size of the inserted documents. My production data is a hash with 14-15 keys and string/int values of maybe 10 characters each

When I loop and keep generating these documents:

    to_insert = {}
    10.times {|i| to_insert["key#{i}"] = rand(33333).to_s * rand(6)  }
I get around 140/s. Without the insert() call, I reach 50.000+, so it doesn't seem to be the overhead.

With simple documents (3 keys, int values) I reach 350-400.

p.s. script @ https://gist.github.com/rb2k/5777997


Ah, I see. This is a known problem with the performance of the protobuf serialization library we're using in the Ruby driver (see https://github.com/rethinkdb/rethinkdb/issues/897). It bottlenecks the CPU and should be fixed in the next release.

In the meantime, you could try running a multithreaded/multiprocess script -- that would significantly increase throughput.

Sorry you ran into this -- it'll be fixed ASAP.


the driver doesn't seem to be threadsafe and just locks once I start using it in threads :(


Sorry -- the drivers are meant to be used in a way where you create a new connection in each thread. If you do it this way, things will work.


Which parts of the ruby driver docs seem inconsistent / wrong? If there's a documentation issue I'd love to fix this ASAP.


It's fixed by now, but everything on http://www.rethinkdb.com/docs/guides/drivers/ was wrong. The old driver e.g. took several arguments, the 1.6 one takes a hash.

The old version basically looked exactly like the python code.


Having the exact same problem, even with small documents. Using the NodeJS driver.


If you're working with small documents, it's possible you're experiencing a performance bottleneck other than protobuf serialization. Could you provide some more info on the queries you're running and an example document on our Google Group so the rest of the team can weigh in? (http://groups.google.com/group/rethinkdb)


Random sampling is such a simple, but nice touch. I've built up large tables/collections and only after realized I needed random samples I had to go through and generate random seeds for entries.


I've also been working on a Scala Driver in stealth mode, I've been able to successfully send a subset of queries. Though my primary focus has been on making sure queries are typed correctly in Scala. As well as concurrently engineering a good interface from typed objects to JSON data, so that you don't end up with Map[String, Any] or Array[Any] all over the place. My current approach is adopting some of the ideas from Bryan O'Sullivan's excellent Aeson library in Haskell.


Cool -- shoot us an e-mail to the google group (https://groups.google.com/forum/?fromgroups#!topic/rethinkdb...) we'll be happy to help you out any way we can!


I hope one day RethinkDB will be as wide-spread as MongoDB is now.


I have a strong conviction that they will! See all the hard work made and focus on the right stuff (durability, ReQL, ui, auto sharding+clustering, etc.)


hey

i could not find a good article about rethinkdb in production.

Can maybe someone share some thoughts about using rethinkdb on a database heavy web e commerce shop?

thx!


There are a number of customers using RethinkDB in production now. We'll be announcing them closer to the end of the summer and will write up case studies (not in the business-y sense but in a tech-y sense) that will give you an idea of how Rethink helps people solve problems in the wild.


Thank you for your response. I'm looking forward to see it.


Great to hear Slava.


Just when just finish my Scala libary... guess I need to find some time to add theses and release to the public.


We'll be sending out an email to driver developers to help them with the Protobuf changes in 1.6 shortly. Shoot me an email (mike [at] rethinkdb.com) and I'll make sure you get on the list.

We also have a Google Group for driver developers (https://groups.google.com/forum/?fromgroups#!forum/rethinkdb...) that you can always ask questions on.

Thanks for working to build a Scala driver, I'm excited to see it!


Please add a link here:

https://github.com/rethinkdb/rethinkdb/wiki/Community-contri...

Every time I see an announcement from RethinkDB I get excited to try it, and then I see there is no driver for Java/Scala.


I guess I'll polish it up and move it to github, been pushing to private bitbucket.


Damn, these guys are too quick for me, I haven't even updated my driver for version 1.5 yet!

This is great news and excellent progress. Keep up the great work, Rethink team.


Andrew, thanks for your hard work on cl-rethinkdb (https://github.com/orthecreedence/cl-rethinkdb -- anyone using Common Lisp should check out his client driver!)

Expect an email shortly briefing you on the protobuf changes for 1.6!


I have nothing against the database, but I wish they'd come up with a better name. At work I'd feel silly trying to pitch a piece of technology that sounds like a motivational slogan. That's not an criticism of their work, but sometimes names matter.


Doesn't matter in the long run, Mongo had similar issues in the early days. "Mongo? What?" -- just pretend the name was chosen by Steve Jobs or whoever and it will add the "weight" to your voice next time you're pitching.


Is there anything in the roadmap for geospatial index support?

Our major usecase for mongo is returning entries in ordered distance from a location and / or within some radius of a location.

Support for such queries would be seriously useful.


We generally try to adopt the philosophy of making the existing features really good before building new ones, and there is a lot of low-hanging fruit for us to pick for now.

I'd like to get geospacial support in, but it will take a bit of time before we can get to that. We'll likely get it in within a year, so it's a fairly long-term feature for now.


I understand that it's not a priority right now, but how hard would it be to port Rethink to other unix-like operation systems?


We ported to OS X in about three weeks, and we have an outstanding pull request for FreeBSD support (https://github.com/rethinkdb/rethinkdb/pull/688), so as long as it's a Unix-like operating system, there a reasonable amount of work involved.

Porting to Windows is another level of complexity altogether.


slava @ rethink here. I would love to get a windows port (if only because it will give us the option to use Visual Studio's debugger). Unfortunately it's a rather non-trivial undertaking. I'd like to get this done some time within a year, but I can't make any promises.


Right now there are much useful issues then MS Windows support ;)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: