I don't understand why databases like RethinkDB and MongoDB don't just provide a SQL interface. I get it - they are marketing themselves as NoSQL but they are infact providing almost exactly the same set of features that say MySQL provided back in the day a simple "fast" sub-set of SQL. Instead they use the lack of an interface as a marketing gimmick... when in reality we have to learn a new query language... :(
> I don't understand why databases like RethinkDB and MongoDB don't just provide a SQL interface.
Actually, it's a really good question. I'm one of the ReQL designers at Rethink, and I was the one pushing for no SQL compatibility. Here is some of my reasoning (we could talk about this for days, though):
* Even SQL designers would tell you SQL isn't a very good
programming language. It even looks like Cobol! Imagine if
every language after Cobol decided to be backwards compatible
-- what sort of world would we live in?
* SQL is really bad for querying hierarchical data with lots of
empty columns. The quality of experience you get from learning
a new language designed for JSON, far outweighs the downsides
of learning it. We're assuming people will be using ReQL
fifteen years from now.
* Designing a language that embeds into your programming language
wasn't possible before -- but it is now. That means no more SQL
injection, no more string manipulation, no more heavy
ORMs. Well worth it in my opinion.
* The chaining paradigm (largely pinoneered and proven by jQuery)
is magical for getting people to intuitively understand how to
write complex queries. No more StackOverflow questions of "How
do I do X in SQL?" because there is now an intuitive
consistency to the language.
* SQL compatibility is very, very difficult. The standard is hard
to implement, and has lots of grey area around the edges. You
can go for full bug-by-bug compatibility, which would take
decades. Or you can go for basic compatibility, which confuses
people. They try to port their application, it works for a
while, and then breaks in some grey area in production, which
results in a terrible user experience.
As an example, I'd ponder on why SQL has both the `where` keyword, and the `having` keyword. This alone wouldn't be enough of a reason to design a new language, of course, but this sort of thing permeates SQL. It's 2014. We can do better.
I can tell at a glance what `where` and `having` mean - the former is a filter over a result set, and the latter is a filter over a grouping.
You're probably right that SQL should have earlier defined this sort of recursion, so that you could sequence groupings and filters more easily, but don't fret: that future is here with common table expressions. Recursion is even supported!
I'm reminded of this abomination of a query I wrote in 2010 that produced a numbers table:
DECLARE @N int = 1000000;
WITH RecursiveRowGenerator (Row#, Iteration) AS (
SELECT 1, 1
UNION ALL
SELECT Row# + Iteration, Iteration * 2
FROM RecursiveRowGenerator
WHERE Iteration * 2 < CEILING(SQRT(@N))
UNION ALL
SELECT Row# + (Iteration * 2), Iteration * 2
FROM RecursiveRowGenerator
WHERE Iteration * 2 < CEILING(SQRT(@N))
)
, SqrtNRows AS (
SELECT *
FROM RecursiveRowGenerator
UNION ALL
SELECT 0, 0
)
SELECT TOP(@N) 1 + A.Row# * POWER(2,CEILING(LOG(SQRT(@N))/LOG(2))) + B.Row# Row#
FROM SqrtNRows A, SqrtNRows B
ORDER BY A.Row#, B.Row#;
Do you consider Rethink relational (in the Codd sense)? From a quick look at the API I would guess that it qualifies.
What I couldn't see at a quick glance was a way to constrain the data in a table or enforce foreign key constraints. I guess those are the trade offs against something like Postgres for schema flexibility and easy distribution. Is that a correct way to view it?
> Do you consider Rethink relational (in the Codd sense)?
Under the hood, yes. How much of that is exposed to the user is debatable.
> What I couldn't see at a quick glance was a way to constrain the data in a table or enforce foreign key constraints. I guess those are the trade offs against something like Postgres for schema flexibility and easy distribution. Is that a correct way to view it?
That's about right. Some of this is by design (lack of schema enforcement), and we might add features to do that later. There are no architectural constraints preventing us from doing that.
Foreign key constraints will probably never be added to Rethink (or at least not for a long, long time) because doing those efficiently in distributed systems is anywhere from hard to impossible.
Good arguments, though only the last one seems like a true objection.
I think, If I were implementing a database from the ground up, I'd indeed make my own query language--SQL does suck--but I would probably strive to accept SQL as an option, with the proviso that it's "just SQL syntax with ThisDB's semantics."
This would let people point their old applications, written under "the database is just an SQL-speaking dumb store for simple CRUD operations" assumptions (e.g. any ActiveRecord app), at the new DB and get somewhat-sensible results out, while still letting them write new applications against the new API.
The alternative is forcing any user with existing services to hack up some sort of application-level back-and-forth replication between their old DB and your new DB during the long-and-possibly-indefinite migration period, rather than just migrating in one fell swoop.
Background: I've done a fair bit of hiring for analysts and have worked in most facets of analytics/DW including vendor evaluation/selection and implementation.
I think that the lack of a SQL interface is a big negative. It's really hard to find analyst (not programmers) with good technical skills or train them to proficiency in new technologies. I would have a very hard time selling a platform that existing analysts wouldn't be able to use right away and that few people on the market could work with.
Also, designing a new language makes integration with other applications that much harder.
That's a great point. Our hypothesis is that great querying capabilities for analysts and great querying capabilities for programmers should look completely different, and be two totally different technologies. We'll eventually ship a solution for analysts, but I don't think a great database product should shoot for a lowest common denominator.
Although initially it seemed otherwise, your point on having a different interface to confusion between complete SQL compatibility v/s the subset being offered makes a lot of sense. Also personally I would rather prefer an imperative language for tasks where efficiency is the primary concern compared to declarative SQL
A properly designed language shouldn't have two different keywords for something that does ostensibly the same thing in different contexts. That's a mark of bad language design.
I don't think they are? Your ReQL to me looks like it's applying WHERE num_comments > 7 and then aggregating that.
I mean regardless, your example SQL should be doing the >7 against an aggregate (e..g sum(num_comments)) not a field, that SQL does not work as written.
Whoops, that's a great catch -- it's a bug in the docs. Will fix momentarily. Thanks!! (that is indeed embarrassing)
In ReQL any command you call after `group` runs on each group. So once you've called `group`, you can run anything you could run on a table on each group and that just works.
So, to be clear here, you have created two fundamental things (both called "filter") -- A pre-group and a post-group filter. Users must still understand the difference and when to utilize them.
SQL just happens to call those WHERE and HAVING instead of "filter" both times.
You're right, but it's not just `filter`. Any command that can run on a group can run on a full table and vice versa. For example:
# get a sample of 3 elements from a table
r.table('foo').sample(3)
# get a sample of 3 elements from each group
r.table('foo').group('category').sample(3)
You could say that we created two versions of `filter`, and `sample`, and every other command. But another way to say it is that we use polymorphism, which is widely considered an advantage in modern programming languages.
Yeah, it's definitely more expressive and terse I just don't think your argument against WHERE/HAVING is incredibly strong. Someone could also just do subqueries and only use WHERE and achieve the exact same behavior that you offer (albeit with a ton more typing).
HAVING exists because it was created prior to subqueries/dynamic tables, otherwise we'd have been likely to just use:
select * from (select category, sum(num_comments) as comments from posts group by category) as temp where comments > 7
I don't think this is a "good language" / "bad language" dichotomy. SQL just sits at a different level of abstraction.
From the examples you have posted, it appears that ReQL is a lower-level abstraction that SQL. In SQL you specify what you want logically and the DB turns this into a query plan. It appears that ReQL is more like a query plan itself, where you explicitly specify the data flow from stage to stage of query evaluation.
As a more specific example of this, it appears that in many cases in ReQL the user specifies what index should be used in the query itself. SQL is more abstract than this; the idea is for the query planner to figure out what index(ex) should be used.
In theory, yes. In practice, the best language is one that maps developer thought process into working code in the most natural ways. I don't know the community that ReQL is targeting, but for the current crop of data scientists, SQL is a more natural language in my opinion. (There is a reason why Hive is so popular among big companies, where you'd expect to find big data sets.
I'm totally the opposite, and blame you entirely for provoking what follows! SQL at this point is passed expiry.
The simple reason is we now have plenty of decent scripting languages to run as interactive prompts, all of which are far better at interfacing with the rest of the system than SQL is. Exposing a sane, direct, API in those languages gets you much more than SQL, along with having removed multiple layers of confusion, including string generation, escaping, reparsing, then is it doing what you want, etc.
I've been using LevelDB (via plyvel) a lot lately for data storage, and every time I end up having to use SQL (even indirectly via ORM) is painful by comparison because you can just feel the control being taken away from you, and somewhere you end up having to fire up the DB prompt for no good reason such as adding strange DB specific indexing flags to columns or even setting up authorization and DB creation, making it yet another thing to go wrong during deployment.
Protocol buffers stored in LevelDB prove so easy to use by comparison with something like SQLite or PgSQL just at an API level. The resulting code is simpler, cleaner, and much easier to reason about. If I need to move it to another format the code, again, is amazingly small.
As someone that cares deeply about my app's data structure seeing the acceptance of a world beyond SQL is one of the best developments in my career, and experience means I simply don't trust any SQL based abstraction to give you the controls to get it right.
>I'm totally the opposite, and blame you entirely for provoking what follows! SQL at this point is passed expiry.
Actually it's the very opposite.
SQL is a higher level abstraction for all this ad-hoc junk, based on actual mathematical principles (relational algebra), and it's also declarative, instead of imperative.
Furthermore, all this NoSQL query systems now emerging are nothing new. They were tried in the 70s and early 80s, and people found out that they sucked. Before SQL what we had was, well, NoSQL.
Wanting a flexible schema-less db or one that's denormalised for certain needs (like Google's or Facebook's) makes sense.
Replacing SQL and RBDMS for the common tasks they are used (company logistics and accounting, etc) with NoSQL is a regression to the primitive past. I guess people not knowning (computing) history are doomed to repeat it.
One reason is that SQL only lets you work with flat relational tables. MongoDB and RethinkDB let you store complex JSON-like objects as values, and extending SQL to query and manipulate those kinds of structures is hard. The closest attempt I've seen is Google's BigQuery.
Another reason is that SQL does its best to decouple the query (what you're asking for) from the execution plan (how it actually gets evaluated). But for operations like joins, the precise way the query is evaluated can have a huge effect on performance. So it makes sense to design your query language to explicitly expose those knobs to the programmer.
I have no experience with RethinkDB, so maybe they did it right, but I have quite a bit with Mongo.
MongoDB's querying is nowhere near as powerful as SQL. And understanding its limitations and pain points are essential when structuring your data. Otherwise you'll never really scale unless you've got data that's embarrassingly easy to query (eg: filtering on 1 or 2 indexed fields).
Being able to write JS map-reduce queries is fine when you need to do stuff ad-hoc, but none of it can be used in production at scale. Then there's the aggregate framework which helps quite a bit but still doesn't offer the same level of performance I'd expect from Postgres, MSSQL, or Oracle. Things get really yucky when you have to start unpacking arrays - especially since there's a memory limit on $sort and $group. There are further issues related to skipping being slow on large collections because it must step through.
And to cope with these limitations you need to invest a ton of time into coding around them. The freedom of the database being schemaless? Gone. You must carefully structure data and decide up-front how it needs to be queried - or cope with poor performance.
And those complex queries, when written in ugly JSON, aren't the most human-readable things in the world. Certainly not any better than SQL.
I'd rather just invest the time in learning SQL than partake in the mental gymnastics necessary squeeze high-performance non-trivial queries out of MongoDB.
I understand MongoDB is a big hit with people who use it for low volume internal tools or people building MVPs. I can totally understand how its an awesome tool for those. But rusty old SQL starts looking better when your app scales and the business demands change.
You're right, you haven't tried REQL =]. As a developer of a RethinkDB driver and a fairly active user of the database, and also a previous fan of Mongo, I can tell you that REQL kicks the shit out of MongoDB's querying.
SQL really is well-geared towards relational, flat database systems. Mongo's querying is an obtuse command-based language that always felt like it was adding hacks on hacks to get the data you wanted.
REQL is almost like having your data in-memory, and you're running programmatic expressions on it that seamlessly melt into your native language. The lack of a query optimizer almost makes it better because you have to think about your query plans and indexes instead of just firing it off at the server and crossing your fingers. You're not running commands, you're processing data.
I don't have a lot of experience scaling Rethink, but I do know from reading the docs and architecture that it scales out better than most SQL servers will and certainly scale up better than Mongo. I've tried to scale both MySQL and Mongo. Both are difficult and painful.
You can't conclude that because of Mongo's failures, SQL beats Rethink. Rethink is light years ahead of Mongo.
I'm not really sure how you could wedge SQL into a system that doesn't really have the concept of columns without it being, at the least, a distinctly different dialect.
And, really, if you've worked with some of the "higher level" interfaces to SQL like SQLAlchemy or Laravel's QueryBuilder, ReQL doesn't seem that alien; they all tend to work by method chaining as well. (But they usually don't support everything SQL does, unlike ReQL's interfaces.)
Also: unlike my (admittedly limited) exposure to other NoSQL systems, RethinkDB supports relations and joins -- it's actually a lot easier for an SQL fiend to move to Rethink than any of its competitors.
I much prefer ReQL over SQL. It is much nicer, flows naturally in my opinion and is combine-able. I'm using this combination method in my rethink-miner tool to help me work with data to try and figure it out and I can make generic queries that I can use on top of every other sub-query which is very very nice.
For now it probably keeps a number of users away that they don't want to support, but it makes sense to at least have an automatic translation layer in the long run.
RethinkDB is my favorite piece of my current tech stack. I tell everybody I know about it, and I'm soon to release some blog posts and speak at a Meetup showing off what it is good at and how to get started - I really just can't say enough good stuff about these guys.
A high level overview from someone who has used it in production since 1.10 (about six months):
Pros:
* ReQL is a beautiful DSL that makes querying and using my data simpler than anything I've ever had before.
* An amazing UI lets you quickly do the things you do the most (verify a query, cherry pick some results, add/drop tables, indexes, shards, etc).
* Unparalleled support. During almost any reasonable hour, a RethinkDB employee will quickly field any question you have on #rethinkdb (freenode), their User Group, or email.
* Quick releases. These guys ship major releases every 60 to 90 days. Each release offers huge features, performance improvements, and bug fixes. They do a great job of listening to what people want and implementing them fast.
Cons:
* Database is still technically "beta" - great for side projects and prototyping, but be mindful if you intend to use it in production.
* Works great on tables of a few GB, but performance really degrades on the next order of magnitude. It looks like there were major changes in 1.12 addressing this.
* Only three officially supported clients (JS, Python, Ruby). That's a good start for their target market, but it is limiting for some.
RethinkDb is an archetype for startups - building what people want, shipping fast, always talking to customers, and clearly passionate about what they do. Even if you don't use their product, we all can learn from these guys. I hope they do well.
Ditto. Also appreciated is the upfront honesty about what their tech can and cannot do currently, and what they plan to do to address limitations / bugs.
I really love their ability to design sensible APIs paired with their honesty, gives me the confidence to begin using this software right now with belief that it will eventually get to a very stable point.
Would not use it for any multi-million$ business just yet but it's perfect for personal projects that require quick iteration which might in the future become big.
In the spirit of making what people want, we're planning 1.13 right now. If you have feedback on what features you want, shoot me an e-mail -- slava@rethinkdb.com, and I'll work hard to try and get them in.
Thanks for the great feedback! If you're speaking at a meetup, let me know (mike@rethinkdb.com) -- we'd be happy to send over T-shirts (and stickers, etc.), as well as spread the word about the event.
That's something I've been searching for, too. I actually opened this page, wishing that someone linked to a benchmark for large data sets. Large datasets are the reason I'm trying to move away from mongodb completely, and I've been very happy with postgres but I'm willing to give a try to others.
We'll be publishing benchmarks and case studies for large data sets in the coming months. You might want to wait for those because there will be tons of info, and lots of bugs that I'm sure are still lurking ironed out. However, if you want to be one of the early adopters and give it try, we'd absolutely love your feedback and will work hard to incorporate it into the product.
You can use our regular channels for feedback (rethinkdb.com/community), your shoot me an e-mail any time -- slava@rethinkdb.com.
I like their API and the webadmin interface but I tried the 1.10 with node.js client (using the cpp protobuffer) and in my test it was slow compared to MySQL (InnoDB). A simple select query with 100 items took around 70 ms while the same query was done in 3ms with MySQL. I'm not sure what did i do wrong or it is just in such an early stage. Also i cannot get versions above 1.10 from the repo on CentOS.
Latency for range queries is a surprisingly tricky issue, there were a number of problems around it that we solved. Check out this issue for example -- https://github.com/rethinkdb/rethinkdb/issues/1766. Most of the problems have been resolved in 1.12, so we'd encourage you to give it a try again!
As far as the outdated CentOS build -- I haven't heard of this problem before, but I'll check with @atnnn who's in charge of packaging -- see https://github.com/rethinkdb/rethinkdb/issues/2176. We'll get this resolved ASAP.
EDIT: ok, @atnnn confirmed that 1.12 works on CentOS. If it doesn't for you, could you post more details on the GitHub issue? It would help immensely!
It's not trivial to do performance comparisons correctly, you'd have to provide a lot more detail about the tests to get proper feedback on it. Ideally the test would be something that anybody can download and run locally themselves.
As far as the package repository for your distro not having the latest version of something, this is very very common. Your best bet is to build your own package from source, or find an alternate repository to use where somebody is building packages of the newer versions.
I love the elegance of ReQL and how well it works in dynamic languages, but I still can't figure out a good way to port that lambda syntax used in python into the Java driver.
What about lambda expressions in Java 8? I haven't programmed in Java in years, so unless there are limitations I'm not aware of, they seem like a great fit.
Without those, doing an elegant driver would be pretty hard, which unfortunately probably means we'll never get good ReQL support into legacy Java.
So the way all underlieing communication is done is through a protobuf. So in python I know they hijack the expression and they can then construct the protobuf from the passed in python lambda. In Java land we can't really do that since lambdas are desugared at compile time to just be functions. In Java I can't even overload operators like you can in Scala or C++ to make it a little nicer. The result is your pretty much left with rolling the protobuf by hand.
The upcoming 1.13 release will introduce a pure JSON interface for queries (which official drivers will switch to). You'll be able to construct JSON directly, and send it to the server -- no protobufs. Would that make things easier?
It will make it on par with the interface for other document oriented databases. I think it will be a lot more user friendly then the protobuf, and it might be the best that Java can do.
To clarify, in 1.13 JSON will be a communication protocol for queries. It's not intended to be used by the end user, but by the client drivers. You can use it as an end user, but it isn't nearly as nice as native language drivers.
The change is meant to simplify driver development, packaging, and improve performance. Protobuf is worse than JSON in almost all of these categories.
You may be able to utilize a lot of their code for converting the lambda bytecode.. or not. Anyway it's interesting to see how they're dealing with the problem you've run into.
> you no longer have to manually specify cache sizes for tables to prevent running over memory and into swap
I'm really glad you're addressing this. Does this auto-sizing apply to RethinkDB's memory usage in general? The last time I tried using it on a small VPS, I set my table's cache size really low, but I still ran out of memory whenever I ran queries on a large table.
Definitely looking forward to secondary index export/migration and to upgrades without the export-import cycle.
The autosizing only applies to the cache itself. There is a separate issue for a query data structures memory limit (https://github.com/rethinkdb/rethinkdb/issues/1375). We'll try to address it soon, but the cache autosizing should take care of most problems people have seen before.
I'm also looking forward to stable formats and seamless migration, but it's a really hard problem. I think for the time being we'll introduce a stable branch, and users will have to pick between new features or stability. It's not ideal, but will give most people most of what they want.
RethinkDB is pretty awesome. Its a blast working with it and the Python driver is quite well rewritten, imo.
As the author of PyRethinkORM for Python it was ridiculously easy to write compared to an ORM for say SQL which was a major selling point for using RethinkDB behind my last several projects.
I'm fairly excited about the ARM port as I've been wanting to use Rethink on a few projects on my BeagleBone White/Black. The new map/reduce changes are pretty cool too.
Hey, sorry you (and everyone else) ran into this! This will go away in the next version (1.13), because we're removing the requirement for protobufs and supporting a pure json transport interface. It turns out to be both faster, easier to setup, and paradoxically more space efficient.
I know its not an official driver, but I maintain the Community Java driver here[1]. If its missing something you would need feel free to open an issue and I can add it for you. I have been using RethinkDB with java for a few projects and it works well enough for me.
1. If you're dealing with JSON, a native JSON database provides a 100x better programming experience than a relational system.
2. Scalability out of the box. You can just add a node, reshard, and keeping going. It's much harder in relational systems.
If you have lots of null columns, hierarchical relationships, or eventual need for scale out, give RethinkDB a try.
If you have traditional rigid data or ACID requirements, stick with a relational DB.