Hacker News new | past | comments | ask | show | jobs | submit login

If you wrote your query with group and count, with no index, then there would be problems with the performance. RethinkDB generally does not do query optimization, except in specific ways (mostly about distributing where the query is run), unless that's changed very recently. You can write that query so that it executes with appropriate memory usage with a map and reduce operation.



Do you think map/reduce would result in performance near what I get from Postgres?


You would get the same behavior that Postgres's would be in terms of how data is traversed and aggregated -- that is, not by building a bunch of groups and counting them after the fact. I do think RethinkDB ought to be able to apply aggregations to group queries on the fly though... I'm not really up to date on that.

Postgres will still have better numbers, I'm sure. It has a schema for starters.


While I don't know RethinkDB is structured internally, I don't see any technical reason why a non-mapreduce group-by needs to load the entire table into memory instead of streaming it, or why a mapreduce group-by needs to be slow. M/R only becomes a slow algorithm once you involve shards and network traffic; any classical relational aggregation plan uses a kind of M/R anyway.

Postgres has a schema, of course, but it still needs to look up the column map (the ItemIdData) in each page as it scans it, the main difference being that this map is of fixed length, whereas in a schemaless page it would be variable-length.

Anyway, I'm hoping RethinkDB will get better at this. I sure like a lot about it.


Generally speaking RethinkDB doesn't query optimize, except in deterministic ways, unless they've changed policy on this. I don't see any reason why a plain group/aggregate query couldn't be evaluated appropriately -- I know it is when the grouping is done using an index, maybe it is now when the grouping is done otherwise (I don't know, but it would be sensible, I'm out of date).


(Also it would be nice if it did/does, because performance will still be terrible if you have too many groups, otherwise.)


I haven't used RethinkDB, but I would assume the answer is no. Choosing to use map/reduce is basically a declaration that performance is your lowest priority.


An optimally optimized query by Postgres would be effectively mapping and reducing.


And the point is that the converse is definitely not true.

Postgres knows about the structure of your data and where it's located, and can do something reasonably optimal. A generic map/reduce algorithm will have to calculate the same thing as Postgres eventually, but it'll have tons of overhead.

(Also, what is with the fad for running map/reduce in the core of the database? Why would this be a good idea? It was a terrible, performance-killing idea on both Mongo and Riak. Is RethinkDB just participating in this fad to be buzzword-compliant?)


While there have been some truly misguided mapreduce implementations, mapreduce is just a computation model that isn't inherently slower than others: A relational aggregation of the type you get with SQL like:

  select foo, count(*) from bar group by foo
...is essentially a mapreduce, although most databases probably don't use a reduce buffer larger than 2. (But they would benefit from it if they could use hardware vectorization, I believe.)

Mapreduce works great if you are already sequentially churning through a large subset of a table, which is typically the case with aggregations such as "count" and "sum". Where mapreduce is foolish is when you try using mapreduce for real-time queries that only seek to extract a tiny subset of the dataset.


There is no relevant knowledge that Postgres has that RethinkDB lacks that lets it evaluate the query more efficiently (besides maybe a row layout with fixed offsets so that it doesn't haven't parse documents, but that's not relevant to the reported problem). A generic map reduce certainly would have more overhead, obviously, but not running-out-of-memory overhead reported above, just the overhead of merging big documents.

The reason you run queries in "the core" of a database is because copying all the data outside the database and doing computations there would be far worse.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: