Well I opened with "I don't understand". I'm not saying that they could, I'm saying I don't understand why they can't.
I'd love for you to elaborate. As far as I can tell they are not using PostgreSQL as a relational database, but rather, as a column store. So why not use Cassandra or even Redis (if the amount of data can totally fit into RAM easily, maybe it can't)?
In fact I think Reddit moved to Cassandra... anyway, I am not an expert, I'm asking.
I can't speak much to Reddit. I know they moved some things to Cassandra, but as a user I'll say I haven't been impressed with their latency and uptime since.
As a developer, I can speak a bit about Disqus. I don't speak on their behalf, but I did work there for two years (ironically I was also the first to use Redis there[1]) so I can at least explain why I think using Redis for the whole site is a terrible idea. I'll also note upfront that where I currently work we use a ton of Cassandra, Redis, and very little MySQL on the side, so hopefully I won't be pegged as some kind of "RDBMS only" guy.
Anyway, some reasons:
1) Relational data. Disqus really is relational. You have users, user have posts, posts belong to a thread, threads belong to sites, sites belong to an account (imagine one account for the different CNN websites). And that's a very, very small subset of the number of tables and foreign keys involved. People don't realize how many features Disqus really has above and beyond "post a text blob to a thread."
Being able to write a query that uses joins is huge. The alternative in Redis/Cassandra is having to denormalize your data into every single possible way you may want to do a "query" on later. Oh, by the way, I promise you will forget a few ways and regret having to backfill/fix all the broken denormalizations.
Even if you don't forget to denormalize anything upfront, the biggest joke k/v and document stores ever played on the developer community was convincing them that they save development time by being "schema free". When Disqus wants to add a new feature it's often only a new JOIN/INDEX away. If you realize a year into your Redis deployment that you want to be able to tell a user how many comments they made per month in the year... what do you do? In Postgres you hit the datetime column index and call it a day.
2) Memory (and cost). The Disqus network is actually pretty huge. Storing the entire dataset in RAM (Redis) would cost a lot more than using an efficient DB like Postgres that is a pro at moving data between disk and RAM. Cassandra would work better than Redis here, but the other problems I list still hold.
Also, as soon as you have to break from one Redis instance to two (either to scale CPU or to live on another box to increase available RAM) you lose a lot of server-side functionality like being able to union sets, or use the embedded Lua to fake 'queries' because now you have keys that live on seperate systems. Before anyone says you should shard by "site", see my link below. I did just that, but you have to understand that Disqus is more than just "comments for my website". Say you shard by website, now how do I run a do a union across sets that involve a single user who has posted to 100 different websites? I can't. Back to backfilling and denormalizing tons of data that also needs to be resident in RAM and kept in sync.
---
I could add more, but I just realized that the linked talk probably spoke about the big sharded Postgres K/V type store that they built. Here's the thing: all of the core stuff (like from point 1) isn't stored in there. It's used when it can be, for scalability, but the majority of the app is still in a behemoth Postgres instance that is replicated many times over. As to why not use Redis for that part? I'd say because it's memory only and because Disqus has Postgres expertise. Also, it's not truly "key value" because it still has indexes for say, datetimes or post_id or site_id which make doing a lot of non-relational queries handy without having to denormalize. Now, why not use Cassandra for that? Well, I would. :)
So when I said switch to Redis I meant to replace the 'big sharded Postgres K/V type store that they use' not the part where they actually use relational features of the database.
I'm always curios about the idea of scaling UP versus OUT -- like you mention, going from one Redis instance to two mucks up the waters. So why do it at all? (Maybe a year from now Redis Cluster will finally come out and solve this)
1TB of RAM is going to dip below five figures soon, I guess if you can't fit into that it is moot.
"If you realize a year into your Redis deployment that you want to be able to tell a user how many comments they made per month in the year... what do you do? In Postgres you hit the datetime column index and call it a day."
I would do bloom filters, but point very well taken. No silver bullets. Thanks again for the reply.
Disqus wouldn't fit into 1TB of memory as a denormalized data set.
It doesnt even fit (indexed, at least) into 1TB of memory as a normalized data set.
At the scale we're at, you're required to make tradeoffs and come up with less than standard solutions to problems. Our solution, as many others have done before us, is to shard datasets (both Redis and SQL).
Congrats on the growth, that's a lot of comments! What I really want to know is how you guys solved being google bot friendly -- I have a fogy memory of a blog post or HN comment from around the time of the new version coming out that said there was something interesting that will be shared about that in the future.
I'd love for you to elaborate. As far as I can tell they are not using PostgreSQL as a relational database, but rather, as a column store. So why not use Cassandra or even Redis (if the amount of data can totally fit into RAM easily, maybe it can't)?
In fact I think Reddit moved to Cassandra... anyway, I am not an expert, I'm asking.