Hacker News new | past | comments | ask | show | jobs | submit login

Pretty sure Reddit has thousands of tables - last time I looked, it was really hard to see this but this is how it seemed like it was working. It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally). While I'm honored they use my library (SQLAlchemy) for relational database access, their actual usage of the relational DB couldn't be more...well... let's just say please don't imitate their style. If you want to build a Reddit, use Mongo/Cassandra or something like that. They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).

Edit: if any reddit devs want to correct me here, feel free, as I found the reddit source extremely difficult to follow back when I looked.




> It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally).

That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.

> They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).

Actually, still today I tell people that even if you want to do key/value, postgres is faster than any NoSql product currently available for doing key/value.


The main issue I run into with storing key/values in MySQL/Postgress is querying those values if they hold complex data (json, serialized data, etc) and the NoSQL products have features that let you dig into those meta fields when querying. (I haven't used a lot of NoSQL products yet, just dabbled with them)

Do you guys know of good techniques to do that with a traditional database like MySQL? I know MySQL has some XML parsing features if you store meta data as XML. I've experimented with it but never used it in production. I sometimes put JSON into a column for dynamic data. Usually I only do that for fields that I know won't need to be queried but that occasionally comes back to haunt me.


Don't know for MySQL, but postgres just added some support for JSON in the latest (beta) versions. Haven't used it and I don't know what it's capable of..


The support for JSON is only validation, otherwise it's just like a plain text column.

A better choice would be to use a hstore column, which is sort of like JSON but we different syntax. Postgres supports indexing hstore subfields, querying on them etc.


Postgres has XML, binary, JSON, geometry ...

And of course, hstore.


And array.


> That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.

Having a bit of trouble parsing this, but I think you mean that the "thing/data" tables are per type, where type is "subreddit", "comment", "links". Which would indicate a fixed schema.

Can you clarify if Reddit just has a fixed number of tables? I remember seeing some "table.create()" in there but I wasn't sure what I was looking at.


You are correct that reddit has a fixed number of tables (waves hands a bit). There is one table per type, so when a new type is created, a new table is created. That rarely happens.

The reason you see those create statements in the code is for bootstrapping for open source users -- the first time they run reddit, it creates the tables.


Every "thing" (a single table) has an attribute called "type". Type is a string, and is something like "comment" or "post" or "badge".

"table.create()" doesn't create a new table, I think it creates in index on "type", so you can treat it a bit like a separate table.

Yes, it is not a "real" database. It's a key-value store, which doesn't lose data.


> "table.create()" doesn't create a new table,

if its a sqlalchemy.schema.Table, create() emits DDL for "CREATE TABLE" to the database (trust me, I wrote it). I'm guessing "table" here is some other object local to the reddit codebase.

anyway, how many "thing" tables are there total?


How do you handle the size of the table (which I assume has many rows). Do the inserts start to slow down as it grows? How much RAM did you have for a master DB instance? Thanks for any input!


Reddit actually does use Cassandra, and you can read about their adventures and such with it here: http://blog.reddit.com/search/label/Cassandra


I guess now I know why reddit is so slow.

Among other problems, with a table structure like this it's hard to make good indexes.


No, it's not.

Everything (everything) is cached in memcachdb, which is also where they store their global variables.

It's a terrible design, but it's a web forum. All they need is good horizontal scaling.


Compared to what? Reddit is damn fast.

And this table structure has nothing to do with Reddit speed. The pages you get, 99.999% of them come from Cassandra and caches, pre-rendered.

So no, reddit is not slow, even less "so slow", and no, the table structure has nothing to do with it's speed.


Signed-in reddit is one of the slowest sites I use. It is, however, very fast for what it is doing. I have no complaints about the speed, only wishful thinking.


I concur. Signed-in reddit is very slow. Signed-out reddit is better but still feels slow. Perhaps I am spoiled by Facebook, Google & co.


Well this sort of explains the question I was thinking... which was how they would deal with such a massive table? At some point the indices will grow very large and need to be split up some how or they won't fit in RAM.


Postgres has feature called index partitioning, it solves such problems. There is also table partitioning. Creating/merging such tables and indexes should be done by your app automatically if you expect it to grow.


> created on the fly (a crime for which any DBA would have you put to death, normally)

Is the crime in creating tables on the fly? Or creating tables of identical structure on the fly?


A relational table is meant to store data, but is not meant to represent data by its own existence.

"Represent data by its own existence" means, if you wrote an app where every time a user created a new account, a brand new table called "user_account_XYZ" would be created to store that user's account information. This is the classic "tables-on-the-fly" antipattern. It's not the same as a horizontal sharding or replicated system where tables are structurally broken out for scaling purposes.

We of course have "on the fly" schemes for all kinds of things that are normally manually generated; some template languages automatically generate backing source files, cloud-based hosting spins up and tears down servers on demand, and database replication or sharding schemes will generate new schemas and tables to accommodate growth or change in the overall dataset. That's not what I mean when I say "on the fly" - I mean an application that relies upon creating new relational tables on the fly in order to represent data.


Both.

Except the second one would be a crime even if not on the fly.


I can't understand why this is a crime. Could you tell us (me) why it (creating new table on the fly) should be prohibited?


As the OP, would also appreciate Reddit dev input into what they're using now... I am summarizing a summary of a Reddit talk from 2010.


See my comment here: http://news.ycombinator.com/item?id=4468905

There isn't one table per subreddit.

I'm not a current dev, FYI, I left a year ago. But as far I know, it still works the same way as a year ago. Which is the same as it worked 3 years ago.


> If you want to build a Reddit, use Mongo/Cassandra or something like that.

Can they handle terabyte-scale data loads these days?


I can't really imagine mongo coping well with reddit, given the global write lock situation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: