Pretty sure Reddit has thousands of tables - last time I looked, it was really hard to see this but this is how it seemed like it was working. It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally). While I'm honored they use my library (SQLAlchemy) for relational database access, their actual usage of the relational DB couldn't be more...well... let's just say please don't imitate their style. If you want to build a Reddit, use Mongo/Cassandra or something like that. They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).
Edit: if any reddit devs want to correct me here, feel free, as I found the reddit source extremely difficult to follow back when I looked.
> It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally).
That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.
> They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).
Actually, still today I tell people that even if you want to do key/value, postgres is faster than any NoSql product currently available for doing key/value.
The main issue I run into with storing key/values in MySQL/Postgress is querying those values if they hold complex data (json, serialized data, etc) and the NoSQL products have features that let you dig into those meta fields when querying. (I haven't used a lot of NoSQL products yet, just dabbled with them)
Do you guys know of good techniques to do that with a traditional database like MySQL? I know MySQL has some XML parsing features if you store meta data as XML. I've experimented with it but never used it in production. I sometimes put JSON into a column for dynamic data. Usually I only do that for fields that I know won't need to be queried but that occasionally comes back to haunt me.
Don't know for MySQL, but postgres just added some support for JSON in the latest (beta) versions. Haven't used it and I don't know what it's capable of..
The support for JSON is only validation, otherwise it's just like a plain text column.
A better choice would be to use a hstore column, which is sort of like JSON but we different syntax. Postgres supports indexing hstore subfields, querying on them etc.
> That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.
Having a bit of trouble parsing this, but I think you mean that the "thing/data" tables are per type, where type is "subreddit", "comment", "links". Which would indicate a fixed schema.
Can you clarify if Reddit just has a fixed number of tables? I remember seeing some "table.create()" in there but I wasn't sure what I was looking at.
You are correct that reddit has a fixed number of tables (waves hands a bit). There is one table per type, so when a new type is created, a new table is created. That rarely happens.
The reason you see those create statements in the code is for bootstrapping for open source users -- the first time they run reddit, it creates the tables.
if its a sqlalchemy.schema.Table, create() emits DDL for "CREATE TABLE" to the database (trust me, I wrote it). I'm guessing "table" here is some other object local to the reddit codebase.
How do you handle the size of the table (which I assume has many rows). Do the inserts start to slow down as it grows? How much RAM did you have for a master DB instance? Thanks for any input!
Signed-in reddit is one of the slowest sites I use. It is, however, very fast for what it is doing. I have no complaints about the speed, only wishful thinking.
Well this sort of explains the question I was thinking... which was how they would deal with such a massive table? At some point the indices will grow very large and need to be split up some how or they won't fit in RAM.
Postgres has feature called index partitioning, it solves such problems. There is also table partitioning. Creating/merging such tables and indexes should be done by your app automatically if you expect it to grow.
A relational table is meant to store data, but is not meant to represent data by its own existence.
"Represent data by its own existence" means, if you wrote an app where every time a user created a new account, a brand new table called "user_account_XYZ" would be created to store that user's account information. This is the classic "tables-on-the-fly" antipattern. It's not the same as a horizontal sharding or replicated system where tables are structurally broken out for scaling purposes.
We of course have "on the fly" schemes for all kinds of things that are normally manually generated; some template languages automatically generate backing source files, cloud-based hosting spins up and tears down servers on demand, and database replication or sharding schemes will generate new schemas and tables to accommodate growth or change in the overall dataset. That's not what I mean when I say "on the fly" - I mean an application that relies upon creating new relational tables on the fly in order to represent data.
I'm not a current dev, FYI, I left a year ago. But as far I know, it still works the same way as a year ago. Which is the same as it worked 3 years ago.
Edit: if any reddit devs want to correct me here, feel free, as I found the reddit source extremely difficult to follow back when I looked.