> Adding a column to 10 million rows takes locks and doesn’t work.
It does not take locks, other than for very briefly.
1. Make a new empty table that has the same structure as the table you wish to add a column to. Add your new column to the empty table.
2. Put triggers on the old table that, whenever a row is added or updated, makes a copy of the row in the new table or updates the copy already there.
3. Run a background process that goes through the old table doing dummy updates:
UPDATE table SET some_col = some_col WHERE ...
where the WHERE clause picks a small number of rows (e.g., just go through the primary key sequentially). Since you aren't actually modifying the table, all this does is trigger the trigger on the specified rows.
4. When you've hit everything with a dummy update, rename the current table to a temp name, and rename the new table to the current table. This is the only step that needs a lock.
There are tools for MySQL to automate much of this. There was a post either here or on Reddit a while back about this which linked to them. I'm sorry but I didn't save a link to it so you'll have to search if you want it.
Alternatively, you can use a database such as PostgreSQL, which stores metadata about tables in other tables, allowing you to not only add a column without locking the table but do so as part of a transaction with other changes that can all be rolled back atomically on failure.
PostgreSQL also supports concurrent index creation, so if you realize later you need an index on your amazingly large table you can have it built in the background while you are still using the table. (Managing indexes were another locking issue mentioned in the article.)
> PostgreSQL also supports concurrent index creation
I use this all the time, and am flabbergasted how people can do without it. I feel like migration frameworks should make it the default with Postgres.
It's too bad it can't be mixed with transactional DDL, but because indexes are not logical changes, I don't really care as much, even if it is dissatisfying.
So, all in all, for those who want to take advantage of this feature in Postgres:
Stop doing this:
CREATE INDEX foo ...
Start doing this:
CREATE INDEX foo CONCURRENTLY ...
For the cost of one keyword, your index additions can be a non-event.
"allowing you to not only add a column without locking the table"
To be more clear, the actual advantage is that adding a column in postgres is an O(1) operation if the default value is NULL. It still requires taking a lock, but for many workloads you won't notice it. You still need to be aware of it though, because it can cause problems if you have long-running transactions.
Or (shameless plug) use tokudb. We have many online schema change features that work within mysql. You can add and remove columns while the database is running, without resorting to complicated tricks with triggers. You just add the column, it completes immediately, and you start using it right away. The work of changing existing rows happens in the background, in a transactionally consistent manner, so you never see the old schema again. It's pretty sweet.
FWIW, I often look at tokudb with envy, but there is no way I will switch to MySQL from PostgreSQL for it: there are so many other things I'd lose that I would rather put up with my other issues. That said, and I totally admit it was now over a year ago that I was looking into various database implementation companies (and thereby might have you confused with a differen company), it seems to me like your technology would be amazing if used as a standalone index (as opposed to also storing the heap), and adding indexes to PostgreSQL is both simple and "part of the point of using it in the first place".
(I also would have expected the PostgreSQL market to be better for this kind of thing, although I'm not in this business, so I'd love to understand where I'm wrong: the general argument being that relying on a third-party commercial index is a much lower commitment--and thereby a simpler sell--than relying on a third-party commercial storage engine, in addition to how most companies using MySQL seem to either know so little about database technology that their criteria was simply "popularity" or, alternatively, so much about database technology that to them MySQL is perfectly fine and they don't really need your solution.)
For the moment, I can say that MySQL has a nicely defined storage engine API which is why we release that, and that if you need something lower-level, please come talk to us because that's not the only thing we develop.
However, part of the point of storing things in a write-optimized data structure is that you can afford to store "the heap" directly in the index, and get good performance on range queries. Anyway, if you're interested, email us and we can talk about how we might be able to help.
TokuDB is interesting but priced a little outside what I'd be willing to pay. For the same price I can throw more machines at the problem and use Cassandra.
> It does not take locks, other than for very briefly.
You are right. I think this is a misunderstanding in the original material: if one does not use a default expression (i.e. the default is NULL) then the implementation they are using -- Postgres -- will just twiddle the catalog information. My guess is that they did not know this detailed system artifact -- adding a column of NULL value to table of billions of records is for most purposes as fast as a table of nothing.
No need to do such a fancy switchover of tables, although one will still need to backfill a desired default, as you have indicated in step 3.
For people wanting to then use this solution, the detailed explanation is that you 1) make the column without a default; 2) modify your application logic as you would otherwise already have to fill in the column during insert and maintain it during update, but not to rely on or use the column for updating other data; 3) alter the column to add the default, to be used for new rows past that point (you do this after step #2, as otherwise you won't know which values should be defaulted in step #4 and which values need special attention); and finally 4) update, in reasonably-sized batches, all of the old rows that are still null to the "correct value" (which might be the default, but might be a calculation based on other data).
(edit, as I forgot the conclusion:) At this point you will have all of your data filled in correctly, so you alter the column to add a not null constraint (if you had wanted that, which you probably did as otherwise you likely wouldn't have needed a non-null default in the first place) and then go back to into your application and go about business as usual using your new column in ways that would affect other data.
> It does not take locks, other than for very briefly.
A lock (even a very brief one) can be infeasible on a sufficiently high-traffic table, though there's always the possibility of, say, a 30-second maintenance window.
Mysql 5.5 supports atomic table switch, meaning no lock is necessary at all.
This is called a http://en.wikipedia.org/wiki/Entity-attribute-value_model or http://en.wikipedia.org/wiki/Triplestore. I think the author is understating the price, though. There's a lot of existing software you could reuse if your data was stored in more conventional relations, and "manually enforce consistency" is a pipe dream. Your code has expectations about your data, so in the abstract you do still have a schema, and not writing it down merely prevents any tools from helping you keep your data sane over time. I've seen Notes databases decay to the point that not even the dev team could explain how a document got into its contradictory state nor how the apps should (much less currently would) handle it. The few people diligent enough to do completely correct work without a checkable schema, aren't the people who would be tempted to try.
I recently read The Pragmatic Programmers book SQL Antipatterns: Avoiding the Pitfalls of Database Programming (http://pragprog.com/book/bksqla/sql-antipatterns) and chapter 6 talks specifically about this kind of DB design (entity-attribute-value) and the benefits/pitfalls. It's a great read and I highly recommend it.
Yeah, I'd been thinking "Magento" from about halfway into that article.
Wordpress uses the EAV db pattern too (have a look at the wp_options and wp_meta tables).
Of course, Magento and Wordpress use it in fundamentally different ways - so this project I've got to build a combined search across a site with both Magento (EAV for products) and Wordpress (non-EAV for posts/pages) in use, is in that "it'll sit on the backburner until it becomes critical or somebody else solves it for me first" state…
It should also be noted that this isn't the only way to do this either.
The other three most common ways I've seen to store differently structured data in a consistent way are:
* Edge table
* Binary table
* Universal table
The edge table approach maps edges into key:value tables for the type of table. So you might have an integer data table that was ID<key>:Value<int>, and it only stored the data values and a key to identify the value. And then you have a thing table which stored Name<string>:Flag<type>:ID<key> where the flag column said "int" and the ID column then contains the key of some row in your integer table.
There are as many tables as there are types of data, + 1 table for the definition of all things stored.
The binary approach groups values of the same name into a table of that name. So if you have a property called "username" you'd have a table called "username", and this approach usually still includes the edge approach such that the values are stored in tables with ID:Value structure by type of value. Effectively partitioning the value table by name of the property being stored.
There are as many tables as there are types of data, + the count of unique property names.
The universal table approach uses a single table to store all data, effectively containing all of the edges in a graph. It's a conceptual full outer join of all binary tables. Imagine 100's of columns, and for each row the vast majority of those are NULLs.
The triplestore can be thought of as a condensed version (not type-safe) of the universal table. More space efficient, but usually at the cost of the datatype safety offered by the database.
You can combine these methods to store any type of semi-structured data in a traditional RDBMS and these approaches are used a lot if you know where to look for evidence of them.
SharePoint uses a version of the universal table. Oracle (for the ORA_XML storage type) uses a combination of the above under the hood to map XML into traditional database tables.
They all have pros and cons relating to query speed (for certain types of query), storage efficiency, indexing, types, etc. And as always you need to know why you're choosing something.
SharePoint also partitions their universal table dynamically. But they start with saying that each "SiteCollection" has it's own database/tablespace. As SiteCollections can contain a single site, this is basically equivalent to Reddit creating new tables per subreddit.
Elsewhere (up or down on this page) I saw that the SQLAlchemy author advises not to do this kind of thing. I simply say don't do if you don't know strongly why you should do it. Don't cargo-cult... go and read up on the problem space and only implement one of these solutions when you know the effects of doing so.
This is one of those times when analysis paralysis can be your friend and prevent you from building a mess you didn't need. If anyone starts with one of these designs and you haven't got a decade of experience to know why... you are probably doing it wrong.
Thanks for the comment, that makes more sense. I was trying to figure out in my head how to do the lookups to reander comments with votes and what comments the user voted on all from one data table.
Also I guess the table schemas are slightly different for the different "thing" tables.
That quote is just painful to read, littered with FUD and not a single bit of evidence to back it up.
You should worry about the database because it's probably your canonical storage of data, which for most of us is the most important part of our product/service/whatever. A good schema enforces consist data, invariants, and all sorts of other stuff that you don't want to be dealing with a manual (and buggy) basis.
Schema updates do not need to be slow. They might not always be as elegant as you hope but the big databases are improving on that front, and as tzs mentions - there are tricks that can be employed. With the latest and greatest PG, I believe we're even starting to get event triggers, so it may well be possible to do schema updates with replication. I also have a feeling the binary replication in PG 9 and up can even do it out of the box, with hot standby to still allow responses. I'm not entirely convinced replication is a backup solution, so maybe that was an operations antipattern. That's some baseless assertion from me though :)
If deployments are a pain, work to alleviate pain. They are pretty mechanical, even if involved, which lead very nicely to being automated.
Seriously, we're smart people, let's not throw at least 30 years of research out the window in favour of glorified entity-attribute-value schemas.
Pretty sure Reddit has thousands of tables - last time I looked, it was really hard to see this but this is how it seemed like it was working. It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally). While I'm honored they use my library (SQLAlchemy) for relational database access, their actual usage of the relational DB couldn't be more...well... let's just say please don't imitate their style. If you want to build a Reddit, use Mongo/Cassandra or something like that. They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).
Edit: if any reddit devs want to correct me here, feel free, as I found the reddit source extremely difficult to follow back when I looked.
> It has "thing"/"data" tables for every subreddit - created on the fly (a crime for which any DBA would have you put to death, normally).
That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.
> They'd very likely have done so themselves if NoSQL products were mature when they first developed their platform (I am vaguely recalling/guessing here on that one).
Actually, still today I tell people that even if you want to do key/value, postgres is faster than any NoSql product currently available for doing key/value.
The main issue I run into with storing key/values in MySQL/Postgress is querying those values if they hold complex data (json, serialized data, etc) and the NoSQL products have features that let you dig into those meta fields when querying. (I haven't used a lot of NoSQL products yet, just dabbled with them)
Do you guys know of good techniques to do that with a traditional database like MySQL? I know MySQL has some XML parsing features if you store meta data as XML. I've experimented with it but never used it in production. I sometimes put JSON into a column for dynamic data. Usually I only do that for fields that I know won't need to be queried but that occasionally comes back to haunt me.
Don't know for MySQL, but postgres just added some support for JSON in the latest (beta) versions. Haven't used it and I don't know what it's capable of..
The support for JSON is only validation, otherwise it's just like a plain text column.
A better choice would be to use a hstore column, which is sort of like JSON but we different syntax. Postgres supports indexing hstore subfields, querying on them etc.
> That's not correct. There in't "table" for a subreddit. There is a thing/data pair that stores metadata about a subreddit, and there is a thing/data pair for storing links. One of the properties of a link is the subreddit that it is in. Same with the comments. There is one thing/data pair for comments and the subreddit it is in is a property.
Having a bit of trouble parsing this, but I think you mean that the "thing/data" tables are per type, where type is "subreddit", "comment", "links". Which would indicate a fixed schema.
Can you clarify if Reddit just has a fixed number of tables? I remember seeing some "table.create()" in there but I wasn't sure what I was looking at.
You are correct that reddit has a fixed number of tables (waves hands a bit). There is one table per type, so when a new type is created, a new table is created. That rarely happens.
The reason you see those create statements in the code is for bootstrapping for open source users -- the first time they run reddit, it creates the tables.
if its a sqlalchemy.schema.Table, create() emits DDL for "CREATE TABLE" to the database (trust me, I wrote it). I'm guessing "table" here is some other object local to the reddit codebase.
How do you handle the size of the table (which I assume has many rows). Do the inserts start to slow down as it grows? How much RAM did you have for a master DB instance? Thanks for any input!
Signed-in reddit is one of the slowest sites I use. It is, however, very fast for what it is doing. I have no complaints about the speed, only wishful thinking.
Well this sort of explains the question I was thinking... which was how they would deal with such a massive table? At some point the indices will grow very large and need to be split up some how or they won't fit in RAM.
Postgres has feature called index partitioning, it solves such problems. There is also table partitioning. Creating/merging such tables and indexes should be done by your app automatically if you expect it to grow.
A relational table is meant to store data, but is not meant to represent data by its own existence.
"Represent data by its own existence" means, if you wrote an app where every time a user created a new account, a brand new table called "user_account_XYZ" would be created to store that user's account information. This is the classic "tables-on-the-fly" antipattern. It's not the same as a horizontal sharding or replicated system where tables are structurally broken out for scaling purposes.
We of course have "on the fly" schemes for all kinds of things that are normally manually generated; some template languages automatically generate backing source files, cloud-based hosting spins up and tears down servers on demand, and database replication or sharding schemes will generate new schemas and tables to accommodate growth or change in the overall dataset. That's not what I mean when I say "on the fly" - I mean an application that relies upon creating new relational tables on the fly in order to represent data.
I'm not a current dev, FYI, I left a year ago. But as far I know, it still works the same way as a year ago. Which is the same as it worked 3 years ago.
My fear with headlines like this is that people with no business working at the scale that Reddit does will suddenly eschew years of best practice SQL and by like "my dad's pizza shop CRM only needs two tables, JUST LIKE REDDIT".
Everyone please, use your brain before repeating such specific configurations. Reddit is quite exceptional. Your burgeoning to-do list app is not.
Umm, assuming the github repo is what they actually use (i assume so given how often it is committed to) there are two tables per object. Reddit_thing_link, reddit_data_link, reddit_rel_thing_savehide, etc etc
Datomic takes similar approach.
A unit of Entity, Attribute, Value and Time is called datom.
Although in this approach we don't have to have schema, datomic forces us to define schema for some benefits.
This is roughly the path I'm taking on my current project. There are some things that have a consistent, "traditional" schema. However, many of the feature-important things are being stored as JSON blobs. This means that I don't have to worry about schema migrations when I add a new feature; instead, I just define its JSON schema. This works particularly well because I don't need to join on these items. The downside is that I do have to do occasional manual processing on them (sum up all values of type FOO) that would be trivial if everything were perfectly normalized.
I looked at that (and have done that in the past), but I really missed some of the ORM features I got with relational. I would up writing more glue code than I liked for things that would otherwise be free. This is my compromise world (and Postgres is getting better at handling its JSON).
I used to work at a company with a huge LDAP database that had been growing in uncontrolled and organic ways for years. If you've never worked with LDAP before, it's basically a big key-value store, with some optional schema enforcement. We spent an inordinate amount of time manually cleaning it up and trying to enforce some kind of data integrity and we were always bemoaning the lack of a real structural schema.
I think the kind of ad-hoc generic data typing described in this article is sometimes the right solution but it comes at a cost.
I wish I read this article a year ago before I started developing my latest app. I have that exact problem, where adding a column to a 10M+ row table takes over an hour. So instead I end up with "things" tables all over the place.
The last company I worked at made an application that used the Entity-Attribute-Value pattern in the database. The stated reason was to be dynamic, so we didn't have to worry about adding new columns and the associated downtime (assuming the DB got huge, which of course it would because this app would surely be a huge success). We had that problem on our main app (with 10s of millions of rows) where adding a column was always tricky, so I think management over-corrected. The other supposed win was that since the model didn't change, the code didn't need to be updated.
The data that was being stored fit into the relational model pretty well. But thanks to E.A.V. it was very difficult to query. The kinds of questions we often looked at (how many records from this zip code) would have been trivial without the E.A.V. Today you might use a NoSQL database (which were just starting to get noticed at the time), but in reality it fit into MySQL just fine.
The real sad part is, we never used that functionality in the 2-3 years after it was developed while I was there. The app wasn't big enough for adding columns to take much time at all. All that "flexibility" we needed? We didn't use it, because it would have taken additional time to implement the additional front-ends and update the other backend systems.
Even if we wanted to keep things smaller, we could have gone with a table per type of record (record_type_one, record_type_two, etc) instead of one big records table. That would have made schema changes easier.
And of course, the code did need to be updated. Always. Sure there were no new columns that might cause problems if they didn't have default values, but you can never make changes to an app without code changes. We still had to implement the new interface. We had to implement the code to post that new kind of record to the systems it got processed by. Making the code handle the DB changes would have taken less time than the day or two a designer might work on the front end. It never would have been a bottleneck.
It was designed to replace a growing set of systems that were all tiny forks of the same basic code base over a couple of years. Managing all that had become a mess, and it did need to be replaced. But there wan an opportunity and we ended up trying to reach for the stars when we should have aimed a bit lower.
The irony is that after the system had been in production for a few months, we noticed it had terrible performance that was getting worse with load.
It turned out the programmer who had written some parts of the system had it recalculating way too much data, things that didn't apply to what was going on and that couldn't have changed. It was probably an artifact from initial development (I'll do it this way to get it going, then cleanup later...). Once we caught and fixed that, it was much much faster.
That was the only optimization I remember it needing while I was there.
It does not take locks, other than for very briefly.
1. Make a new empty table that has the same structure as the table you wish to add a column to. Add your new column to the empty table.
2. Put triggers on the old table that, whenever a row is added or updated, makes a copy of the row in the new table or updates the copy already there.
3. Run a background process that goes through the old table doing dummy updates:
where the WHERE clause picks a small number of rows (e.g., just go through the primary key sequentially). Since you aren't actually modifying the table, all this does is trigger the trigger on the specified rows.4. When you've hit everything with a dummy update, rename the current table to a temp name, and rename the new table to the current table. This is the only step that needs a lock.
There are tools for MySQL to automate much of this. There was a post either here or on Reddit a while back about this which linked to them. I'm sorry but I didn't save a link to it so you'll have to search if you want it.