An Unorthodox Approach to Database Design : The Coming of the Shard

simpleenigma · on Aug 1, 2007

I've been working with a programming language called Erlang for about two years now, and their RAM based database called Mnesia almost forces you into this 'sharding' style.

In Erlang you also get the ability to have the same database on multiple computers or fragment the database into different servers that hold only a certain piece.

It certainly was a trick to unlearn normalization techniques I had been using for years, but once you get he hang of it the data really becomes or efficient.

menloparkbum · on Aug 2, 2007

The sharding approach seems to be gaining popularity. It seems like a good datapoint in favor of the anti-database crowd. Why even bother using an RDBMS when you end up having to take this approach in the end, anyway?

sanj · on Aug 1, 2007

There's something familiar about this approach. It reminds me of JMP vs. LJMP commands in processor instructions sets that had partitioned memory sets. Or had a JMP command that was distance limited (often to +/- 32k).

Effectively, sharding takes advantage of the locality of information. As soon as you need information outside of that locale, you need to make a more expensive request.

nostrademons · on Aug 1, 2007

It's a bit more than that - sharding is fundamentally a way of spreading writes. You're limited in the number of writes you can make by the speed at which the disk rotates, so this quickly becomes a bottleneck, particularly in heavily-trafficked social sites. Sharding lets you spread your writes out among an arbitrary number of machines, so you can grow with traffic.

Big sites that are primarily read-hungry often don't need to shard at all, instead relying upon replication. Look at Craigslist - they have a single master database, about 20 slaves, and an archive cluster for postings that nobody looks at anymore. Or PlentyOfFish - 1 machine for billions of pageviews. Social sites with lots of user-created content need to shard much earlier - LiveJournal has a fraction of the users as Craigslist, but they had to start sharding around 2002/2003.

There're locality concerns as well, but they tend to be secondary to the write issue. Disks are big, and with good caching you can often avoid hitting the disk at all.

jamongkad · on Aug 2, 2007

"Or PlentyOfFish - 1 machine for billions of pageviews. Social sites with lots of user-created content need to shard much earlier"

Is that really true? I find it hard to imagine...but something deep down inside of me says "Yes it's possible". Oh my...

staunch · on Aug 2, 2007

Saying it's accurate is the kindest way to put it: http://news.ycombinator.com/item?id=23497

patrickg-zill · on Aug 1, 2007

I think this is a re-invention of relatively well-known ways of optimizing databases. still interesting to read about however.

nostrademons · on Aug 1, 2007

It used to be called horizontal partitioning before Google started using the sharding term with Google Filesystem. It's been around for a while. The new part seems to be notion of pushing it down into the application code instead of paying Oracle or IBM a few million dollars for a DBMS that handles it automatically.

horatio05 · on Aug 1, 2007

How would one know what data to place in shards? Are there any "RDBMS to Shard" resources out there?