Hacker News new | past | comments | ask | show | jobs | submit login

I think in some cases this kind of appeal to authority can be valid.

Facebook has absolutely insane sparse matrices to handle. They handle enourmous volumes of traffic querying very specific (read: not cachable between users) datasets. Moreover, they've already invested mind-boggling amounts of capital into their stack. Same goes for Amazon with Dynamo. These people operate on scales that startups like us can't even comprehend; and they've found it worthwhile to write their own datastores for those scenarios. Moreover, their use of those databases has apparently contributed to their success. That, to me, is strongly suggestive evidence.

That and HA/fault-tolerance is a no-brainer; Cassandra's scaling characteristics rock the socks off of any SQL DB I've used. The consistency tradeoff is well worth it for some use cases.




Absolutely. Facebook. Google. All great examples of the need for a different solutions. But I'm not sure about Digg. It seems like a very straight forward implementation would work for them. But given the small amount of information they've provided about their setup, it doesn't sound like they've ever gone for one.

Compare them to StackOverflow, which at recent evidence, has about 10% of the traffic of Digg. They're running a very straight forward RDBMS configuration on rather pedestrian hardware. If Digg has a 50 node cluster (for example), StackOverflow should require at least a 5 node cluster.


StackOverflow runs on a 3.33 Ghz quad core x 2, 48 GB RAM, 6 drive RAID 10

http://blog.stackoverflow.com/2009/12/stack-overflow-rack-gl...

So if Digg is as 10x times the size they would need a (assuming perfect scaling) 3.33 Ghz quad core x 20, 480 GB RAM, 60 drive RAID 10

Oh, but you can't get anything more than 8 sockets in x86, and Windows only runs on x86 now. So assuming you switch away from Windows, you'll need a Sun or IBM for that.

Sun's kit is only Dual-Core and the processors aren't as fast (either per Ghz or in clockrate), so here is the 64 CPU model you'd need and it's already got 64 disks: "For a 64-processor gorilla, 2.4 GHz SPARCVI dual-core chips, 6 MB of on-chip L2 cache, 128 GB of memory and a 64 x 73 GB SAS drive raise the price tag to $10,100,320."

http://www.serverwatch.com/hreviews/article.php/3688771/Serv...

And not forgetting you'll probably want to upgrade the 128GB of memory it comes with and you'll need two of these really (the other for failover)


"Windows only runs on x86 now"?

PlentyOfFish.com is, according to these posts: http://highscalability.com/plentyoffish-architecture http://www.codinghorror.com/blog/2009/06/scaling-up-vs-scali... ...running: 512 GB of RAM, 32 CPU’s, SQLServer 2008 and Windows 2008

$10 Million? Try ~$100,000. (Granted, the article you linked was from 2007).

My company spends >$10k on fricking meetings to discuss whether they should spend $20k on a server (as well as the other technical details they are unqualifed to be discussing). Of course, anyone that actually knows anything about it is not welcome at these meetings. :)


The 32 CPU quote [originally from POF's blog and repeated w/o correction elsewhere] is incorrect; the HP server in question has 8 quad-core _CPUs_, for a total of 32 _cores_.

So still a factor of 2.5 away from our hypothetical 80 core requirement, and not a refutation of GP's claim that x86 maxes out at 8 sockets.


That doesn't make any sense. You're assuming that they have 10 times the content, need 10 times the processing power, and that they derive no advantages to caching (from 48 GB RAM to 480 GB RAM).

That is not how capacity planning works at all.


The scenario is 10x the traffic, not 10x the content. And given that Digg is an update-heavy site for all the voting (which affects how pages are rendered for your friends) assuming that caching is a magic wand here seems bogus.


I don't think the the 50-node cluster mentioned in the article is for Digg -- it's for http://simplegeo.com. :)


Yeah, I'm a little surprised that Digg has moved away for performance reasons. Maybe their data model is fundamentally more complex than StackOverflow? Or maybe SO has a better caching layer in front of the service?


Or maybe Digg's model is flawed. I don't know if it is or not, but from everything I'd read it was far from optimal. I'd love to see more about it though. Now, relationship graph traversal is an issue for normalized relational systems, but in these cases, things could be split pretty from articles versus recommendations.

One big problem I see in these comparisons is when a NoSQL person claims that their box is processing 5000 req/sec, what does that mean? Are they denormalizing this so much that it's equivalent to 500-1000 req/sec on a RDBMS?

Another thing: when Digg was starting their type of site was very novel. There wasn't much out there that approached the scale and growth they experienced. I'm sure that StackOverflow has been designed with scaling in mind.


I'm not that familiar with Digg (it's the same as Reddit though, right?), but two major things occur to me:

1) I'm much less likely to vote a answer/question up/down on SO than I am at Reddit. On SO, if I'm not asking or answering, I'm rarely causing any writes to the data store. On Reddit, I vote on most of what I look at. I could see this having a huge impact.

2) Obviously Reddit can do some caching, but I think SO can cache much larger pieces of data. As far as I know, everyone who goes to the main page of SO sees the exact same list of questions. On Reddit, each subreddit's top items can be cached, but they are mixed differently for each user.


I'm not familiar with the Digg data model, but the SO data model is not particularly complicated, and any nested interactions between, if there even is such a thing, I'm sure is done in non real-time batch processing.

Facebook, on the other hand, is incredibly complex because of all the interactions between users, not to mention the data is stored if I recall correctly, geographically disparately throughout the world. I don't have a link, but the shit that happens behind the scenes when you logon to your facebook account is wild.


Facebook's primary data storage is MySQL supported by memcached. Cassandra is only used for messaging (inbox search?) This is at least from what I have read - it could be outdated thought.


Facebook is especially poignant considering how Friendster died due to scaling at a tiny fraction of the traffic and feature set that Facebook has.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: