I think rewriting from scratch is the core of their problem and not really Cassandra. Gradually going over to Cassandra would have been a much better idea.
I've heard this time and time again, but what if your application is a genuine ball of mud? Would you really not advocate a rewrite for an unmaintainable spaghetti classic asp app still in production today?
Microsoft could be one example of a successful one. Their flagship Windows codebase that ran the 3.1->95->98->ME line was basically ditched with the rewrite-from-scratch NT (famously done by an ex-VMS team), which later had some APIs ported to it to make Windows 2000 and especially XP be close to drop-in replacements for the old line, while not really sharing much code. I think in retrospect that was probably a good idea: the NT rewrite put the codebase on much better footing than the aging, incrementally updated classic Windows codebase had been.
Solaris is another example of a rewrite that seems to have worked, though the rewrite did derive from a different set of existing code, not a total from-scratch job. But the classic SunOS 1.x.-4.x codebase was ditched, and SunOS 5.x / "Solaris 2" replaced it.
You covered Windows and Linux - don't forget Apple's switch to OS X. They wouldn't done a wholesale switch at some point even if they hadn't gone with something unix-based because the other alternative was Copeland, the internal project to do a complete rewrite.
Its a good point that NT was a successful rewrite. However, its worth noting how this was done. NT was originally aimed at a different market, there was a overlap of several years where the old system was still available AND NT ran Windows 3.1 apps in their own subsystem which contained .... the codebase of Windows 3.1.
I think the NT and OSX updates were absolutely required due to architectural reasons which impact security, 98/ME and MacOS 9 just simply weren't suited to types of threats on the internet (no support for dropping process privileges, file permissions, cutting direct access to hardware etc.). If you think XP is bad in 2010, think what Me would be like if widely installed still, one program crashing the whole machine, boot sector viruses etc.
I would advocate an incremental rewrite of parts of the unmaintainable spaghetti classic asp until there's none of the spaghetti left. It's easier to rewrite part of a system than an entire system.
Release to production dozens, if not hundreds of times. Releases are non-events, rollbacks are non-events.
A system-wide ground-up rewrite with a big-bang switchover at the end is a classic clusterfuck recipe. It's a shame that so many people think it's a good idea, even in 2010.
> incremental rewrite of parts of the unmaintainable spaghetti classic
Sounds good in theory. In practice? Part of the problem with many big ball of mud systems is that all the parts depend on and talk to all the other parts. Want to fix that horrid DB schema? You'll have to rewrite all the code that talks to it, or rewrite it to talk to an intermediary. Want to rewrite that horrid bit of code that's called foobar_20040623 ("foobar" has been changed, but yes, I saw this in some PHP code...), you'll have to find out what all it interacts with, and likely redo that too.
In practice? It's not easy, but nothing is. I see three options to the big ball of mud problem.
1. Wallow in it (work with existing structure). Sadly, this is what a lot of people do. I left a job once because that was the only way out of the ball of mud. I was afraid I would turn into a mud-person.
2. Slowly crawl out of it (incremental rewrite). This is hard, but do-able. It involves setting up barriers to mitigate ripple effects, automated testing to be comfortable with frequent releases, and tolerance for temporary imperfection (basically, you need to be willing to frequently release things that are only a tiny bit better than status quo, even if it's not ideal). Not everyone is willing to accept this persistent imperfection and lack of conceptual consistency, especially when option #3 is more exciting and fun.
3. Try to leap out of it and land wherever (total rewrite). This is very, very hard, and prone to total failure or spending valuable money and time to effectively stand still in the market.
I've seen development teams leap directly out of one ball of mud into a different one. One where nobody even knew their way around anymore. How is that anything but a huge waste of time and resources?
Fascinating topic - lots of startup talk is fun because you get to do things from scratch, but this is very pertinent to the Real World.
In terms of "standing still in the market", I think the incremental rewrite contains a lot of that too, no? It's just spread over more time. You're still rewriting it, and during that time, you're not adding new features. An example might be retrofitting some testing code to a system that's never had test code. That could potentially be a fair amount of work, and given a constant pool of resources, it will take time away from "new stuff". Just that it's not so much of a quantum leap - you can still drop your new testing code and go implement some must-have feature if you need to, without saying you have to wait for the whole thing to be ready.
Sadly though, my experience in this is that the reason there's a ball of mud in the first place is a political/social one, so that any "dead time" is frowned upon.
In reality yes, it works, I've done it. We took a horrible, accreted web application, and rewrote it in stages over a period of about 12 months. At the same time we were making regular releases, and needless to say the site stayed up the whole time.
You just have to plan things carefully, work hard, and keep your head screwed on. (Just like with many things in life ...)
I agree, although it's a somewhat sensitive thing to write about if you're doing it in practice. I sure as hell don't want to be known as the guy who comes in and calls all of the existing code a pile of shit.
I guess I could "change the names to protect the innocent" and tell some stories about digging out of tight places incrementally. If it would convince even one development team that they didn't absolutely have to do a total rewrite, it would be worth it.
Another issue is whether you are just rewriting the code, or fundamentally changing your data store (as Digg have done).
Rewriting the code is fine - if it goes wrong, just back up to the old code, users won't notice the difference. You can do it on a page-by-page basis, just route URLs selectively to the new install.
Migrating data is something that should be done under only the most extreme circumstances - and something will inevitably go horribly wrong, so be prepared to rollback.
Because migration is a scenario that may go horribly wrong, you should be prepared to tackle it in small pieces. So often (especially with relational databases, I don't have much experience with the NoSQL realm) everything is in one logical and physical data store, even for non-related features. That makes any migration harder than it needs to be.
I'd say it depends on if your engineering talent is vastly superior today than what it was when it was written. Even then it's so risky. It's probably always better to do it incrementally even if it takes twice as long to do so, because you can maintain working software and fix bugs as you go.
The approach I would take is to get the minimum set of engineers who know the most about each major aspect of the code, and put their heads together on what the ideal architecture would be. But rather than building it from scratch, figure out how to implement just one of those pieces now. That way you can decrease entropy in the codebase piecemeal without chucking out all the code at once, which is no doubt full of forgotten assumptions that no one will remember until it's too late.
The other problem is they have been hyping V4 for a long time by making beta testing invite only and hard to get. It would have been better if they can have a public V4 beta running in parallel with V3 since it is a rewrite.
you can't do a parallel beta test of two functionally different sites. You would end up with two separate data stores, workflows, and ultimately a confused user base.
Never mind the fact that the user base is confused by the rollout.
And never mind the fact that Google has forever damaged the term "Beta" in the minds of general Web consumers.
And never mind the fact that Google has forever damaged the term "Beta" in the minds of general Web consumers.
The general web consumer still thinks betas are a type of fish. Not really relevant to the median Digg user, though, since they are not the general Web user.
Considering the completely different backends of v3 and v4, I think, this would have been incredibly hard to implement. At least if changes in one system should end up in the other.
In this case you would not just have to write scrips to one-time migrate all needed data from v3 mysql to v4 Cassandra.
No. You would have to build a mechanism that doesn't just do it in both directions, it would also have to work at near-realtime speed.
If you then need consistency of the transferred data, this quickly gets impossible. Try finding a way to ensure consistency between these two completely different architectures.
In the end, most of v3 would have needed to be rewritten for a parallel use to be possible, at which point you don't really gain much.
"not hard to implement" - sigh
Disclaimer: I don't work at digg and I don't know more about their backend than the rest of the public. I did however just get around doing something like what you describe and there it wad "just" a different schema on the same database backend and even just that would have been hell
In a way, the one-time migration might be harder than near-realtime bidirectional synchronization. That way, you could move portions of users to the new system and back as needed. A sudden leap from one backend to another is like jumping over the grand canyon on a motorcycle. Personally, I would rather build a bridge.
For the concurrent configuration to work consistently, digg would probably have to port v3 to the new backend.
Now in the current case, the most issues apparently come from the non-working backend as opposed to the changed featureset.
So while they could have run the two versions in parallel, they would not have gained anything. Likely, this was their rationale behind not doing so in the first place.
No, for the current configuration to work, you would have to create a way to move data in both directions from the "old" backend to the "new" backend. Ideally, you would slice up the beast so you didn't have to do this with everything all at once.
It's not a system that would be ideal for something transactional like a bank, but it may have been possible for an organization like Digg.
No, what they done wrong is not to make a proper 2 way conversion tool for their database backup.
Starting from scratch is no problem if you got your core data covered, then you can always revert back.
Don't ever develop yourself into a one way street!!
This is my all time favorite Spolsky blog entry. This lesson has saved me a few times. The broader lesson is that new and shiny in software isn't better and is often worse.
Yep, big bang rewrites are always scary. Doing something more incremental would have been safer.
One way to "safely" do a big-bang change would have been to use both architectures in parallel and sync back and forth. That way, if the new architecture fails, you still have the old one.
Of course, that is lots of effort, and restricts new features (which may be a good thing anyway).
Is there any reason that Cassandra is the focus of this article? It is really silly and irresponsible to peg a nascent project like that without any reasoning or sources. I'm sure something changed besides just a Cassandra rollout, and wasn't Digg using it on v3 too?
I think Cassandra is pretty well tested. There have been lots of super-large-scale deployments. It just seems lame to blame it on that, but I guess maybe their anonymous sources inside Digg revealed it? But then we'd hope they'd know if the problem was with the datastore or the implementation.
I get that some VP suggested a new buzzwordy technology, they gave him enough rope to hang himself and he did and left the company with a broken pile of crap. It could happen, if you have a healthy company you give trust to people. That it got this far doesn't speak well of the rest of management. It doesn't speak really well of the rest of the team either. Shouldn't there have been some circuit breakers or something?
Digg isn't a poor, bring-your-own-laptop startup. They've got resources, they've had substantial investment. They can afford to build and test software and I know of no real marketing reason they had to push something untested out. Rose could go out and say it wasn't done, it's going to take more time.
How does Rose keep his job? Wasn't he this VP's boss?
And it's single technology? That couldn't have been vetted and tested independently of all of digg? Really? And MongoDB, or hadoop, or one of the dozens of other nosqls wouldn't work, either? You do what you have to do and there is never a 'truth' with VPs and CxOs are canned, but it all doesn't float with me, just looks like another over valued and under talented company, got lucky and the blind squirrel found the nut, there isn't gonna be a second act.
Maybe it just seems low brow to me, name and finger the guy, blame the opensource tool you use, never explain or elaborate why you launched anyways when you were fixing bugs in the tool at the 11th hour.
Perhaps, though I think my sour grapes specifically about digg are somewhat dead. What really makes me angry is the joke that is the VC industry, which is really the root cause of this. In April I talked to an associate at Andreesen-Horowitz (where Kevin Rose is the mayor on foursquare), who said that digg, after many difficulties, was on the right track. From their perspective, engineers are cogs in the machine and the only thing that matters are executives. I worked closely with the executives at digg, and they spent the vast majority of the time feathering their nests, and the digg v4 rollout is just the logical conclusion of them draining value for five years.
Also I said this was a "rough guess." Comments on TC from people I know who worked there (there's been a mass exodus for the last two years) suggest I'm mostly correct.
Finally, I think "sour grapes" is a poor rejoinder. Especially since it's the standard PR line at digg.
No, I apologize. I didn't mean to sound like I was attacking you or suggest that you supported it. It just seemed like a good place to interject. It's such a transparent move on Digg's part.
Part of the reason that Cassandra may be the focus is that Kevin Rose places much of the stability problems on Cassandra's shoulders. Two or three minutes into the most recent diggnation he talks about Cassandra and describes Cassandra as "very beta-stage software" and says how days before the launch at least some of their focus was on fixing "Cassandra problems" rather than issues with Digg v4.
If true, it's a lame excuse: blame should be placed on the people that decided to use the beta software to power their very-important-to-their-paycheck website
So if you watch the Revision3 video where they talk about the stability issues Kevin throws Cassandra out there nothing really specific but when discussing downtime its the only technical factor he mentions.
He calls it "still beta software" and states they were fixing bugs in Cassandra during the days leading up to the release of v4.
In reply to you and epoxy, sounds like a gross failure of engineering management. If you're having problems of this sort of foundational nature a few days before the planned release, it strongly suggests you should delay and figure out what's generally gone wrong with the project.
I actually wondered if they were running into issues with Cassandra. I'm not a NoSQL hater - but its still pretty bleeding edge and it always seemed like making it your core db was super risky.
arg - really want more insight, maybe Quinn will elaborate now that he's gone.
Back then they seemed to have a rather sensible migration strategy (ie, basically running the new Cassandra back-end in parallel with the MySQL backend).
It seems to me that it was the v4 upgrade that broke, not Cassandra alone. It's possible their frustrations with Cassandra were more long term, and the fact the v4 upgrade didn't go well was the last straw.
For example, Digg has done a lot of work on Cassandra internals and tools. If you are using a new, open source product you kind of expect that, but it's possible the expense of that didn't seem like good value once v4 started to get into trouble,
For some reason this reminds me of the "no one has ever been fired for choosing IBM." slogan. I guess maybe there is some truth to it. Seems like new tech needs to start slow on big places, only startups really have the freedom to risk it all.
That depends on how you define your terms. I'm not sure a company that has been around 6 years and has ~100 staff can be called a startup any more. On the other hand, I'm not sure a company that still survives more on the wishful thinking of investors than on the money it brings in after 6 years can be called a business, either, and certainly I wouldn't call it either big or mature. I'm honestly not sure what I would call that sort of organisation, though if I were an investor, I suspect the word "liability" would feature somewhere.
I have questions about what they're doing with all that staff, especially now that it's become apparent that Reddit is doing similar (if not larger) numbers with a handful of guys (literally, countable on one hand).
Evidently Digg is better at the monetization game than Reddit has been, so more sales staff I can understand... but devs?
I can understand that Digg needs more people than Reddit (even though their traffic numbers are about the same) but TEN TIMES more people? That's just not sensible. I'm sure Digg can make a nice profit without further growth if they just got rid of about half their staff.
I wonder if its easier to get funding (and later position yourself to get sold out), if you show that you have quite a few people working for you.
Personally I have no idea nor do I see any reason for that many people working for digg, but my understanding is that Digg (or rather Kevin Rose) has always been about having the perception of big without actually being that big.
You know what's even better? Having the perception that you are big but without actually hiring 100 fucking people to do it. I mean if that's what you need to impress investors at a cocktail party, you're a pretty sad entrepreneur.
Tell that to investors. I see their position as precarious and they haven't IPO'd or been acquired. They have a lot of staff and have longevity, but I don't think that is enough for them to stop being a startup.
Companies like Etsy insist they are still startups, too - five years, $55 million dollars in funding, tens of millions in revenue, and 130 employees later. I don't buy it, either. They are an established business, but routinely use this 'startup' label as an excuse.
Being a startup isn't about the company, it's about the product. There are startup teams at fortune 500 companies. A startup company is a company with one new product in development.
That is it! So the problem seems to be a complex one. Some part of it I think is an ordinary Java issues - what happens when there is not enough memory, and system start use swap, what happens when a connection rate is too fast while system is doing a heavy IO? What happens during recovery of network operation or replication failure and so on.
The second part is the complexity of the software itself, but not the complexity of the algorithms or tasks - it isn't a rocket science, but added artificial complexity due to all those CrappyFactoryManager().GetSpecialShitFactory().instantiateANewCrap() and so on - seem like no one can comprehend the whole mess itself.
In the other hand, this failure probably will cause some improvements or at least more attention to Cassandra project, and everyone who uses it will benefit.
This man's reputation is on the line. I hope they release more details on what exactly it is that is causing the problem. As it stands he appears to be an unfortunate scapegoat.
Yeah originally it was Slashdot but better, more current and user powered (which was a new concept in the mid 2000's). What it turned into was a traffic generator for a small group of power users who fed the community the lowest common denominator of content.
They are planning to transition to Cassandra as the primary database. PostgreSQL is not used in a relational way on reddit -- it is used as a makeshift k-v store. You can take a look at this yourself as reddit is fully open-source: http://github.com/reddit .
I could swear an admin said it with some pretty fair definitiveness; I thought I remembered that it was ketralnis, but this was the best I could find from him, and it's not nearly as definite as my memory says.
According to reddit admins some of the recent downtimes in the last few months were due to cassandra and they were having some bad performance and stability issues. I can look it up if anyone is interested but it will be a bit of a work, since they stated it (several times) in the comment section and perhaps once on a blog post too.
I screwed up our Cassandra deployment, and wrote about how I screwed it up. We were under-provisioned, and the version we were using didn't deal with the case of being overloaded in a graceful way. We're no longer under-provisioned, so I don't know if more recent versions deal with it better.
We've never claimed to have performance issues, I don't know where you're getting that one.
The cassandra issues were primarily ops failures, and secondarily an older version of Cassandra making it difficult to recover once it was overwhelmed. (Some of the resulting improvements in Cassandra are documented here: http://www.riptano.com/blog/whats-new-cassandra-065)
This could very well be, I remember one of the problems they were having is not having enough resources for cache or something similar. But I could be wrong I will ask them on twitter and see if they can comment here on it.
This article is a fine example of the poor logic that pisses me off nearly every day. Just because Digg v4, which heavily uses Cassandra, can't be used as a Cassandra success story does not automatically mean you can assume the opposite, that it is a Cassandra failure story. There is no indication why Digg has been down so often and there is really no conclusions to draw as of yet on the technology they use.
The article basically comes right out and says it. You just need to read between the lines a little bit.
From the article:
Quinn was the main champion of moving over to Cassandra, say our sources. Now the site is taking a huge hit, at least in the short term, because of that decision and/or how it was implemented, and Quinn is paying for it with his job.
It's always a toss up between whether it was implemented correctly or not. The correct course of action of course would have been to slowly move the site over to the new technology piece by piece rather than a wholesale switchover. The risk is in the migration strategy not the technology picked. They could have been equally stupid switching over to a new architecture with mysql.
Reddit went through similar issues a few months back (downtime, slowness, etc.), but they overcame these issues without turfing people. My guess is Digg pushed the engineering VP out to make the investors happy rather than to actually move forward.
by doing this, they are laying the blame on one person so the media can stop hating Digg and start hating the ex-VP of Engineering who "killed Digg".
of course, whether he was actually responsible in some way is something we may never know. for all we know he may have been completely against releasing v4 but was vetoed by Rose et al. or on the other hand he may have overpromised and under-delivered, putting the company in jeopardy, in which case he deserves to be let go.
it's all speculation until we hear an official comment from either side.
The typical investor doesn't take the time to really understand the companies or technical issues, unfortunately. They see 'there was a problem, management fired the guy who was at fault' and are happy.
Unless Digg cannot roll out new features due to Cassandra (eg: a) Cassandra woes taking all dev team time, or b) tech limitations prevent features) it seems highly unlikely that cassandra is the reason why Digg 4 is grating with users initially.
It's a change of features/product rather than technology that is the problem.
I think the folks at Digg where prepared for the heartburn associated with the content changes. They weren't ready for some very serious issues with the implementation. Digg has definitely seen a lot of downtime over the last week or two.
Digg has had some significant downtime during this transition (Can't blame Cassandra - but can't clear it either). Issues with product features are one thing - but nothing probably got more people jumping to a different ship (reddit) then not being able to find a good lolcat when you're bored, and your favorite news aggregator is down. Hmm - I wonder if fark has seen the same uptick as reddit has.
I'm really interested to see what comes of this and what went wrong. It sounds like (from reading his blog) they were making a lot of customizations to Cassandra?
They have made many feature additions to Cassandra.
They haven't said much about the details of why they are having trouble.
It could be a core Cassandra problem, something they added, or completely unrelated to Cassandra; But the internet doesn't care. It's drama at its finest.
"It could be a core Cassandra problem, something they added, or completely unrelated to Cassandra; But the internet doesn't care. It's drama at its finest"
Digg made a big deal about their move to Cassandra (just as Digg's move to Cassandra was used to legitimize Cassandra, and by correlation NoSQL, among a wide range of zealots), going back over a year.
The thing about talking big like that is that it often comes back to bite you in the ass if things don't go well.
If Digg quietly released a new version that worked more reliably and provided a better experience, they would have been in a perfect position to pontificate on technology.
Ok Digg is having a hard time. I feel like I witness a strange blood thirsty tone in some of the comments though. I wish Digg the best and I hope they can prove all the skeptics wrong. It's going to be tough though!
I wonder what issues they are running into that a dark launch would not have found. As I have discovered painfully in my own projects, making big changes without a rollback plan is usually a bad idea, and it sounds like this is no exception.
Digg v4 has added back the features I liked after taking them away - mostly the Upcoming section, which tends to have interesting things that never make it to the front page, and the setting that allows you to default to Top Mews, instead of things I have already read ('My News').
For most users, change in features is what drove them away, not necessarily the spotty QOS (though that was pretty bad for a while).
I have to say I like the new Digg. Sad this guy is getting his career stomped on over it. Lot of people complain that the new site lets big sites submit their own content automatically, but how was it any different with a middle-man user submitting it himself?
Let me guess - the problem is about the difference between theory and practice.
In theory, Java is great and Cassandra is great. In practice - Java under a heavy load is a disaster, because it was never designed for it, and Cassandra is a just a hype and propaganda.
Face the reality - it doesn't work in production as it supposed to - as a primary storage engine.
People at the Digg aren't amateur idiots, so I think they do everything as it described in docs, but the damn thing just doesn't work.
What would you prefer to write web applications in? C++?
Most languages used in web programming are divorced from the hardware, and either use a virtual machine (e.g. Java, .NET) or an interpreter (PHP, Python, Perl, Ruby).
Virtually no-one uses a low level language for web programming, and the benchmarks say that the virtual machines are faster than interpreters. Java is very fast and efficient once it's running, but initial start-up time is often slower than interpreted languages due to the JIT compilation, but servers rarely "start up". If you're really hitting a barrier with Java's performance, just as with other non-native languages, you can write the performance critical section in C.
I think rewriting from scratch is the core of their problem and not really Cassandra. Gradually going over to Cassandra would have been a much better idea.