I think rewriting from scratch is the core of their problem and not really Cassandra. Gradually going over to Cassandra would have been a much better idea.
I've heard this time and time again, but what if your application is a genuine ball of mud? Would you really not advocate a rewrite for an unmaintainable spaghetti classic asp app still in production today?
Microsoft could be one example of a successful one. Their flagship Windows codebase that ran the 3.1->95->98->ME line was basically ditched with the rewrite-from-scratch NT (famously done by an ex-VMS team), which later had some APIs ported to it to make Windows 2000 and especially XP be close to drop-in replacements for the old line, while not really sharing much code. I think in retrospect that was probably a good idea: the NT rewrite put the codebase on much better footing than the aging, incrementally updated classic Windows codebase had been.
Solaris is another example of a rewrite that seems to have worked, though the rewrite did derive from a different set of existing code, not a total from-scratch job. But the classic SunOS 1.x.-4.x codebase was ditched, and SunOS 5.x / "Solaris 2" replaced it.
You covered Windows and Linux - don't forget Apple's switch to OS X. They wouldn't done a wholesale switch at some point even if they hadn't gone with something unix-based because the other alternative was Copeland, the internal project to do a complete rewrite.
Its a good point that NT was a successful rewrite. However, its worth noting how this was done. NT was originally aimed at a different market, there was a overlap of several years where the old system was still available AND NT ran Windows 3.1 apps in their own subsystem which contained .... the codebase of Windows 3.1.
I think the NT and OSX updates were absolutely required due to architectural reasons which impact security, 98/ME and MacOS 9 just simply weren't suited to types of threats on the internet (no support for dropping process privileges, file permissions, cutting direct access to hardware etc.). If you think XP is bad in 2010, think what Me would be like if widely installed still, one program crashing the whole machine, boot sector viruses etc.
I would advocate an incremental rewrite of parts of the unmaintainable spaghetti classic asp until there's none of the spaghetti left. It's easier to rewrite part of a system than an entire system.
Release to production dozens, if not hundreds of times. Releases are non-events, rollbacks are non-events.
A system-wide ground-up rewrite with a big-bang switchover at the end is a classic clusterfuck recipe. It's a shame that so many people think it's a good idea, even in 2010.
> incremental rewrite of parts of the unmaintainable spaghetti classic
Sounds good in theory. In practice? Part of the problem with many big ball of mud systems is that all the parts depend on and talk to all the other parts. Want to fix that horrid DB schema? You'll have to rewrite all the code that talks to it, or rewrite it to talk to an intermediary. Want to rewrite that horrid bit of code that's called foobar_20040623 ("foobar" has been changed, but yes, I saw this in some PHP code...), you'll have to find out what all it interacts with, and likely redo that too.
In practice? It's not easy, but nothing is. I see three options to the big ball of mud problem.
1. Wallow in it (work with existing structure). Sadly, this is what a lot of people do. I left a job once because that was the only way out of the ball of mud. I was afraid I would turn into a mud-person.
2. Slowly crawl out of it (incremental rewrite). This is hard, but do-able. It involves setting up barriers to mitigate ripple effects, automated testing to be comfortable with frequent releases, and tolerance for temporary imperfection (basically, you need to be willing to frequently release things that are only a tiny bit better than status quo, even if it's not ideal). Not everyone is willing to accept this persistent imperfection and lack of conceptual consistency, especially when option #3 is more exciting and fun.
3. Try to leap out of it and land wherever (total rewrite). This is very, very hard, and prone to total failure or spending valuable money and time to effectively stand still in the market.
I've seen development teams leap directly out of one ball of mud into a different one. One where nobody even knew their way around anymore. How is that anything but a huge waste of time and resources?
Fascinating topic - lots of startup talk is fun because you get to do things from scratch, but this is very pertinent to the Real World.
In terms of "standing still in the market", I think the incremental rewrite contains a lot of that too, no? It's just spread over more time. You're still rewriting it, and during that time, you're not adding new features. An example might be retrofitting some testing code to a system that's never had test code. That could potentially be a fair amount of work, and given a constant pool of resources, it will take time away from "new stuff". Just that it's not so much of a quantum leap - you can still drop your new testing code and go implement some must-have feature if you need to, without saying you have to wait for the whole thing to be ready.
Sadly though, my experience in this is that the reason there's a ball of mud in the first place is a political/social one, so that any "dead time" is frowned upon.
In reality yes, it works, I've done it. We took a horrible, accreted web application, and rewrote it in stages over a period of about 12 months. At the same time we were making regular releases, and needless to say the site stayed up the whole time.
You just have to plan things carefully, work hard, and keep your head screwed on. (Just like with many things in life ...)
I agree, although it's a somewhat sensitive thing to write about if you're doing it in practice. I sure as hell don't want to be known as the guy who comes in and calls all of the existing code a pile of shit.
I guess I could "change the names to protect the innocent" and tell some stories about digging out of tight places incrementally. If it would convince even one development team that they didn't absolutely have to do a total rewrite, it would be worth it.
Another issue is whether you are just rewriting the code, or fundamentally changing your data store (as Digg have done).
Rewriting the code is fine - if it goes wrong, just back up to the old code, users won't notice the difference. You can do it on a page-by-page basis, just route URLs selectively to the new install.
Migrating data is something that should be done under only the most extreme circumstances - and something will inevitably go horribly wrong, so be prepared to rollback.
Because migration is a scenario that may go horribly wrong, you should be prepared to tackle it in small pieces. So often (especially with relational databases, I don't have much experience with the NoSQL realm) everything is in one logical and physical data store, even for non-related features. That makes any migration harder than it needs to be.
I'd say it depends on if your engineering talent is vastly superior today than what it was when it was written. Even then it's so risky. It's probably always better to do it incrementally even if it takes twice as long to do so, because you can maintain working software and fix bugs as you go.
The approach I would take is to get the minimum set of engineers who know the most about each major aspect of the code, and put their heads together on what the ideal architecture would be. But rather than building it from scratch, figure out how to implement just one of those pieces now. That way you can decrease entropy in the codebase piecemeal without chucking out all the code at once, which is no doubt full of forgotten assumptions that no one will remember until it's too late.
The other problem is they have been hyping V4 for a long time by making beta testing invite only and hard to get. It would have been better if they can have a public V4 beta running in parallel with V3 since it is a rewrite.
you can't do a parallel beta test of two functionally different sites. You would end up with two separate data stores, workflows, and ultimately a confused user base.
Never mind the fact that the user base is confused by the rollout.
And never mind the fact that Google has forever damaged the term "Beta" in the minds of general Web consumers.
And never mind the fact that Google has forever damaged the term "Beta" in the minds of general Web consumers.
The general web consumer still thinks betas are a type of fish. Not really relevant to the median Digg user, though, since they are not the general Web user.
Considering the completely different backends of v3 and v4, I think, this would have been incredibly hard to implement. At least if changes in one system should end up in the other.
In this case you would not just have to write scrips to one-time migrate all needed data from v3 mysql to v4 Cassandra.
No. You would have to build a mechanism that doesn't just do it in both directions, it would also have to work at near-realtime speed.
If you then need consistency of the transferred data, this quickly gets impossible. Try finding a way to ensure consistency between these two completely different architectures.
In the end, most of v3 would have needed to be rewritten for a parallel use to be possible, at which point you don't really gain much.
"not hard to implement" - sigh
Disclaimer: I don't work at digg and I don't know more about their backend than the rest of the public. I did however just get around doing something like what you describe and there it wad "just" a different schema on the same database backend and even just that would have been hell
In a way, the one-time migration might be harder than near-realtime bidirectional synchronization. That way, you could move portions of users to the new system and back as needed. A sudden leap from one backend to another is like jumping over the grand canyon on a motorcycle. Personally, I would rather build a bridge.
For the concurrent configuration to work consistently, digg would probably have to port v3 to the new backend.
Now in the current case, the most issues apparently come from the non-working backend as opposed to the changed featureset.
So while they could have run the two versions in parallel, they would not have gained anything. Likely, this was their rationale behind not doing so in the first place.
No, for the current configuration to work, you would have to create a way to move data in both directions from the "old" backend to the "new" backend. Ideally, you would slice up the beast so you didn't have to do this with everything all at once.
It's not a system that would be ideal for something transactional like a bank, but it may have been possible for an organization like Digg.
No, what they done wrong is not to make a proper 2 way conversion tool for their database backup.
Starting from scratch is no problem if you got your core data covered, then you can always revert back.
Don't ever develop yourself into a one way street!!
This is my all time favorite Spolsky blog entry. This lesson has saved me a few times. The broader lesson is that new and shiny in software isn't better and is often worse.
Yep, big bang rewrites are always scary. Doing something more incremental would have been safer.
One way to "safely" do a big-bang change would have been to use both architectures in parallel and sync back and forth. That way, if the new architecture fails, you still have the old one.
Of course, that is lots of effort, and restricts new features (which may be a good thing anyway).
I think rewriting from scratch is the core of their problem and not really Cassandra. Gradually going over to Cassandra would have been a much better idea.