Apple’s fix for corrupt binaries

justinschuh · on July 6, 2012

Good on Apple for fixing this quickly. Even the best team can let a bug like this slip through, and the best solution is a fast response. However, I'm confused by the blog post implying that the solution to unbricking the apps was somehow novel or praiseworthy. This is pretty much textbook for an auto-update system. You just bump the revision to force an application update; there shouldn't be any need to reinstall or muck with the data files. I'd honestly have been shocked if they hadn't been able to handle it just the way they did (which would have been noteworthy, but not in a good way).

thought_alarm · on July 6, 2012

However, I'm confused by the blog post implying that the solution to unbricking the apps was somehow novel or praiseworthy.

In the 4 years of the App Store we've never seen a distribution problem of this magnitude (it would be an absolute nightmare for any devs affected) and we really have no idea how Apple would respond to such a problem.

We've also never seen Apple unilaterally update a specific group of apps like this. Many App Store devs likely didn't think Apple would or could take such action, and those of us familiar with what it takes for a device to accept and install new bits are now wondering just how this is being done.

Are they manually bumping each affected app's version string? Is there there some hidden field that forces a device to reinstall the same version of an app? It's curious.

justinschuh · on July 6, 2012

There's a difference between being curious about the exact mechanics and considering the activity itself novel. Apple manages the updates and signs app bundles that developers upload. They need to be able to redeploy packages and trigger updates just to deal with normal operational issues (eg. malformed bundles, versioning mistakes, or reversions). Those are just the operating expectations of any system like this. There's nothing unusual about Apple being able to competently manage a relatively straightforward system that's such a big part of their business.

pooriaazimi · on July 6, 2012

My theory: there's a version number that devs set on an app, and there's another version (say an integer starting from zero) that Apple puts on apps. They bumped their version, leaving app's version intact.

quesera · on July 6, 2012

I'd say that's a good theory, and if they baked it in from the beginning, it's a great solution to this sort of problem.

The fact that Apple is so strict about monotonically increasing dotted numbers for version strings led many of us to believe that they were being parsed for important things inside the App Store publishing platform.

But maybe they are just part of the UI. Apple is known to have strong opinions and strict adherence requirements about that, too.

Someone · on July 6, 2012

Do they need to update any part of the app on the devices for this? I think it could be something as simple as "if a device checks for updates, check whether it (might have) gotten a faulty binary. If so, lie to him that there is a new version, and send them a version he already has, but use a correctly signed binary this time".

quesera · on July 6, 2012

Marco used the word "interesting", not "novel". Nevertheless, it is "novel" in the context of the App Store because, as far as we know, it has never been used before and none of us know how it works yet, nor that it could be done at all.

And it is "interesting". Apple did not bump the user-visible version strings (which are set by the developer), which would have been easy and expected, but cause some minor confusion for devs.

They didn't even bump the semi-invisible fourth field in the version string, which almost no devs use, and would have solved the problem quickly.

They also did not just mark transfers as having failed and requeue the downloads, which would have been fairly unsurprising.

They were able to determine which apps were corrupted, of all the updates in the affected period. This shouldn't have been too hard, but it shows that they are optimizing their solution pretty well.

They were apparently unable to determine who got corrupt copies of the affected apps and who did not, or maybe they are just erring on the side of caution here. Since their reupdate mechanism seems to work so well, the extra caution costs nothing.

I don't know how many kinds of catastrophic failures you've recovered from, but at App Store scale, recovery is often hard. They were either prepared for this sort of problem, or figured something out quickly that resolves the issue quite cleanly, taking care of all the details.

Either one is impressive. So, nicely done.

justinschuh · on July 6, 2012

I guess this might seem impressive if you're unfamiliar with automated client software updates. But this really is the norm when you're packaging, signing, distributing, and updating software bundles. You're going to treat the bundle you get as relatively opaque and not rely on its data. First, it's not reliable to trust third-party data. Second, it's just easier because you own the metadata wrapper (for the signatures, etc.) and the update channel.

quesera · on July 6, 2012

It's a special case of automated client software updates, though.

Most of us are probably more familiar with systems like Firefox, Chrome, or Sparkle. The App Store mechanism is more complicated, and interesting.

The third party data (version string) that you don't want to trust is, in this case, very carefully screened by Apple as part of the submission process. It is trustworthy by the time it lands at the App Store -- but since Apple owns the whole workflow, they have much better options.

falling · on July 6, 2012

He's praising it just because he didn't think of it, being too busy enjoying every news site parroting his whining yesterday.

mikeash · on July 6, 2012

What good would it have done for him to think of it? He couldn't fix the problem; only Apple could.

phuff · on July 6, 2012

Does anybody have any insight on how a process like this is debugged at Apple? I've heard, for example, that at Amazon there is a process of blame-finding after a major issue/outage like this, whereas at Google I hear things are more post-mortem let's fix the process that led the human error involved in the outage, rather than blame the person who made the faulty commit.

Anybody know how things work inside Apple's culture?

SpikeGronim · on July 6, 2012

"that at Amazon there is a process of blame-finding"

Ex-Amazonian here. It's important to note that Amazon's Cause of Error (COE) process is not about blame. It is about determining what happened, why it happened, and what concrete steps are being taken so that it does not happen again. Individuals are not blamed as part of this process and that's in the official rules. The goal is to iterate and avoid making the same mistakes again.

phuff · on July 6, 2012

I've heard from a lot of ex-amazonian's that in practice there's a lot of blaming as part of the process at least in part because of the compensation/promotional processes. But maybe that's changed recently?

Of course, I've also heard that there is a wide diversity of culture between teams, so maybe that plays in to it, too.

SpikeGronim · on July 6, 2012

Totally happens, even though it is not supposed to. If a dev didn't outright break the rules then they shouldn't be blamed. The rules in this case would be something like "a peer must review before deploy" - if you bring down the site after breaking that rule, you're going to get fired most likely.

I've never had COEs come up in my or others' performance reviews.

If the boss who owns the COE "gets it" and has internalized the old-school Amazon culture then there won't be blame. The bosses really take the hits here, they do get personally blamed for these things. If they can't stand between the team and the more senior management then they are not doing it right.

If you and your boss mutually hate each other (which unfortunately I have seen) then it won't go well.

RandallBrown · on July 6, 2012

I'm sure it depends a lot on your team and your manager. Also, some people see a process like that and automatically assume blame is being assigned even if it isn't.

CervezaPorFavor · on July 6, 2012

Nobody ever lives to tell the story...

oemera · on July 6, 2012

This is Apple culture at its best: No one ever talks about it. All employees are so loyal to this company that you can't even imagine one speaking about such insights.

usea · on July 6, 2012

Loyalty? More like fear. I'm not saying that we should be privy to the inner-workings of every company, but a lack of transparency is hardly a culture worthy of praise.

oemera · on July 6, 2012

Why shouldn't it be? Only cause people don't like it doesn't mean it is a bad thing... You don't have to put all your efforts and insights in blog posts!

vacri · on July 6, 2012

Not sharing ways to do things is good? Another example of Apple being the antithesis of open source philosophy.

recoiledsnake · on July 6, 2012

>All employees are so loyal to this company that you can't even imagine one speaking about such insights.

I think it is more of fear about losing their job rather than loyalty. It is not uncommon to find that any leak will be tracked down and the leaker summarily fired.

snowwrestler · on July 6, 2012

Fear of what, getting fired? Is it hard for Internet-scale application engineers to get a job in the Valley these days?

Is it really so hard to imagine that employees at Apple feel loyalty to the company? Employee loyalty is not an uncommon thing.

Flow · on July 6, 2012

Also, if you like to speak and write about how things work, I'm sure you have loads of opportunities to do that inside Apple.

It's just unprofessional to babble around about new tech like it's a cookie recipe.

crazygringo · on July 6, 2012

Does anyone know what the actual problem ever was?

dljsjr · on July 6, 2012

There was an error with the FairPlay DRM signing process that lead to a lot of app binaries becoming corrupt; people would go to download updates from the app store, and after doing so would be completely unable to launch the app. Not even a splash screen or a display of the freeze-dried screenshot from multitasking. Half a second of black, then kicked back out to Springboard.

Marco did a really good job of cataloging it because Instapaper was one of the affected applications: http://www.marco.org/2012/07/04/app-store-corrupt-binaries

jonhendry · on July 6, 2012

Supposedly one server involved had somehow gotten munged and was producing garbage.

psychotik · on July 6, 2012

How did they fix/change the version string, or did they? I don't see how simply re-released a previously released version would cause affected users to update their binaries.

What am I missing?

spaghetti · on July 7, 2012

This snafu, the Galaxy Nexus thing and the fact that the simple update to my highly rated and heavily used iOS app has been "waiting for review" for ten days made me start Android Dev. Sorry Apple. You took too long. Hire more reviewers already.

thechut · on July 6, 2012

Can somebody please provide some context for this?

btucker · on July 6, 2012

http://www.marco.org/2012/07/04/app-store-corrupt-binaries

thechut · on July 6, 2012

Thanks! That sounds bad, glad I'm not an iOS developer

pooriaazimi · on July 6, 2012

Come on... Shit happens. And it was the first glitch in the App Store after 5 years of operations! They've sold 30 billion apps (and updated probably well over 200 billion).

:D

mikeash · on July 6, 2012

The first glitch? You're kidding, right?

pooriaazimi · on July 6, 2012

Well, many, many things suck about the App Store and the submitting process, but they are not that important. This one really ruined a lot of people's holiday, tons and tons of angry and 1-star reviews and if I'm not mistaken, people lost local data. If it was a little more widespread, it could've been a real fiasco.

nirvdrum · on July 6, 2012

Minor issue, but the App Store did not launch when the iPhone did. According to someone on wikipedia, the App Store isn't quite 4 years old yet. Still an impressive record if it truly is the first issue they've had.

biot · on July 6, 2012

There's a link in the very first sentence of the article...