Yup, all those things are good. I could recount the counter arguments but without the Google 'context' they wouldn't make a lot of sense to people on HN. And when you get right down to it, step way back and look hard, is the elephant in the room, the monkey in the wrench, the uninvited love child, is that the arguments not making sense outside of Google is a symptom of a more pernicious issue.
When you get to the point where the arguments against something don't make sense outside the context of the internal path you've chosen, you have to ask if maybe you've crossed into an area where you're very much in danger of group think.
I respect what they have been able to achieve, and I certainly respect the vision of data center sized computers. I also experienced first hand the kinds of weird rationalization that accompanies trying to mentally (or culturally) accommodate an externally forced invariant for which the original principles have ceased to be relevant.
The 'Google Scale' problem is one such invariant.
For a large part of its existence there was never enough infrastructure to support things, and if you looked at Google's financial disclosures they were spending a billion dollars a quarter on new infrastructure. An x86 class server costs about $5k, do the math.
When the Great Recession hit that expansion stopped but what was interesting was this: When Google had been growing, an App that was popular would develop a material footprint in the infrastructure, that imprint would cause additional pain if it wasn't designed to fit in with everything else. However, just buy growing so much infrastructure, Google had reached a point where some apps would never have a material impact on the total resources consumed, the infrastructure was just that big.
But the rules didn't change, you had to design to run on what constituted 'Google Scale' as defined in the present not by what it meant when that requirement was put in place.
So the difficulty of implementing the requirement scaled with the size of the infrastructure, but the kinds of products that were being conceived and deployed would never need that level of scale. Stalemate, and what was worse there wasn't anywhere inside the company to even have the conversation about whether or not the requirement still made sense. And that was the root of a lot of problems and a lot of people have reflected that up the chain.
When I was there this blog posting might have been a long missive to the internal list for miscellaneous discussions, it would no doubt become a centi-thread (over a 100 responses) and detecting any change that it produced would be difficult at best (and positive change would never be credited to the person who pointed it out, only the people who changed would get any credit). And it's not that it wouldn't cause change, the social network inside worked through these issues with a plodding but deliberate slowness and sometimes those internal groups within groups might reach out to the original instigator, but more often not. But what the change wouldn't feel like is "startup-y."
I suggested to Alan Eustace once that Google I/O was a really cool way of getting people on board with various API changes and understanding the direction folks were taking, how come we didn't have an internal version?
Lots of potential, lots of challenges. Fortunately lots of folks dedicated to working through the challenges to make things better and its always great to have Larry pounding his fist on the table to add some urgency to an already frenetic environment. Sometimes the table pounding though just made the organization look like one of those table top football games where the vibrating games board caused the game pieces to move, somewhat randomly, around the board :-)
This sounds way more like "how to make Google a developer wonderland" than "how to make Google more startup like".
The author started at Google post-IPO. That's hardly "startup" time. In a startup it's not like resources are given out like candy and there's massive amounts of free-time for working on 20% projects.
It's true that there are less meetings at startups, but there are other things that typical hackers find unsavory: developers also have to do sales, sysadmin, customer support and generally be much more aware of where the money comes from.
The thing that makes a startup interesting is that the company is the project. Frequent meetings aren't as necessary because the goals are usually easy to understand and fairly one dimensional, e.g. "We have one product and we need to increase the number people using it."
The way you'd make a big company more startup like would be to make teams far more autonomous and drastically increase the risk / reward gradients. Your product ships 6 months late? Your entire team is fired. You open up and sustain a new revenue channel that's paying big dividends? Everybody gets a $2 million bonus. Team leader disagrees with his boss? He can chose to do it his way, knowing he's risking his team's livelihoods.
I don't know if anybody's really crazy enough to try it in a large company.
Microsoft was using this model in the mid/late 90's. Teams like DirectX, Netshow (became Messenger) and Xbox were given wide latitude, and run more or less like startups, with appropriate rewards and consequences. Not sure to what extent this approach was company-wide, but maybe it should have been, because I think it worked pretty well for them at the time.
Eric Schmidt has mentioned several times that google is very fond (focused on he would say) of the "hockey stick" projects that are created by developers and not by the managers on up, and as users discover them, experience exponential growth due to the value they provide. I assume his strategy is to attempt to create an environment in which such projects can first be created, and then be successfully scaled to the masses if they prove popular. So the "use google infrastructure" mantra will be difficult to let go if you are in that mindset.
They need to either remove at least a major part of the overhead and constraints of these highly scalable systems, or create a well planned and organized transition path from something with low deployment and development overhead to something "google scale". This could be at least at first a completely non-technical thing, such as requiring teams to at least have a plan of how they will migrate their redis, solr or whatever data stores to whatever google is using at scale. But reading this makes me cringe, that sounds like a very frustrating environment to work at and sounds very ungoogle-like to me. It appears their public image diverges much farther from reality than I would have thought.
I'll speculate that this is a problem with all large companies: the price of failure for the team involved is huge (unemployment?), but the benefit of success is small (continued employment and maybe a bonus). If the incentives are so structured, it's no surprise that you get risk-averse behavior.
Very few teams actually get laid off just because the product fails. The only time it's happened in Google's history was Google Radio Automation, where they brought on a bunch of people with very specific domain knowledge of the radio industry and then had nowhere to put them when they decided they weren't going to get involved with radio after all. Most of the time (assuming they don't quit), the team just gets absorbed into other products.
I assume things are similar in other large tech companies.
I'll posit that the real problem with big companies is the opposite: the price of failure is mild (you get reassigned to another team and get to do more cool stuff) and the benefits of success are also mild (you might get a promotion and an iPad or something). When your actions have little effect on the outcome, your actions tend to regress to the mean. In a startup, where failure means you starve and success means you never have to work again, you have a much bigger incentive to go the extra distance.
From an innovation standpoint, it might be interesting to go with the model of high benefits of success, low cost of failure. That relative skew is what has arguably made the US the leader in this field in the macro.
Startups given the market for engineers in particular actually only have a temporal starvation penalty. It's not like they can't get another job after they fail.
Doing this in a large company might lead to even more innovative products because startups are so under-resourced and governed by promises to investors that they risk becoming myopic and less agile.
the reward for successful skunkworks at a large co is typically promotion (serious promotion to very high-paying, visible roles). That can be very rewarding for certain personality types.
I have to agree from my experience at IBM. Those rewarded tend not to be rewarded much by industry standards (years salary at best) and the entire team or even the most meritorious contributors are rarely rewarded.
I know of at least one team that came up with something major that asked for their compensation to be tied to their product success having their idea moved to another team because "they should do it for IBM purely". Of course that product failed for lack of vision and focus.
Although I agree with some of your points, Google's cluster management system is years ahead of everyone else. It just makes so much more sense for that thing to exist given the wide heterogeneous workload at Google.
It is so much easier at Google to design something for scalability (which Google is mostly about) than at other companies, mostly thanks to the policies and infrastructures you criticized.
It is easy to just criticize without considering the implications of alternative policies.
For example, re: Switch to team-based distributed source control
I've worked at a pretty large software company that does this. The problem is lots of teams are working on similar things and results in duplicated effort.
Didn't even see this on HN. Here's the reply I left on your blog.
As someone who worked in the SRE and datacenter/cluster management teams during the same period you were there (2005-2010), I can confidently say that I agree with almost everything you’ve mentioned. If you think engineers on small projects have a hard time dealing with acquiring and managing cluster resources, try being on the team that has to resolve all of those requests. Because many of Google’s core infrastructure pieces are so inflexible and frankly not designed to be used as they are, they end up dying a death of a thousand cuts. Systemic design flaws lead to telling most teams “no” when they asked for even 5 machines worth of resources.
At the end of the day, Google has maybe 5 products that generate 99% of the revenue and operate at huge scale. Should they devote most of their attention and money to these products? Absolutely. Should they do this at the expense of all the small projects? Not if Larry wants the company to act like a startup.
Ultimately the limiting factor to Google’s agility will be its technology infrastructure, not its engineers.
Ultimately the limiting factor to Google’s agility will be its technology infrastructure, not its engineers.
Isn't that a contradiction? It's the engineers who build the infrastructure. As such I would expect them to be the bottleneck, just like in every other company that has an effectively unlimited hardware budget.
Yishan Wong (previously FaceBook's Director of Engineering )'s suggestions there are interesting
e.g
"
1. Fire a broad swath of people in the executive and management staff
I've talked to quite a few extremely talented and didn't-leave-because-they-were-incompetent Xooglers over the past year at Sunfire (and via other avenues), many of them key early employees. A recurring refrain that I hear is that Google has been taken over by overly-political managers who have laid waste to a formerly meritocratic organization where good ideas get turned into good products, and this has been deadly in two ways: (1) truly good talented people who keep their heads down and get things done are motivated to the leave the company and (2) the organization that remains becomes, by necessity, one that revolves around this internal politicking rather than productive endeavor and shipping products.
Identifying these people from above will be hard, because part of being politically skilled includes looking good to the people above you (and the more politically skilled they are, the better they look), so Larry should directly contact a thousand of the best ex-Googlers and ask them to anonymously name 5 people who are still at Google who should be fired, and using a histogram of the results, fire the top 100 names without letting those people "explain" their way out of it. Steve Jobs did something like this when he returned to Apple (except he just walked the halls firing anyone he thought sucked), and this is the data-driven equivalent: a thousand ex-Googlers is enough to even out any personal grudges, and the aggregate information is likely to be highly reliable about who has been climbing without regard for those below them or the good of the organization. "
LCE & SRE “blockers”. Having support for Launch Coordination & Site Reliability is great, but when these people say “you can’t launch unless…” then you know they’re being a hindrance, and not a help.
Isn't that the point? They are there to maintain a larger point of view from individual developers. Giving every person with their own agenda launch authority is disastrous (in a large organization): Of course my project is important. My project doesn't need review, I wrote it.
Actually, not. LCE's get involved with launches of new products, and have a very long list of requirements & procedures that you have to follow. It's well known that going through this process for a new launch can take weeks. This isn't "agile" for new services that haven't launched before.
Agreed that for google.com search (and AdWords & GMail) there should be a few more procedures in place, but applying those procedures to every new small launch is a huge blocker.
I remember PG saying something like "Every time a big company institutes a procedure to prevent something from ever happening again, they should consider the cost of that procedure to future development." I wish Google would take that to heart. I think that the executives do understand this, but many of the middle managers are too concerned with doing their job well to think about the negative impact that has on other people's jobs.
Actually, what I wish more companies did was have annual "feature killing" and "procedure killing" parties, where you reevaluate everything the company does, and if its cost is greater than its benefit, get rid of it. So "we'll institute a procedure so this never happens again" becomes "we'll try to institute a procedure so this doesn't happen again in the next year, see what its impact is on productivity, and if it saves more than it costs, we keep it."
I think this happens any time you separate the people who think about risk and the people who think about benefit. So long as there's a team with the job of managing risk, that's what they'll do. Balancing those two sides then becomes the job of someone up the stack who doesn't have the time to focus on all of the details.
There is a better chance that people are using Google products (like Mail and Docs) for tad more critical stuff than Facebook. So the comparison may not be completely correct.
Yeah. I did, so I got it. (I was technically a SWE but I mostly did product stuff. A bit of it went over my head.)
I also worked at Yahoo, so I can compare what you said to "big internet companies" as well. The dedicated hardware thing can be a nightmare. Using open source stuff internally can be a nightmare too.
I tried to only use names that were commonly known infrastructure components at Google, so I don't think this is really an issue.
I've searched for each of the internal projects (i.e. "google blobstore") and if there were no reasonable results, I've removed the references. All the others should be well known, so I've left them.
I agree with nearly everything you had to say except for your bit about interviewing. At a properly functional startup people spend A LOT of time thinking about recruiting. You seem to want to reduce that load at google.
Rather than reducing the load Google needs to better connect the work of doing good recruiting to to individual/team success. The total disconnect between who does the interviewing and what team a hire ends up working on is the big problem, not the total amount of time/effort being devoted to recruiting.
Really good point. Mostly, I was amazed at the amount of time devoted to doing "manager-ish" things like writing up perf feedback, snippets, and also the laborious interview task.
Part of my frustration there was seeing myself doing 2+ interviews per week for most of my career, and my colleagues and coworkers doing 1/month or less. This isn't right. I also saw internal recruiters who had "favorites" who were googlers that were more laxed in their feedback and would more likely lead to a hire. This is wrong.
Interviewing and hiring is really, really important! But, I think that Google's distributed system actually bogs down everyone instead of doing the fast & easy thing.
How would a 3 person company hire their 4th and 5th people? Google needs to do that, at scale.
Having interviewed at several 3-person companies - the good ones have everyone in the company interview the candidate, then everyone reports back to the CEO and a single "no hire" nixes the candidate. That obviously won't scale to Google's size, nor should it. In an early-stage startup, it's crucial that everyone be able to work well with everyone else. In a 25,000 person company, not so much.
"One way Page tries to keep his finger on Google’s pulse is his insistence on signing off on every new hire—so far he’s vetted well over 30,000. For every candidate, he is given a compressed version of the lengthy packet created by the company’s hiring council, generated by custom software that allows Page to quickly scan the salient data. He gets a set every week and usually returns them with his approvals—or in some cases bounces—in three or four days. “It helps me to know what’s really going on,” he says"
He didn't review mine, because he was out of town for Thanksgiving at the time and I had another offer that was getting antsy. OTOH, it took like 3 VPs to sign off on it in his absence.
> How would a 3 person company hire their 4th and 5th people?
> Google needs to do that, at scale.
Yes, exactly. If you're doing 2+ interviews/week this should directly lead to you getting more high quality co workers on your team. While the folks who are only doing 1/month end up with withering teams. Properly aligning this incentive is the key to solving the problem.
Even without working for Google, I can fully appreciate most of the issues and proposed solutions. Like in any big company, only the people "deep in the trenches" even see these issues, worry about them, and worry why management is so clueless that they don't even see the issue.
However, the key is to understand that these are simply superficial manifestations of a bigger issue: as the company grows in size, communication requirements grow exponentially, and even a slight mismatch in the talents of people will lead to serious issues. I don't think any of the recommendations will work at the scale of 20,000 engineers that Google has.
Take open source software for instance. For a 10 person start up, it works beautifully. Now try convincing 10 other startups to adopt exactly the same set of software, and you'll have a never ending religious war on hand. But you cannot also let everyone to pick their own solution, for then the 10 different groups cannot integrate with each other.
Far too many people keep complaining (I complain where I work for as well :), but the solution is not easy. By far the best approach is for people to realize that there is a need for a company to grow so much, and not anymore. However, human nature will not permit that.
The reason he puts it as "LOVE" is directly because products aren't allowed to launch unless they can prove they have sufficient capacity. And AFAIK the policy was put in place because of a few high profile failures where a popular service launched and then keeled over from high demand.
I don't necessarily think it's the right trade-off, but it's important to note that it is a trade-off, and other people absolutely love the effects of something that I find rather annoying as an engineer.
With all this talk about Amazon, I'd love to hear if Amazon actually has the same problems, and AWS is just a customer-facing product that you can't actually use internally.
AFAIK, the concept and core tools behind AWS built for internal use in Amazon, making them available was a serendipitous byproduct. Their strong "you build it, you run it" manifesto would ease some of the problems that OP has pointed out, e.g. correcting other people's code.
I think that having "The Google-way" is certainly valuable to Google (and engineers). Because there are only a few ways to do things you can build something and then hand it off to another time (SRE) for day-to-day operations. It also makes security and machine management much easier to deal.
I'm not saying you're wrong, just pointing out the flip-side.
Maybe Google should have a separate network with separate virtualized machines and no "Google stuff" for new medium-sized projects. Test whether they get traction there and then port them into "the Google-way" later.
Or maybe "the Google-way" just needs to be fixed up to make it easier. For example you could use AppEngine.
Agreed 100% on making dual infrastructure. Google Search and other high-reliability products need to stay on time-tested infrastructure.
Making a "wild-west" of infrastructure (a-la an in-house EC2) would really be the right way to go. And, as you said, put small & medium sized stuff there, and make porting over a moderate but not impossible task.
With respect to AppEngine, it's often stated as some kind of panacea solution for scalability for small applications. But, as a web application developer, I see AppEngine as "Googleisms on the outside". And by this, I mean that AppEngine is also a walled garden. Can you access a MongoDB instance from AppEngine? What about memcached for caching? Solr for document indexing and search? These are all things that are easy to do in a dedicated machine environment and greatly benefit small projects, but are impossible via anything Google builds.
What was fascinating for me was how much of this is applicable to Microsoft too (in some cases, scarily so). I guess every large tech company has similar dynamics, even when Google has tried really, really hard to be different.
It's interesting that the author wants to use stuff like Hadoop/Cassandra, products that are clones of Google's older gen. systems, and are certainly inferior. (Just download the latest Hadoop/Hbase, kill -9, and it all goes to hell with data loss.)
This changed my view of Google, I thought they kinda aced on those points, being smart & pragmatic, and that that was what made them the place where the smartest folks on the industry wanted to work.
I'm unable to access it now, at 7:10 PM Pacific time. Luckily, if I just prepend the URL of this Google-critical blog post with "cache:" in Google Chrome, it pulls up a copy on Google's cache.
The site just hangs for me, so I'm guessing it got linked from somewhere else and is getting overloaded, not being killed for NDA. I'd think it would 404 if it was specifically pulled.
Google is pretty much dead and it doesn't even know it. They have produced nothing but high profile flops for some time now and all indications are that pangerank is no longer reliable, they have stemmed e tide a bit with blacklisting but I am pretty sure they are approaching search the wrong way. As the web matures they should focus on ferreting out quality sites, not just boiling the ocean and indexing everything there is to index. A "like button" based search is going to eat their lunch, the question is who is going to get there first.
> A "like button" based search is going to eat their lunch
Only until the content farms build bots capable of liking each other. It's an eternal arms race. Google figures out a property that quality content has (links from other sites, URL matching keywords, "like button" clicks), and it's useful for a while until the content farms figure it out and morph their content to match those rules.
There seems to be a steady stream of these from ex-Googlers, which is frankly what I'd expect from a company of that size. What I'd be interested to know is what percentage of SWE's are leaving each year, and of course how do the people that are left feel about the company. The ones that I've heard from have a range of opinions from It's Ok to It's Great, but those are of course anecdotes.
Also, "behaviors that lead to success" are not necessarily the same as "behaviors of the successful". Lots of folks have effectively won the lottery.
I disagree with almost everything except the startup incubator, reducing the endless meetings, and getting started with hardware. Google has these things for a reason, and they have been more reliable than twitter, facebook, etc. and their technology actually works. So what if they restrict their engineers into using their own systems? They put reliability and dependability for their users first, and that's why we have all come to trust google's infrastructure, privacy, etc. way more than facebook. Would you put your corporate email on facebook?
It sounds like what they SHOULD do is streamline and document the system for their engineers better. There should be an internal project started to make their developers HAPPIER and more productive. For example, you have an idea for a new project? Here's the actionable checklist. Need to launch? Please make sure all of these are checked, then you can launch. Treat your developers like you treat your users.
As for capturing people into an incubator before they leave the company? I like that idea. Except of course, one has to wonder how much this will incentivize people to quit google, just to get more autonomy and a better deal :P Not to mention, that once acquired by google, the startups' technologies are just rewritten to live in the Google ecosystem, so this seems like a waste of money... except for possibly the IP licensing costs.
"I think it’s a great idea, and it needs to be made effective. 1 day per week isn’t reasonable (you can’t get enough done in just one day and it’s hard to carry momentum). 1 week per month would be great, but doesn’t do justice to your “main” project. Something needs to budge here, and engineers need to be encouraged to take large amounts of time exploring new ideas and new directions."
What about ~2 months per year? Like a mini-sabbatical.
BTW, can someone explain the first point, "Compiling & fixing other people’s code" ? The "world" here refers to other teams inside Google or external libraries?
Google's internal code management system won't let you check in code that fails pre-checkin tests. This is generally a 'best practice' sort of approach. The challenge is however if you modify a library, and in the process of modifying it you change a side effect, and other code's unit tests fail because they depended on the side effect, you have some choices:
1) You can go fix their code so that they don't have the dependency and then check-in (this can be laborious because they have to approve your changes to their code)
2) You can re-introduce the side effect so that their code continues to work.
3) You can try to get them to change their code (very hard since they probably have bunch of other things going on that don't depend on you)
4) You can create a new library (or routine in the library) that has the semantics you want and deprecate the older version.
While this was extremely painful for folks who were working lower down in the system it was not as big an issue for folks on the upper levels. And in a perverse way it motivated good interface design.
Thanks for the typo corrections, they've been fixed.
"The world" refers to all the dependent google source code. This is primarily google-written code, but includes some open source packages that are used as dependencies.
Personally, I think Google should break itself up. It needs to treat it's search & ads business totally differently than it treats it's other "startup-ish" properties. These can/should be launched under different branding, or, like a startup, each could be launched under it's own discrete branding.
I'm not sure it really matters if a site is "developed by Googlers". Why do the users care? If Google is playing farovitism with results, then it's probably important to disclose that, but otherwise, just let startups be startups.
I completely agree in principle. I think this is the advice Clayton Christensen would give.
The example that comes to mind (and I may have some of the details wrong) is HP finally breaking through with DeskJet after stagnating in innovation by making the group _completely_ separate. They moved the group to a new location geographically and gave them huge amounts of autonomy to make something outside the bubble / group-think of their LaserJet juggernaut.
I have a personal example of this as well. I worked as a PM for a real estate software firm that owned 70%+ of the market for desktop software for real estate appraisers. 300+ employees with a lot of engineers. They decided to move into the real estate agent segment with a new product, and to avoid the group think and "our company's bread winner and primary focus is on X", they started a satellite office in another state. It was like a startup with occasional oversight from some investors and board. The corporate values were the same, but our autonomy allowed us to truly innovate.
> I'm not sure it really matters if a site is "developed by Googlers". Why do the users care?
brand matters a lot, especially when trusting a web app with private data is concerned.
more than that there's brand loyalty. i realize this is an extreme example, but i don't use dropbox because it's the kind of internet-feature i want under my google account, not somewhere else. i literally don't want any company other than google to succeed in that space.
Sounds interesting, one point of yours that resonates with me is the push to develop everything only on Google Products.
The big paradox that brings up is: If all internally developed projects must be developed Google-scale, on Google products, why are so many of Google's most successful products acquisitions rooted in non-Google products?
When you get to the point where the arguments against something don't make sense outside the context of the internal path you've chosen, you have to ask if maybe you've crossed into an area where you're very much in danger of group think.
I respect what they have been able to achieve, and I certainly respect the vision of data center sized computers. I also experienced first hand the kinds of weird rationalization that accompanies trying to mentally (or culturally) accommodate an externally forced invariant for which the original principles have ceased to be relevant.
The 'Google Scale' problem is one such invariant.
For a large part of its existence there was never enough infrastructure to support things, and if you looked at Google's financial disclosures they were spending a billion dollars a quarter on new infrastructure. An x86 class server costs about $5k, do the math.
When the Great Recession hit that expansion stopped but what was interesting was this: When Google had been growing, an App that was popular would develop a material footprint in the infrastructure, that imprint would cause additional pain if it wasn't designed to fit in with everything else. However, just buy growing so much infrastructure, Google had reached a point where some apps would never have a material impact on the total resources consumed, the infrastructure was just that big.
But the rules didn't change, you had to design to run on what constituted 'Google Scale' as defined in the present not by what it meant when that requirement was put in place.
So the difficulty of implementing the requirement scaled with the size of the infrastructure, but the kinds of products that were being conceived and deployed would never need that level of scale. Stalemate, and what was worse there wasn't anywhere inside the company to even have the conversation about whether or not the requirement still made sense. And that was the root of a lot of problems and a lot of people have reflected that up the chain.
When I was there this blog posting might have been a long missive to the internal list for miscellaneous discussions, it would no doubt become a centi-thread (over a 100 responses) and detecting any change that it produced would be difficult at best (and positive change would never be credited to the person who pointed it out, only the people who changed would get any credit). And it's not that it wouldn't cause change, the social network inside worked through these issues with a plodding but deliberate slowness and sometimes those internal groups within groups might reach out to the original instigator, but more often not. But what the change wouldn't feel like is "startup-y."
I suggested to Alan Eustace once that Google I/O was a really cool way of getting people on board with various API changes and understanding the direction folks were taking, how come we didn't have an internal version?
Lots of potential, lots of challenges. Fortunately lots of folks dedicated to working through the challenges to make things better and its always great to have Larry pounding his fist on the table to add some urgency to an already frenetic environment. Sometimes the table pounding though just made the organization look like one of those table top football games where the vibrating games board caused the game pieces to move, somewhat randomly, around the board :-)