Hacker News new | past | comments | ask | show | jobs | submit login
LinkedIn adopts protocol buffers and reduces latency up to 60% (infoq.com)
200 points by ijidak on July 19, 2023 | hide | past | favorite | 167 comments



“50 thousand API endpoints exposed by microservices”

Feels that their problem may just be they went a bit too far on the microservices band wagon.


Agreed. The headline leads people towards a deceptive conclusion. I've worked with both JSON and Protocol Buffers and there is no way that merely changing from JSON to PB would reduce latency by 60%; not on an average request. Clearly other refactorings were undertaken to achieve that speedup. I've done some raw performance comparison tests between JSON and PB and the speedup with PB was insignificant for most of the payloads I tried.

Does any developer here on HN really believe that JSON parsing (plus schema validation) is what adds the most latency to a request? It just doesn't add up that just switching to PB would deliver that speedup.

It reminds me of another headline I read a few years ago about a company thanking TypeScript for reducing bugs by 25% (I can't remember what the exact number was) after a rewrite... Common. You rewrote the whole thing from scratch. What actually cut the bugs by 25% was the rewrite itself. People don't usually write it more buggy the second time...

It's a pattern with these big tech consultants. They keep pushing specific tech solutions as silver bullets then they deliver flawed analyses of the results to propagate the myth that they solved the problem in that particular way... When in fact, the problem was solved by pure coincidence due to completely different factors.


Parsing overhead is sensitive to the amount of work the service is doing. If the service is doing relatively little, and there's a lot of junk in your request, you are forced to parse a lot of data which you then throw away. JSON is particularly nasty because there's no way to skip over data you don't care about. To parse a string, you have to look at every byte. This can be parallelized (via SIMD), but there's limits. In contrast, protobuf sprays length encoded tags all over the place which allows you to quickly handle large sequences of strings and bytes.

If your cache is hot and has a good hit-rate, the majority of your overhead is likely parsing. If you microbatch 100 requests, you have to parse 100 requests before you can ship them to the database for lookup (or the machine learning inference service). If the service is good at batch-processing, then the parsing becomes the latency-sensitive part.

Note the caveat: the 60% is for large payloads. JSON contains a lot of repetition in the data, so you often see people add compression to JSON unknowingly, because their webserver is doing it behind their back. A fairly small request on the wire deflates to a large request in-memory, and takes way more processing time.

That said, the statistician in me would like to have a distribution or interval rather than a number like "60%" because it is likely to vary. It's entirely possible that 60% is on the better end of what they are seeing (it's plausible in my book), but there's likely services where the improvement in latency is more mellow. If you want to reduce latency in a system, you should sample the distribution of processing latency. At least track the maximal latency over the last minute or so, preferably a couple of percentiles as well (95, 99, 99.9, ...).


Either way, the conclusion made by the headline is cherry-picked and is misleading for anyone who is concerned about average situations.

From the article:

> The result of Protocol Buffers adoption was an average increase in throughput by 6.25% for responses and 1.77% for requests. The team also observed up to 60% latency reduction for large payloads.

It's very sneaky to describe throughput improvements using average requests/responses (which is what most people are interested in) but then switch to the 'worst case' request/response when describing latency... And doubly sneaky to then use that as the headline of the article.


I agree.

There's also a lot of alarm bells going on when you have reports of averages without reports of medians (quartiles, percentiles) and variance. Or even better: some kind of analysis of the distribution. A lot of data will be closer to a Poisson-process or have multi-modality, and the average is generally hiding that detail.

What can happen is that you typically process requests around 10ms but you have a few outliers at 2500ms. Now the average is going to be somewhere between 10ms and 2500ms. If you have two modes, then the average can often end up in the middle of nowhere, say at 50ms. Yet you have 0 requests taking 50ms. They take either 10 or 2500.


Tail latency is really important. Usually more than average, because it's what drives timeouts in your system.


> Does any developer here on HN really believe that JSON parsing (plus schema validation) is what adds the most latency to a request? It just doesn't add up that just switching to PB would deliver that speedup.

I've absolutely had times when json serialising/deserialising was the vast majority of the request time.


100%. In fact, I've usually found in my career that serialization/deserialization is the largest latency contributor, except when the API call makes remote database requests or something equally expensive. For the critical path of, say, game state updates, it's best to keep as much as you can in-memory on-box, so ridiculously expensive operations like json deserialization really stand out by comparison. Even for enterprise web crud calls, it's quite important when you're part of a giant ensemble of web calls, some of which may be serially dependent, to worry about things like this.


Other sources say protobuf is 6x as fast than JSON though. I mean with REST / JSON you get the overhead of converting data to JSON objects, JSON objects to text representation, text to gzip or whichever, then the transfer over TCP / HTTP (establish connection, handshake, etc), the actual transfer, and then gzip -> text -> json -> data again. Protobuf is a lot more direct and compact, you get a lot of those things for free, and that's even before considering the schema and code generation which is still far from ideal in the REST/JSON world.

I don't believe that for the purposes of inter-service communication, REST/JSON can compete with protobuf.

Caveat: I only have experience with REST/JSON and a little GraphQL, I haven't had the opportunity to work with protobuf yet. I'm more of a front end developer unfortunately, and I try to talk people out of doing microservices as much as possible.


I don't buy these arguments about 'This cheap computation is 6x faster than this other alternative cheap computation'.

For example, in JavaScript, using standard 'for loops' with index numbers is a LOT faster (over 16 times faster for basic use cases) than using Array.forEach(), yet all the linters and consultants recommend using Array.forEach() instead of the standard for loop... What about latency??? Suddenly nobody cares about latency nor performance.

The reason is that these operations which use marginal amounts of resources are pointless to refactor. If a function call which uses 1% of CPU time* (to service a standard request) has its performance improved by 'a whooping' 50%, then the program as a whole will use only 0.5% less CPU time than it did before.

* (where all the other operations use the remaining 99%)


> For example, in JavaScript, using standard 'for loops' with index numbers is a LOT faster (over 16 times faster for basic use cases) than using Array.forEach(), yet all the linters and consultants recommend using Array.forEach() instead of the standard for loop... What about latency???

Do you have proof of this claim? It smells like bs.


Not the person you are asking but, 16 times seem like a lot, however it's going to heavily depends on how good the VM or JIT is at inlining.

Because `Array.forEach` is one method call for each iteration, when `for loops` stay on the same frame. If the compiler can't inline that, it's a "major" overhead compared to simply jumping back at the top of the loop.

Googling for it I find some benchmark showing a 3x performance: https://leanylabs.com/blog/js-forEach-map-reduce-vs-for-for_..., but they call `Array.push` in the loop, so it's possible the difference is even bigger in practice.


If you have Node.js installed, you can run it yourself. I did run the experiment, that's how I came up with the 16x number. I just had the two kinds of loops iterating about 10 million times and just assigned a variable inside them (in the same way). I recorded the start and end times before and after the 10 million iterations (in each case) and found that Array.forEach took 16x longer to execute. The difference could be even higher if the operation inside the loop was cheaper since much of the cost of the loop depends on what logic is executed inside it. I kept it light (variable assignment is a cheap operation) but it's still an overly generous comparison. The real difference in performance of just the two kinds of loops are almost certainly greater than 16x.

Ideally you should compare running the loops without any logic inside them but I was worried that optimizations in the JavaScript engine would cause it to just skip over the loops if they performed no computations and without any memory side effects. Anyway this was beside the point I was trying to make. My point is already proven; it makes no sense to optimize cheap operations.


forEach() incurs a function-call overhead unless you optimize it away in a compiler. So the idea a loop would be faster by some multiplier is quite plausible.

To get an order of magnitude in difference, you'd have to construct a special case. I've seen a multiplier of about 3, but not 16.

As an aside: you use forEach() over for loops because most array processing operates on small arrays, where the marginal improvement of using a loop is limited. If you have a large array, you will eventually switch to a for loop if the loop body is relatively small. Likewise, when the request size is small, JSON works fine. But when your requests grows in size, the JSON overhead will eventually become a problem which needs a solution.

The underlying consideration is Amdahl's law.


forEach is faster than indexing for me on Firefox Android. Tried it with this: https://www.measurethat.net/Benchmarks/Show/13207/0/performa...


Devil’s advocate: I would generally expect that a rewrite would increase bugs, at least at first.


I would expect that the target for a rewrite would be better known than the during the first write and that the morphing to hit a moving target would provide more opportunities for bugs.

OTOH if there was a lot of pressure to get the rewrite done, that would be conducive to producing buggy code. I think management would be a bigger factor than any technical issues.


As per TFA, the 60% was for huge payloads, maybe p99 or something. The mean benefit was much lower, but it's arguable that it's the slow ones that you care about speeding up the most.


> Does any developer here on HN really believe that JSON parsing (plus schema validation) is what adds the most latency to a request? It just doesn't add up that just switching to PB would deliver that speedup.

50k API endpoints means that probably a lot of them are pretty simple. The simpler the API, the higher percentage of it's time is spent in call overhead, which with JSON is parsing the entire input.


> I've worked with both JSON and Protocol Buffers and there is no way that merely changing from JSON to PB would reduce latency by 60%; not on an average request. Clearly other refactorings were undertaken to achieve that speedup.

The article notes that this is only for "large payloads", likely an edge case, and the average performance improvement is 6.25% for responses and 1.77% for requests. 1.77%!

I get that this is "at scale" but is the additional complexity of all this engineering worth that? How much more difficult is the code to work with now? If it's at all more difficult to reason about, that is going to add to more engineering hours down the road to work with it.

I assume tradeoffs like this were taken into account, and it was deemed that a <7% response improvement was worth it.


60% was just one request on average, could have been a huge payload that tended to be between data centers or something.


There's no good reason to use JSON for internal communication long-term. Although I'm not a big fan of Protobuffs as well (too fragile and inflexible).


So what's your third option, if any


There are many options, and I'd be uncomfortable to suggest any in specific without knowing the project's details (I'll list a few below). The reason why switching JSON to Protobuff make me raise an eyebrow is because it represents a switch from one extreme to another. A mainstream, flexible, text-based protocol, to a specialized, rigid, binary protocol. When people do sudden moves like these, it's often misguided. I can almost hear the evangelists bashing how everything LinkedIn did with JSON is wrong, or something.

In terms of formats, you'd get an easier transition and more balance between flexibility and efficiency out of BSON, Avro, Thrift, MessagePack. There are also alternatives to Protobuff like FlatBuffers and Cap'n Proto. There's also CBOR, which is interesting.

There are also other ways of looking at the problem. How does Erlang serialize messages? It doesn't because it messages itself, so the message format is native to itself. And in fact I mostly lean in that direction, but it's not for everyone. Erlang is also dynamically typed, not the kind of language Protobuff and Cap'n Proto is aimed at I suppose.


> How does Erlang serialize messages? It doesn't because it messages itself, so the message format is native to itself.

I don't get the difference you're drawing... the in-memory and on-the-wire representation of terms are different, so there's still serialization involved (term_to_binary/1). The format is documented and there are libraries for other languages.


They're different, but not that different as the functional message-oriented nature of Erlang means your entire state is typically in transferable terms, and the serialization format directly maps terms back and forth.

Technically Erlang could go much further, but much like multicore support took forever, I guess due to lack of funding, it doesn't. Things like:

1. When transferring between two compatible nodes, or processes on the same node, 'virtually' serialize/deserialize skipping both operations and transferring pointer ownership to the other process instead.

2. When transferring between compatible nodes on different servers, use internal formats closer to mem representation rather than fully serializing to standard ETF/BERT


Speaking as a current engineer at LinkedIn,

Big Tech companies will often conflate micro service with large distributed system.

These services are by no means at all micro.


As a former LinkedIn engineer, let me ask, what happened? LI was using protobufs and BJSON like ten years ago. That was part of the whole conversion from Java serialization to rest APIs. This article implies that tossed everything, went back to text based JSON, and then said, “Well somabitch! Binary formats are faster than uncompressed text after all!”


There are 19,000 LinkedIn employees world wide according to my cursory Kagi. That means more than two endpoints for every employee. I feel like regardless of micro or macro that’s an enormous sprawl of entropy.


The term “endpoint” is pretty fuzzy. Often it’s meant how many different url handlers there are (eg a service exposing “customers” might have dozens of endpoints for searching them, listing them, creating, updating and deleting them etc.

So number of endpoints can quickly grow to a big number.


Are we verbing kagi now?


I'd whynot this.


In which case I'll decisively nope that.


More seriously though, I wish we used generic terms instead of using brands. "Look up" (or lookup as a noun) or "search" can't be too hard to use. We don't need to turn our thoughts into ads at every corner.

(so, yes, I'll take your nope for my whynotting)

Now, if you are launching a search engine, you need to name it wisely for when people verb it. As a brand you probably want this to happen. Duck Duck Go was not very wise.


Given that’s already happened with Google that’s a bit late. I’d rather support Kagi than reinforce Googles brand.


Curious to know how many macroservices or monoliths LinkedIn has currently


Our interlinked system of distributed monoliths includes one in AWS region af-south-1, another on the Moon, and a third orbiting Jupiter. All of these services are yours, except for the fourth on Europa. Attempt no POSTing there.


Convert all of my JSON to protobuf to get 60% performance improvement?

I'm sorry Dave, I'm afraid I can't do that.


Ok for Jupiter and Europa, you need some redundancy and to be close to your users, but AWS would is a bit unexpected with LinkedIn belonging to MS.

Then again, GitHub still uses AWS IIRC.


I feel that wholly depends on how you’re classifying those.

What makes a service micro vs macro?


There's certainly a spectrum, but you can have huge clusters of even the most monolithic monolith.

The real semantic fun begins when you have monolithic codebase/deployments that could do everything, but each instance gets assigned a specialized role it serves to its peers _ (I think I've seen someonevdescribing that approach somewhere). I'd consider that micro if there are more than a handful of "roles", perhaps with some qualifier, but it would hopefully be easy to agree than no description is entirely wrong.


I’ve used this approach before, I wonder if there is a name for it ?


I haven't, but if I had and if I was trying to place it in the conference circuit, I'd call it morphoservice.

(and the actual me who has never reached beyond consumer side in conferences sure hopes that conferences would resist unless a less ridiculous name was found)


we would call it different "installations"


It's micro if it is so small that you question why this wasn't done as a library call in the same process.


It’s micro if it pulls <1gig of dependencies </s>


I have seen 5 table micro services. A solo backend developer had created "3 micro services" and as I joined the team, their argument was that we're doubled in strength so there were a series of architectural meetings to split everything nicely into 5 different services.


Hello world in NodeJS?


* clones hello_world repo

cd hello_world

npm install

* watches text scroll for five hours


How’s Ember.js doing these days at LinkedIn?


Pretty good tbh!


Do you really need that many endpoints even at LinkedIn scale? I’d expect a lots of them is due to engineers reinventing the wheel due to undocumented endpoints


Never want to deprecate an endpoint, so you end up with /service/v1, /service/v2, /service/v3… /service/v37


Someone who's job it is to oversee development across the comp, just needs to ensure teams treat internal dependencies like they would external dependencies — allow for time to upgrade upstream services as part of the normal dev cycle, never get more than N versions behind etc.

If you're on v37 of a service and your forced to continue to support v1 (and 35 others) there's a problem somewhere.


For one, that there wasn't enough challenging to make backwards incompatible changes.

If it's internal APIs, they need to get on top of deprecating and removing older ones. This is one of the key points of Google's SWE book (at least the first part) and the benefits of a monorepo; if you change an API in a backwards incompatible way, you're also responsible for making sure every consumer is updated accordingly. If you don't, either you're left maintaining the now deprecated API, or you're forcing however much teams to stop what they're doing and put time into a change that you decided needed to happen.


> If you're on v37 of a service and your forced to continue to support v1 (and 35 others) there's a problem somewhere.

I think you misunderstand.

v23 was built on v5, which is built on v1. Re-using the earlier logic was obviously better than duplicating it. v24 is used by an external system that nobody has any control over, so it’s impossible to change. All the other versions… well, no idea if anyone uses them, but everything works now, why invite disaster by removing any?


One of the problems with API versioning is that it’s really a contract for the whole shebang and not just a specific endpoint. You almost certainly want them to move in sync with each other.

So if you have an API with 10 endpoints, and one of them changes 10 times… you now have 100 endpoints.


It doesn't make sense, are you implying that every change changes the behaviour of every single endpoint?


No, more like “For our team’s next micro service epic are we using API v5 for it like we have been so far?”…”No we’ll upgrade to consuming API v10 as part of this epic”.

But maybe the only changes between API v5 and v10 were to 5% of the endpoints. But the other 95% of the endpoints got a new version number too. That way people can refer to “API v10” instead of “Here’s a table with a different version number for all 19,000 endpoints we’re consuming in this update on our micro service”.

It’s an organizational communication thing, not a technical thing. The “API v10” implies a singular contract. Otherwise how do you communicate different version numbers for 19,000 endpoints without major miscommunications? You couldn’t even reasonably double check the spreadsheet sent between teams. Instead it’s “just make sure to use v10”. Communication is clear this way.

Obviously this method has pros and cons, I’ve explained the pros. Also this is why chaos engineering can help by intermittently black-holing old API endpoints to encourage teams to move to new ones and finally remove the old versions entirely so you don’t ever get to 19,000 endpoints, which is the real problem.


Yes, this is what I meant. At least a service should be versioned as a single unit, it’s /api/myservice/v2/endpoint. But if you have 10 endpoints in your service and 10 versions, it’s still 10x10 even if most of them don’t change.

It would be a nightmare to consume something like /api/myservice/endpoint/v2. Needing v2 of the create endpoint but only v5 of the update? That would be ugly to try and work against. And actually there is no guarantee versions are even behavior compatible (although it would be stupid for it to wander too far). There can be cases where response objects don’t convey some info you need in some versions etc.

I was thinking of service as being the unit of "API" here rather than an API consisting of multiple services, "each service provides its own API" is how I was thinking of it. But I can see the usage of saying "this is our [overall] public/internal APIs" too. And I agree /api/v2/myservice would be a bit much if every service moved the global version counter every time a single endpoint was changed lol

(although I suppose you could make an argument for "timestamp" as a "point in time" API version, if you version the API globally. Sounds like it would cause friction as services try to roll out updates, but it's notionally possible at least.)


I was thinking along the same lines. It is easy to make it sound like you have a lot of endpoints, when the vast majority are likely API mappings pointing to the same underlying service.

API endpoints is is almost as weird of a metric as LoC. It does tell you something, but in a way that can be misleading.


For example if your URL scheme is /api/{version}/{path} then any new version will introduce lots of new endpoints. Most of them will work in the same way, but without checking source code you will never be sure.

Because of that I prefer to version each service instead of versioning whole API, but both of those strategies have pros and cons.


I’m trying to parse this too and am reaching the same conclusion. Basically they seem to be implying with the “100 endpoints” number a scenario where I have 10 endpoints and change endpoint 0v1 and now call it endpoint 0v2, I must duplicate endpoints in the range 1v1 to 9v1 to go alongside my endpoint 0v2 so I serve them all together. This doesn’t make sense to me as there’s no reason to upgrade or duplicate the other nine endpoints just because I updated the first as far as I can tell.

It really reaches absurdity when the first endpoint is on its tenth iteration (the others haven’t changed) and now you’re serving ten duplicate endpoints per version, or 100 total endpoints where 90 of them are duplicate of themselves.


If you're only incrementing when you change a particular call, then you end up with /api/myservice/create/v2 (or sometimes this is done via header) but v5 for the update call, and understanding what version goes to what becomes a cognitive overhead.

(and really the problem isn't basic CRUD endpoints, it's the ones with complex logic and structure where what's being built isn't necessarily the same thing over time.)

It's one thing when v2 and v5 are the latest, but if someone else comes through later and wants to bolt on a feature to a service that is trying to talk to v2/v5 when v3/v9 are the latest, you have to go back and look up a map of which endpoints are contemporary if you want to add a third call (v2/v5/v2) that is supposed to work together.

This can be done via swagger/etc but you are essentially just rebuilding that service versioning with an opaque "api publish date" built over the top.


Absolutely not.


I remember a talk about the Zalando infrastructure, and they had some 500 Kubernetes clusters, that's not even individual services. Sorry but no one can convince me that is an appropriate or needed amount to sell cloths online.


why does it matter what you sell? the backend of a big online retailer is extremely complex. i agree that 500 sounds large, but (1) i'm not sure that's real and (2) we don't know if this doesn't cover other related businesses. the part about selling clothes baffles me.


A K8s cluster with 500 nodes sounds more sensible to me.


Maybe they just call every single data object they expose it’s own api endpoint even if the vast majority come from the same program.


The final few sentences quietly mention that they’re going to migrate from Rest.li to gRPC, which seems like the bigger story here for anyone that’s adopted the open-sourced Rest.li for their own projects.


Former director at LinkedIn. There might be that many internal URL endpoints (I am skeptical though). There are far fewer services than that though, each service supports many APIs.


50 thousand endpoints, oh my gosh.


Good. latency has always been a problem at LinkedIn. for me, as a recruiter, I want to send friend requests and job vacancy to a bunch of random users in the blink of an eye. This helps. Thank you LinkedIn.


Don't worry this gap will be filled by another microservice!


Shhh, don’t want this to be a new ai-supported feature as a service


Indeed, with LinkedIn and come browser console features I can accept as many friend requests from as many recruiters I want in a blink of an eye.


Linkedin didn't reduce client latency by 60%. Instead they reduced latency of some microservice, which used to take ~15ms, by <10ms, which is not bad but the claim is flat out wrong.


Agreed. And "by up to" is a poor metric. What about P90 or even average?


It’s incomplete, not wrong. You inferred more than you should have from the title.


"We switched to $BUZZWORD and reduced $LOOSELY_UNDERSTOOD_METRIC by $STIMULATING_NUMBER !"

Never tell them it's just because of all the technical debt you finally had organizational will to pay off.


I think this is the correct answer.

To fix technical debt you often need a sexy sounding cover story.


Wish they'd adopt a code of conduct but at least they'll be able to mine my phone contacts 60% faster. I always feel held hostage by LinkedIn. I have to write a resume for every posting, and update LinkedIn and connect with every working person I've ever met on there. Otherwise I don't exist.

What a plague of a company.


> they'll be able to mine my phone contacts 60% faster

You give... random applications... access to your contact list?


I realize maybe a minority of HN even remembers this, but at some point ~10 years ago, LinkedIn would agressively spam your contacts lists. If you logged in on your phone they would somehow get your phone contacts. When you signed up they had you log in with your email address, and then would try to get you to spam your entire contacts list there too.

I think web integrations have gotten locked down since then.. not for altruistic reasons, but because Google and Apple don't want to give Microsoft access to the data they have on you, so yeah, it's at least harder to accidentally give LinkedIn access to your contacts list now, but the damage has already been done for a lot of people


Pretty sure on iOS they always had to get permission, and I've never given it. As for email, i think there was a wizard when i signed up and of course i canceled it. Don't sign up in a hurry ;)

And to come back to the current topic, their app felt so bloated that I removed it.


Contacts permission was introduced in iOS 6. Before that, an app could just read and write the contacts data.


Also some apps, like the old Zynga social games, would hold you hostage, you had to approve everything (at once, not granular like it is today) to use or even install.

It's been a while since early Android, but I'm pretty sure that's how it worked.


I rememeber being spammed by zynga games on behalf of my facebook contacts. So I never went near them, in browser or on mobile.


A piece of Anecdata: I managed to live around and still have no problems finding a new job. Yes, some job application sites have a mandatory "link to linkedin" field but I just ignore these.


Yes you can live without it. But you'll be missing you whole lot of jobs that are "linkedin only" in the form of posts, DMs and people spontaneously discovering your profile and reaching out.

Some people will be fine finding jobs without it, a lot won't be that fortunate.


I'm sure I'm missing a lot of job offers but I reply to a person who is afraid without it's as if they didn't exist. I can assure you it's not like that at all. I have no any problem finding new jobs, and new jobs finding me. I left my contact info with the recruitment agencies informing them they can contact me with the offers exceeding a certain rate. They regularly send me offers and sometimes I pick one. The system I adopted guarantees my pay only increases with time. Maybe I could be earning even more if I used LI, but I value my privacy a lot.


I feel “LinkedIn only” is a class of job that many people aren’t interested in (biz/fintech maybe?).

I’ve literally never seen a LinkedIn only job for a field I was interested in let alone a job - but I don’t work biz/fintech.


I've found my most interesting jobs on LI.

Sure, several of them, I could have found elsewhere, e.g. on a dedicated job board. But I wasn't at those other job boards, nor was the recruiter who brought them onto my radar.

If I had avoided LI, I would have missed out on some of my best gigs.


I am not in either of the fields. I get "confidential" offers from headhunters usually on Linkedin or Xing.


Yes, I refuse to engage with LinkedIn out of principle. I probably miss out on some opportunities, but I just tell myself that those jobs would have made me miserable anyway.


> I just tell myself that those jobs would have made me miserable anyway

This is certainly wrong because the usage of LinkedIn is very high. It's equivalent to saying, "I don't take recruiting calls from meat-eaters." I mean, if it's extremely important to you, go for it, but if not, you're doing yourself a disservice.


Well they said it’s out of principle so I read that as simply a rationalization that helps them stick to their principle(s)


How is that LinkedIn's fault? Sounds like you're complaining about other companies making LinkedIn mandatory.


Talked to a few LinkedIn engineers a while ago and was astonished at the size of some payloads they’re pushing over Kafka.. in almost all production deployments I’ve done with different teams in the past once we got rid of the big payloads and stored them on block storage and instead used pointers in the messages most pressure issues disappeared — switching to protobuf sounds like an attempt to decrease the payload sizes without offloading the huge payloads… which almost never works because hey if we’re faster now why not send more messages? That’s how it always goes…


Startup CTOs after reading the title: we gotta build from the ground up with Protobuf or we won't have any customers!


The article itself illustrates this nonsense attitude by mentioning 60% latency drops but a) what does that translate in ms, b) were those savings relevant at all?

A cursory glance over the article leads to some references to P99 latencies dropping from around mid 20ms to around mid 10ms.

Those are irrelevant.

Does the 1% care if their requests take 10ms more to fulfill?

From the surface this sounds like meaningless micro-optimization. I'm glad someone got to promote themselves with this bullshit though.


Latencies add up. 1 end-user request may have lots of requests in the backend which run in serial… especially when they have “50,000 services.”


In other news binary serialization is smaller.


> 50 thousand API endpoints exposed by microservices.

So feature wise, there's profile page of each user, there's searching page with filters, there's inbox, some recruitment tools I'd imagine, job posting, apply to the job and such along with search and recommendations for that too plus analytics for each of the component above certainly there would be some internal advertisement management platform.

What else I'm missing? So 50 thousand endpoints for all that?


Typically, the difference between equivalent JSON and Protobuf payloads that have been gzipped is quite small. It can even be that the gzipped Protobuf payload is larger than gzipped JSON one. Granted, there's less data to decompress out of it, which may mean a faster decompression stage, but I'm not sure that's a bottleneck in the first place.

So, do they not compress requests between microservices? Otherwise, how did they see such a reduction?


I wonder why they didn’t use MS’ protobuf clone, bond: https://github.com/microsoft/bond

When I was at MS, it was definitely preferred over protobuf. IIRC it was also used internally as the wire format for gRPC. I guess LinkedIn is still kind of doing their own thing.


A likely reason is that Bond does not support their target languages, listed in the article as "Java, Kotlin, Scala, ObjC, Swift, JavaScript, Python, Go". Of those, Bond seems to support only Java and Python.


If it's available in Java, then it surely is usable from Kotlin and Scala.

The others I have no idea about though.


Bond over gRPC appears to have been deprecated. https://github.com/microsoft/bond/issues/1131


I think that’s referring to some specific libraries by MS that provided a batteries-included gRPC + bond setup. The standard gRPC libraries still allow you to plug in whatever serialization format you want, including bond.

But based on the complete lack of activity on the issue, gRPC + bond doesn’t seem like a popular choice. :)


How is it different from Google's protobuf?


for intra-service communication gRPC makes the most sense. the only problem with it is the code it generates. personally I am speaking from Go point of view. am not sure how other languages fare but the Go code is as far from optimal as humanly possible(essentially everything is a pointer and schema is a gzipped blob). not to mention the issues of empty map or slice vs nil that is big pain-point in Go serialization in general.


The Python code is much, much worse.


why it makes sence more than for example using http3 with avro payload and data structures generated from json schema like swagger?


Avro does not even come with deserialize to object in the libs. A terrible choice for interoperability.


What happens if you need to dynamically discover a service and invoke it? Or does it only work if both ends are known and you can push out change to both end of the wire?

Also, wouldn't something like MessagePack, ION, RION, or CBOR give you payload compression over JSON?


https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-t...

One thing about protobufs in a highly interconnected ball of mess, good luck reving the protocol in any non trivial way. Schema-less encodings such as those you mention (as well as bson, etc) are really advantageous for loosely coupled interfaces and graceful message format migrations. IMO (opinion!) protobufs is popular simply because of the Google cargo cult where anything Google tech is slavishly adopted without a great deal of introspection on applicability to the specific situation.


I always try to avoid these special formats. Most people have not used them and even if there are claims that the tools work on different platforms there are often some quirks. Another approach is to add compression and tweak it for performance.


The target service must be known and compatible. Looser coupling is sacrificed for higher performance.


What happens when you don't need to do that?


Nice to see people realize that unnecessary conversion to and from text is unnecessary.


First instinct is:

Duh! Of course going from a verbose text format like JSON to _any_ kind of binary format will lead to decreased everything.

Second instinct was:

Wait, were they using HTTP/2, HTTP/3? What JSON parser were they using? One of the fast ones, or one of the slow ones?

Third instinct was:

There is not enough information in the article to conclude "60%" of anything. It's not an apples to apples comparison. It's just an attention-seeking headline.

Then again, if it wasn't an attention seeking headline, it probably wouldn't have made it to HN. ;)


Well this is bizarre. They were using a binary JSON and protobufs 9 or 10 years ago, so I don’t know what the hell happened at that company.


At a company that large and that diverse, the biggest gain in using protocol buffers would be having an official schema and versioning for the messages. Serialization is nice (but not 60% improvement nice). But having some structure around messages is the big win.


Just use gRPC.


From rest.li's github page[0] -

At LinkedIn, we are focusing our efforts on advanced automation to enable a seamless, LinkedIn-wide migration from Rest.li to gRPC. gRPC will offer better performance, support for more programming languages, streaming, and a robust open source community. There is no active development at LinkedIn on new features for Rest.li. The repository will also be deprecated soon once we have migrated services to use gRPC. Refer to this blog[1] for more details on why we are moving to gRPC.

[0] - https://github.com/linkedin/rest.li

[1] - https://engineering.linkedin.com/blog/2023/linkedin-integrat...


from the article:

> Based on the learnings from the Protocol Buffers rollout, the team is planning to follow up with migration from Rest.li to gRPC, which also uses Protocol Buffers but additionally supports streaming and has a large community behind it.


Yep, REST make no sense between microservices, for what are basically RPCs.


How is gRPC related? You send JSON or Protobuf over gRPC


No Swift support.



Thanks. It's not in their list of supported languages: https://grpc.io/docs/languages/


lol. at some point in time I ran protocol buffers over kafka. Fun times.

I liked avro better though.


I’ve seen protbufs as a custom encoding in the database before.

Don’t do this, it’s a terrible terrible idea.


It depends. I've seen it at Google - in bigtable, megastore and spanner - there are quirks - for example you (may) need a registry with all proto(s) the db may contain. In our case (Ads team), any other team using us could've placed their proto in our db, and during the ETL process (bigtable -> capacitor columnn db) you had to know the protos down to the leaf, as each leaf field gets made into a column (based on allow/disallow lists as we had at some point over million columns).

It's crazy, but worked.


Unfortunately, it really only works ergonomically at Google, because you need the global protodb and a proto-aware SQL dialect to make it usable for humans, and you need proto-aware indexing to make the retrieval speed usable for machines if you ever query by the contents of a proto field. Externally, the closest I've been able to get is using JSON-encoded protos as values, but that comes with its own problems.


You are absolutely right, but when it works - it's amazing...


What are some of the benefits of doing it this way?


Most database schemas at Google have exactly two "columns": a primary key column, and an "Info" column which contains a large protobuf. The difference between that and a simple K/V store, though, is that the storage engine is building indices and columnar data stores under the hood that make it perform as if it had been defined with all the fields as flattened columns, but you never actually have to declare a schema in any detail. That cuts down dramatically on tooling and release nonsense - there's nothing like a Rails migration script, because the only update you're likely to do is "sync to latest proto definition". It also means that you have all the semantics of a protocol buffer to define your data structure - oneofs to express optionality, and submessages to group fields together - which simply don't exist in a normal DBMS.


I can speak only from my experience. I was in a team doing ads evaluations, and we had multiple other teams specializing in various bits & pieces of the Ads infrastructure - for example one team would keep what new amenities were on a hotel/motel, etc. - so in order to fill all these mundane, but intricate details - e.g. "bool has_pool" of sorts, it'll keep adding fields, or introducing new proto messages to fully explain capture that data. And then each team would own several if not hundreths of these protos, these protos would be bundled as `oneof` (e.g. "union") or something like this in a bigger encapsulating structure - so in your

row database (bigtable), each team would have a column that stores their proto (or empty). So we'll run a batch process to read each such column, and then treat all the fields in the proto stored there as data, and expand all these fields as individual columns over the the columnar (read-only) db.

Later an analysts/statistician/linguist/etc. can query that columnar db with data they are interrested about. So that's what I remember (I left several years ago, so things might've changed), but pretty much instead of typical for row-databases have a column for everything - you just have a column for your protobuf (a bit like storing HSON/JSON in postgres), but then have the ETL process mow down through each field in that "column" and create "columns" for each such field.

We had to do some workarounds though as it was exporting too much, and it was not clear how we can know which fields would be asked about (actually there was a way probably, but it'll taken time to coordinate with some other team, so it was more (I think) on coordination with internal "customers" to disallow exporting these).

But the cool thing, is that if customer team added new field to their proto, our process would see it, and it'll get expanded. If they deprecate a proto, there could be (not sure if there was, but could be added) - no longer export it. But for this to work you need the "protodb" e.g. to introspect, able to reflect actual names in order to generate the column.


That’s very cool. Thanks so much for the detailed explanation


Why a bad idea


what's a good idea then?


Mind expanding? What happened


The article acts like this was an atomic switch, but certainly it was rolled out service by service over more than a year, in which case the comparison is grain of salt.


Why did it take this long? Real question.


Some "rockstar engineers" who write tons of logic in their controllers are shivering right now.


why not flatbuffers, if youre looking to minimize wasted cpu?


Still feels extremely bloated though. Maybe their internal latency?


great


"They also didn’t want to limit the number of supported language stacks"

But that's exactly the problem with protocol buffers. In typical Google fashion, they never bothered to create support for the native language of their platform: Kotlin. And that was just the one example we encountered on a project. If your language isn't supported by protoc, then what?

Also touted was protocol buffers' "future-proofing" or resistance to version changes. We never realized this alleged benefit, and it wasn't clear how it was supposed to be delivered.

I dreaded the opacity of the format, and expected decodes to start failing in mysterious ways at some point... wasting days or weeks of my time. But I have to admit that this rarely if ever happened. In our case we were using C++ on one end and Swift on the other.

I don't get this rationale in the article: "While the size can be optimized using standard compression algorithms like gzip, compression and decompression consumes additional hardware resources." And unpacking protobufs doesn't?



Resistance to version changes in protocol buffers comes from practice and doctrine that is widespread in the user community, not entirely from the format itself. There is really only 1 point to keep in mind.

1) Never change the type or semantic meaning of a field.

It's really that simple.


I was referring to this oft-cited feature:

"Reason #2: Backward Compatibility For Free

Numbered fields in proto definitions obviate the need for version checks which is one of the explicitly stated motivations for the design and implementation of Protocol Buffers... With numbered fields, you never have to change the behavior of code going forward to maintain backward compatibility with older versions."


Pretty much, but a bit more here - https://protobuf.dev/programming-guides/dos-donts/


2) make all fields optional


How is the format opaque ither than being a binary protocol? It has great documentation on encoding/decoding and a tool for inspecting the binary messages https://protobuf.dev/programming-guides/encoding/


It isn't.


Could you use the Java implementation of protocol buffers from Kotlin?


You absolutely can. You can also use Square’s excellent Wire library.


You can but you’re going to have ? EVERYWHERE, which does drive the search for a native implementation that doesn’t make you deal with null everywhere


Is there another way you're supposed to represent optionals in Kotlin?


No, but it makes the code a mess because you’re handling optionals for every protobuf field deref. So you’ve got error handling all over the place. It looks like go’s error handling.


They really are optional though. This is inherent in the problem if you want to support arbitrary past and future versions. You need to decide what to do when the data is missing. Open file formats are a difficult problem. [1]

I might define a different class for a validated object, so version skew only needs to be dealt with at the system edge.

Maybe you don’t actually have that problem because you can make a closed-world assumption? For example, you control all writers and aren’t saving any old-format data, like in a database that has a schema. In that case you can make fields required.

[1] https://acko.net/blog/on-variance-and-extensibility/


Kotlin is not a "legal" backend language at Google.


I had seen a couple of talks on Kotlin where there were representatives from Google saying not only was it supported, but was the recommended language over Java for newer services that aren’t written in cpp/python/go.

Ref: https://youtu.be/hXfGybzWaiA


Your information is outdated. Kotlin is now a supported backend language at Google.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: