Hacker News new | past | comments | ask | show | jobs | submit login
Running Istio In Production (hellofresh.com)
70 points by cr_huber on March 5, 2020 | hide | past | favorite | 44 comments



Lots of shallow dismissals in the comments for this and it seems to be a trend every time microservices come up.

"N Microservices for $BUSINESS_DOMAIN ? Crazy!"

Can't we just taka it on good faith that the engineers who do these blog posts have at least a modicum of competence and therefore their solution has _some_ merit (that is worth discussing) rather than just "Micro-services bad" dismissals?

<sideshow bob>Yes I realise this can also be taken as a shallow dismissal of the shallow dismissals</sideshow bob>


>> Can't we just taka it on good faith that the engineers who do these blog posts have at least a modicum of competence and therefore their solution has some merit

On a technical resource, no. In a technical article, technical points need to be explained. Decisions need to be checked.

Unless your argument is "it cannot be bad because it must have been checked by many people", we are totally in the position to provide any sort of constructive feedback.


If you want to give constructive feedback then have at it.

My comments were aimed at the shallow dismissals at the level to which I gave an example (and examples of which can be found in the comments on this post).

If a response to these articles is along the lines of "OMG they are using microservices, BAD!" then that is not constructive or useful unless you are trying to invoke Cunningham's Law (Or whatever the one was about getting help on a linux mailing list by saying something cannot be done).


No, it’s a blog post about running a cutting edge technology in production. Whether or not fundamental decisions were made correctly is vey much in play.


The blog post is about "Running Istio in production". Not "Why we chose to run istio in production" so I don't think "they shouldn't be running istio in production" it is in scope.

_However_ if you want to have a constructive discussion about if isio is even the right choice here then have at it if you think there is enough info to go into that. (Personally I don't think there is enough info in the blog post to have that discussion... since that is not the point of the blog).

Again, my comment was mainly aimed at the shallow 1-sentence dismissal comments that crop up on these posts.


Of course it’s in scope, it’s an unproven technology that’s still quite immature. It’s the same as the “Hadoop tips” posts back in the day where people would rightfully point out that Hadoop was a bad choice for the sample problem.


>> At HelloFresh we run hundreds of microservices that do everything from supply chain management and handling payments to saving customer preferences. Running microservices at scale is not without its own challenges and many companies are beginning to experience the pain of complexity.

Lets solve this problem by introducing another level of indirection and not solving the root cause(?).

At this point i really believe that software architects who don't code, don't belong to this industry. If the implementers and operators are suffering, there should be a feedback channel.


Indeed, it is another level of indirection, and I’m all for it (having introduced a custom service mesh based on envoy into the company I work for having ~100 microservices, though it took more like 2-3 months of me working solo).

It’s a great way to get a uniform metrics and troubleshooting experience.

Most important though, especially if you have strategically compiled binaries as microservices, a service mesh lets you roll out improved routing logic easily because it’s all abstracted away, not encoded in client libraries in X number of languages. Same for tracing. Wanna add a field to all traces being generated? Go recompile and redeploy a 100 services... or change one service mesh config.

Other than that, envoy can usually withstand much more traffic than the service it overlays, so you can use it to provide DoS protection in depth, by limiting on service proxies everywhere. Saved us a couple outage escalations already.


>> Other than that, envoy can usually withstand much more traffic than the service it overlays

services-to-service peer-to-peer communication creates problems on its own. and i dont even know what are the benefits. it does not guarantee you anything per-se. all the redundancy and bandwith improvements need to be... coded... like with any other approach.


I think this is and underlying feature of replacing DevOps with SRE is they start out as software engineers and move into a more operations focused role removes the whole don't code issue.


Having struggled with setting up kubernetes over the last month on my own, I’ve come to realize the absolute value of simplicity.

In the end Helm just introduced more problems than it solved. Rather than applying configs haphazardly and relying on 3rd party services, it was ultimately much simpler to just download the configs for whatever service was needed (nginx ingress controller for me) and committing them to source control.

My biggest take away from the k8s community is that lots of people write terrible documentation and other people write blog posts and SO answers without actually understanding how kubernetes works under the hood.

There is still an ocean of depth in regards to k8s I don’t know yet, but I feel a lot stronger about getting the intermediate basics. I’m at a point where I’m being productive again.


This is actually the advice I’m giving every new systems engineer that joins our organization.

Learning how Kubernetes works is much easier if you first get a firm grasp of the basics and then start bolting stuff on like istio knative and all the other cool stickers “modern architects” wet dream about.


Here's a worrying trend: software architects seem to talk in brand names instead of concept these days.


Yes, absolutely. It is like an infection or something. Once you catch it you only can communicate in logos of hip tech brands.

Edit: punctuation


The people writing posts and the people running production workloads seem to be a non overlapping ven diagram when it comes to a huge portion of kubernetes.


They're too busy! ;)


Istio solves lots of problems (pod-to-pod encryption, telemetry, etc).

It also enables lots of functionality when it comes to CD. It enables things like canary releases, testing, rollback etc etc in a simpler way by keeping it in the Kubernetes space (and not relying on slow external LBs).

You probably won't know why you need Istio till you need it.


I would like to read more about this, but the Part I that was linked is very, very thin on information. It's essentially just an introduction about the cloud tech in use, not anything substantial about the actual topic "running Istio in production".

Looking forward to upcoming parts!


Delivering pre-packaged meals to people requires _hundreds of microservices_. That is truly astounding.


It's not entirely unreasonable on its face; physical fulfillment is a complicated business domain. You might have services dealing with suppliers, customers, delivery vendors, regulatory compliance, payments, sales taxes, so on, so forth. You might have ML services to predict product availability or customer demand. Those subdomains might themselves be decomposed into API services, vendor gateways, background workers, etc. This doesn't factor in infra-related things like data stores, caches, etc.

Even if you only sipped the microservices kool-aid, I can easily see dozens of services.

Granted, it seems reasonable that a handful of monoliths could get the job done. Without having worked at HelloFresh, I'm inclined to think there's more to the story that we don't know. Maybe there's a good reason to have as many services as they do.


> Maybe there's a good reason to have as many services as they do.

Or maybe we're in a tech bubble and if you want engineers, you have to acquiesce to their demands of working with the latest shiny tools while they create mountains of technical debt, because if you don't, they'll just go to another startup that allows that behavior, or they'll go twiddle their thumbs at a FAANG while banking $300k+ total comp for their 5-years of experience.

I'm not trying to say one could replicate this business with a few scripts, but it sure as hell doesn't need a service mesh that looks like a mutated SARS-CoV3.


I would wager that its because ex-Uber folks work there and carried their ways with them.


That's rather pointed, but also matches just about every system-design interview I've ever given to an Uber engineer.

Next time, I want to ask them how they keep track of all those services!

edit: replaced an incorrectly-used idiom


Tip - use a database.

Computers are awesome at automating things, that goes for dev tooling as well.

If you’ve touched the ITSM space you’re used to managing and maintaining many thousand of assets. A few hundred microservices is nothing, really.

My team use what you could call a simplified CMDB (configuration management database) which is cross referenced against the service discovery.

The cmdb keep info about every service, such as persistent data-sources, vm’s etc, but most important - relationships: domain, team, services and resources.

A microservice is basically a ”ci” (configuration item) with a managed lifecycle.


Keeping track is one thing. Actually _running_ the Lernaean Hydra of an application in production is a whole another story. The amount of "housekeeping" you have to do to keep the thing afloat is astounding: cascading failures, distributing tracing, logging and diagnostics, metrics. Even operational side of things require a lot of attention. Presumably, each microservice would require at least a minimal level of admin-level tooling around it.


Logging and monitoring is part of the lifecycle. Use strict automated conventions to aid developer teams. Always opt for convention before configuration is our tooling motto! :)

Log shipping is what we do from thousands of servers already (you should at least!), adding a shipper for a few 100 containers on a set of hosts is no big deal.

Fluent(d/bit) -> some kind of elastic? There are a few resonable patterns available that works and scales pretty well.

Failures and issues with the actual code - well I might have been lucky... DDD with somewhat senior devs where no spaghetti action takes place. The tooling we keep usually seem to pinpoint issues fairly well.

We’re on the scale of roughly 40 devs and my team of 3 support them with tooling that handles service lifecycle and operational stuff.

It let’s us be pretty fluent with what and how teams build and iterate stuff. I guess it requires a certain scale and experience though.


I work for a large webshop. I was amazed by how many engineers work here. But now, after two years, I understand why we need them. Handling millions of customers, sending millions of parcels, logistics, storage, customer support, etc... requires a lot of systems. It is definitely more complex that you would image at first glance.

All of those systems need each other's information. The shop needs to know if a product is available, and what price it is. Customer Service needs to know how it was sent and the tracking code etc,...

We've built hundreds of microservices to manage this. This is definitely not something that is easy in a monolith. (We came from a monolithic architecture, we're really happy we're now using microservices).

We're now also rolling out Istio on GKE. So who knows, they might be on to something at Hello Fresh.


If you turn every function into a microservice, then you can have hundreds of microservices. I still don't see the benefits though.


Given where this is going, couldn't we just skip a couple iterations of the hype cycle, and simply port BEAM runtime to work directly on machine clusters instead of machine code? This way we'll jump straight to one actor = one microservice = one VM.


Isn't that serverless, ie. AWS Lambda?


You would also need an orchestrating layer on top of it.


I think it’s a safe assumption that any given problem domain has more complexity than is obvious at first glance.


This is one of those posts that brings out comments about things being over architected for the business. My mind definitely went there.

But, at some point, companies that want to keep talented tech people need to let them go build what they want to build. Maybe those things are over-architected for what the company needs right now, but it's tough to say if that's a bigger risk than losing talented tech people.


No, that’s definitely the kind of talent you don’t want to have. Engineers too naive to realize they’ve over-engineered something are a cancer in an org. You end up with home-grown complex solutions to problems that nobody but the original team can grok.


I think this points at a fundamental question that creeps up in lots of different ways around our discussions of Silicon Valley. A founder needs to ask themselves: Is your business really a software company. I'd suggest that actually, the technical complexity of HelloFresh's core business is about as complex as the mail order business Pryce Pryce-Jones set up in 1861 to sell mail order Welsh flannel. Which begs the question why you're paying rockstar software engineers at all.

Obviously part of the answer is "Because I need a tech unicorn valuation". But maybe your marketing to investors shouldn't be what you're actually basing your hiring decisions on.


That kind of "talent" isn't really what drives a business forward.


I recently started investigating Istio, and though it will probably be an unpopular opinion here on HN, I honestly don't understand why it's not a core part of Kubernetes.

It's amazingly simple to configure, the docs are pretty OK, and the benefits seems huge. Mutual TLS within minutes? BAM, done. Don't worry about cert rotation, Citatel does that for you.

Block all egress, and whitelist what you need? Like, that's a killer feature! Also, the inability to do 90-10 canary releases with plain K8s baffles me. With Istio? Simple...

I don't know, I'm sure I'll find the pains of Istio in the coming months, but in my dev cluster, it looks amazing.


I have a hunch various cloud providers have been aware of TLS related frustrations for some time and are working on proprietary turn key products to add to their own offerings at relatively high prices.

This quietly disincentivized them to lean into any OSS stacks that will take away from this future revenue stream.


Have you given Linkerd a try for comparison? Mutual TLS in 0 seconds (it's on by default) and a significantly lighter footprint. Canary traffic via SMI. Etc


its only recently become 'stable' but google offers it now in GKE baked into their clusters, if you want.


Surprisingly, it is possible to build a profitable business in this space. I looked them up and they're (very slightly) profitable. They also operate in multiple countries, which likely adds complication to their infrastructure, so it's not as ridiculous as it sounds.


“Now, with our new service mesh—that only took a few months to roll out—a failure in our hello-fresh-left-pad microservice can be withstood with only a few hours of downtime” /s - Don’t go work for HelloFresh unless you hate your nights and weekends.


You can experience the same, if not worse, levels of pain and suffering with a monolithic application designed by the same Enterprise Architects.


Part 1




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: