Announcing Envoy: C++ L7 proxy and communication bus

seasonedschemer · on Sept 14, 2016

The info in https://lyft.github.io/envoy/docs/intro/comparison.html#prox... about Proxygen not supporting HTTP/2 is not correct. Proxygen has had HTTP/2 support for a while (https://github.com/facebook/proxygen/blob/master/proxygen/li...).

Disclaimer: I work on Proxygen at FB.

mattklein123 · on Sept 14, 2016

We will fix the docs, apologies. FYI, your README still says that "HTTP/2 support is in progress".

seasonedschemer · on Sept 14, 2016

Thanks for pointing out, will fix that on our side!

athrowaway98988 · on Sept 15, 2016

This post needs no disclaimer. You are posting facts, not opinions that might be considered biased. A "Source: I work on Proxy at FB" might be understandable, but you provided source in the post.

Whether it's true or not, you look like you're just using the excuse to announce to everyone that you work at FB. As a person who works at Google, I understand the temptation... but you should fight it. It's super annoying.

doublerebel · on Sept 14, 2016

Wow, with L7 routing on path (not just host) this does almost everything I'm using bud+fabio+consul to do. It's like Hystrix+sidecar-HAProxy in one.

The one thing I must have is SNI. The docs only have a short blurb [1], does someone know the full status of SNI support?

[1]: https://lyft.github.io/envoy/docs/intro/arch_overview/ssl.ht...

EDIT: Also is there any kind of visualization for the resulting network topology? It looks like Envoy should know everything about who is talking to who.

mattklein123 · on Sept 14, 2016

Currently we support SNI for client connections, not for server connections. There is no reason for this, just that we have not needed server-side SNI ourselves yet. Adding server SNI support would be a small change. Please file a GitHub issue and we can look into adding support (or can help you with the patch!).

mattklein123 · on Sept 14, 2016

Visualization is going to be a big area of future investment for us. We already have some pretty cool tools internally and obviously lots of dashboards, etc. but we would love to have a dedicated UI for Envoy. If you know any good UI devs who would want to work on this please send them our way. :)

doublerebel · on Sept 14, 2016

Thanks for all the info. Any insight into the service discovery issues described in the docs [1]?

  Many existing RPC systems treat service discovery as a
  fully consistent process. To this end, they use fully
  consistent leader election backing stores such as
  Zookeeper, etcd, Consul, etc. Our experience has been
  that operating these backing stores at scale is painful.

[1]: https://lyft.github.io/envoy/docs/intro/arch_overview/servic...

mattklein123 · on Sept 14, 2016

Mainly just years of experience at different companies watching ZK, etcd, etc. fall over at scale and require teams of people to maintain them.

We have had zero outages caused by our eventually consistent discovery system with active health checking (knock on wood), and haven't really touched the discovery service code in months. It just runs.

I'm not saying that a system using ZK, etc. can't be made to work. It certainly can since many companies do it. It's mostly that I think those solutions are actually making the overall problem a lot more complicated and prone to failure than it has to be.

Usu · on Sept 14, 2016

This looks really neat! We are deploying a few new microservices every month now and I'm afraid that things will get out of hand networking wise (we are too using ELBs for both load balancing and service discovery), so I'm looking forward to try Envoy, thank you for open sourcing it guys.

manglav · on Sept 14, 2016

Pardon my ignorance, but would someone mind explaining, in a little more detail, when this software would be necessary, and perhaps other tools that do the same thing? Envoy seems like it does a lot, I'm just trying to wrap my head around it.

wccrawford · on Sept 14, 2016

Their docs actually compare it to a lot of other stuff like haproxy, nginx, Amazon ELB, and more.

https://lyft.github.io/envoy/docs/intro/comparison.html

manglav · on Sept 14, 2016

I didn't see these, thank you!

wccrawford · on Sept 14, 2016

Yeah, I read a fair bit of their docs before I happened across it. I feel like they should make that comparison more prominent. Even knowing it was in there, I had a little trouble finding it again.

epberry · on Sept 14, 2016

This does seem incredibly useful for service oriented architectures. As I understand, it's basically a per application, per machine monitoring library for quickly detecting problems up and down the network stack.

However it also does load balancing. But doesn't that defeat the purpose a little bit? If your monitoring tool is the same as your load balancing tool, then who's monitoring the load balancer? :) I might be misunderstanding the architecture here.

chairmanwow · on Sept 14, 2016

So correct me if I'm wrong, but it seems to maintain a web socket 'mesh' that it proxies all inter-service communication through. So whenever you need to speak to another service, you don't need to worry about the extra cost of creation/teardown of a new web socket. It also says that it handles automatic retries, global rate limiting (https://lyft.github.io/envoy/docs/intro/what_is_envoy.html).

Because all inter-service requests are going through Envoy, it is really easy to keep incredibly detailed stats about network health, request success rate & more.

Envoy performing the task of load balancing does not defeat the purpose, because it provides extremely detailed stats for ALL THE THINGS, they reported it helped them find problems much quicker, instead of checking service code, EC2 networking, or the ELB. Essentially by creating a supersolution with better stats reporting for all, troubleshooting seems like it would be easier.

wccrawford · on Sept 14, 2016

This is hardly my field, but it was my understanding that this was the opposite. It's a load balancer that reports statistics so that you can figure out what's happening. That doesn't make your concern less valid, but it does change how everything is framed.

honkhonkpants · on Sept 14, 2016

If there's anyone from Lyft here answering questions, I have a few.

#1: cost. Doesn't it basically cost double to move the request from the client application to the proxy, and then from the proxy to the backend?

#2: upgrades. What happens to the clients when the proxy is being rolled out?

#3: head-of-line blocking. If an application has two streams to the same backend, one stream which is low priority and one which is higher, how does the proxy handle that?

mattklein123 · on Sept 14, 2016

Hi,

I work at Lyft. To answer your questions:

1) There is added cost, though it varies depending on how many things Envoy is configured to do (e.g., logging, tracing, stats, rate limiting, health checking, etc.). Even in complex scenarios (Envoy being used to proxy both inbound connections into a service, as well as proxy outbound connections to Mongo or Dynamo), we measure Envoy overhead to be < 1ms, which for almost all applications is negligible. There are definitely certain cases where this might be prohibitive, but in general we find the common functionality we get (again stats, tracing, etc.) to be invaluable in a production setting.

2) Envoy supports hot restart (https://lyft.github.io/envoy/docs/intro/arch_overview/hot_re...), as well as graceful drain of existing connections, so there is limited/no impact to existing clients. There is one enhancement that we would like to make to our HTTP/2 graceful draining to make it even more seamless, but that is more complicated than I can type here. :)

3) Right now Envoy does not support HTTP/2 priority so all streams are treated equally. We are currently working on priority support at the routing layer, with different connection pools available for high and low priority traffic, as well as circuit breaking settings. In the future we will likely merge this back into a single HTTP/2 connection with proper priority support. In practice though, within the DC, head of line blocking at the TCP layer isn't too much of an issue.

igor47 · on Sept 14, 2016

i'm not at lyft, but i've been looking into introducing this proxy in our environment to replace haproxy in smartstack (http://nerds.airbnb.com/smartstack-service-discovery-cloud/) and so have thought about these questions a bit.

#1: the cost of local proxying is very small, on the order of tenths of ms. it's going to be hard to distinguish against the backdrop of the broader network latency. also, in our environment we've been thinking of only running the proxy on the client end for now, to make the migration easier.

#2: as with haproxy, you leave the existing connections handled with the running proxy, and new connections are taken up by the new proxy. unlike haproxy, envoy actually supports live config reloading, so you don't have to pay that penalty on every config change

#3: multiple connections are multiplexed, but i'm not sure how you specify priority to the proxy. do the docs specify a mechanism? i imagine both requests are served in parallel, in no particular order, and the data is buffered for the client.

xissy · on Sept 14, 2016

Nicely done! Quite impressed since my company also has had similar problems with a fine-grained service oriented architecture. Envoy seems to cover almost of them, kudos.

However, I couldn't find a performance benchmark test or something compared to alternatives such as haproxy, nginx, etc. So I'm going to make my hands dirty now. ;)

_RPM · on Sept 14, 2016

> Envoy works with any application language. A single Envoy > deployment can form a mesh between Java, C++, Go, PHP, > Python, etc.

I find it odd that they did not include Rust in their list of preferred languages. Rust is safer than C++ and Go.

blub · on Sept 14, 2016

Rust barely registers outside HN and a few other web gathering places.

All those other languages have their niches and ecosystems and are safe enough.

twic · on Sept 14, 2016

The thing that struck me about language choice was the list of also-rans:

> very productive but not particularly well performing languages such as PHP, Python, Ruby, Scala, etc

Which seems to be missing a certain more-productive-than-C++ and very well-performing coffee-themed language.

I mean, you would never use Java for this, because although it could go fast enough, it would need way too much memory to do it. But i would have liked to see it dismissed for that reason rather than glossed over!

wyager · on Sept 14, 2016

Lots of things are safer than C++ and Go. Many just don't see as much use.

conradev · on Sept 14, 2016

I assume that's list of languages that are commonly used to host web services or power databases.

jguegant · on Sept 15, 2016

Any explanation on not using Asio instead of libevent (or libuv)? Performances?

halayli · on Sept 14, 2016

Wouldn't this require private keys to be sprinkled on all machines running it to inspect the traffic?

ryan_lane · on Sept 14, 2016

I work for Lyft.

For this we have a secret management system, called confidant (https://lyft.github.io/confidant/), that we use to distribute any necessary secrets. So, yes, you may need to have keys on every node (depending on your monitoring system), but assuming you securely distribute them, it's not a big deal.

This is, of course, a general problem that's not necessarily related to envoy.

halayli · on Sept 17, 2016

This increases your attack surface area. Any breach to one of those machines and the attacker can start doing mitm attacks. It also limits auto scalability assuming newly provisioned machines require manual approval of priv key distribution (that stays in memory) via hsm, and the same goes if the process dies. One way to limit the key distribution is to embed the routing information you require in the SNI at a second lb layer that's shielded from public traffic. This way your public machines don't hold any keys and if they get compromised, limiting the damage.

I agree it's a general problem. But sometimes certain architectures would require more vulnerable approaches vs others.

mgrennan · on Sept 14, 2016

Just seems like another piece of code in search of a problem and ways for things to go wrong, because someone didn't take to time to research how things work now.

Retra · on Sept 14, 2016

Do you have any relevant criticisms, or are we all supposed to just pretend we already know what your problem is?