Concurrency Limits by Netflix

jsnell · on March 25, 2018

The blog post announcing this might be a better read:

https://medium.com/@NetflixTechBlog/performance-under-load-3...

(And while reading that and thinking "I swear I recently read something else talking about applying TCP congestion control to RPC queueing". And indeed I had: http://www.evanjones.ca/prevent-server-overload.html)

cperciva · on March 25, 2018

I swear I recently read something else talking about applying TCP congestion control to RPC queueing

I did this almost 10 years ago for accessing Amazon SimpleDB, too: http://www.daemonology.net/blog/2008-06-29-high-performance-...

Twirrim · on March 26, 2018

Had a co-worker implement something similar to this dynamic throttling in a back-end analytical process, based around access of a DynamoDB table that was accessed by multiple front end services.

The code on the back-end processing software would throttle itself back hard (about 50% of the speed it had reached) whenever it ran into a DynamoDB throttle message as a response, and then would ramp itself back up steadily. Combined with good retry logic on the front end services, it meant we could keep the DynamoDB table humming along at near maximum usage.

ramchip · on March 25, 2018

This reminds me a lot of the sbroker library in Erlang:

> SBroker is a framework to build fault tolerant task schedulers. It includes several built in schedulers based on TCP (and other network scheduling) algorithms.

> [...] in an ideal situation a target queue time would be chosen that keeps the system feeling responsive and clients would give up at a rate such that in the long term clients spend up to the target time in the queue. This is sojourn (queue waiting) time active queue management. CoDel and PIE are two state of the art active queue management algorithms with a target sojourn time, so should use those with defaults that keep systems feeling responsive to a user.

http://www.erlang-factory.com/euc2017/james-fish

https://github.com/fishcakez/sbroker

lkrubner · on March 26, 2018

I like the sounds of this:

"Executor -- The BlockingAdaptiveExecutor adapts the size of an internal thread pool to match the concurrency limit based on measured latencies of Runnable commands and will block when the limit has been reached."

I'm often surprised this kind of auto-scaling thread pool is not a more common thing in Java land.

tyrankh · on March 26, 2018

Agreed - not trying to trivialize the work here, but this thing seems solved with tools at hand. Perhaps I miss some of the subtlety in their usecase, though.

otterley · on March 26, 2018

Which tools are you referring to, exactly? Some citations would be helpful.

baybal2 · on March 26, 2018

PROTIP: Managing congestion control on application level is not a good business. A better idea is to leave it to the edge and CDN, while the app level uses computationally cheap optimistic algos since the com in between the app and your edge will go over your own high quality infrastructure.

This adds flexibility to allow for use of multiple algorithms in different load balancing regions (mobile as a lossy fabric is better to stay with conventional, desktop and server clients can use a some smarter throughput maximizing algo, and in countries with high percentages of connections being laggy DSLs, you can use something else )

otterley · on March 26, 2018

Can you refer us to any papers or analyses that support your claim?

baybal2 · on March 27, 2018

time0ut · on March 26, 2018

This library looks very interesting. I've used a similar approach for pulling batches of items from a queue (e.g. discover the optimal batch size and inter-poll wait time). There are plenty of other places we could benefit from something like this. I can't wait to try this out.

Netflix puts out some amazing Java libraries. I've had excellent results using Hystrix [0]. It has been an excellent addition to our systems.

[0] https://github.com/Netflix/Hystrix

Anderkent · on March 26, 2018

This is a pretty cool design if your requests to a given endpoint are supposed to all take about the same time. It's not easy to see how you'd adjust it for things with more variance; perhaps rather than using the fastest seen mtt, you could look at your p99 over the last N minutes, and see if it's been changing?

Bombthecat · on March 26, 2018

Jeah, I'm also thinking about this problem. Thought this could help...

Most of the problematic backedends are not those with response time of 20ms. On almost every request. The backedends with problems are those which could reply in 10ms or 2 minutes ...

nvarsj · on March 26, 2018

I think you bring up the main problem w/ using tcp vegas. It's not clear to me this will work with heterogenous requests. If the typical request time distribution is long tailed, it might never increase the window size.

elandau · on March 26, 2018

Even with heterogenous workload there normally is a uniform distribution of request types. Instead of generating complex statistics for average latency or tail latencies, especially for multimodal distributions, we just look at the minimum latencies as a proxy to identify queuing. So, when there is any queuing for whatever reason (increased RPS or latency in a dependent service) all latency measurements will show an increase, especially the minimum.

nvarsj · on March 27, 2018

"uniform distribution of request types" - okay, it makes sense in that context. Although if that assumption breaks down, your thread limits may become under or over provisioned.

I'm wondering though - how do you pick the right alpha and beta values? It seems like you need to do testing/validation to ensure you use the right values, right?

Sorry if I'm sounding critical by the way. I think this is a really cool project - thanks for open sourcing it!

fwef · on March 26, 2018

Would this work when connecting multiple threadpools (java executors)? Imagine I have a microservice where first threadpool downloads large files to disk (IO bound) and another threadpool that processes the downloaded data (CPU bound). Those two threadpools communicate with a bounded queue. Will using the concurrency-limit Executor allow me to get the best throughput in this scenario?

bvod · on March 26, 2018

Can anyone explain how they decide which requests to reject? The blog post just mentions that excess RPS gets rejected, but couldn't rejecting arbitrary requests cause other problems?

elandau · on March 26, 2018

Requests are rejected essentially when an atomic counter of inflight requests hits the limit. It's important to note that the library doesn't actually keep any kind of queue of requests. That's really not necessary because every system already has a ton of queues in the form of socket buffers, executor queues, etc...

Yes, the basic implementation does reject arbitrary requests. We do have a partitioned limit strategy (currently in the experimental state, which is why it wasn't brought up in the techblog). The partitioned limiters lets you guarantee a portion of the limit to certain types of requests. For example, let's say you want to give priority to live vs batch traffic. Live gets 90% of the limit, batch gets 10%. If live requests only account for 50% of the limit then batch can use up to the remaining 50%. But if all of a sudden there's sustained increase in live traffic you're guaranteed that live requests will only be rejected once the exceed 90% of the limit.

nvarsj · on March 26, 2018

My guess is they use a 0 / small queue in front of the request pool. If queue is full (indicating the server is at its concurrency limit), it returns a 429 (which is sort of weird - return a 503 instead). I don't think that is part of the library though - the library just provides the low level bricks.

ad-hominem · on March 26, 2018

I think their point is basically "It doesn't matter" - any client that sends a request which then gets rejected automatically retries and is bound to get a server that is up and has capacity. The retry happens so fast that even with just a naive retry implementation, the end-user won't even notice the interruption.

From https://medium.com/@NetflixTechBlog/performance-under-load-3...

> The discovered limit and number of concurrent requests can therefore vary from server to server, especially in a multi-tenant cloud environment. This can result in shedding by one server when there was enough capacity elsewhere. With that said, using client side load balancing a single client retry is nearly 100% successful at reaching an instance with available capacity. Better yet, there’s no longer a concern about retries causing DDOS and retry storms as services are able to shed traffic quickly in sub millisecond time with minimum impact to performance.

Edit: In terms of how they decide what to reject, from reading the blog post, there is a queue and there is a limit to how big the queue can be. Requests that come in while the queue is "full" get rejected immediately. They don't wait in the queue and timeout.

0354305 · on March 26, 2018

I'm not grasping the concept; Do I use this in production to manage concurrency, or do I use this in 'testing' to fine tune my system?

michaelmior · on March 26, 2018

You use this on production to allow either the server or client to self determine optimal concurrency levels.

99300432 · on March 26, 2018

Thanks

stephen123 · on March 26, 2018

I would have thought reactive streams or back pressure would be the usual way to deal with this issue.

Is this better ?

lllr_finger · on March 26, 2018

Netflix's circuit breaking OSS project, Hystrix, is commonly seen alongside Ratpack/rxjava where both reactive streams and back pressure are in play. I don't think you're wrong, but Netflix and others using their solutions like Hystrix and Hollow, are outside of what I'd consider "usual" problems and solutions.

nvarsj · on March 26, 2018

I don't think this library is meant to be used w/ reactive streams. It talks a lot about limiting number of concurrent threads, so it sounds more like traditional RPC with a request pool that they are trying to size to inform clients to back off (by returning 429).