Exponential Rate Limiting

hansvm · 2024-09-12T13:37:10 1726148230

Ha, I wrote one of these a year ago. The problem being solved was gracefully degrading a service if the request volume was too high.

Some additional tweaks you might care about:

1. It's easy to change the "numerator" being measured to something continuous rather than discrete. As a rate limiter in my application I found CPU seconds per second to be much more useful than requests per second. If you have more control over the low-level networking then queue depth or something might be more applicable, but you work with what you have.

2. It's easy to change the units of the configuration parameters. I haven't checked if it's already in these units for example, but TFA's "averaging period" can easily be re-cast as a half-life (in microseconds or hours or whatever's most convenient).

Anywho, this is a nice write-up on exponential smoothing. I like the attention to edge cases affecting rate-limiting specifically.

As a bit of a fun aside, the state-update equation here is an Euler update solving the exponential ODE, fusing in the new knowledge you have (new requests) as instantaneous tweaks. For rate-limiting in particular the exponential is probably a good choice (numerically stable, hardware instructions, works even when time gaps or other features are far outside of your intended configuration space, ...), but in other domains, like memoryless (still "stateful", obviously, but only requiring an amalgamation condensed into the very last state, as opposed to some O(n) thing) animation updates, you might prefer other properties like the ability to decay to exactly 0 in finite time. You can also recover leaky buckets and other similar algorithms with the right choice of ODE.

brightball · 2024-09-12T14:10:00 1726150200

> As a rate limiter in my application I found CPU seconds per second to be much more useful than requests per second.

This is really insightful and makes good sense. I've toyed for years with the idea of building a "Cost Benefit Cache" that basically tracked the request volume for a piece of data, the time to retrieve that data and the size of the data retrieved to make decisions about what to cache during system stress.

The idea was basically a cost benefit formula where the cost is the size of the data, so how much of the available cache space would it use, while the benefit is the seconds of savings from requests * average retrieval time. Requests per second alone doesn't really matter if the request is low/no impact.

CBLT · 2024-09-12T15:15:21 1726154121

Google had a library like that, and one of the foot guns is that requests that are malformed or otherwise do 0 work have incredibly low cost, which results in them being prioritized above all other requests. Every user had to configure some obscure options to set cost multipliers based on return code to try and correct for this, which felt like a pretty fundamental bug that was being hacked around.

Edit: this was actually for cost-based load balancing, not load shedding. You obviously can't load shed based on return code.

travisjungroth · 2024-09-12T16:00:15 1726156815

Interesting that “if the request is bad or there’s an error” is a difficult condition to implement universally.

hansvm · 2024-09-12T18:07:51 1726164471

Some things are really hard to appropriately abstract, but there's so much similarity we can't help but be tempted to try.

Soemthing that I wrote at $WORK recently was a data-oriented exponentially smoothed reservoir sampling order statistic randomized splay tree.

All of those components exist individually. Even picking on two of them though; given a tree implementation it's trivial (as a human) to create an analogous order-statistic tree. In any pseudo-mainstream programming language though, especially at a system level, how do you tack on the "order-statistic" behavior to an arbitrary tree?

It's not too bad if you additional require some interface the primary tree needs to adhere to. That interface is, necessarily, biased toward the needs of an order-statistic tree though. There isn't actually that much difference between implementing a tree which adheres to the interface and one which just does every operation manually. Plus, there are some allocation questions and other inefficiencies (or, alternatively, unsoundness) that level of abstraction introduces which you can avoid by writing the whole thing from scratch.

Bringing it back to the $GOOG library, the problem is that you have a lot of code which _looks_ like it's the same, but it's impossible to make it actually _be_ the same without heavy lifting for callers of that library.

Taking those monolithic frameworks (e.g., inadvisedly popular rate-limiting shenanigans) and turning them into libraries of helpful utilities often helps. General solutions are harder to come by.

travisjungroth · 2024-09-12T23:39:12 1726184352

I think the answer is for the library to not be quite so ambitious. This detail isn’t a footgun, bug or hack (to quote the previous comment) if it’s explicitly part of the interface.

The user would need to declare the signals to know a request should be deprioritized, or even fully handle that logic themselves. People will deride this as “more complex”, but it’s actually less complex than the solution that appears simple, but doesn’t really work.

After it gets enough usage, you’ll start noticing patterns and you can implement smaller abstractions to make things easier. You could have sensible defaults of codes to ignore, or integrate it with your error logging service.

jpollock · 2024-09-12T17:08:20 1726160900

When you've got 80,000 engineers you end up with lots of different opinions on what the wire protocol should look like.

* Sometimes, you want to fail closed - errors are returned for all errors.

* Sometimes, you want to fail open - ok is returned for errors.

* Sometimes, you want to separate "failure" from "denied", the code "worked", but it didn't do what the client wanted - think "payment declined", so you return "OK", with the actual result in the response.

Compare the different costs to charge a credit card that is expired vs one that works.

* The expired one can be checked before calling out to the processor - cheap, returns "OK + Declined".

* The success ends up with a call to the processor and takes a second or two, returns "OK + Success".

donavanm · 2024-09-13T14:07:29 1726236449

Fyi there is a ton of good work in storage and cdn systems that is very similar. Check out some of my favorites, gil einziger https://scholar.google.co.il/citations?user=kWivlnsAAAAJ and the ex coho guys https://www.usenix.net/system/files/conference/osdi14/osdi14...

jeffbee · 2024-09-12T16:32:03 1726158723

CPU rate is only quasi-continuous. The fact of a thread running or not running is binary. You cannot be 63% on a CPU, that is an illusion of sampling and integration.

But the real trouble with CPU rate is that it represents the past, not the future. It tells you what happened to the last request and nothing about what will happen to the next request. A signal with actual predictive power is how many other requests are running or waiting to run.

hansvm · 2024-09-12T17:46:18 1726163178

Like everything, the devil's in the details.

In our case, non-idle cumulative thread CPU time (including system activity) is a nice metric. If it exceeds 70%, start sending some 503s (or 500s depending on the client). A single core can easily service 100B requests per day, a few times more than we needed to, so if we exceeded that by enough margin to saturate every core to 70% then the right course of action was some combination of (1) scale up, and (2) notify engineering that a moderately important design assumption was violated. We could still service 100B requests per day per core, but the excess resulted in a few 5xx errors for a minute while autoscaling kicked in. It's not good enough for every service, but it was absolutely what we needed and leaps and bounds better than what we had before, with a side-benefit of being "obviously" correct (simple code causes fewer unexpected bugs).

The core idea (in tweaking the numerator in exponential smoothing) isn't any easily refutable definition of CPU time though. It's that you're able to tailor the algorithm to your individual needs. A related metric is just the wall-clock time from the beginning of a request being queued till when it was serviced (depending on your needs, maybe compared to the actual processing time of said request). If you ever get to total_time >= 2 * processing_time, that likely indicates a queue about to exceed a critical threshold.

> how many other requests are running or waiting to run

I referenced that idea pretty explicitly: "If you have more control over the low-level networking then queue depth or something might be more applicable, but you work with what you have."

Yes, that matters. That doesn't make other approaches invalid. Even then, you usually want a smoothed version of that stat for automated decision-making.

ted_dunning · 2024-09-12T20:03:50 1726171430

An often overlooked extension of simple exponential weighting is that if you take the difference of two or three exponential averages, you can get some very nice behavior.

One particular behavior that may be desirable is that you can build a system that allows flexibly defined short bursts, but limits long-term request rates which is really nice for chatty protocols where you might want to nearly ignore a burst of 100 requests as long as the burst takes less than a few milliseconds, but clamp down hard if there is a single query per second over the last thirty seconds.

Differences in exponential decay can also be used to implement a good approximation of a Gaussian kernel of any desired size. In image processing, for instance, this allows a Gaussian blur whose cost doesn't depend on the scale of the blur (unlike a true convolution).

taneq · 2024-09-12T14:07:42 1726150062

We were just this week talking at work about all the many and varied names that people make up for “weighted average between smoothed value and new sample”. :D

socketcluster · 2024-09-12T20:58:33 1726174713

For WebSockets, using SocketCluster (https://socketcluster.io), it's possible to queue up all requests from the same client and then detect and respond to high backpressure spikes (e.g. by disconnecting the client and/or recording the incident).

You can combine different approaches like limiting the number of connections from a single IP within a certain timeframe and also limiting the backpressure.

The ability to detect and respond to backpressure spikes on a per-end-user basis is highly valuable because backpressure in SocketCluster takes into account the processing time of client requests.

A common strategy that spammers use is to identify and invoke the most expensive endpoints in your system.

Unfortunately, a lot of people still don't understand the value proposition of being able to process requests from clients in-stream and in-order. It's also good for preventing race conditions and makes your environment highly predictable.

In terms of security, many of the worst, most expensive hacks in history were the result of asynchronous events exploiting unexpected race conditions on the server side. The crypto industry has been plagued with those since its inception.

People seem to have gotten used to running chaotic, unpredictable systems, supporting impossibly complex permutations of concurrent events and trying to handle every possibile edge case instead of constraining each stream to certain sequences.

I don't understand the industry's obsession with type safety in the face of far greater problems like this. Maybe we just need a new catchphrase: Concurrency safety? Async safety?

Queueing up message processing per-client doesn't add any noticeable delays from the user's perspective because most operations are a few milliseconds which is unnoticeable once you factor in latency between client and server. It's only noticeable when you want it to be; for example when the user uploads a large chunk of data which requires heavy processing. You can also specify which streams can be parallel and which can be serial.

o11c · 2024-09-12T17:30:35 1726162235

Hm, I really need to study this more. The early claim that it requires "complicated float arithmetic" took me by surprise. I'm used to fixed-time-period sampling, so you can just do:

   // Textbook version:
   //average = alpha * new_value + (1-alpha)*average
   // Which is equivalent to:
   //average += (new_value - average)*alpha
   // To get rid of the float, define w=1/alpha  and make it an integer.
   // If w is a power of 2 (try 8), this is just a shift.
   average += (new_value - average) / w

ted_dunning · 2024-09-12T19:50:30 1726170630

Absolutely. You just need integer implementation of fixed point for evenly spaced data.

For irregularly spaced data, you need exp(-t), but that isn't hard to do with fixed point.

fanf2 · 2024-09-12T19:13:26 1726168406

Yeah, it’s easy when alpha is fixed. The complications — the exp() and ln() — come from updating the smoothed rate at varying intervals, whenever a request happens to arrive, so alpha has to be adjusted according to the interval to make sure the exponential decay happens at a fixed speed regardless of the interval.

tlarkworthy · 2024-09-12T16:07:02 1726157222

I wrote an interactive notebook visualization and a fast converge optimization for one of these

https://observablehq.com/@tomlarkworthy/rate-estimation

Terretta · 2024-09-13T12:41:38 1726231298

For the client side, when servers do this, look for a library like:

https://tenacity.readthedocs.io/en/latest/

This is one of the easiest when slinging third party APIs in Python:

- Generic Decorator API

- Specify stop condition (i.e. limit by number of attempts)

- Specify wait condition (i.e. exponential backoff sleeping between attempts)

- Customize retrying on Exceptions

- Customize retrying on expected returned result

- Retry on coroutines

- Retry code block with context manager

alphazard · 2024-09-12T13:46:39 1726148799

This is like a capacitor discharging. I had always considered this a variant of token bucket with exponential instead of linear decay.

fanf2 · 2024-09-12T16:43:07 1726159387

The main difference from my point of view is the EWMA produces a measurement of the rate, whereas token bucket only produces the result of a comparison with the limit. If you don’t need the actual value of the rate then you can save a lot of time and space by using GCRA instead https://dotat.at/@/2024-08-30-gcra.html

ted_dunning · 2024-09-12T19:57:24 1726171044

If doing a small amount of arithmetic and keeping 4 bytes (or less) additional memory per client is "a lot of time and space", you may need to rethink your priorities.

It is very, very unusual for any kind of request to anything to take a time within many orders of magnitude of how long exponential averaging will take. The difference between a GCRA implementation and complex multi-exponential average will be nanoseconds.

deisteve · 2024-09-12T21:18:31 1726175911

i've written an implementation on postgresql

you can use it like this:

     SELECT * FROM check_rate_limit('client1', 1.0, 10.0, INTERVAL '1 minute');

-- Create a table to store client rate limiting data

CREATE TABLE rate_limiter ( client_id VARCHAR(255) PRIMARY KEY, last_update_time TIMESTAMP WITH TIME ZONE NOT NULL, average_rate DOUBLE PRECISION NOT NULL );

-- Function to check and update rate limit

CREATE OR REPLACE FUNCTION check_rate_limit( client_id VARCHAR(255), cost DOUBLE PRECISION DEFAULT 1.0, limit_value DOUBLE PRECISION DEFAULT 600.0, period INTERVAL DEFAULT INTERVAL '1 hour' ) RETURNS TABLE ( allowed BOOLEAN, next_allowed_time TIMESTAMP WITH TIME ZONE ) AS $$ DECLARE current_time TIMESTAMP WITH TIME ZONE := NOW(); time_interval DOUBLE PRECISION; alpha DOUBLE PRECISION; instantaneous_rate DOUBLE PRECISION; new_rate DOUBLE PRECISION; client_data RECORD; BEGIN

    -- Get client data or use defaults if not exists

    SELECT \* INTO client_data
    FROM rate_limiter
    WHERE rate_limiter.client_id = check_rate_limit.client_id;

    IF NOT FOUND THEN
        client_data := (client_id, current_time - period, 0.0)::rate_limiter;
    END IF;

    -- Calculate interval

    time_interval := EXTRACT(EPOCH FROM (current_time - client_data.last_update_time)) / EXTRACT(EPOCH FROM period);
    time_interval := GREATEST(time_interval, 1.0e-10);

    -- Calculate alpha (exponential smoothing weight)

    alpha := EXP(-time_interval);

    -- Calculate instantaneous rate

    instantaneous_rate := cost / time_interval;

    -- Calculate new average rate

    new_rate := (1 - alpha) * instantaneous_rate + alpha * client_data.average_rate;

    -- Ensure rare requests are counted in full

    new_rate := GREATEST(new_rate, cost);

    -- Check if rate limit is exceeded

    IF new_rate > limit_value THEN
        -- Calculate next allowed time
        next_allowed_time := current_time + (period * LN(new_rate / limit_value));
        allowed := FALSE;
    ELSE

        -- Update client data

        INSERT INTO rate_limiter (client_id, last_update_time, average_rate)
        VALUES (client_id, current_time, new_rate)
        ON CONFLICT (client_id) DO UPDATE
        SET last_update_time = EXCLUDED.last_update_time,
            average_rate = EXCLUDED.average_rate;

        next_allowed_time := current_time;
        allowed := TRUE;
    END IF;

    RETURN NEXT;

END; $$ LANGUAGE plpgsql;

give that a whirl

theendisney4 · 2024-09-13T03:26:37 1726197997

Found it surprising that php doesnt have nice eqipment for ratelimiting.

It seems an every man for himself kind of project?

My joke was to randomly ignore a percentage of the action. Ratelimit the ratelimiting.