Instrumenting Python GIL with eBPF

polotics · 2024-07-19T16:43:10.000000Z

Wow, 36ms per second is only 3.6% of the time. Python waiting on the GIL is then a pretty overblown problem, as this is not very significant. I wonder if this measure could be run on apps built with various frameworks. I expect that with uvloop and all, the percentage would be even less.

godelski · 2024-07-20T01:09:08.000000Z

  > only 3.6%
  > GIL is then a pretty overblown problem, as this is not very significant

How do you conclude this? 3.6% of the time spent being locked seems pretty significant. As another user notes, this is 3-4 fewer machines you need to pay for out of every 100 you currently use.

But the problem is worse than that. That's 3.6% of the time you're locked, not 3.6% of the time your program takes vs the time it would take without GIL. During that same time other processes could be run. I've seen plenty of mundane and ordinary tasks be 10x to 1000x faster when comparing serial to even unoptimized parallel. I'd be really careful to take this conclusion unless you actually have experience in writing parallel and/or optimized code.

almostgotcaught · 2024-07-20T02:26:43.000000Z

people who think 3.6% doesn't matter don't have any imagination. ballpark whatever you imagine is FB's compute cost vis-a-vis pytorch and take 3.6% of that and ask yourself whether you'd be happy passing up that kind of bonus/pay-bump/....lottery winnings. and of course for many pytorch workloads (smallish kernels in a highly multithreaded environment) i'm sure that number can be higher than 3.6%.

godelski · 2024-07-20T20:52:04.000000Z

Probably, but more likely a lack of knowledge and an overestimation of their knowledge. People are famously bad with statistics but people are also famously over confident in their understanding of numbers and data. I mean cloud servers really try to get some extra 9s on the end of their uptime. (People are also __terrible__ with rates of change. e.g. 90% increase as from last year).

But I think an interesting one is the also literally the most common example that's in many stats textbooks, at least the ones that cover Bayesian stats. That being that if a test is 95% accurate, that it does not mean that that scoring positive on a test means that there's a 95% chance you have whatever the test is testing for.

I think there's a great irony in the latter, as I find in the tech and engineering communities a lot of confidence, especially in understanding numbers yet very often make this mistake[0], and frequently want to establish meritocracies (advocating for standardized testing and leet code). But understanding the former should result in thinking the latter is ineffective.[1]

--------------------------------

[0] I particularly like 3Blue1Brown's introduction to Bayes theorem as it includes the Kahneman and Tversky questions AND (most importantly) he discusses how when the questions are reframed that peoples accuracy can dramatically swing from overwhelmingly wrong to overwhelmingly correct. I like this because the hardest thing in doing statistics (and much of science) is actually down to this. It is all about understanding the assumptions you are making, and more specifically, the hidden ones. https://www.youtube.com/watch?v=HZGCoVF3YvM

[1] I think this one is worth working out a little and so let's apply Bayes to a proxy problem (for illustration). It's good to start to understand how metrics can undermine meritocracies and it will illustrate how people actually lie with statistics despite never saying anything that is actually untrue (and that is why it is so common and why often lying with stats/data is unintentional).

Let's let monetary wealth represent our measure of success, let's assume that college entrance exams (and the ability to graduate) is purely meritocratic (i.e. "I went to Standford" -> "I'm smart"), let's define $100m net worth as ultra wealthy, let's assume that graduation from a college gives a person just intelligence and the connections you make there are not significantly influential to success (this one is actually key, but "left to reader as an exercise"), and let's assume your final net worth is purely due to your own work/efforts. We're using this because the frequency of which awarding institution is used to imply one's value/worth/skill (your cousin/friend/boss/coworker who brags about being an "x" graduate or parents showing off their kid's intelligence, etc) and how admittance to these institutions has a very strong correlation to standardized test scores.

Top 8 schools here are Harvard, MIT, Stanford, UPenn, Columbia, Yale, Cornell, Princeton which represents 7%, 5%, 5%, 4%, 4%, 4%, 3%, and 3% of US ultra wealthy, respectively. And these represent 35% of all of the US's ultra wealthy, which there are 10,660 today (~28.5k globally). So 746, 533, 426, and 320 people, respectively. (Note we now have to assume there isn't double/triple counting: e.g. undergraduate Harvard, masters MIT, PhD Stanford) Each year these schools graduate 10k (Harvard is a bit confusing), 3.5k, 5.2k, and so on. Now I don't have data for when those people graduated from those schools, so we'll describe a function to help us estimate. Using Harvard we have 746 people / (10k graduates * number of years). I was able to find that the median age of this group is >55 but also evidence of a bimodal distribution). So let's naively assume a 10 year period, as this still gets us a pretty conservative/lower bound estimate (as well as assuming no dupes). So now we have that Harvard represents 7% of US centimillionaires, but only (and this is an absurd overestimate) 0.7% of Harvard graduates are centimillionaires.

Now we see a framing issue. Harvard tries to show its value by advertising that it is the highest producer of centimillioaires, citing this 7% number. But when we consider context (i.e. the number of graduating people) we find that far fewer than 1% of graduates make it to that level of success, which suggests that a Harvard education has little to nothing to do with said success. Even if we try a clearer cut case like with MIT where there are 2k engineering degrees awarded a year (200 arch, 100 hum/art/soc. sci, 1k management, 500 sci, etc) then we have 533/(2k * 10) or 2.6% and this still does not imply that MIT has a large effect on you becoming ultra wealthy, because this all relates to the exact same problem as the intro medical test problem: the base population size is large and the event you're testing for is rare. In fact, you can do this type of analysis for any test and mark of success and you'll probably end up with very similar results. The underlying reasoning being that there are many factors that actually lead to success and any single one of them is not a significant indicator, but it is the agglomeration of weakly influencing events that accumulate over time. Of course these numbers still have utility and this doesn't suggest you shouldn't get a degree or that you shouldn't aim for a top school degree, but they have to be understood under context and that context is often complex. Which removing the context is how one lies with statistics/data and the most common person that lie is told to is yourself.

black_puppydog · 2024-07-19T16:48:49.000000Z

It may not be very significant in this context.

But the author literally starts by explaining that that's a typical argument for webservers because they're mostly I/O bound. Anyone working with code that's more CPU bound will have very different numbers, and interpret them differently.

godelski · 2024-07-20T01:23:47.000000Z

> for webservers because they're mostly I/O bound

I'm not a web person, but can't you also gain I/O from parallelization? The I/O bound is waiting on responses right? So parallelization should increase I/O because you can make multiple asynchronous requests and even if there is dependence you can often stage or do partial computation in the mean time (at least this is common in scientific computing). (And if disk, well reading/writing to disk in parallel is far faster than serial but idk why you'd read/write to disk with pure python. Though it seems people do). So wouldn't this have significant effects that are more than the 36ms that we see in TFA? Or am I missing something and can someone explain why my guess is wrong?

Flimm · 2024-07-20T05:44:51.000000Z

The GIL doesn't prevent all types of parallelization. You have always been able to use threads or to use async code to make I/O happen in parallel.

polotics · 2024-07-19T17:51:25.000000Z

are you sure though that being more CPU-bound will imply more waiting on the GIL? CPU-bound python in my experience means libraries, like eg. numpy, that are well-designed and release the GIL.

fleetfox · 2024-07-19T17:57:36.000000Z

If you are interested PEP703 describes the scenarios pretty well: https://peps.python.org/pep-0703/#motivation

johnjr · 2024-07-19T18:19:09.000000Z

I just wrote a post about how the Cpython is much faster without GIL:https://news.ycombinator.com/item?id=40988244

arp242 · 2024-07-19T22:00:48.000000Z

I mean, only the threaded version, which is expected. For tons of cases Python without the GIL is not just slower, but significantly slower; "somewhere from 30-50%" according to one of the people working on this: https://news.ycombinator.com/item?id=40949628

All of this is why the GIL wasn't removed 20 years ago. There are real trade-offs here.

rbenchmark · 2024-07-20T20:47:01.000000Z

30-50% is an understatement. The latest beta is more than 100% slower in a simple benchmark:

https://news.ycombinator.com/item?id=41019626

BossingAround · 2024-07-20T07:04:17.000000Z

How is single-threaded code slower without GIL?

pKasdhB · 2024-07-20T09:00:55.000000Z

Because in the --disable-gil build data structures like ref-counting, dicts, freelists, etc. are locked, even when there is only a single thread.

This is the reason why previous attempts were rejected. But those attempts came from single individuals and not from a photo sharing website.

This matters if --disable-gil becomes the default in the future and is forced on everyone.

nemetroid · 2024-07-20T11:59:03.000000Z

That cannot be the reason for a 30-50% slowdown. Uncontested locks are very fast.

krhsG · 2024-07-20T12:33:19.000000Z

They may be fast in C++, but not in the context of CPython. Here are the dirty details. Note that fine-grained locking has also been tried before:

https://dabeaz.blogspot.com/2011/08/inside-look-at-gil-remov...

nemetroid · 2024-07-20T13:48:29.000000Z

Thanks for the link, that's an interesting read. Actually the referenced PyMutex is a good old pthread_mutex_t, the same you'd use in C or C++. But I shouldn't have written so surely. Although uncontested locks are very fast, if the loop is tight enough, adding locks will be significant.

However, PEP 703 specifically points out that performance-critical container operations (__getitem__/iteration) avoid locking, so I'm still highly skeptical that those locks are the cause of the 30-50%.

https://peps.python.org/pep-0703/#optimistically-avoiding-lo...

tialaramex · 2024-07-20T15:05:24.000000Z

The pthread_mutex_t is focused on compatibility at any cost. So while you're right that the C++ stdlib chooses this too, it's not actually a good choice for performance.

But I think you're right be sceptical that somehow this is to blame for the Python perf leak.

tialaramex · 2024-07-20T18:19:33.000000Z

One of the things this spends some time on that was already obsolete in 2011 is using a pool of locks. In 1994 locks are a limited OS resource, Python can't afford to sprinkle millions of them in the codebase. But long before 2011 Linux had the futex, so locks only need to be aligned 32-bit integers. In 2012 Windows gets a similar feature but it can do bytes instead of 32-bit integers if you want.

If a Linux process wants a million locks that's fine, that's just 4MB of RAM now.

memco · 2024-07-19T19:24:35.000000Z

That’s over 40 minutes of a full day or 5 hours a week spent locking! As the author states it’s highly context dependent but in some cases this is a lot of time to do other work if it can be reduced.

in_a_society · 2024-07-19T20:03:36.000000Z

I'm not sure your conclusion is a fair take. In the app I work on, GIL acquisition would easily take 2-3x longer than the postgres queries which would subsequently be issued.

kevingadd · 2024-07-19T19:38:24.000000Z

If you're running your application on 100 servers, that's potentially 3-4 entire machines you no longer have to pay EC2/S3/etc fees for, or it's more scaling headroom because you've decreased load across your whole fleet by 3.6%.

It's "small", sure, but in production performance issues are often "death by a thousand cuts" situations, so a 3.6% reduction is a big win compared to optimizations that are often in the 1% range.

PeterZaitsev · 2024-07-20T16:16:40.000000Z

I think what would also be very interesting in this case is to get latency histogram, This 36ms per second can be 1000 stalls of 36microseconds which might not be significant for your app or it could be one giant 36ms stall... which can be an issue.

klysm · 2024-07-20T07:51:49.000000Z

I don’t really agree with the characterization. That can matter at scale and the other thing to consider is latency

galdosdi · 2024-07-20T02:12:08.000000Z

Thoroughput vs Latency

ablekh · 2024-07-21T04:49:25.000000Z

Here is another post describing an alternative approach to instrumenting Python GIL: https://www.maartenbreddels.com/perf/jupyter/python/tracing/....

noident · 2024-07-19T18:17:47.000000Z

>Every Python developer has heard about the GIL

Sadly, that is not the world we live in.

I've cleaned up dozens of applications written by people with flawed understandings of threads, multiprocessing, and asyncio. I don't even blame the developers for this; it's a glaring language design problem.

If you need parallelism, Python is not the language you should reach for. Nobody ever takes my word for it until it's release day and the product is a broken pile of spaghetti code.

simonw · 2024-07-19T21:36:02.000000Z

How do you feel about the GIL free option coming in Python 3.13?

noident · 2024-07-19T22:48:26.000000Z

It looks like it will be some time before free-threading can be safely used in production. I don't want to have to worry about whether the underlying C code supports no-GIL mode, or trade off single-threaded performance. [0]

Maybe someday this will make parallelism in Python at least as sane as other languages, but in the meantime, I still want to use a compiled language any time performance matters and wait for the kinks in no-GIL to be ironed out.

[0] https://docs.python.org/3.13/whatsnew/3.13.html#free-threade...

icedchai · 2024-07-19T22:42:29.000000Z

Same. I "inherited" an app doing crazy slow sync IO from their "async" functions... Sure, there was often some async IO mixed in, too, but what's the point?

mjmdavis · 2024-07-19T17:40:28.000000Z

[flagged]

bmacho · 2024-07-19T18:44:23.000000Z

https://archive.md/oldest/https://coroot.com/blog/instrument...

falcor84 · 2024-07-19T16:57:52.000000Z

[flagged]

deepakee · 2024-07-19T19:52:07.000000Z

What browser do you use? I can see the link to the settings.

falcor84 · 2024-07-20T00:51:56.000000Z

Chrome on android, but it seems to be working now. Either they fixed it, or my browser was glitching earlier.