> So rather than comparing mean performance, we'll compare minimum performance.
If I'm understanding correctly, the new test is based on a single data point from each group, rather than an aggregate statistic (like mean). I'm no statistician, but it seems like this data would have far too much variance and noise for this to be a useful test.
The minimum performer could be someone who had a sudden personal crisis. Or who had 10 competitors suddenly pop up. Or any number of other circumstances outside their control. The minimum performer is, almost by definition, an outlier. It doesn't seem rational to suppose that an outlier is representative of the group.
I can understand that statistically this test may be more rigorous. In practice I would expect it to be less rigorous. Because the assumption it makes (that a single outlier is representative of the group) seems even more dubious than the assumptions required for Paul's original idea.
The sample minimum (or maximum) is not an inherently unstable statistic. If there is sufficient density in the distribution near its minimum, the sample minimum can be quite robust. For example, consider that the maximum likelihood estimator for the upper bound of a uniform distribution is simply the sample maximum, and the minimum-variance unbiased estimator is also based on the sample maximum[1]. (This method was used by the Allies in World War 2 to estimate the total number of German tanks by sampling the serial numbers from destroyed tanks[2].)
Of course, a real thresholding process would not be perfect, so the lower bound of the distribution of accepted candidates would not be a perfect vertical cutoff as in the examples. Just like any process that adds additional variation to the data, this would reduce the statistical power. You could accept more bias in return for lower variance in your test by taking, say, the 5th percentile instead of the sample minimum as your test statistic. (You can think of the sample minimum as the zeroth percentile.)
The robustness of a sample min or max for a continuous distribution is basically proportional to the density of the distribution at that extremum. A steep or vertical drop-off at the edge of the distribution is the ideal case.
The article gives a formula for the statistical power of the hypothesis test derived from the sample min. It depends on the function h(x), whose purpose is to establish a lower bound on the density of the distribution at the min, and hence a lower bound on the robustness of the sample min as an estimator.
If I'm following his argument correctly, he assumes that there is a hard cut-off at some unknown point in the performance metric being measured, below which no-one is selected and above which all or most people are, and tests whether that hard cut-off point is the same for the two groups. It's going to be an accurate test in that scenario, but that's not a good model of reality. This is especially true when the performance metric being used is noisy and measured after the people are selected - which is exactly the scenario we're actually interested in!
Yup. The author mentions that noise is a serious problem for this method in the article, and talks about some ways he looked at trying to reduce it, but didn't come up with a good one.
That would probably be a sensible stat for measuring marginal applications to a school. But when it comes to measuring portfolio startup companies, none of the bottom decile are expected to be worth anything in the medium run, and all the results that actually matter are in the top quartile...
One other thing which both this and PG's original theory get wrong:
Their basic premise is wrong, if bias continues to exist after the selection event in question.
For example, if YC had (hypothetically) a real bias against black or women entrepreneurs, it is almost certain that future funding rounds, as well as all possible exit scenarios, would exhibit very much of the same bias.
In which case, the future "performance" of those candidates would be poor, and by PG's definition unbiased even though the only meaningful result is that YC is no more biased than subsequent performance evaluations.
Let's assume that bias does persist past selection through the duration of the program. Does that change the interpretation when you look at First Round Capital's data that shows its female founders outperforming the males by 63%? I don't think it does.
The test may not be sufficient to prove that you have no bias, but it may be good enough to prove that you do. When it does indicate bias, it seems likely to be correct.
To put it another way, if it is 1948 and the only three black people in Major League Baseball are all superstars, then the distribution of baseball skill among black players is extremely unbalanced or there is a lot of bias keeping the average and moderately-better-than-average black players out.
The interpretation is still somewhat unreliable as in many cases there's still another hypothesis, which is that $UNDERREPRESENTED_MINORITY actually overperforms due to favourable treatment after the selection process
Of course favourable treatment can't make people into superstar startup founders or baseball players (and I'm sure any special treatment afforded to black baseball players in the 1940s was the complete opposite of favourable). But more generally it can make an organisation with fair selection processes look like it sets a higher bar for $MINORITY because it addresses low numbers by being very keen to promote and very reluctant to fire/deselect members of said minority, so these kind of studies still have to be considered with care.
(Of course even if an organisation is proactively treating a minority group favourably after selection doesn't mean that conscious or unconscious biases don't exist in the selection process.)
There's one thing I don't get though. Being biased often means valuing something that is common among those you favor, but rare among those you don't. So even if one group outperforms the other, why would that in a practical scenario prove that your next candidate should be of that group?
It seems entirely possible that if your program is "narrow" enough you could exhaust the pool of more successful candidates from a minority group. Of course this would be a lot more plausible if you're biased.
This isn't actually a flaw in the theory. The theory is designed to measure whether the selection process is inaccurately predicting outcomes. So an unbiased selection process will account for post-selection factors - including the fact that VCs might not like a person.
It's also worth noting that an unbiased prediction is not necessarily "fair" in the colloquial sense. For example, I've seen data suggesting that an unbiased prediction of college outcomes would actually penalize black applicants, since black applicants underperform relative to their SATs and college grades. (The person who had this data was very careful not to draw this conclusion in the publication - career limiting move, as they say.)
So a fair selection process which looks only at high school grades/SAT/etc might actually be biased as a statistical decision procedure.
The theory presumes a reliable way of measuring "actual performance"; that is, the quantity against which the selection process is supposedly biased. That's a limitation, but I wouldn't say it makes the test "wrong".
It does mean that maybe monetary earnings or anything else sensitive to later-round bias are not the thing to use to measure candidate performance, at least if you're doing this for the social utility.
Of course, if you're only in it to make money, and you're only in charge of the first round... then you really do want just an unbiased evaluation of the (biased) future earnings prospects. So in that case using raw earnings would be correct...
True - though this means only that the signal (outperformance by a group that is the target of discrimination) might not be present (if discrimination continues past initial selection as you point out).
However, the test may still useful to help confirm bias. If outperformance is observed, you can infer one of 3 things is true:
1) there is bias at initial selection but not after (or at least reduced bias)
2) members of the outperforming group are simply stronger performers (different but still interesting)
3) there is no bias at selection but there are affirmative action effects after the initial selection (not obvious why this would be the case)
This seems so obvious that it's possible we're missing something. It seems, at best, "A Way to Diff Your Bias", which isn't the same thing as detecting your bias at all. No one has the goal (I hope) of aligning their negative biases with others.
It's reasonable to think that the bias would decrease over the company's lifetime. In early rounds there is little data on the company, so more decisions are made on hunches, and there's a lot of potential influence for bias. While there's potential for that in later rounds too, there's also a lot more objective information. The company is either making money or it isn't.
The importance of relationships to the funding round also plays a role. If you get as far as an IPO, it seems unlikely that the stock-buying public is going to stay away because the founders are female/black/etc.
Dropping outliers can be done when outliers cloud the analysis, but doing this in an analysis of startups is inane since startup investors' entire goal is to find outliers.
Possible. In this case, we're not looking for outliers or measuring based on financial success, but trying to tell if the VC is systematically biased anti-woman.
It's not clear that dropping outliers is a bad idea there. It's also not clear it's a good idea, granted.
Well, if you are trying to measure whether men founders or women founders you have funded on average make more, then you would have to include Uber. The real issue with the analysis is that results are unlikely to be statistically significant due to small samples and high variance, which means they are useless.
It's still BS. Outliers are a signal that you don't have a simple, nicely decaying distribution.
The right way to deal with outliers is to use a method that acknowledges their existence, not to ignore them. For example, if outliers destroy your OLS linear regression, it's because your error is not normal. That means you need to do Bayesian linear regression with a non-normal error term, not just throw them away.
Depends. Throwing outliers out without thinking is obviously wrong. In many instances outliers can be just invalid measurements and you should ignore them.
> In many instances outliers can be just invalid measurements and you should ignore them.
signal[i] = value[i] + noise[i].
If you know that value[i] == NaN, then by all means throw out signal[i]. If value[i] != NaN, then you're better off modeling error[i], and using that model to give you information about value[i] as yummyfajitas suggests.
This is trivial to see if noise[i] == 0, but for some reason becomes progressively harder for people as noise[i] increases.
Statistics 101. When you have samples you throw away the highest and lowest member, to counteract some random occurrence. The mean net worth of the patrons in any restaurant carlos slim frequents rises substantially when he is there.
Yes, because that is how mean net worth is defined. I don't see specifically what that argues against, except that mean is not the best indicator to use in all situations; perhaps a different indicator is appropriate, such as the income per patron by percentile. 100th percentile will be Carlos Slim, but 99th percentile and lower will be other patrons.
If Carlos Slim actually does frequent the casino, then his attendance is an important part of understanding the situation.
That, and the fact that outliers can often be discounted due to measurement/instrumentation error.
Moreover, the fact that Carlos entered your restaurant may be a significant event depending on the analysis that you're attempting to do. So you need to have to have a good rationale for dropping outliers, and you should probably also watch for bias when dropping outliers that don't support your hypothesis!
Very nicely done Chris, but the basic problem with Paul’s analysis is not the mathematics (this can be fixed as you have shown), but the underlying data. Any data set you could get to measure bias in the start-up world is too small and messy to tell you anything useful. No matter how sophisticated your analysis, if the data is garbage then all you will end up with is garbage.
This does even consider the problem of data dredging which First Round Capital engaged in.
Maybe First Round's data set is too small or messy to get meaningful results, but the entire startup world has plenty of data to potentially pull some meaningful conclusions about bias.
I wonder if all YC companies would be enough data points to learn something useful. Or maybe grab a large swath of VC-funded startups including First Round's investments and many other top firms.
Most of the raw stats in Chris's post was above my head, but I'd love to see this applied to a larger data set of fundings.
The basic problem is getting hold of good data. Most VC rightly consider their data in this area very valuable and they are not going to part with it easily. Not even Paul delved into YC’s data.
More fundamentally even if you could get enough data, the data is just too messy to analyse and draw any valid conclusions.
Yeah, a minima-ish statistic (even one less sensitive to noise) is probably not going to work for VC. Noise isn't the only issue - in VC the minima will always be zero unless a VC only picks winners.
A test in this general direction (but which handles noise) is much better suited for answering questions like "are colleges biased against Asians". In that case you have a pretty clear output (college GPA) which very rarely reaches zero.
> The idea is generally correct - bias in a decision process will be visible in post-decision distributions
I find what's wrong with the idea more fundamental, that it talks only about the 'selection process' but in fact bias that impacts success or failure can come at other points.
This is really important. Lets say the whole VC ecosystem is biased against redheads (just to pick a random group). What would happen is the redheads would under perform other groups as they were discriminated against at each stage of the VC lifecycle. They would not show up as a group that over performing later. The only bias you can detect using Paul’s approach is bias that only applies at the initial stage and not later.
To take your example one step further, the redhead performance would be held up under pg's rubric as evidence that non-redheads are biased against and so redheads may fall into a vicious circle of deepening discrimination.
Just to cross the beams of pedantry here for a moment, a widespread and well known --- if less than serious or systemic --- cultural/social bias against red haired people, probably first coming into public consciousness in North America due to the infamous South Park 'Ginger' episode, has in fact primed you to select "redheads" as a non-contentious example of a plausibly ethnic group that might be discriminated against, something that every red-haired person knows, although you apparently do not. This means that the choice is statistically insensitive and the social methodology is poor.
Actually, choosing an identifiable group at random would be both socially and statistically unwise, as, following Patero distribution, there are vastly more minority/extreme minority distinguishable groups of people than there are majority/significant minority ones; this means, firstly, that any group randomly selected with equal biasing between all groups has a high probability of being subject to actual discrimination, mooting any social benefit of choosing a group at random; secondly, that the generalizable qualities of the group chosen would therefore have a distribution with very little deviation (if I'm using my terms correctly) and would be highly predictable, thereby obviating any possible statistical benefit of doing so.
Alternatively, I'd be OK imagining that it was subconsciously chosen here not at random, but because I used this reference for an example in the previous thread.
Is South Park another journal worth reading? Are they open access?
I think this is the first post that's a DH5 on pg's How to disagree scale (http://paulgraham.com/disagree.html). Not only that, the OP is charitable enough to explicit state why it's not a DH6:
Paul Graham wrote an article about an idea. The idea is generally correct - bias in a decision process will be visible in post-decision distributions, due to the existence of marginal candidates in one group but not the other. But the math was wrong. // That's ok! Very few ideas are perfect when they are first developed.
I'm not good enough at statistics to check that OP's math is sound, but this is the mindset of a scientist. OP reasons rigorously, finds a way to salvage the core insight and improves on it. As readers can see, it took quite a lot of work and prior knowledge to do.
If I were pg I would consider putting a link to this post on both the disagree.html and bias.html as a note for posterity.
To be fair pg's scale isn't perfect. DH1-DH5 is something you can improve if you want to refute a given argument. DH6 is about choosing the "main" argument you want to refute. There is nothing to improve between DH5 and DH6, especially if you agree with the main argument but you disagree with some minor argument.
I think this is a really good start for the most common types of bias. A few counter-examples that might slip through the cracks of this test:
Only examining the sample without looking at the population of applicants has its limits. Especially as multiple interviews becomes the norm, filters that don't affect the distribution of outcomes will be missed. For example, the person screening resumes might weed out anyone with an ethnic-sounding name. A different person, who is not biased, interviews the candidates. The quality of the candidates accepted will be the same, but the number of minority applicants will be smaller than it should be.
Measuring outcomes allows for external biases to distort the results. Start with a company that is biased against women, so that the average female founder is better than the average male. However, that same level of sexism exists in the market, such that the company's performance is hampered due to prejudice against the founder. The VC's bias would be hidden by the counter-bias in the market.
In the earlier thread, it seemed like some people were reaching different conclusions because they were using different definitions of "bias". I think my working definition would be something like "there existed in the actual applicant pool a subset of unfunded female founders who should have been statistically expected (given the information information available to the VC's at the time of decision) to outperform an equal sized subset of male founders who did in fact receive funding".
Alternatively (and I don't think equivalently?) one could reasonably take bias to mean "Given their prejudices, if the same VC's had been blinded to the sex of the applicants, they would have made funding choices resulting in higher total returns than the sex-aware choices they actually made." I'm sure there are many other ways of defining "bias". Could you define what would need to be true for your test to show that "the VC process is biased against female founders"?
My definition is the same as yours - it's exactly about the existence of rejected women who are better than accepted men.
This particular test is terrible for VC since the min return in VC will always be zero. But if you build a noise-sensitive version for something like college admissions, what needs to be true is a) bias manifests as raising/lowering the bar for one group relative to another, and b) both groups have a significant number of members near the cutoff.
As an example of the type of bias this test would detect, consider U-Michigan's point system [1]. An extra +1.0 GPA was added to black applicants. I.e. an Asian person with 3.9 GPA and black person with 2.9 GPA were equivalent. This would result in Asian people having a higher min GPA than black people.
[1] They replaced the point system with vague human heuristics when the supreme court said point systems can't be racist, but vague heuristics can.
Maybe I'm missing something but this seems like a pretty perfect application for the bootstrap - a remarkably intuitive but powerful framework. Without loss of generality, imagine that you have two populations, A and B, and that you want to test some hypothesis about a statistic of A being different from a statistic of B (mean, in this case). Using the simplest form of the bootstrap you would do the following:
1. Pool and randomly label the data from A and B
2. Sample with replacement and form two partitions of the same cardinality as the original A and B groups
3. Compute the differences in mean
4. Rinse and repeat millions of times to form a distribution of mean differences
5. Check if the observed difference in means (from the true A/B labels) is statistically significant relative to the distribution found in (4)
This has some problems with fat tailed distributions but tends to work great otherwise. It's so simple that it avoids a host of pitfalls that can arise with other resampling schemes (what's being proposed is a type of resampling), and I love that it makes basically zero assumptions on the underlying data.
That post shows that what PG is doing is
a first-cut effort at a statistical
hypothesis test but with being
vague on assumptions and without
any information on false alarm rate.
In particular, in my post,
get to compare sample
averages without making a distribution
assumption. Indeed make no
distribution assumptions at all.
Yes, distributions exist, but that
does not mean that we have to
consider their details in all applications!
Come on guys, this is distribution-free
statistical hypothesis testing, and we
should be able to use that.
So the alleged flaw in pg's reasoning is his assumption that the best applicants from two large groups of humans should turn out to create equally successful startups on average, if the selection process is not biased.
Is this really such an unreasonable assumption, given that pg restricts the applicability of his bias test to groups of equal ability distribution and that we can assume that both groups have roughly the same amount of capital at their disposal?
The question is if the "equal ability" qualification is sufficient to make sure the distributions are roughly similar. But that is not a mathematical issue.
The point of the post is to relax the "equal ability distribution" assumption. If the distributions are identical, any disparity in outcomes must be caused by bias.
Honest question: suppose I pick a plausible sounding h(x), perform the proposed test, and get a vanishingly small p-value. So I feel pretty happy about rejecting the null hypothesis. But the null hypothesis is a conjunction: A group members accepted have cdf a(x), B group members accepted have cdf b(x), and h(x) satisfies various technical conditions related to a( ) and b( ). So when I reject the null, I'm saying the data were unlikely to be generated by the hypothesized process. Couldn't that simply be because I guessed the wrong form of h( )?
Thanks for the write-up Chris. Now I understand why I couldn't follow the path of logic you were laying out in our original discussion in PG's article's comments.
The main problem I was having is that you are assuming our observation variable is the latent skill or potential value variable (which you're calling x here). However, the article by PG was talking solely about the average of returns (let's call it y).
So the reason I was confused is that, assuming that the outcome of a startup is dependent only on x, we are really observing y ~ f(x) = \int_0^1 g(x)h(x)dx, where h is your cut-off criteria for x, g(x) is some unknown payoff distribution for a given skill level, and I'm assuming our x is in [0,1] without loss of generality. So in essence, the real problem here, even if you could see all of the individual returns for a given portfolio, is that you have to perform a very, very difficult deconvolution problem. And I'm pretty sure it's non-identifiable without some other information or additional parametric assumptions.
Thinking out loud a bit, let's assume that y is actually log(return), where a return of 1 is breaking even and 0 is losing everything. Since log(0) is undefined, most startups return 0, and very few exit for less than 1, I would think we could model this as a point-inflated normal distribution: p(y) = c * \delta_0 + (1-c) * N(\mu, \sigma^2). Given this, we could then model our latent parameters (c, \mu, \sigma) as being functions of x. Since the model is separable, we can even just look at the zeros and non-zeros in isolation. Then we can come up with a test from there, but I'm not really sure what that test would be at this point. Anyway, that's a completely different line of thinking, but it seems much more tractable in practice.
It's since been dressed up with some mathiness, but this idea was originally proposed in the comment threads of pg's original article. [0] See the responses there for a few reasons why it just won't work.
To be concrete, assuming "performance" is measured as return on investment, min(performance) will always go to to -100% (i.e., bankruptcy) with a large enough sample size.
To use maths/statistics to reason that YC is not biased against certain groups of applicants is amusing. But to even consider that technical female founders are weak candidates is disappointing.
The sample population was chosen by specific type of groups of partners. There is no female technical partners in the group. As a female technical founder, I am not interested in building 'tea-making bot', sandwich making bot or selling organic condom. IMHO we have different views in looking at problems and solving them. Without having female technical founder as partner, YC would be perceived to be biased.
The algorithm of selecting promising candidates will vary once there is a variety of partners.
I think the more fundamental flaw in PG's argument, which is just as present here, is that it assumes the populations are otherwise identical. That's obviously not the case -- there's no random assignment for bias -- so this sort of test can't tell you anything direct about casuation.
This test explicitly does NOT assume the populations are otherwise identical. See the graph right after Theorem 1 - it shows two unequal distributions satisfying the assumptions of the test. That's the whole point.
This still seems like nonsense to me. What if you reject each black candidate (independently) with probability 0.5, and the proceed to perform a fair interview process with all remaining candidates?
Surely the distribution of minimums would then be the same between all skin colours, but you end up employing half the number of black applicants that you should be.
Posting this, since no one seems to have pointed it out:
> So rather than comparing mean performance, we'll compare minimum performance.
1. This is a useless metric for startup investors to use, since (almost surely) the minimum performance in every group of reasonable size will 0 (the startup went out of business)... and this will be true even if the investor is biased.
2. The maximum statistic was rightly avoided here because for power-law distributed values (which startups returns are), you'd need to know the population sizes to estimate if the distribution of {A} was different than the distribution of {B}.
If you're willing to take on faith that both A and B have the same distribution, then the test is easy: is the acceptance rate for As significantly different than the acceptance rate for Bs? If you've invested in more than, say, 100 startups, you have a big enough sample to check this... this requires knowing the size of application pools, and who was accepted though.
3. I believe that in general it's not possible to determine a bias from the kind of aggregate statistic pg is discussing without at least some knowledge of the sample space.
For example, using OP's method, you will find that almost every selection process in the world is biased for you if you divide the world as {you} vs {non-yous} (you're doing significantly worse than the best non-you). And find that almost every selection process in the world is heavily biased against you if you use the minimum statistic (you're doing significantly better than the worst non-you). This is also true for smallish groups (eg {your friends} vs {not your friends}).
The same is true for PG's method -- it's highly unlikely that {you} fall exactly at the average value of {non-yous}, or that {your friends} fall exactly at the average of {not your friends}.
4. I believe that the math here is distracting from the core question.
Core question 1: Do men and women on average make the same choices?
If you believe that, then determining bias is easy: we already know who the investor funded. Is the number of men the investor funded different from the number of women? Yes? Then the investor is biased. This is much more direct than the the kind of forensic accounting pg is proposing.
I suspect that pg didn't propose this test because pg doesn't believe that men and women on average make the same choices. He knows, for example, that the number of female applicants to YC is different than the number of male applicants (a gendered difference in behavior). Google "men and women career choices" or similar if you're interested in learning more, or better yet, read some first person accounts from FTM men about the cognitive effects of taking testosterone.
Since it's clear that there's a gendered difference before applying to YC, it seems very difficult to justify an assumption there would be no gendered difference in behavior after applying to YC (or any other investment firm, FirstRound in this case). Given that, the question we were asking becomes much more confusing... a simple bias towards ideas and plans you understand/agree with/are excited by is a gender bias in as much as your gender caused you to like the idea or plan. Removing that bias (supporting plans you understand less, agree with less, or are less excited by) seems like an obviously bad idea.
Returning to the problem: if we accept that this sort of "makes sense to me bias" can be observed when looking for gender biases, we are left in a really hard place. That bias seems to be both a good thing, and confounds the entire analysis. Unless you've controlled for the "makes sense" bias, such analysis will apply pressure for investors to waste money from their perspective. This seems obviously bad.
Core question 2: which biases do we want investors to have?
Investors who knowingly pass up good opportunities on the basis of the founder's gender are punishing themselves worse than any company they pass over -- their competitors who aren't gender biased will get higher returns, and so will have more money to invest in the future. This is to say that gender biases for startup investing are self-correcting. The investors already have their self-interest maximally aligned with not being sexist.
I don't pretend to know which companies are worth investing in more than any other smart technologist. I also don't pretend to know to what extent gender differences cause differences in returns, so my answer is: investors should be as biased (selective about investing) as they see fit. Startups are positive sum for society, and anyone who can find a way to fund more of them profitably is making the world better.
In large part, this is because I find it very unlikely that any modern investor is knowingly sexist -- I think it's much more likely that the sort of "makes sense" bias I discuss above is at play.
Of course, this is an early thought that came from first principals, so counter arguments solicited. Perhaps there is something deeply evil about passing over startups you don't feel comfortable investing in (assuming that comfort has any correlation with founder gender), or perhaps there's some easy fix which makes previously dicy-looking ideas from {other-gender} founders look like obviously good investments. (If you know what that idea is, I'd love to know it too).
5. Thanks to both pg and Chris for the fun math/philosophy problem. :)
This kind of thinking could be problematic. What would happen if someone compared the performance of whites and persons of African heritage at college?
They would detect if the acceptance criteria were biased. This is a good thing, since after you've measured something you know if a change is in order.
Lots of math in here premised on shaky foundations:
>Group A comes from a population where the chance of graduation is distributed uniformly from 0% to 100% and group B is from one where the chance is distributed uniformly from 10% to 90%
>The mean of group B is not lower because of bias (which would be reflected near x=80), but because the very best members of group B are simply not as good as the very best members of group A.
Yes, if we can assume some a-priori knowledge about certain "groups" of people, then we can make a more "informed" decision. That's pretty much the definition of bias, isn't it? Paul Graham's point, as I understood it, was that those assumptions are often invalid. Therefore, bias could cause the market to under value someone or some company. Your counterpoint seems to be, "let's suppose those biases are legitimate."
>Unfortunately, using the mean as a test statistic is flawed - it only works when the pre-selection distribution of A and B is identical, at least beyond C
His argument is based the proposition that different sexes/races have different market value profiles. He needs to demonstrate why that is the case before proceeding to heavy math.
> His argument is based the proposition that different sexes/races have different market value profiles. He needs to demonstrate why that is the case before proceeding to heavy math.
Not really, his argument is that "PG's mean-post selection test (the 'PMST') is only valid if different sexes have the same distribution of abilities". If you or PG believe that the PMST is a valid way of showing that bias exists, the burden is on you to show that different sexes have the same distribution of abilities.
>You can use this technique whenever (a) you have at least a random sample of the applicants that were selected, (b) their subsequent performance is measured, and (c) the groups of applicants you're comparing have roughly equal distribution of ability.
So yes, OP is ignoring the entire premise of PG's argument.
OP acknowledges this... "Unfortunately, using the mean as a test statistic is flawed - it only works when the pre-selection distribution of A and B is identical, at least beyond C."
To me the rest of the article is asks the question, "requirement (c) is really strong, is there a way we can use post-selection statistics to determine bias while weakening (c)? what if we tried measuring the post-selection minimum instead of the mean?"
Also PG edited his essay to add that disclaimer only after WildUtah's comment, so it's possible that OP hasn't read the updated version.
No, he's not asserting any such thing. Again, it's a hypothetical. A counterexample that means, yes, the "mean" test is flawed. Because it doesn't work in all scenarios.
If I'm understanding correctly, the new test is based on a single data point from each group, rather than an aggregate statistic (like mean). I'm no statistician, but it seems like this data would have far too much variance and noise for this to be a useful test.
The minimum performer could be someone who had a sudden personal crisis. Or who had 10 competitors suddenly pop up. Or any number of other circumstances outside their control. The minimum performer is, almost by definition, an outlier. It doesn't seem rational to suppose that an outlier is representative of the group.
I can understand that statistically this test may be more rigorous. In practice I would expect it to be less rigorous. Because the assumption it makes (that a single outlier is representative of the group) seems even more dubious than the assumptions required for Paul's original idea.