What makes me uncomfortable here is the obscure description of the issue and how the obscurity will affect beginners and young minds. Kids with an interest in data science are going to read this and find it baffling, and the references won't help much. They will get the impression that ethics in machine learning is some sort of abstruse field that they can't reason about on their own, so they need to be told what is ethical by experts.
Contrary to the cited Medium post, including race, integration, or racism-related factors as predictors of housing prices doesn't imply that it's okay for those drivers to exist in reality, or that the model is suitable for some real-world deployment where it might affect real prices. That is not the only use case; such a model could just as easily be used to imply that those factors are bad. Such a model could play a role in a disparate impact case fighting against racism.
I don't love how the dataset creators only included B and not the underlying untransformed value, and I agree that it's based on a questionable theory about how integration affects housing prices. These issues could be taken as sufficient cause to stop using the dataset, especially when better alternatives are available. But calling them ethical issues seems either puzzling or wrong. Problematizing something should ideally come with an accessible public explanation of why it is problematic. Ethics should not be obscure.
The cited Medium post says many things. Most of them seem to be background or diversions. I did my best to pick out the essential points and discuss them in my comment above. To the extent I could understand, those essential points seemed questionable.
The FairLearn authors here explain, like I did above, that the ethical issues depend on use case, and are not inherent to the data or the dataset developers' choice of transformation for the B variable.
This - it's really encoding the authors' model with some magic values in there. Not even ethics, just bad stats.
The correct step would be back out the raw proportion of black population, rather than the output of that quadratic model.
Edit:
D'oh 62% and 64% black neighborhoods get the same B. Couple glasses of wine in already, math is hard...
Read the code? You can try to import it but you'll get a message explaining the problems with it and a link to where to find it shall you decide you'd like to use it anyway. It's the opposite of hiding.
It’s out of the scope of this library to publish all datasets in existence or highlight particular datasets that are relevant to particular societal problems. It’s literally just a few datasets so that you can play around with the ML library without downloading any external datasets. I think it’s fair to allow them to exercise reasonable discretion in their choice of which toy datasets to ship with their ML library.
Close, but not exactly. One of its variables is how far its integration differs from 63% Black, squared.
I.e., you cannot distinguish a 73% black neighborhood from a 53% black neighborhood with this variable.
It's a bizarre variable and I guess I could see purging the column or at least suggesting it not be used, but I don't really understand why you'd delete the rest of the (sample) dataset on this basis.
> Close, but not exactly. One of its variables is how far its integration differs from 63% Black, squared.
> I.e., you cannot distinguish a 73% black neighborhood from a 53% black neighborhood with this variable.
Isn't the point of that operation that a 1% black neighborhood and a 99% black neighborhood are both less integrated than a 50% black neighborhood? If you didn't do something like squaring, then wouldn't at least one of the former incorrectly register as more integrated than the latter?
> It's a bizarre variable and I guess I could see purging the column or at least suggesting it not be used, but I don't really understand why you'd delete the rest of the (sample) dataset on this basis.
Yeah, I mean, I guess I'd suggest a raw Bk column to preserve the original data and maybe just absolute value of difference instead of square of difference.
"Thus, any models trained using this data that do not take special care to process B will learn to use mathematically encoded racism as a factor in house price prediction."
House price prediction for prices in 1970s Boston, yes, where housing prices almost certainly reflected racist preferences. That seems like a (potentially) accurate model?
ML models could also learn that the correlation between B and price is negative (i.e., that integration improves house prices). But the critics of the dataset all suggest that B and price are positively correlated.
A confusion I sometimes have is where we should start pulling the string to unravel systematic racism. On some levels it can seem like if the purpose of this dataset was to analyze the impact of these variables on the prices humans (who were and are racist) are willing to pay than if B had predictive power (because people were racist), than excluding it in general seems wrong. On the other hand applying that across society is part of how we got redlining and in general a lot of the systematic part of systematic racism. Should we sort of handicap our models of human behavior to exclude race (and try to exclude proxies the model can find).
I think regardless removing this model seems right since it's A a toy, B dated, and C liable to not be seen with the care it needs to be, but say if I'm trying to calculate who will be the next famous actor, is it wrong to include variables to allow the model to pick up on how shallow and say fat phobic society is (or just plain racist again)
What's the point of such an incendiary comment? No, you aren't understanding it right. At worst you are offering a deliberately misleading interpretation. Here's what the link says:
The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.
The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.
The "B" variable measures how integrated a neighborhood is, and that snippet seems to be saying that its existence is the "ethical problem" that led them to purge the dataset. How is any of that different than what I said?
So, the issue is more subtle than that - if it was just a matter of including demographic data, there wouldn't be an issue. The problem is that the "B" column is _not_ measuring how integrated the neighborhood is- it is a transformed value whose calculation begins with data about integration levels, but the details of how and why that transformation is performed are super important. The calculation the original authors did to produce that column rests on a model about the relationship between a neighborhood's level of integration and its property values, and that model's assumptions are frankly racist and also factually incorrect (and were known to be incorrect at the time of its original publication back in the '70s). As a result, if one were to _use_ the "B" column in a model, one would be getting results that at best were wrong and useless, and at worst would make a model that literally encodes broken 70s-era ideas about how real estate and race interact in the US.
And the transformation itself is non-invertible, so it's not possible to recover the original values for about 7-8% of the rows in the dataset. The commit diff links to a thorough investigation of the data[1] in which the author takes a crack at linking up the ambiguous rows with the original 1970 Census data that supposedly went in to generating this dataset, and long story short, it looks like the original dataset's authors may have made some errors in their calculations on top of everything else.
Expressing incredulity over one piece of context without acknowledging the existence of other parts which modify it. There was a full detailed explanation at the medium blog, and all that was necessary to do was follow the citation given and read it. Repeatedly proclaiming confusion when already in possession of the explanatory material, rather than challenging that explanation, is not a credible posture.
> Please provide a source that says that "number of blacks in my neighbourhood" is a measure of "neighbourhood integration".
It isn't. Bk is "number of blacks in my neighbourhood" as you put it, and the whole point of using B instead of it was so that an all-black neighborhood wouldn't count as more integrated than one with a mix of races.
I don't see why non-invertibility matters. Lots of useful features are non-invertible.
Edit: and if you are dealing with real data sets or producing real datasets for analysis you will often have only approximations to the thing you want to measure. Determining whether your proxy variable is worth including or how to interpret your results in light of it are necessary skills to develop.
The feature is bad. The non-invertibility means that you cannot get back the original data that was used to generate the feature, and try to salvage it.
Sure, that makes it less useful. But why is that so bad that the entire dataset should be discarded and not used, even for uses that don't care about that particular part of the original data?
If you want the dataset, scikit even tells you how to get it. If you just want an example dataset, there are better ones. I mean, this seems somewhat like the Lena debacle: why insist on this particular dataset?
> The non-invertability is part of the problem, and he completely doesn't understand that.
I get that invertibility means that you can't fully recover the original racial percentage, e.g., that a 48/52 split and a 78/22 split will both look exactly the same, since (.48-.63)^2 and (.78-.63)^2 are equal. I don't see why that totally taints the entire dataset.
Sadly, assumptions of good faith are easily exploited by bad actors (the classic term for this is "just asking questions") but I suppose you're right, I should not have assumed malice.
Is it just me or this some horrifically bad English? I have a fairly strong math background and I'm struggling to figure out what the author meant by any of that.
As far as I can tell, it's something like, "The author made a bad variable. Also, the goal was to check air quality but the variable was bad." What does the subject being air quality impact have to do with anything there?
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
And not sure how anyone can argue the dataset is worthy of being included. It is pretty offensive and misguided at minimum to argue that having more black people in your neighbourhood will depress housing prices. And for it to be solely because they are black and not to do with a range of other factors e.g. socio-economic.
It is pretty offensive and misguided at minimum to argue that having more black people in your neighbourhood will depress housing prices.
I think that's the wrong lens to look at this through. I'm happy to concede your statement about it being offensive is true (although I think from a purely statistical perspective, correlations with poverty, etc probably make the assumption correct. Before 2015 or so when we all lost it, it would only be racist to say there was a causal relationship between race and price, not a correlation). Anyway, that's all an aside.
It's the purging of a dataset, a toy dataset in this context, for a reason of political correctness, that I don't support. If you look hard enough at anything, you can probably find a way to call it racist or some similar slur. If we start applying this lens to tools like scikit learn, we go down a path I don't agree with, that's completely performative in terms of actually addressing any wrongs, and is a continuing distraction from what could be useful work. Debating if and how racist this is is immaterial imo to whether or not we should erase everything doesn't align with modern hypersensitivity about political correctness
So even if you want to ignore the fact that the data was outright used to discriminate in the past, the data itself is actually flawed in several ways...
because this one is flawed and comes with some racist presumptions made by the people that originally correlated the data (the presumption being that black people moving into a neighborhood decreases property values... a particularly white-biased correlation without causation)
The data is the data. The data isn't suggesting that "having more black people in your neighbourhood will depress housing prices." That's your take on what a racist causal interpretation would look like.
That's actually a racist causal interpretation made about the data when it was originally compiled and analyzed... which raises a lot of alarm bells:
> At low to moderate levels of B, an increase in B should have a negative influence on housing value if Blacks are regarded as undesirable neighbors by Whites.
> At low to moderate levels of B, an increase in B should have a negative influence on housing value if Blacks are regarded as undesirable neighbors by Whites.
It's significantly more complicated. This presumption assumes that black people are responsible for depressing housing values and ignores a myriad of other factors (i.e., it assumes that this is caused by black people moving in, not white people moving out). It's such a narrow view of the problem that it makes me question the motivations of collating such data to begin with.
For example ignores the fact that on the whole black people are significantly poorer.
In Boston the average white family has a net worth of $200k+ while black families in the same city have a net worth of <$10 (that's not a typo, it's less than ten dollars). Poorer people by nature can not afford houses in more expensive neighborhoods, so naturally you have a concentration of black people in poorer neighborhoods... it's not because white people find black neighbors undesirable, it's that black people are disproportionately poorer.
This kind of oversimplification tends to perpetuate a lot of negative stereotypes about black people while hand-waiving away the chronic issues black people are faced with that creates this kind of disparity.
That's actually a racist causal interpretation made about the data when it was originally analyzed:
> At low to moderate levels of B, an increase in B should have a negative influence on housing value if Blacks are regarded as undesirable neighbors by Whites.
In an incredible whitewashed way that's very familiar to anyone that lived during the Jim Crow era, yes. Note that it doesn't say that white people leave because they're intolerant of black neighbors, but paints black neighbors as undesirable.
Either way, it's a dramatic oversimplification. Black people are also significantly poorer than white people, so it's also likely that black people can only afford to move into a neighborhood as property values decrease... yet the people who collated this data chose to make a different assumption.
I wonder if that was the overall proportion for the entire surrounding area at the time. If so, then B would be a measure of how different the racial makeup of a given subset is from the entire area.
Isn't that a measure of how integrated a neighborhood is? And even accepting for the sake of argument that such a variable is evil, why not just exclude it instead of ditching the whole dataset?
a) No. If 50% of the community is African-Americans and live on one half and everyone else on the other then it would be exactly integrated. Except of course it isn't.
b) It is not scikit-learn's responsibility to alter third party datasets.
> a) No. If 50% of the community is African-Americans and live on one half and everyone else on the other then it would be exactly integrated. Except of course it isn't.
IIUC, you're arguing that "whole-town" level aggregation is misleading. So if we get more granular, we could do it my neighborhood, street, building, apartment/car/shelter, bedroom, bed, bed @ time of day, etc.
Any one of those aggregation levels could hide interesting distinctions that could be made if only the data were reported with even more granularity.
So are you arguing against aggregation in general? Or just whole-town aggregation specifically?
But it’s a fact that—due to the very racism you’re pointing out—house prices tend to be lower the more minorities there are. That’s in part because of the racist policies of banks and the real estate industry. Ignoring it doesn’t do anyone a service. Now, whether it’s used/included responsibly in this dataset is another matter entirely.
I used the dataset with my students, as it is small and does not require preprocessing, like dummy coding or handling missing values. Students also brought the racial issue to my attention and it created a bit of a discussion. We eventually decided to simply change the definition to "birds by town" and moved on.
Think of all the children books that get rewritten. Read the new ones to your children and discuss the old ones when they are teenagers. I would have preferred if sklearn contributors had done the same and simply revised the description as opposed to removing the dataset.
This is an impressively responsible way of handling the situation, and I'd recommend that others read it as well. It identifies the specific problem with the dataset which led to its removal from the library (with references!), tells the user how to retrieve it if they really need it, and suggests alternatives.
Can someone ELI5 why this was removed? Is the problem just that the dataset includes a feature that references black people which might cause a model to draw a connection between black population and housing values? I thought it was pretty well accepted that (for a huge complex variety of historical reasons) black people tend to live in neighborhoods with lower valued houses. Or is there a deeper fairness issue that I'm missing?
The biggest problem is that the dataset contains an artificial feature that is not invertible. This is an issue because the biases of the author of the dataset are present in that feature, and you will never be able to "train" your way out of it because it is not invertible.
Great explanation but consider adding blurbs about what makes a feature artificial, what 'invertibility' is, why it's important, and how you could (as you put it) train one's way out of the bias so long as the feature is invertible. Finally, bring it home by giving an example (fictious or, even better, real-historical!) of a model that would be biased, but for the blessings of feature-invertibility and further training, and then explain that you since you couldn't do that with this dataset, because the artificial feature is not invertible.
That I think gets us a full ELI5, though I agree the common-sense cutoff is subjective.
(And I say this in a spirit of co-collaboration -- I'm fascinated by the problem of ELI5 something like this, as I wind up having these conversations ad-hoc in-the-wild with family and friends, as I work in the field. Finding simple language just is progress)
Have you ever talked to a five-year-old? They're very interested but ignorant and have minute attention spans. Have some self-respect, put on your big boy pants and at least make an effort to understand the adult version and ask some specifics about what you don't understand.
What you're asking for is a full on Malcom Gladwell style essay. Not only that, but all the points are discussed in the linked blog post. If one can't even be asked to put in the effort to read it, isn't it a bit presumptuous to expect anyone to summarize for you in elaborate detail!?
ELI5 would be: The authors wanted to find out if housing prices have something to do with bad air. They thought some people unfairly make homes cheaper if black people live in them. And that in some cases more expensive. Instead of including data about how many black people live in a neighborhood he included a value that supports his opinion and can't be checked. Because scientists aren't supposed to base their conclusions on gut feelings, this makes the sample data an example for problems that arise when working with data.
Five year olds are not particularly intellectual and wouldn't enjoy the discussion of biased models with feature-invertible data!
'Although models that make use of non-invertible, artifical features may still be biased, it's less likely. Using only invertible features eliminates some risks of bias, but not others.'
At this point I think one might have to explain the Naturalistic Fallacy (which accounts for much of the remaining possibility of bias in a model) but it starts to get into tit-for-tat hand-to-hand ontology: what 'bias' means, what different kinds of bias are possible, and how even 'unbiased data' can create a model that demonstrates behaviours that colloquially and idiomatically count as biased.
But one must cut the cloth of the universe somewhere
Being non-invertible just means you can't get the original racial breakdown from the variable. It doesn't mean that you're stuck with any particular bias forever.
Instead of keeping the actual proportion of black residents, the data set contains a presumed model B of how housing prices change as black resident population moves away from 63%. The problem is B gives you the same answer for 62% and 64% of black population, so you can recover the actual proportion.
Contrary to the cited Medium post, including race, integration, or racism-related factors as predictors of housing prices doesn't imply that it's okay for those drivers to exist in reality, or that the model is suitable for some real-world deployment where it might affect real prices. That is not the only use case; such a model could just as easily be used to imply that those factors are bad. Such a model could play a role in a disparate impact case fighting against racism.
I don't love how the dataset creators only included B and not the underlying untransformed value, and I agree that it's based on a questionable theory about how integration affects housing prices. These issues could be taken as sufficient cause to stop using the dataset, especially when better alternatives are available. But calling them ethical issues seems either puzzling or wrong. Problematizing something should ideally come with an accessible public explanation of why it is problematic. Ethics should not be obscure.