Boston housing price dataset was removed from scikit-learn 1.2

civilized · on Dec 30, 2022

What makes me uncomfortable here is the obscure description of the issue and how the obscurity will affect beginners and young minds. Kids with an interest in data science are going to read this and find it baffling, and the references won't help much. They will get the impression that ethics in machine learning is some sort of abstruse field that they can't reason about on their own, so they need to be told what is ethical by experts.

Contrary to the cited Medium post, including race, integration, or racism-related factors as predictors of housing prices doesn't imply that it's okay for those drivers to exist in reality, or that the model is suitable for some real-world deployment where it might affect real prices. That is not the only use case; such a model could just as easily be used to imply that those factors are bad. Such a model could play a role in a disparate impact case fighting against racism.

I don't love how the dataset creators only included B and not the underlying untransformed value, and I agree that it's based on a questionable theory about how integration affects housing prices. These issues could be taken as sufficient cause to stop using the dataset, especially when better alternatives are available. But calling them ethical issues seems either puzzling or wrong. Problematizing something should ideally come with an accessible public explanation of why it is problematic. Ethics should not be obscure.

anigbrowl · on Dec 31, 2022

Kids with an interest in data science are going to read this and find it baffling, and the references won't help much.

There's nothing baffling about it.

civilized · on Dec 31, 2022

The cited Medium post says many things. Most of them seem to be background or diversions. I did my best to pick out the essential points and discuss them in my comment above. To the extent I could understand, those essential points seemed questionable.

For what it's worth, another commenter posted this and I found it much clearer, if still imperfect: https://fairlearn.org/main/user_guide/datasets/boston_housin...

The FairLearn authors here explain, like I did above, that the ethical issues depend on use case, and are not inherent to the data or the dataset developers' choice of transformation for the B variable.

erehweb · on Dec 30, 2022

Some discussion of the dataset and issues - this link claims it was removed in June 2020 https://fairlearn.org/main/user_guide/datasets/boston_housin...

VHRanger · on Dec 30, 2022

I'm annoyed that the dataset has the B variable and the LSAT variable encoding some manually-chosen hyperparameters in the formula creating them.

If the dataset gave the raw features it would be better at least

pxmpxm · on Dec 31, 2022

This - it's really encoding the authors' model with some magic values in there. Not even ethics, just bad stats. The correct step would be back out the raw proportion of black population, rather than the output of that quadratic model.

Edit: D'oh 62% and 64% black neighborhoods get the same B. Couple glasses of wine in already, math is hard...

josephcsible · on Dec 30, 2022

Am I understanding this right? They purged an entire dataset because one of its independent variables is how integrated a neighborhood is?

mcs5280 · on Dec 30, 2022

If we selectively hide things from view all the world's problems will go away

mertd · on Dec 30, 2022

Read the code? You can try to import it but you'll get a message explaining the problems with it and a link to where to find it shall you decide you'd like to use it anyway. It's the opposite of hiding.

ipaddr · on Dec 30, 2022

Putting unnecessarily gates in front making the process non-standard is not the opposite of hiding.. perhaps somewhere inbetween

laidoffamazon · on Dec 31, 2022

They suggest an alternative of the same type without these issues.

tshaddox · on Dec 30, 2022

This is a very small collection of toy datasets that this machine learning library includes for the purpose of illustrating the behavior of the library: http://scikit-learn.org/stable/datasets/toy_dataset.html

It’s out of the scope of this library to publish all datasets in existence or highlight particular datasets that are relevant to particular societal problems. It’s literally just a few datasets so that you can play around with the ML library without downloading any external datasets. I think it’s fair to allow them to exercise reasonable discretion in their choice of which toy datasets to ship with their ML library.

mistrial9 · on Dec 30, 2022

.. with AI

JeremyBanks · on Dec 30, 2022

My Rights to be included as example data in a tool distribution

loeg · on Dec 30, 2022

Close, but not exactly. One of its variables is how far its integration differs from 63% Black, squared.

I.e., you cannot distinguish a 73% black neighborhood from a 53% black neighborhood with this variable.

It's a bizarre variable and I guess I could see purging the column or at least suggesting it not be used, but I don't really understand why you'd delete the rest of the (sample) dataset on this basis.

josephcsible · on Dec 31, 2022

> Close, but not exactly. One of its variables is how far its integration differs from 63% Black, squared.

> I.e., you cannot distinguish a 73% black neighborhood from a 53% black neighborhood with this variable.

Isn't the point of that operation that a 1% black neighborhood and a 99% black neighborhood are both less integrated than a 50% black neighborhood? If you didn't do something like squaring, then wouldn't at least one of the former incorrectly register as more integrated than the latter?

> It's a bizarre variable and I guess I could see purging the column or at least suggesting it not be used, but I don't really understand why you'd delete the rest of the (sample) dataset on this basis.

Agreed.

loeg · on Dec 31, 2022

Yeah, I mean, I guess I'd suggest a raw Bk column to preserve the original data and maybe just absolute value of difference instead of square of difference.

version_five · on Dec 30, 2022

Yes. It's an embarrassment.

mcguire · on Dec 30, 2022

From https://fairlearn.org/main/user_guide/datasets/boston_housin...:

"Thus, any models trained using this data that do not take special care to process B will learn to use mathematically encoded racism as a factor in house price prediction."

loeg · on Dec 30, 2022

House price prediction for prices in 1970s Boston, yes, where housing prices almost certainly reflected racist preferences. That seems like a (potentially) accurate model?

ML models could also learn that the correlation between B and price is negative (i.e., that integration improves house prices). But the critics of the dataset all suggest that B and price are positively correlated.

space_fountain · on Dec 30, 2022

A confusion I sometimes have is where we should start pulling the string to unravel systematic racism. On some levels it can seem like if the purpose of this dataset was to analyze the impact of these variables on the prices humans (who were and are racist) are willing to pay than if B had predictive power (because people were racist), than excluding it in general seems wrong. On the other hand applying that across society is part of how we got redlining and in general a lot of the systematic part of systematic racism. Should we sort of handicap our models of human behavior to exclude race (and try to exclude proxies the model can find).

I think regardless removing this model seems right since it's A a toy, B dated, and C liable to not be seen with the care it needs to be, but say if I'm trying to calculate who will be the next famous actor, is it wrong to include variables to allow the model to pick up on how shallow and say fat phobic society is (or just plain racist again)

anigbrowl · on Dec 30, 2022

No, you are not. https://medium.com/@docintangible/racist-data-destruction-11...

ffssffss · on Dec 30, 2022

What's the point of such an incendiary comment? No, you aren't understanding it right. At worst you are offering a deliberately misleading interpretation. Here's what the link says:

         The Boston housing prices dataset has an ethical problem: as
            investigated in [1], the authors of this dataset engineered a
            non-invertible variable "B" assuming that racial self-segregation had a
            positive impact on house prices [2]. Furthermore the goal of the
            research that led to the creation of this dataset was to study the
            impact of air quality but it did not give adequate demonstration of the
            validity of this assumption.
            The scikit-learn maintainers therefore strongly discourage the use of
            this dataset unless the purpose of the code is to study and educate
            about ethical issues in data science and machine learning.

josephcsible · on Dec 30, 2022

The "B" variable measures how integrated a neighborhood is, and that snippet seems to be saying that its existence is the "ethical problem" that led them to purge the dataset. How is any of that different than what I said?

stevenbedrick · on Dec 30, 2022

So, the issue is more subtle than that - if it was just a matter of including demographic data, there wouldn't be an issue. The problem is that the "B" column is _not_ measuring how integrated the neighborhood is- it is a transformed value whose calculation begins with data about integration levels, but the details of how and why that transformation is performed are super important. The calculation the original authors did to produce that column rests on a model about the relationship between a neighborhood's level of integration and its property values, and that model's assumptions are frankly racist and also factually incorrect (and were known to be incorrect at the time of its original publication back in the '70s). As a result, if one were to _use_ the "B" column in a model, one would be getting results that at best were wrong and useless, and at worst would make a model that literally encodes broken 70s-era ideas about how real estate and race interact in the US.

And the transformation itself is non-invertible, so it's not possible to recover the original values for about 7-8% of the rows in the dataset. The commit diff links to a thorough investigation of the data[1] in which the author takes a crack at linking up the ambiguous rows with the original 1970 Census data that supposedly went in to generating this dataset, and long story short, it looks like the original dataset's authors may have made some errors in their calculations on top of everything else.

1: https://medium.com/@docintangible/racist-data-destruction-11...

anigbrowl · on Dec 30, 2022

[flagged]

trinsic2 · on Dec 30, 2022

Stop what?

anigbrowl · on Dec 31, 2022

Expressing incredulity over one piece of context without acknowledging the existence of other parts which modify it. There was a full detailed explanation at the medium blog, and all that was necessary to do was follow the citation given and read it. Repeatedly proclaiming confusion when already in possession of the explanatory material, rather than challenging that explanation, is not a credible posture.

trinsic2 · on Dec 31, 2022

If you say so...

threeseed · on Dec 30, 2022

Please provide a source that says that "number of blacks in my neighbourhood" is a measure of "neighbourhood integration".

It's a ridiculous and offensive premise from my perspective.

josephcsible · on Dec 30, 2022

> Please provide a source that says that "number of blacks in my neighbourhood" is a measure of "neighbourhood integration".

It isn't. Bk is "number of blacks in my neighbourhood" as you put it, and the whole point of using B instead of it was so that an all-black neighborhood wouldn't count as more integrated than one with a mix of races.

CoastalCoder · on Dec 30, 2022

> Please provide a source that says that "number of blacks in my neighbourhood" is a measure of "neighbourhood integration".

I wonder if the dataset design makes more sense in the context of Boston in particular: [0].

[0] https://en.wikipedia.org/wiki/Boston_desegregation_busing_cr...

CoastalCoder · on Dec 30, 2022

> It's a ridiculous and offensive premise from my perspective.

Can you elaborate on the problem you have with this?

(I'm just trying to not guess at your meaning.)

trinsic2 · on Dec 30, 2022

How is doing a sanity check incendiary in your view again?

ffssffss · on Dec 30, 2022

[flagged]

jeffreyrogers · on Dec 30, 2022

I don't see why non-invertibility matters. Lots of useful features are non-invertible.

Edit: and if you are dealing with real data sets or producing real datasets for analysis you will often have only approximations to the thing you want to measure. Determining whether your proxy variable is worth including or how to interpret your results in light of it are necessary skills to develop.

guipsp · on Dec 30, 2022

The feature is bad. The non-invertibility means that you cannot get back the original data that was used to generate the feature, and try to salvage it.

josephcsible · on Dec 30, 2022

Sure, that makes it less useful. But why is that so bad that the entire dataset should be discarded and not used, even for uses that don't care about that particular part of the original data?

guipsp · on Dec 30, 2022

If you want the dataset, scikit even tells you how to get it. If you just want an example dataset, there are better ones. I mean, this seems somewhat like the Lena debacle: why insist on this particular dataset?

josephcsible · on Dec 30, 2022

> The non-invertability is part of the problem, and he completely doesn't understand that.

I get that invertibility means that you can't fully recover the original racial percentage, e.g., that a 48/52 split and a 78/22 split will both look exactly the same, since (.48-.63)^2 and (.78-.63)^2 are equal. I don't see why that totally taints the entire dataset.

generalizations · on Dec 30, 2022

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

https://news.ycombinator.com/newsguidelines.html

ffssffss · on Dec 30, 2022

Sadly, assumptions of good faith are easily exploited by bad actors (the classic term for this is "just asking questions") but I suppose you're right, I should not have assumed malice.

generalizations · on Dec 30, 2022

Better one one mud slinger than two, for the sake of the community. Guilty of it myself too many times. Cheers.

europeanguy · on Dec 30, 2022

> I think he wrote the comment in bad faith. The non-invertability is part of the problem, and he completely doesn't understand that.

If he doesn't understand it, then it's not in bad faith. Right?

jimbob45 · on Dec 30, 2022

Is it just me or this some horrifically bad English? I have a fairly strong math background and I'm struggling to figure out what the author meant by any of that.

As far as I can tell, it's something like, "The author made a bad variable. Also, the goal was to check air quality but the variable was bad." What does the subject being air quality impact have to do with anything there?

threeseed · on Dec 30, 2022

No. The dataset [1] defines B as:

1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

And not sure how anyone can argue the dataset is worthy of being included. It is pretty offensive and misguided at minimum to argue that having more black people in your neighbourhood will depress housing prices. And for it to be solely because they are black and not to do with a range of other factors e.g. socio-economic.

[1] http://lib.stat.cmu.edu/datasets/boston

version_five · on Dec 30, 2022

   It is pretty offensive and misguided at minimum to argue that having more black people in your neighbourhood will depress housing prices.

I think that's the wrong lens to look at this through. I'm happy to concede your statement about it being offensive is true (although I think from a purely statistical perspective, correlations with poverty, etc probably make the assumption correct. Before 2015 or so when we all lost it, it would only be racist to say there was a causal relationship between race and price, not a correlation). Anyway, that's all an aside.

It's the purging of a dataset, a toy dataset in this context, for a reason of political correctness, that I don't support. If you look hard enough at anything, you can probably find a way to call it racist or some similar slur. If we start applying this lens to tools like scikit learn, we go down a path I don't agree with, that's completely performative in terms of actually addressing any wrongs, and is a continuing distraction from what could be useful work. Debating if and how racist this is is immaterial imo to whether or not we should erase everything doesn't align with modern hypersensitivity about political correctness

micromacrofoot · on Dec 30, 2022

You're missing the fact that there are a lot of reasons the data itself is actually bad... another commenter shared a reasonable analysis: https://medium.com/@docintangible/racist-data-destruction-11...

So even if you want to ignore the fact that the data was outright used to discriminate in the past, the data itself is actually flawed in several ways...

skfingngihg · on Dec 31, 2022

I’m sure we can find flaws in any number of the included data sets, but we’re not excluding those, right?

Why are people singling out this one dataset?

micromacrofoot · on Dec 31, 2022

because this one is flawed and comes with some racist presumptions made by the people that originally correlated the data (the presumption being that black people moving into a neighborhood decreases property values... a particularly white-biased correlation without causation)

threeseed · on Dec 30, 2022

> If you look hard enough at anything, you can probably find a way to call it racist or some similar slur.

Which is such an insane premise that it's hard to take the rest of your point seriously.

thw09j9m · on Dec 30, 2022

> offensive and misguided

The data is the data. The data isn't suggesting that "having more black people in your neighbourhood will depress housing prices." That's your take on what a racist causal interpretation would look like.

The correlation is very real and turning a blind eye to it is worse: https://www.brookings.edu/testimonies/how-racial-disparities...

micromacrofoot · on Dec 30, 2022

That's actually a racist causal interpretation made about the data when it was originally compiled and analyzed... which raises a lot of alarm bells:

> At low to moderate levels of B, an increase in B should have a negative influence on housing value if Blacks are regarded as undesirable neighbors by Whites.

kbelder · on Dec 31, 2022

> At low to moderate levels of B, an increase in B should have a negative influence on housing value if Blacks are regarded as undesirable neighbors by Whites.

Isn't that true? It's almost a tautology.

(Bad things can be true.)

micromacrofoot · on Dec 31, 2022

It's significantly more complicated. This presumption assumes that black people are responsible for depressing housing values and ignores a myriad of other factors (i.e., it assumes that this is caused by black people moving in, not white people moving out). It's such a narrow view of the problem that it makes me question the motivations of collating such data to begin with.

For example ignores the fact that on the whole black people are significantly poorer.

In Boston the average white family has a net worth of $200k+ while black families in the same city have a net worth of <$10 (that's not a typo, it's less than ten dollars). Poorer people by nature can not afford houses in more expensive neighborhoods, so naturally you have a concentration of black people in poorer neighborhoods... it's not because white people find black neighbors undesirable, it's that black people are disproportionately poorer.

This kind of oversimplification tends to perpetuate a lot of negative stereotypes about black people while hand-waiving away the chronic issues black people are faced with that creates this kind of disparity.

micromacrofoot · on Dec 30, 2022

That's actually a racist causal interpretation made about the data when it was originally analyzed:

> At low to moderate levels of B, an increase in B should have a negative influence on housing value if Blacks are regarded as undesirable neighbors by Whites.

RobotToaster · on Dec 30, 2022

Isn't that saying that lower house prices are caused by racism?

micromacrofoot · on Dec 31, 2022

In an incredible whitewashed way that's very familiar to anyone that lived during the Jim Crow era, yes. Note that it doesn't say that white people leave because they're intolerant of black neighbors, but paints black neighbors as undesirable.

Either way, it's a dramatic oversimplification. Black people are also significantly poorer than white people, so it's also likely that black people can only afford to move into a neighborhood as property values decrease... yet the people who collated this data chose to make a different assumption.

kortilla · on Dec 30, 2022

Offensive to who? It sounds like a great way to compare racism in different areas.

The more it correlates with a lower price, the more race issues.

mateo411 · on Dec 30, 2022

How did they arrive at .63?

It's seems like a weird choice. If there are 62% or 64% then the feature will yield the same value.

I think it would make more sense just to include percentage of households where at least one member of the household has a certain ethnicity.

I don't think it's a offensive to analyze demographic information in the aggregate. In fact this happens all the time.

josephcsible · on Dec 30, 2022

I wonder if that was the overall proportion for the entire surrounding area at the time. If so, then B would be a measure of how different the racial makeup of a given subset is from the entire area.

josephcsible · on Dec 30, 2022

Isn't that a measure of how integrated a neighborhood is? And even accepting for the sake of argument that such a variable is evil, why not just exclude it instead of ditching the whole dataset?

threeseed · on Dec 30, 2022

a) No. If 50% of the community is African-Americans and live on one half and everyone else on the other then it would be exactly integrated. Except of course it isn't.

b) It is not scikit-learn's responsibility to alter third party datasets.

CoastalCoder · on Dec 30, 2022

> a) No. If 50% of the community is African-Americans and live on one half and everyone else on the other then it would be exactly integrated. Except of course it isn't.

IIUC, you're arguing that "whole-town" level aggregation is misleading. So if we get more granular, we could do it my neighborhood, street, building, apartment/car/shelter, bedroom, bed, bed @ time of day, etc.

Any one of those aggregation levels could hide interesting distinctions that could be made if only the data were reported with even more granularity.

So are you arguing against aggregation in general? Or just whole-town aggregation specifically?

Yajirobe · on Dec 30, 2022

Data can show a correlation. What the underlying causation is is not what the dataset aims to answer.

mdcds · on Dec 30, 2022

data is just that, not an argument in itself. how you use it to disprove something is up to you.

guipsp · on Dec 30, 2022

Except that this is not just data! The feature is just overall bad.

postalrat · on Dec 31, 2022

More bad than evil?

guipsp · on Dec 31, 2022

Even if it wasn't evil, it's still bad is my point.

sidlls · on Dec 30, 2022

But it’s a fact that—due to the very racism you’re pointing out—house prices tend to be lower the more minorities there are. That’s in part because of the racist policies of banks and the real estate industry. Ignoring it doesn’t do anyone a service. Now, whether it’s used/included responsibly in this dataset is another matter entirely.

loehnsberg · on Dec 30, 2022

I used the dataset with my students, as it is small and does not require preprocessing, like dummy coding or handling missing values. Students also brought the racial issue to my attention and it created a bit of a discussion. We eventually decided to simply change the definition to "birds by town" and moved on.

Think of all the children books that get rewritten. Read the new ones to your children and discuss the old ones when they are teenagers. I would have preferred if sklearn contributors had done the same and simply revised the description as opposed to removing the dataset.

EDIT: changed "banning" to "removing" the dataset

guipsp · on Dec 30, 2022

Can you really call this "banning the dataset"? https://github.com/scikit-learn/scikit-learn/commit/8a86e219...

duskwuff · on Dec 30, 2022

This is an impressively responsible way of handling the situation, and I'd recommend that others read it as well. It identifies the specific problem with the dataset which led to its removal from the library (with references!), tells the user how to retrieve it if they really need it, and suggests alternatives.

kristianp · on Dec 31, 2022

Does it make sense to revise a definition of a real world data series to some random definition like "birds by town"?

aftbit · on Dec 30, 2022

Can someone ELI5 why this was removed? Is the problem just that the dataset includes a feature that references black people which might cause a model to draw a connection between black population and housing values? I thought it was pretty well accepted that (for a huge complex variety of historical reasons) black people tend to live in neighborhoods with lower valued houses. Or is there a deeper fairness issue that I'm missing?

guipsp · on Dec 30, 2022

The biggest problem is that the dataset contains an artificial feature that is not invertible. This is an issue because the biases of the author of the dataset are present in that feature, and you will never be able to "train" your way out of it because it is not invertible.

1attice · on Dec 30, 2022

Great explanation but consider adding blurbs about what makes a feature artificial, what 'invertibility' is, why it's important, and how you could (as you put it) train one's way out of the bias so long as the feature is invertible. Finally, bring it home by giving an example (fictious or, even better, real-historical!) of a model that would be biased, but for the blessings of feature-invertibility and further training, and then explain that you since you couldn't do that with this dataset, because the artificial feature is not invertible.

That I think gets us a full ELI5, though I agree the common-sense cutoff is subjective.

(And I say this in a spirit of co-collaboration -- I'm fascinated by the problem of ELI5 something like this, as I wind up having these conversations ad-hoc in-the-wild with family and friends, as I work in the field. Finding simple language just is progress)

a2800276 · on Dec 31, 2022

Sorry, about the rant: ELI5 really triggers me.

Have you ever talked to a five-year-old? They're very interested but ignorant and have minute attention spans. Have some self-respect, put on your big boy pants and at least make an effort to understand the adult version and ask some specifics about what you don't understand.

What you're asking for is a full on Malcom Gladwell style essay. Not only that, but all the points are discussed in the linked blog post. If one can't even be asked to put in the effort to read it, isn't it a bit presumptuous to expect anyone to summarize for you in elaborate detail!?

ELI5 would be: The authors wanted to find out if housing prices have something to do with bad air. They thought some people unfairly make homes cheaper if black people live in them. And that in some cases more expensive. Instead of including data about how many black people live in a neighborhood he included a value that supports his opinion and can't be checked. Because scientists aren't supposed to base their conclusions on gut feelings, this makes the sample data an example for problems that arise when working with data.

Five year olds are not particularly intellectual and wouldn't enjoy the discussion of biased models with feature-invertible data!

1attice · on Dec 31, 2022

Oh! One more critical point for an ELI5!

'Although models that make use of non-invertible, artifical features may still be biased, it's less likely. Using only invertible features eliminates some risks of bias, but not others.'

At this point I think one might have to explain the Naturalistic Fallacy (which accounts for much of the remaining possibility of bias in a model) but it starts to get into tit-for-tat hand-to-hand ontology: what 'bias' means, what different kinds of bias are possible, and how even 'unbiased data' can create a model that demonstrates behaviours that colloquially and idiomatically count as biased.

But one must cut the cloth of the universe somewhere

josephcsible · on Dec 31, 2022

Being non-invertible just means you can't get the original racial breakdown from the variable. It doesn't mean that you're stuck with any particular bias forever.

pxmpxm · on Dec 31, 2022

Instead of keeping the actual proportion of black residents, the data set contains a presumed model B of how housing prices change as black resident population moves away from 63%. The problem is B gives you the same answer for 62% and 64% of black population, so you can recover the actual proportion.

guipsp · on Dec 30, 2022

You can read the full reason in the commit diff: https://github.com/scikit-learn/scikit-learn/commit/8a86e219...

version_five · on Dec 30, 2022

Only on HN would someone ask for an ELI5 and get pointed to a git commit diff :)

guipsp · on Dec 30, 2022

You are, of course, correct. I gave it a better shot.

Der_Einzige · on Dec 30, 2022

Good riddance. Even independent of the ethical problems, we should use harder benchmark datasets than this or iris.

pb060 · on Dec 30, 2022

Thanks god. Most boring dataset ever. The reason why I never got past the first chapter of any ML book I tried to read.