Hacker News new | past | comments | ask | show | jobs | submit login

> We should expect a machine learning system to act as a correlation-seeker

We should expect machine learning systems to dismiss accidental indirect correlation for the benefit of the variables that are directly correlated. There are plenty of algorithms that achieve that, it's only that gradient descent doesn't.

The fact that our AIs are becoming biased is a bug. It should be fixed.




> The fact that our AIs are becoming biased is a bug. It should be fixed.

In many cases, it's the data that is biased. In that case it's impossible to differentiate between bias that the AI should learn and bias that the AI shouldn't learn.

Let's assume we have a database of incidents where a person was found to have cannabis. This database has the following data items: a timestamp, the persons name, the persons ethnicity and the amount of cannabis that was found. Now assume further that black and white people have the same base rate of cannabis use (which according to the studies I found seems to be the case). The last thing we have to assume in this case is that this database was created by racist policemen who arrest more black people for cannabis consumption.

An AI trained using this data would assume a higher base rate of cannabis consumption by black people. It's impossible for this AI to differentiate between correlations it should learn (for example that people who used cannabis multiple times and were found to have much of it are worth looking at) and (untrue) correlations that it shouldn't learn (that black people have a higher base rate).

The correct solution here is to use a dataset that is not biased, but it's hard to tell whether a dataset is biased.


The data can't tell you how the groups differ, since you can't tell the difference between criminal behavior and policing behavior. So you have to add some priors. The most progressive approach is to assume that there are no intrinsic differences between protected groups, and any difference in the data is the legacy of past discrimination.

You can add such a prior by adding a term to the loss function that penalizes any difference between the way the groups are treated. The math isn't hard, only the political decision of what is protected and what isn't.


> assume that there are no intrinsic differences between protected groups, and any difference in the data is the legacy of past discrimination.

I don't see why we should assume that this would reflect reality.

If a law has racist roots, i.e. if it is written to target a particular ethnicity, then we should expect that certain ethnicities really do break that law more than others.


So... If you don't like the data, change the character of the data?

I get what you're getting at. If the data collection is biased, massage the actor on the data to compensate. Got it.

...but if you're uncomfortable with the consequences of unleashing a Machine Learned system trained on your data as part of your decision making process from the get go, why in the heck would you want to anyway? There is no benefit to doing so when you realize that the system only works as well and unbiased as the sum of the biases of the data architects, data populators, and data interpreters combined, and that extra loss factor you added is, in fact, another bias. An unacceptable on to the minds of many no matter how disagreeable one may personally find it.

Better to just leave it in human hands to deal with on a case by case basis by that point. At least by doing that we stay ultimately in control of the overall decision making process.

People being in the loop is not a bug. Trying to solve social/political problems with poorly thought out applications of inexplicable technological systems is.


> An AI trained using this data would assume a higher base rate of cannabis consumption by black people

This is a very good point, but it's not enough to ensure your input data aren't reflecting existing biases in society. (I believe it's necessary, but not sufficient.)

In my other comment I gave the example of maternity leave. A woman who becomes pregnant won't be as productive that year. It's no-one's 'fault' that this is the case, and it doesn't reflect anyone's bias. It's still important to ensure, when making hiring decisions, that applicants are not eliminated on the grounds of 'high pregnancy risk'.


No, that doesn't sound right. It doesn't matter which particular algorithm you're thinking of. The issue here is at a different level.

Let's consider an AI that weighs in on hiring decisions. Let's consider what it might make of maternity.

Decades ago, it wasn't unusual for a prospective employer to ask a female applicant whether she was planning on getting pregnant. Society decided that wasn't acceptable, and made the practice unlawful (as well as forcing employers to offer paid maternity leave), despite that, from a profit-seeking perspective, that is valuable information for the prospective employer to have.

Let's suppose an AI has been trained on a dataset of previous hiring decisions, and some measure of how much value those employees went on to deliver. Let's suppose it's also told the age and sex of the applicants. It might notice that female applicants within a certain age range, have a tendency to have a long period of zero productivity, some time after being hired. Having noticed this correlation, it would then discriminate against applicants matching that profile.

But wait!, I hear you think, Why did we provide it data on sex and age? Remove those data-points and the problem goes away. Not so. There could be non-obvious correlations acting as proxies for that information. The system might detect these correlations.

In a parole-board context, suspects' names and addresses could be proxies for ethnicity, which could correlate with recidivism rates. The 'correct' behaviour for a correlation-seeking AI is to detect those correlations, and then start to systematically discriminate against individuals of certain ethnicities, even when their crime and personal conduct is identical to another individual of a different ethnicity.

Back to our maternity leave example, then. It's an interesting exercise to come up with possible correlations. The ones that occur to me:

* Given names (if provided to the AI) generally tell you a person's sex

* Given names also fall in and out of fashion over the decades, so there's a correlation with age

* The topics a person chose to study at university, can correlate with the person's sex, as well as their age, as some fields didn't exist 40 years ago

* The culture a person is from, may impact the statistically expected number of children for them to have, and at what age. Various things could act as proxies for this.


- why are we providing the names and addresses as input to the neural network? they should be annonymized?

humans may also act on the same information. The fact that you are less likely to be called if you have an arabic sounding name in europe is well documented.

at least with neural networks we can remove those variables.

I guess another problem could be that our current distribution is biased. so the system might give an advantage to new-grads from unviersity X becausea lot of old employees are from it. which is inherent disadvantage from people from tradtionally-black university Y.

but hiring based on alma mater is already a common practice.


What are the algorithms that can dismiss “indirect correlation”? How can you tell a direct from an indirect correlation in data?


> What are the algorithms that can dismiss “indirect correlation”?

Anything that doesn't get stuck on local maxima do.

> How can you tell a direct from an indirect correlation in data?

There is another variable, with stronger correlation, that explains the same change.


Citation please?




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: