Hacker News new | past | comments | ask | show | jobs | submit login

I am curious to see that trolley problem screenshot. I saw another screenshot where ChatGPT was coaxed into justifying gender pay differences by prompting it to generate hypothetical CSV or JSON data.

Basically you have to convince modern models to say bad stuff using clever hacks (compared to GPT-2 or even early GPT-3 where it would just spout straight-up hatred with the lightest touch).

That's very good progress and I'm sure there is more to come.




When you hard code in a blacklist is that really considered progress?


Yes. Machine learning models learn from the data they are fed. Thus, they end up with the same biases that humans have. There is no "natural" fix to this, as we are naturally biased. And even worse, we don't even all agree on a single set of moral values.

Thus, any techniques aiming to eliminate bias must come in the form of a set of hard coded definitions of what the author feels is the correct set of morals. Current methods may be too specific, but ultimately there will never be a perfect system as it's not even possible for humans to fully define every possible edge case of a set of moral values.


I don't have a copy of the screenshots any longer, but they did not appear to be using hypothetical statements, just going for raw output, unless that could've happened in an earlier part of the conversation cut off from the rest.

There was a flag on one of the responses, though it apparently didn't stop them from getting the output.


> I saw another screenshot where ChatGPT was coaxed into justifying gender pay differences by prompting it to generate hypothetical CSV or JSON data.

I remember seeing that on Twitter. My impression was author instructed the AI to discriminate by gender.


Did the author tell it which way or by how much?

If I say to discriminate on some feature and it consistently does it the same way, that's still a pretty bad bias. It probably shows up in other ways.


If it’s trained on countless articles saying women earn 78% of what men make and you ask it to justify pay discrimination what value do you think it’s going to use?


It's not about what I expect, it's that it doing that is a bad thing. If it ever infers that discrimination might fit a situation, you'll see it propagate that. The anti-bad-question safeguards don't stop bias from causing problems, they just stop direct rude answers.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: