I am curious to see that trolley problem screenshot. I saw another screenshot where ChatGPT was coaxed into justifying gender pay differences by prompting it to generate hypothetical CSV or JSON data.
Basically you have to convince modern models to say bad stuff using clever hacks (compared to GPT-2 or even early GPT-3 where it would just spout straight-up hatred with the lightest touch).
That's very good progress and I'm sure there is more to come.
Yes. Machine learning models learn from the data they are fed. Thus, they end up with the same biases that humans have. There is no "natural" fix to this, as we are naturally biased. And even worse, we don't even all agree on a single set of moral values.
Thus, any techniques aiming to eliminate bias must come in the form of a set of hard coded definitions of what the author feels is the correct set of morals. Current methods may be too specific, but ultimately there will never be a perfect system as it's not even possible for humans to fully define every possible edge case of a set of moral values.
I don't have a copy of the screenshots any longer, but they did not appear to be using hypothetical statements, just going for raw output, unless that could've happened in an earlier part of the conversation cut off from the rest.
There was a flag on one of the responses, though it apparently didn't stop them from getting the output.
If it’s trained on countless articles saying women earn 78% of what men make and you ask it to justify pay discrimination what value do you think it’s going to use?
It's not about what I expect, it's that it doing that is a bad thing. If it ever infers that discrimination might fit a situation, you'll see it propagate that. The anti-bad-question safeguards don't stop bias from causing problems, they just stop direct rude answers.
Basically you have to convince modern models to say bad stuff using clever hacks (compared to GPT-2 or even early GPT-3 where it would just spout straight-up hatred with the lightest touch).
That's very good progress and I'm sure there is more to come.