Hacker News new | past | comments | ask | show | jobs | submit login

This paper is the product of a failed model of AI safety, in which dedicated safety advocates act as a public ombudsman with an adversarial relationship with their employer. It's baffling to me why anyone thought that would be sustainable.

Compare this to something like RLHF[0] which has acheived far more for aligning models toward being polite and non-evil. (This is the technique that helps ChatGPT decline to answer questions like "how to make a bomb?")

There's still a lot of work to be done and the real progress will be made by researchers who implement systems in collaboration with their colleagues and employers.

[0] https://openai.com/blog/instruction-following/




> researchers who implement real systems

That's what I didn't like about Gebru - too much critique, not a single constructive suggestion. Especially her Gender Shades paper where she forgot about Asians.

http://proceedings.mlr.press/v81/buolamwini18a/buolamwini18a...

I think AnthropicAI is a great company to follow related to actually solving these problems. Look at their "Constitutional AI" paper. They automate and improve on RLHF.

https://www.anthropic.com/constitutional.pdf


> Compare this to something like RLHF[0] which has acheived far more for aligning models toward being polite and non-evil. (This is the technique that helps ChatGPT decline to answer questions like "how to make a bomb?")

I recently saw a screenshot of someone doing trolley problems with people of all races & ages with ChatGPT and noting differences. That makes me not quite as confident about alignment as you are.


I am curious to see that trolley problem screenshot. I saw another screenshot where ChatGPT was coaxed into justifying gender pay differences by prompting it to generate hypothetical CSV or JSON data.

Basically you have to convince modern models to say bad stuff using clever hacks (compared to GPT-2 or even early GPT-3 where it would just spout straight-up hatred with the lightest touch).

That's very good progress and I'm sure there is more to come.


When you hard code in a blacklist is that really considered progress?


Yes. Machine learning models learn from the data they are fed. Thus, they end up with the same biases that humans have. There is no "natural" fix to this, as we are naturally biased. And even worse, we don't even all agree on a single set of moral values.

Thus, any techniques aiming to eliminate bias must come in the form of a set of hard coded definitions of what the author feels is the correct set of morals. Current methods may be too specific, but ultimately there will never be a perfect system as it's not even possible for humans to fully define every possible edge case of a set of moral values.


I don't have a copy of the screenshots any longer, but they did not appear to be using hypothetical statements, just going for raw output, unless that could've happened in an earlier part of the conversation cut off from the rest.

There was a flag on one of the responses, though it apparently didn't stop them from getting the output.


> I saw another screenshot where ChatGPT was coaxed into justifying gender pay differences by prompting it to generate hypothetical CSV or JSON data.

I remember seeing that on Twitter. My impression was author instructed the AI to discriminate by gender.


Did the author tell it which way or by how much?

If I say to discriminate on some feature and it consistently does it the same way, that's still a pretty bad bias. It probably shows up in other ways.


If it’s trained on countless articles saying women earn 78% of what men make and you ask it to justify pay discrimination what value do you think it’s going to use?


It's not about what I expect, it's that it doing that is a bad thing. If it ever infers that discrimination might fit a situation, you'll see it propagate that. The anti-bad-question safeguards don't stop bias from causing problems, they just stop direct rude answers.


> The resulting InstructGPT models are much better at following instructions than GPT-3. They also make up facts less often, and show small decreases in toxic output generation. Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters.

I wonder if anyone's working on public models of this size. Looking forward to when we can selfhost ChatGPT.


This is going to happen alot over the next few years. One can fine tune GPT-2 medium on an RTX2070. Training GPT-2 medium from scratch can be done for $162 on vast.ai. The newer H100/Trainium/Tensorcore chips will bring the price down even further.

I suspect if one wanted to fully replicate ChatGPT from scratch it would take ~1-2 million including label acquisition. You probably only require ~200-500k in compute.

The next few years are going to be wild!


These things have reached the tipping point where they provide significant utility to a significant portion of the computer scientists working on making these things. Could be that the coming iterations of these new tools will make it increasingly easy to write the code for the next iterations of these tools.

I wonder if this is the first rumblings of the singularity.


I can imagine a world where there are an infinity of “local maximums” that stop a system from reaching a singular feedback loop… imagine if our current tools help write the next generation, so on, so on, until it gets stuck in some local optimization somewhere. Getting stuck seems more likely than not getting stuck, right?


chatGPT being able to write OpenAI API code is great, and all companies should prepare samples so future models can correctly interface with their systems.

But what will be needed is to create an AI that implements scientific papers. About 30% of papers have code implementation. That's a sizeable dataset to train a Codex model on.

You can have AI generating papers, and AI implementing papers, then learning to predict experimental results. This is how you bootstrap a self improving AI.

It does not learn only how to recreate itself, it learns how to solve all problems at the same time. A data engineering approach to AI: search and learn / solve and learn / evolve and learn.


Isn't RLHF trivially easy to defeat (as it stands now)?


Assuming a motivated “attacker”, yes. The average user will have no such notion of “jailbreaks”, and it’s at least clear when one _is_ attempting to “jailbreak” a model (given a full log of the conversation and a competent human investigator).

I think the class of problems that remain are basically outliers that are misaligned and don’t trip up the model’s detection mechanism. Given the nature of language and culture (not to mention that they both change over time), I imagine there are a lot of these. I don’t have any examples (and I don’t think yelling “time’s up” when such outliers are found is at all helpful).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: