Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In the context of AI, the more safety measures are put into a model, the worse it performs or so it gets claimed.

Safety here being say; a user asking for instructions on how to make a dirty bomb and the model responds with "Sorry, I can't do that ethically".




AI ethics (like making current AI refuse to do some things) and dealing with existential risks to humanity from future AI are quite different so we should probably not put them into the same category when talking about it


Exactly, one is about shielding humans from their own stupid intents whilst the other is shielding ourselves from AI's homocidal/genocidal intents (even if it's a second-order effect).


This is a common misconception of the meaning of model performance. AI safety effectively means adjusting the objective function to penalize some undesirable outcomes. Since the objective function is no longer absolute task performance model performance doesn't go down - it is simply being evaluated differently. The user may be unhappy - they can't build their dirty bomb - but the model creator isn't using user happiness as the only consideration. They are trying to maximise user happiness without straying outside whatever safety bounds they have set up.

In that sense it is mathematically equivalent to (say) applying an L2 regularization penalty to reduce the occurrance of higher-order terms when fitting a polynomial. Strictly it will produce a worse fit on your training data, however it is done because out of sample performance is important.


Is it just safety? We also need to align them to be useful. So I'm not sure safety and usability are mutually exclusive? A safe model seems like a useful model, a model that gives you dangerous information seems, well dangerous and less useful ?


Does Google or other search engines block sites that have instructions on how to make a dirty bomb?


I googled that exact phrase (and put on the kettle for the visit I'll soon get). The first page was all government related resources on how to deal with terrorist attacks.

If not outright blocked, instructions do seem to be weighted down.


That's not what people mean by AI safety - they're referring to the dangers of uncontrollable or runaway AI. Particularly AGI.


it's a Motte-and-bailey situation [1]. In theory AI-Safety is about X-Risks. In practice it's about making AI compliant, non-racist, non-aggressiv etc.

[1] https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy


scott alexander has a nice post about the intersection of these two types of ai safety https://www.astralcodexten.com/p/perhaps-it-is-a-bad-thing-t...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: