AI ethics (like making current AI refuse to do some things) and dealing with existential risks to humanity from future AI are quite different so we should probably not put them into the same category when talking about it
Exactly, one is about shielding humans from their own stupid intents whilst the other is shielding ourselves from AI's homocidal/genocidal intents (even if it's a second-order effect).
This is a common misconception of the meaning of model performance. AI safety effectively means adjusting the objective function to penalize some undesirable outcomes. Since the objective function is no longer absolute task performance model performance doesn't go down - it is simply being evaluated differently. The user may be unhappy - they can't build their dirty bomb - but the model creator isn't using user happiness as the only consideration. They are trying to maximise user happiness without straying outside whatever safety bounds they have set up.
In that sense it is mathematically equivalent to (say) applying an L2 regularization penalty to reduce the occurrance of higher-order terms when fitting a polynomial. Strictly it will produce a worse fit on your training data, however it is done because out of sample performance is important.
Is it just safety? We also need to align them to be useful. So I'm not sure safety and usability are mutually exclusive? A safe model seems like a useful model, a model that gives you dangerous information seems, well dangerous and less useful ?
I googled that exact phrase (and put on the kettle for the visit I'll soon get). The first page was all government related resources on how to deal with terrorist attacks.
If not outright blocked, instructions do seem to be weighted down.
it's a Motte-and-bailey situation [1]. In theory AI-Safety is about X-Risks. In practice it's about making AI compliant, non-racist, non-aggressiv etc.
Safety here being say; a user asking for instructions on how to make a dirty bomb and the model responds with "Sorry, I can't do that ethically".