"This work is part of the third pillar of our approach to alignment research: we want to automate the alignment research work itself. A promising aspect of this approach is that it scales with the pace of AI development. As future models become increasingly intelligent and helpful as assistants, we will find better explanations."
On first look this is genius but it seems pretty tautological in a way. How do we know if the explainer is good?... Kinda leads to thinking about who watches the watchers...
The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation. They ask GPT-4 to guess neuron activation given an explanation and an input (the paper includes the full prompt used). And then they calculate correlation of actual neuron activation and simulated neuron activation.
They discuss two issues with this methodology. First, explanations are ultimately for humans, so using GPT-4 to simulate humans, while necessary in practice, may cause divergence. They guard against this by asking humans whether they agree with the explanation, and showing that humans agree more with an explanation that scores high in correlation.
Second, correlation is an imperfect measure of how faithfully neuron behavior is reproduced. To guard against this, they run the neural network with activation of the neuron replaced with simulated activation, and show that the neural network output is closer (measured in Jensen-Shannon divergence) if correlation is higher.
> The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation.
To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.
It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.
Why is this genius? It's just the NN equivalent of making a new programming language and getting it to the point where its compiler can be written in itself.
The reliability question is of course the main issue. If you don't know how the system works, you can't assign a trust value to anything it comes up with, even if it seems like what it comes up with makes sense.
I love the epistemology related discussions AI inevitably surfaces. How can we know anything that isn't empirically evident and all that.
It seems NN output could be trusted in scenarios where a test exists. For example: "ChatGPT design a house using [APP] and make sure the compiled plans comply with structural/electrical/design/etc codes for area [X]".
But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".
> But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".
I understand that around the 1980s-ish, the dream was that people could express knowledge in something like Prolog, including the test-case, which can then be deterministically evaluated. This does really work, but surprisingly many things cannot be represented in terms of “facts” which really limits its applicability.
I didn’t opt for Prolog electives in school (I did Haskell instead) so I honestly don’t know why so many “things” are unrepresentable as “facts”.
There is a longer-term problem of trusting the explainer system, but in the near-term that isn't really a concern.
The bigger value here in the near-term is _explicability_ rather than alignment per-se. Potentially having good explicability might provide insights into the design and architecture of LLMs in general, and that in-turn may enable better design of alignment-schemes.
It doesn't have to lag, though. You could ask gpt-2 to explain gpt-2. The weights are just input data. The reason this wasn't done on gpt-3 or gpt-4 is just because a) they're much bigger, and b) they're deeper, so the roles of individual neurons are more attenuated.
I had similar thoughts about the general concept of using AI to automate AI Safety.
I really like their approach and I think it’s valuable. And in this particular case, they do have a way to score the explainer model.
And I think it could be very valuable for various AI Safety issues.
However, I don’t yet see how it can help with the potentially biggest danger where a super intelligent AGI is created that is not aligned with humans.
The newly created AGI might be 10x more intelligent than the explainer model. To such an extent that the explainer model is not capable of understanding any tactics deployed by the super intelligent AGI. The same way ants are most probably not capable of explaining the tactics delloyed by humans, even if we gave them a 100 years to figure it out.
You're correct to have a suspicion here. Hypothetically the explainer could omit a neuron or give a wrong explanation for the role of a neuron.
Imagine you're trying to understand a neural network, and you spend enormous amount of time generating hypotheses and validating them.
Well the explainer might give you 90% correct hypotheses, it means you have 10 times less work to produce hypotheses.
So if you have a solid way of testing an explanation, even if the explainer is evil, it's still useful.
Using 'im feeling lucky' from the neuron viewer is a really cool way to explore different neurons. And then being able to navigate up and down through the net to related neurons.
On first look this is genius but it seems pretty tautological in a way. How do we know if the explainer is good?... Kinda leads to thinking about who watches the watchers...