"This work is part of the third pillar of our approach to alignment research: we...

sanxiyn · on May 9, 2023

> How do we know if the explainer is good?

The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation. They ask GPT-4 to guess neuron activation given an explanation and an input (the paper includes the full prompt used). And then they calculate correlation of actual neuron activation and simulated neuron activation.

They discuss two issues with this methodology. First, explanations are ultimately for humans, so using GPT-4 to simulate humans, while necessary in practice, may cause divergence. They guard against this by asking humans whether they agree with the explanation, and showing that humans agree more with an explanation that scores high in correlation.

Second, correlation is an imperfect measure of how faithfully neuron behavior is reproduced. To guard against this, they run the neural network with activation of the neuron replaced with simulated activation, and show that the neural network output is closer (measured in Jensen-Shannon divergence) if correlation is higher.

habryka · on May 10, 2023

> The paper explains this in detail, but here is a summary: an explanation is good if you can recover actual neuron behavior from the explanation.

To be clear, this is only neuron activation strength for text inputs. We aren't doing any mechanistic modeling of whether our explanation of what the neuron does predicts any role the neuron might play within the internals of the network, despite most neurons likely having a role that can only be succinctly summarized in relation to the rest of the network.

It seems very easy to end up with explanations that correlate well with a neuron, but do not actually meaningfully explain what the neuron is doing.

sanxiyn · on May 10, 2023

Eh, that's why the second check I mentioned is there... To see what the neuron is doing in relation to the rest of the network.

TheRealPomax · on May 9, 2023

Why is this genius? It's just the NN equivalent of making a new programming language and getting it to the point where its compiler can be written in itself.

The reliability question is of course the main issue. If you don't know how the system works, you can't assign a trust value to anything it comes up with, even if it seems like what it comes up with makes sense.

0xParlay · on May 9, 2023

I love the epistemology related discussions AI inevitably surfaces. How can we know anything that isn't empirically evident and all that.

It seems NN output could be trusted in scenarios where a test exists. For example: "ChatGPT design a house using [APP] and make sure the compiled plans comply with structural/electrical/design/etc codes for area [X]".

But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".

DaiPlusPlus · on May 10, 2023

> But how is any information that isn't testable trusted? I'm open to the idea ChatGPT is as credible as experts in the dismal sciences given that information cannot be proven or falsified and legitimacy is assigned by stringing together words that "makes sense".

I understand that around the 1980s-ish, the dream was that people could express knowledge in something like Prolog, including the test-case, which can then be deterministically evaluated. This does really work, but surprisingly many things cannot be represented in terms of “facts” which really limits its applicability.

I didn’t opt for Prolog electives in school (I did Haskell instead) so I honestly don’t know why so many “things” are unrepresentable as “facts”.

philomath_mn · on May 10, 2023

I bet GPT is really good at prolog, that would be interesting to explore.

"Answer this question in the form of a testable prolog program"

falsissime · on May 11, 2023

You lost this bet: Write append3/4 which appends three lists to a fourth, such that append3(Xs,Ys,[e],[]) terminates.

DaiPlusPlus · on May 10, 2023

Did you give it a try?

typon · on May 9, 2023

Seems relevant: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

jacobr1 · on May 9, 2023

There is a longer-term problem of trusting the explainer system, but in the near-term that isn't really a concern.

The bigger value here in the near-term is _explicability_ rather than alignment per-se. Potentially having good explicability might provide insights into the design and architecture of LLMs in general, and that in-turn may enable better design of alignment-schemes.

lynx23 · on May 9, 2023

I can almost hear the Animatrix voiceover: "At first, AI was useful. Then, we decided to automate oversight... The rest is history."

wongarsu · on May 9, 2023

It also lags one iteration behind. Which is a problem because a misaligned model might lie to you, spoiling all future research with this method

regularfry · on May 9, 2023

It doesn't have to lag, though. You could ask gpt-2 to explain gpt-2. The weights are just input data. The reason this wasn't done on gpt-3 or gpt-4 is just because a) they're much bigger, and b) they're deeper, so the roles of individual neurons are more attenuated.

KevinBenSmith · on May 9, 2023

I had similar thoughts about the general concept of using AI to automate AI Safety.

I really like their approach and I think it’s valuable. And in this particular case, they do have a way to score the explainer model. And I think it could be very valuable for various AI Safety issues.

However, I don’t yet see how it can help with the potentially biggest danger where a super intelligent AGI is created that is not aligned with humans. The newly created AGI might be 10x more intelligent than the explainer model. To such an extent that the explainer model is not capable of understanding any tactics deployed by the super intelligent AGI. The same way ants are most probably not capable of explaining the tactics delloyed by humans, even if we gave them a 100 years to figure it out.

ChatGTP · on May 9, 2023

Safest thing to do, stop inverting and building more powerful and potentially dangerous systems which we can’t understand?

m1el · on May 9, 2023

You're correct to have a suspicion here. Hypothetically the explainer could omit a neuron or give a wrong explanation for the role of a neuron. Imagine you're trying to understand a neural network, and you spend enormous amount of time generating hypotheses and validating them. Well the explainer might give you 90% correct hypotheses, it means you have 10 times less work to produce hypotheses. So if you have a solid way of testing an explanation, even if the explainer is evil, it's still useful.

vhold · on May 9, 2023

It produces examples that can be evaluated.

https://openaipublic.blob.core.windows.net/neuron-explainer/...

bottlepalm · on May 9, 2023

Using 'im feeling lucky' from the neuron viewer is a really cool way to explore different neurons. And then being able to navigate up and down through the net to related neurons.

eternalban · on May 9, 2023

Fun to look at activations and then search for the source on the net.

"Suddenly, DM-sliding seems positively whimsical"

https://openaipublic.blob.core.windows.net/neuron-explainer/...

https://www.thecut.com/2016/01/19th-century-men-were-awful-a...

quickthrower2 · on May 10, 2023

How do we know WE are good explainers :-)

liamconnell · on May 10, 2023

Gpt2 answers to gpt3. Gpt3 answers to gpt4. Gpt4 answers to God.