This week’s AI Research Review is Locating and Editing Factual Associations in GPT.
What’s Exciting About this Paper
In this paper, the authors show that facts in GPT can be localized and that individual facts can be changed.
Key Findings
This paper touches on the old question: Can we localize where the knowledge in a network is?
Obviously, GPT has learned facts about our world. For example, GPT predicts Seattle
for the input text The Space Needle is in downtown
. So it has learned that the Space Needle is in Seattle.
The question is: can we find out where this knowledge is located? And if yes, can we modify it?
Localization of facts
To answer the first question, the authors did the following:
They run the network twice. One time normally. After that, they change the input The space needle
and add some noise to it such that the result would be something else, e.g. Paris
.
Then, for each hidden state in the distorted forward pass, they replace it with the original one. They observe that if copying from the original forward pass, it changes the output back to Seattle. If it does, they consider this hidden state as important for the prediction.
Localization of facts - findings
The images show the effect of (e) each hidden state on the prediction, (f) only MLP activations, and (g) only attention activations.
As a result, you get a two-cluster: early site right after the subject and late site right before it needs to predict the output.
Based on these findings, they assume that the MLP activations are the location of these facts.
The MLP activations consist of an up and a down projection. They focus specifically on the down projection.
They think about this matrix as something comparable to a key-value store. The hypothesis is that the key corresponds to the subject Space Needle
and the value to some fact about the subject, like location=Seattle
. At this point in time, the network does not know that it should predict a location so it is collecting facts about the subject.
How can we change a fact?
To answer the second question, the authors did the following:
Based on their findings from the previous step, they want to change the down projection of the MLP activation maximizing the causal effects from the previous step.
To do this, they collect a new key-value pair which they want the model to output.
The input would be the key, which is easy to get because they just copy it from the feedforward pass.
The value is harder to get because that doesn't exist. So they set the output to Paris
and in a similar way as one would produce an adversarial example, they kind of back optimize what the vector v needs to be in order for the output to change to Paris. This backpropagation is not changing the network itself, it is simply to compute this vector v.
Then, they run a local optimization on one specific down-projection matrix such that the value for the key will be changed.
Which matrix should be picked?
As we can see in the picture, there is a hole region that contains the information.But it turns out that it is enough to change one particular matrix. They pick the matrix where the causal effects peaks. This local change of one matrix is enough to change the output prediction to Paris
.
The authors assume that this works possibly because of the residual stream. They assume the MLPs somehow write their facts to the stream. If any one of them overwrites the previous one then it is the new fact.
Our Takeaways
MLPs seem to behave like key-value stores and facts seem to be located in them.
Facts can be changed by updating a single down projection matrix in one of the MLPs.
In the future, one might be able to update facts inside a language model on a large scale instead of getting an updated training dataset and retraining the model.
This represents a step forward to understand where knowledge is located inside a language model, however it also raises a lot more interesting questions about facts inside a language model.
Credits to Yannic Kilcher and his YouTube video explaining the paper in detail!