What’s Exciting About This Paper
This work falls in the category of parameter efficient fine-tuning, where the goal is to use as few parameters as possible to achieve almost the same accuracy as if we were to fine-tune the whole model.
The authors propose a novel approach, i.e., freezing all the parameters except the bias-terms in the transformer encoder while fine-tuning. In other words, the researchers fine-tune on downstream tasks using only the bias parameters. The results are rather surprising as it achieves results on par with the full fine-tuned model on GLUE benchmark tasks despite using only 0.08% of the total parameters.
Key Findings
Although pre-trained transformer-based language models like BERT perform significantly better for many NLP tasks, it is quite expensive to train these models and deploy them in production. This led researchers to come up with different efficient fine-tuning techniques.
BitFit approaches this problem by freezing all the parameters in a pre-trained LM and only updating the bias terms. For small to medium size datasets, this strategy performs almost the same as a fully fine-tuned model and sometimes even outperforms it.
Here is a comparison table between BitFit and Full-FT.
If we allow the tasks to suffer a small degradation in performance, we can go even further by only using the bias of the query vector and second MLP layer (which consists of 0.04% of the total params). Another question the authors had is whether the bias terms are special or if we can achieve the same thing with other random parameters. To test this they randomly selected 100k params to fine-tune the model. And it performed significantly worse than BitFit.
Our Takeaways
Fine-tuning on a small group of parameters opens a door to easier deployment. Since most of the parameters are unchanged, we can deploy one model and re-use it on different tasks. Having one re-usable model for multiple tasks also consumes significantly less memory.