This week’s Deep Learning Paper Recaps are Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition and Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition
Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition
What’s Exciting About this Paper
This paper proposes a novel method to generate samples that are not only stealthy, but that are also robust to being played over the air.
Key Findings
This adversarial samples generated have the following properties:
- Imperceptible: The attacked audio sounds extremely similar to the original audio such that a human cannot differentiate between the two.
- Robust: The attacked audio should be effective even when it is played over the air. For example, an audio sample played by a speaker, recorded by a microphone, and then supplied to a model.
Our Takeaways
While the imperceptible attack is stealthy with a high success rate, the imperceptible + robust attack needs to be improved, which is currently at just a 50% success rate. The attack seems to be weakened by resampling audio.
Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition
What’s Exciting About this Paper
Fine-tuning the pretrained wav2vec model for each downstream task leads to one big model per task, which is expensive to deploy. However, this paper shows that applying adapters reduces the number of parameters required to adapt during fine-tuning. Instead of 90% of the parameters, only 10% of the parameters need to be fine-tuned. This enables us to reuse 90% of the parameters for each downstream task.
Key Findings
The authors insert adapter layers in each of the transformer encoder blocks of the wav2vec model. Inside of each adapter layer, they do a linear down-projection followed by a linear up-projection. They also add skip connections.
The authors trained the entire model with English speech data. Then, they ran two experiments with French speech data:
- They fine-tuned the entire network (95.6% of the parameters)
- They fine-tuned only the adapter layers (9.2% of the parameters)
Results are measured in word error rate (WER) on French test data. It shows a similar performance for both experiments:
- 40.2% WER fine-tuning the hole network
- 39.4% WER fine-tuning only the adapter layers
Our Takeaways
Adapters show that it is not required to fine-tune the whole model for downstream tasks. Instead, it is enough to fine-tune specific adapter layers inserted carefully into the model.
Using adapters would allow us to reuse 90% of the parameters for each downstream task rather than deploying one fine-tuned model per task.