July 7, 2022

Deep Learning Paper Recap - Language Models

This week’s Deep Learning Paper Recap is Prune Once For All: Sparse Pre-Trained Language Models

AI Concepts

Taufiquzzaman Peyash

Deep Learning Engineer

Taufiquzzaman Peyash

Deep Learning Engineer

Reviewed by

No items found.

Table of contents

[Visible on live site]

This week’s Deep Learning Paper Recap is Prune Once For All: Sparse Pre-Trained Language Models

What’s Exciting About This Paper?

Model pruning is one of the key ways to compress a Deep Learning model, and the pruning techniques differ based on the model architectures. This paper introduces an architecture-agnostic method of training sparse pre-trained language models.

This method enables us to prune only once during the pre-training phase and not worry about pruning during the fine-tuning. The researchers also propose a fine-tuning mechanism that leverages distillation to achieve the best compression-to-accuracy ratio.

Key Findings

Fine-tuning pruned (sparse) models usually leads to either poor results or a low sparsity ratio. That’s why modern pruning approaches like Gradual Magnitude Pruning (GMP) apply pruning during the fine-tuning phase.

But the problem with this approach is that each time we fine-tune, we have to consider both the task and model architecture to choose the pruning technique.

With the proposed pre-training and fine-tuning mechanism, we can save time by pruning only once. Here is what the whole pipeline looks like:

This technique leads to the best compression-to-accuracy ratio for BERT-base, BERT-Large, and Distil-BERT. Best scores were achieved with 85% and 90% weight pruning.

They also tried Quantized Aware Training (QAT) with 85% pruning, which led to an even more accurate and smaller model than the 90% pruned model.

Our Takeaways

These pre-trained pruned models can be used to obtain fine-tuned pruned models without the burden of task-specific pruning.

This approach saves us time and effort of pruning the model, similar to a lot of pre-trained Deep Learning models out there where we don’t have to train from scratch. We just use the pruned model for fine-tuning instead in this case.

Deep Learning Paper Recap - Language Models

What’s Exciting About This Paper?

Key Findings

Our Takeaways

Is Word Error Rate Useful?

The best audio file formats for speech-to-text: A guide

What is speaker diarization and how does it work? (Complete 2026 Guide)

AI trends in 2025: Graph Neural Networks

Top 7 meeting intelligence platforms in 2025

Batch Normalization for Neural Networks - How it Works

Built with AssemblyAI - Real-time Speech-to-Image Generation

Transformative use cases of AI in contact centers

Deep Learning Paper Recap - Language Models

What’s Exciting About This Paper?

Key Findings

Our Takeaways

Related posts

Is Word Error Rate Useful?

The best audio file formats for speech-to-text: A guide

What is speaker diarization and how does it work? (Complete 2026 Guide)

AI trends in 2025: Graph Neural Networks

Top 7 meeting intelligence platforms in 2025

Batch Normalization for Neural Networks - How it Works

Built with AssemblyAI - Real-time Speech-to-Image Generation

Transformative use cases of AI in contact centers