February 4, 2022

Review - Perceiver: General Perception with Iterative Attention

This week’s Deep Learning Paper Review is Perceiver: General Perception with Iterative Attention.

Dillon Pulliam

AI Concepts

Reviewed by

Table of contents

[Visible on live site]

This week’s Deep Learning Paper Review is Perceiver: General Perception with Iterative Attention.

What’s Exciting About this Paper

Deep Learning today is mainly focused on models built for a specific modality such as language, vision, or speech. This has been the case in recent history for two reasons; first, it's easier to design an architecture to handle a specific data type; second, models are typically designed to take advantage of the data such as convolutional neural networks and images. In this work, the authors build upon the Transformer architecture and utilize an asymmetric attention mechanism to train a single model on image, audio, video, and point cloud data, all while outperforming architectures specific to these domains.

Key Findings

Scaling transformers to high-dimensional audio / visual data has led to lots of custom architectures being developed. Typically these models utilize a number of low-level convolutional layers to reduce the dimensionality of the data before feeding the higher-level inputs into the transformer. Often this is a requirement due to the quadratic complexity of the transformer architecture where every token attends to every other token.

For example, a 224x224 image contains 50176 pixels which is significantly more data than a model such as BERT can handle (limited to 512 input tokens). By utilizing an asymmetric attention mechanism and latent bottleneck the authors are able to scale the transformer to handle hundreds of thousands of inputs all while keeping the number of model parameters to a minimum.

Source - Perceiver: General Perception with Iterative Attention

The image above shows just how this technique works by sequentially attending parts of the byte array to a set of latents which are then fed through a transformer in the latent space. With the size of the latent array being significantly smaller than the size of the byte array, the compute requirements of the transformer are decoupled from the input allowing deeper models to be used. Overall, this results in a total complexity of O(MN + LN²) where M is the dimensionality of the byte array, N is the dimensionality of the latent array, and L is the depth of the transformer.

Our Takeaways

With this paper, Data2vec, and similar work, the trend in Deep Learning is moving towards having general-purpose models and algorithms that easily scale to any data type. Perceiver is a milestone along that path as it helps to address one of the major challenges of the transformer architecture which prevents it from directly being used for high-dimensional data. It will be interesting to see future works that build on the ideas of Perceiver and continue to push the limits of model-based generalizability.

Review - Perceiver: General Perception with Iterative Attention

What’s Exciting About this Paper

Key Findings

Our Takeaways

References

Is Word Error Rate Useful?

The best audio file formats for speech-to-text: A guide

What is speaker diarization and how does it work? (Complete 2026 Guide)

AI trends in 2025: Graph Neural Networks

Speech recognition in the browser using Web Speech API

Review - Pretraining Representations for Data-Efficient Reinforcement Learning

How accurate is speech-to-text in 2025?

What is GPT-3 and How Does It Work?

Review - Perceiver: General Perception with Iterative Attention

What’s Exciting About this Paper

Key Findings

Our Takeaways

References

Related posts

Is Word Error Rate Useful?

The best audio file formats for speech-to-text: A guide

What is speaker diarization and how does it work? (Complete 2026 Guide)

AI trends in 2025: Graph Neural Networks

Speech recognition in the browser using Web Speech API

Review - Pretraining Representations for Data-Efficient Reinforcement Learning

How accurate is speech-to-text in 2025?

What is GPT-3 and How Does It Work?