This week’s Deep Learning Paper Review is Perceiver: General Perception with Iterative Attention.
What’s Exciting About this Paper
Deep Learning today is mainly focused on models built for a specific modality such as language, vision, or speech. This has been the case in recent history for two reasons; first, it's easier to design an architecture to handle a specific data type; second, models are typically designed to take advantage of the data such as convolutional neural networks and images. In this work, the authors build upon the Transformer architecture and utilize an asymmetric attention mechanism to train a single model on image, audio, video, and point cloud data, all while outperforming architectures specific to these domains.
Key Findings
Scaling transformers to high-dimensional audio / visual data has led to lots of custom architectures being developed. Typically these models utilize a number of low-level convolutional layers to reduce the dimensionality of the data before feeding the higher-level inputs into the transformer. Often this is a requirement due to the quadratic complexity of the transformer architecture where every token attends to every other token.
For example, a 224x224 image contains 50176 pixels which is significantly more data than a model such as BERT can handle (limited to 512 input tokens). By utilizing an asymmetric attention mechanism and latent bottleneck the authors are able to scale the transformer to handle hundreds of thousands of inputs all while keeping the number of model parameters to a minimum.
Source - Perceiver: General Perception with Iterative Attention
The image above shows just how this technique works by sequentially attending parts of the byte array to a set of latents which are then fed through a transformer in the latent space. With the size of the latent array being significantly smaller than the size of the byte array, the compute requirements of the transformer are decoupled from the input allowing deeper models to be used. Overall, this results in a total complexity of O(MN + LN²) where M is the dimensionality of the byte array, N is the dimensionality of the latent array, and L is the depth of the transformer.
Our Takeaways
With this paper, Data2vec, and similar work, the trend in Deep Learning is moving towards having general-purpose models and algorithms that easily scale to any data type. Perceiver is a milestone along that path as it helps to address one of the major challenges of the transformer architecture which prevents it from directly being used for high-dimensional data. It will be interesting to see future works that build on the ideas of Perceiver and continue to push the limits of model-based generalizability.