The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Updated on June 19, 2017
This is an AI-generated summary
The Transformer is a new network architecture based only on attention mechanisms.
Transfomer performs translation tasks faster and more parallelizable.
It beats traditional models in English-to-German and English-to-French translations.
Transformer architecture uses an encoder-decoder structure and stacked self-attention.
Attention function calculates a weighted sum of the values and assigns weights using a compatibility function.
The Transformer uses Scaled Dot-Product Attention and Multi-Head Attention.
Position-wise Feed-Forward Networks, Embeddings and Softmax, and Positional Encoding are key components of this architecture.
The authors show that Self-Attention is critical for computational efficiency and for learning long-range dependencies.
The Transformer was trained on standard datasets using parallelization on NVIDIA GPUs.
Sine and cosine functions were used for Positional Encoding.
The model's performance exceeded previous state-of-the-art models in translation tasks.
The Transformer also performs well on new tasks like English constituency parsing.
Special thanks to Denis Shilov for giving us a budget to generate this text.
We proposed a new network architecture, the Transformer, that solely relies on attention mechanisms and outperforms traditional network models in machine translation tasks. The Transformer is faster, more parallelizable, and achieved significant results in English-to-German and English-to-French translations.
Recurrent language models and encoder-decoder architectures have limitations due to their sequential nature, making parallelization challenging.