Transformer Implementation (Attention all you Need)

Published in

Analytics Vidhya

6 min readFeb 4, 2021

A Transformer’s architecture is purely based on self-attention, regardless of the RNN sequence model. The goal of self-attention is to capture the representation of each sequence by relating different positions of the sequence. Using this foundation, BERT breaks the familiar left-to-right indoctrination that is inherent in earlier text analysis and representation models.

In this article, I’ll be explaining the coding part of the Transformer as mentioned in the paper (Attention all you need) and also try to cover some important theoretical aspects of this paper as well.

Above is the architectural diagram of Transformers. Basically, it contains multiple layers of Encoder and Decoder. I’m sure almost all of us have seen this architecture multiple times. So let’s break it into multiple parts and try to understand each and every part of it.

Model Architecture

Embeddings
Positional Encoding
Encoder
Decoder
Multi-head Attention
Position-wise Feed-Forward Networks
Normalization layers
Softmax

Embeddings

In transformers, input embedding is getting concatenated with the Positional Encoding before sending for further processing.

What is Positional Encoding?

Since transformers are not using any sequential network so there is no recurrence, all the input tokens are getting passed to the model at once only. So in this case transformer needs some information regarding the order of the sequences. So to get the info regarding the sequencing of the input tokens, the authors used the Positional Encoding concept.

The positional encodings have the same dimension d_model as the embeddings so that the two can be summed.

In this work, authors use sine and cosine functions of different frequencies:

PE(pos,2i)=sin(pos/10000**2i/dmodel) → for even positions

PE(pos,2i+1)=cos(pos/10000**2i+1/dmodel) → for odd positions

class Embeddings(nn.Module):
    """
    Implements embeddings of the words and adds their positional encodings. 
    """
    def __init__(self, vocab_size, d_model, max_len = 50):
        super(Embeddings, self).__init__()
        self.d_model = d_model
        self.dropout = nn.Dropout(0.1)
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pe = self.create_positinal_encoding(max_len, self.d_model)
        self.dropout = nn.Dropout(0.1)
        
    def create_positinal_encoding(self, max_len, d_model):
        pe = torch.zeros(max_len, d_model).to(device)
        for pos in range(max_len):   # for each position of the word
            for i in range(0, d_model, 2):   # for each dimension of the each position
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        pe = pe.unsqueeze(0)   # include the batch size
        return pe
        
    def forward(self, encoded_words):
        embedding = self.embed(encoded_words) * math.sqrt(self.d_model)
        embedding += self.pe[:, :embedding.size(1)]   # pe will automatically be expanded with the same batch size as encoded_words
        embedding = self.dropout(embedding)
        return embedding

Reference → https://github.com/fawazsammani/chatbot-transformer/blob/master/transformer%20chatbot.ipynb

Here d_model is the model dimension.In create_positional _encoding function, we have intialised the positional encoding values. While in forward function would dynamically concatenates the embedding with the positional encoding after every iteration.

Kindly refer to the below article to get the profound insights behind the positional encoding.

Elegant Intuitions Behind Positional Encodings

How do we capture positional information?

medium.com

Encoder and Encoder Layers

As mentioned in the above image there are 3 different tasks are happening before sending to the Decoder part. Here Nx represents the number id encoding layers.

Multi-Head Attention

Attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. We call this particular attention “Scaled Dot-Product Attention”.

Whereas in Multi-Attention or we call as Self -Attention in Transformers, the input tokens segregated into multiple chunks (12 by default). Now then self attentions performed on all of these segregated tokens independently.

Below is the implementation of the Multi-Head Attention Layers. n_heads represents the number of heads we want while processing. In our case, it is 8.

2 Position-Wise Feed-Forward Layer

In addition to attention sub-layers, each of the layers in the encoder and decoder contains a fully connected feed-forward network, which is applied to each position separately and identically. This consists of two linear transformations with a ReLU activation in between.

While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1.To get more insights out of the data, first, they have convoluted the data on the higher number of neurons and then back to a similar number of neurons in Linear Layer

3 Encoder Stack Layers

In transformers, the input tokens get passes through multiple encoder layers, to get the most benefit of the self-attention layer. By default 6 encoder and Decoder layers are getting used by authors.

One thing worth noting here, before passing the input to the next layer we are concatenating the original input tokens to it as well. So that the next layer would leverage the output of the last encoder layer as well as get insights from original source inputs. It’s similar like “Resent skip connections”.

Finally, we need to call all these functionalities from our Encoder class over our source input tokens.

ModuleList → Holds submodules in a list.

ModuleList can be indexed like a regular Python list, but modules it contains are properly registered and will be visible by all Module methods.

ModuleList is similar to Sequential list but with a slight difference. (https://discuss.pytorch.org/t/when-should-i-use-nn-modulelist-and-when-should-i-use-nn-sequential/5463)

Decoder Architecture

Functionalities in the Decoder are almost the same as the Encoder. There are slight changes like masking of the outputs.

Why Masking?

Masking is needed to prevent the Decoder to look at the Next Tokens. So that models predict the next one with the help of previous tokens only. To perform triangular masking, we have used torch API’s “torch.tril , torch.trio”

Similar to the Encoder layer, the Decoder layer is also using multi-head attention and a Position-wise feed-forward network.

Decoder Layer

Decoder

We are taking attention as an output for visualization purposes.

Finally, we need to call the decoder and encoder from another class. Below is my Github link to the full code.

https://github.com/deepak121993/END/blob/main/Transformer_from_Scratch.ipynb

The next article would be on Bert's Implementation and the uses of its contextual Embedding for Sentence Similarity.

Special Thanks to Rohan and Team for the awesome lectures on Deep Learning NLP.

## Keep Learning

References:

The Annotated Transformer

The Transformer from "Attention is All You Need" has been on a lot of people's minds over the last year. Besides…

nlp.seas.harvard.edu

https://github.com/fawazsammani/chatbot-transformer/

Transformer Implementation (Attention all you Need)

Elegant Intuitions Behind Positional Encodings

How do we capture positional information?

The Annotated Transformer

The Transformer from "Attention is All You Need" has been on a lot of people's minds over the last year. Besides…

Written by Deepak Saini