BERT Embedding for Classification

Published in

Analytics Vidhya

6 min readMay 16, 2021

The recent advances in machine learning and growing amounts of available data have had a great impact on the field of Natural Language Processing (NLP). They facilitated the development of new neural architectures and led to strong improvements on many NLP tasks, such as machine translation or text classification. One advancement of particular importance is the development of models which build good quality, machine-readable representations of word meanings. These representations often referred to as word embeddings, are vectors that can be used as features in neural models that process text data.

Types of embeddings

1. Static Word Embedding:

As the name suggests these word embeddings are static in nature. These incorporate the pre-trained values of the words, which we could use while training our models.

For e.g. Glove and word2vec are the best examples of static word embeddings. These embeddings mostly we could use pre-trained once. But in some scenarios, we could train these embeddings as well. But it would contain high computational GPU power and a lot of time to train these embeddings from scratch.

2. Contextual Embedding:

Contextual embedding (e.g. ELMo, BERT), aims to learn a continuous (vector) representation for each word in the documents. Continuous representations can be used in downstream machine learning tasks.

But just how contextual are these contextualized representations?

Consider the word ‘mouse’. It has multiple word senses, one referring to rodents and another to a device. Does BERT effectively create one representation of ‘mouse’ per word sense (left)? Or does BERT create infinitely many representations of ‘mouse’, each highly specific to its context (right)?

First, the traditional embedding represents only the difference between two types of mouses while from BERT, we would be getting real-time context as well. As shown cheese-loving is kinda showing similarity with the rodent while for the other “click on the” showing the similarity.

How could we use Embedding for Classification?

If we don’t have much data for the classification, in that case, we could use the embedding’s just like we do in Siamese Network for facial recognition.

This approach basically called single-shot learning or zero-shot learning where we don’t have much data to build any classifier on it. In that case, we could calculate the embedding of the existing labeled data then classifying the new data point with the closeness of its embedding with the other embeddings. Whoever it would be more closer we would classify it as that particular category. For closeness of the embeddings, we are using the cosine-similarity function.

I’ll be using the Hugging face library for this task. (https://huggingface.co/transformers/)

In the below use case, I would be showing the embedding creation of long sentences (more than 512 tokens). As we are already aware of the Bert memory constraint for using more than 512 tokens.

To conquer this issue, I have used the sliding window technique here so that we could be able to extract the embedding for the whole corpus at once.

We are using the “bert-base-uncased” model. Always set output_hidden_states = True, while initializing the model otherwise, the model will not spit the hidden states.

Create tokens of the text with the help of the tokenizer library. While setting up add_special_tokens= False, the special tokens (‘CLS’ or ‘SEP’) are not included in the tokens themselves.

To overcome the issues of 512 token constraints, we would be using the sliding window technique. Therefore we would split the input ids and attention mask ids so that they cannot exceed the 512 lengths in total.

Splitting the tokens in chunks of 510 lengths. The remaining 2 are CLS and SEP tokens

Now we could iterate over each and every chunk and pass them through our model and generate the embeddings of each of them.

Here torch.tensor([101]) and ([102]) represents the “CLS and SEP” tokens.After adding these tokens to the input chunks, we need to check for the padding as well. Mostly it would be needed in the last iteration where the length of our input would be lesser than 510.

Padding would be needed only and only in the last iteration in our case. And below we are preparing our input-ids and mask-ids before passing them into the model.

Converting into long and int tensors before passing into the model

The model will spit output as well as hidden states in tuple format. Bert total will send 13 layers (including the input embedding as well). But as per the researchers, the last layers of the embeddings would contain the most information regarding the context of the Corpus. That’s why we would only take the last 9 layers to create our embedding for our Corpus.

Below is the code to get only 4layers of hidden states and take out the average of them.

hidden layer shape =[num of sentences or batch-size, Total tokens, num of hidden layers, num of features]

toek_embedding[ : , : , 9: , : ] → Gives the last 4 hidden layers for better context.

This is not the final embedding for our document as we have split our document into a chunk of 510 tokens. So we would iterate over each and every chunk and get the average of its last 4layers embedding. And finally, we could take the average of all the token embeddings as well, to get the final embedding of the document.

Once we have the final embedding of the document, we could easily find out its similarity with the other documents with the help of the Cosine Similarity.

What is Cosine Similarity?

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, the higher the cosine similarity.

Or we could use the inbuilt function of PyTorch

cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

similarity = cos(sent1 embedding , sent2 embedding)

Thanks for Reading !!

References:

BERT, ELMo, & GPT-2: How Contextual are Contextualized Word Representations?

Incorporating context into word embeddings - as exemplified by BERT, ELMo, and GPT-2 - has proven to be a watershed…

ai.stanford.edu

BERT Embedding for Classification

BERT, ELMo, & GPT-2: How Contextual are Contextualized Word Representations?

Incorporating context into word embeddings - as exemplified by BERT, ELMo, and GPT-2 - has proven to be a watershed…

Written by Deepak Saini