Bidirectional Encoder Representations from Transformers

  • While BERT is an older model, it’s still used as a baseline in NLP experiments
  • model released in 2018 by Google researchers
  • https://huggingface.co/google-bert
  • The easy way to understand BERT
    • Take the Transformer Architecture and just take the encoder, then stack them
    • It’s Bi-directional, (not reading in a sequence like RNNs)
  • It is pretrained on the following tasks
    • Masked language model (MLM)
      • provide input where tokens are masked
      • we ask BERT to fill in the blank for sentences
    • Next sentence prediction (NSP)
      • BERT is provided 2 sentences A and B
      • BERT has to predict if B would follow A
  • Need to fine tune it further for specific tasks
    • Named Entity Recognition (NER)
    • Question Answering (QA)
    • Sentence pair tasks
    • summarization
    • feature extraction/embeddings
    • and more…
  • BERT Variants
    • roBERTa
    • DistilBERT
    • ALBERT
    • ELCTRA
    • DeBERTa
  • Multiple sizes
    • BERT Large 240M parameters
    • BERT Base 100M parameters
    • BERT Tiny 4M parameters
    • and ~24 other model sizes
  • Today… ppl work with BERT for
    • classification (labels) and sentiment analysis
    • analyzing a text to reach a conclusion, not to generate something

Example of using BERT

  • sentiment analysis via the HuggingFace “sentiment analysis” pipeline
from transformers import pipeline
 
# Creating a sentiment analysis pipeline using a pretrained BERT model
classifier = pipeline(
	"sentiment-analysis", 
	model="bert-bert-uncased",
	tokenizer="bert-base-uncased"
)
 
# Test sentences
sentences = [
	"I love using BERT for NLP tasks!",
	"I am not a fan of ice cream."
]
 
results = classifier(sentences)
 
for sentence, result in zip(sentences, results):
	print(f"sentence: {sentence}")
	print(f"prediction: {result['label']} | Score: {result['score']:.4f}")
	print()
 
  • HuggingFace is downloading a BERT Base Uncased model that has been fine-tuned for Sentiment ANalysis
  • BERT needs to be fine-tuned for specific tasks beyond pretraining
  • related

BERT Lab

pip install torch transformers
 
import torch
from transformers import BertTokenizer, BertModel
  • BertTokenizer:
    • It converts raw text into input tokens that the BERT model can understand.
    • It also handles padding, truncation, and conversion into input IDs and attention masks.
  • BertModel:
    • It is the core BERT model that provides the hidden states (embeddings) for each token in the input text.
    • It outputs:
      • last_hidden_state: The embeddings for all tokens.
      • pooler_output: The embedding for the [CLS] token, representing the entire sentence.
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)
 
# converting the test into input IDs and attention masks
text = "I love AI!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
input_ids = inputs["input_ids"] 
attention_mask = inputs["attention_mask"]
  • The tokenizer might break the text"I love AI!” to ['[CLS]', 'i', 'love', 'ai', '!', '[SEP]']
    • [CLS]: Special token added at the start. BERT uses this to understand the whole sentence.
    • [SEP]: Special token added at the end. Useful when comparing two sentences.
  • input_ids: Input IDs are the numbers that represent the tokens. BERT has a vocabulary where every word or subword has a unique number.
    • Ex. the text is converted to [101, 1045, 2293, 9937, 999, 102]
  • attention_mask: Tells the BERT which tokens are real words and which are just padding (extra zeros to make all input the same length).
    • input_ids = [101, 1045, 2293, 9937, 999, 102, 0, 0]
    • attention_mask = [1, 1, 1, 1, 1, 1, 0, 0]
# getting BERT embeddings
with torch.no_grad():
  outputs = model(input_ids, attention_mask=attention_mask)
 
# the last hidden state (embeddings for each token)
last_hidden_state = outputs.last_hidden_state
 
# pooled output
pooled_output = outputs.pooler_output
 
print("Last Hidden State Shape:", last_hidden_state.shape)
print("Pooled Output Shape:", pooled_output.shape)
  • last_hidden_state
    • This is the output of BERT for every token in the input.
      • A big matrix containing numbers (embeddings) for each token.
      • Each token gets a 768-dimensional vector (for bert-base-uncased).
  • pooled_output
    • The embedding for the [CLS] token, representing the entire sentence.