- Lies in between Word-based tokenization and Character based tokenization
- The algorithm
- Frequently used words should not be split into smaller subwords
- Rare words should be decomposed into meaningful subwords
- Different algorithms you can use for subword-based tokenization
- Byte Pair Encoding (BPE): Used in models like GPT.
- WordPiece: Used in BERT.
- Unigram: Used in models like T5.