• Splits text into individual characters (not words).
    • While there are many words, the number of characters is limited and finite. (An advantage over Word-based tokenization)
  • Out-of-vocabularies less frequent
  • Libraries:
    • TensorFlow and PyTorch: Can implement custom character-level tokenizers.
    • Hugging Face: Supports character-level tokenization through specific models.
  • Limitation
    • A character do not hold as much information individually as a word would (at least in the english language)
    • Their sequences are very long to be processed by the model