- Splitting raw text into words by spaces and/or punctuation
- Each word has a number of ID
- The number is higher if the word contains lots of contextual and semantic information in a sentence
- Limitations
- Very similar words have entirely different meanings (entirely different IDs)
- Too many words in English dictionary (leading to heavy model). We can ignore certain words we don’t really need, like taking only 10000 most frequent words in the text.
- Any other word will be “unknown”