Tokenization
Tokenization is the process of breaking text down into simpler units
For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters.
The tokenization process is complicated by a large number of factors such as:
- Language
- Text format–palin,html,markups
- Stopwords
- Text Expansion–acronyms and abbreviations
- Case–upper or lower
- Stemming/lemmatization
Uses of tokenizers
- spell check
- process simple search
- downstream NLP tasks such as identifying POS, sentence detection, and classification
Specifying the delimiter
useLocale
usedelimiter–based on sting or pattern
useRadix–with numbers
skip
findInLine
Understanding normalization
Normalization is a process that converts a list of words to a more uniform sequence. This is useful in preparing text for later processing.
eg. toLowerCase facilitate searching process
Operations include:
- Changing character to lowercase
- Expanding abbreviations
- Removing stopwords
- Stemming and lemmatization
StandfordNLP
tokenize-Tokenization, ssplit-Sentence Spliting, pos,lemma,ner-NER,parse-Syntatic parsing, dcoref-Coreference resolution