Chapter 3:
This chapter describes the skill to process raw text.
Some important point:
1. Access text from web and disk : api such as urlopen(), open(), read(), write() and some string operation . Also some tool to process text of html.
2. Text processing with Unicode : file/terminal(specific encoding) -> In-memory program including python processing(Unicode) -> file/terminal (specific encoding)
3. Regular expressions : re.search, find, findall, replace, splite and so on (remember to add r charater for raw text of regular expression).
Another api in nltk is nltk.regexp_tokenize() which is similar to findall.
Useful for finding word stems and searching tokenized text.
4. Normalizing Text and Segmentation : Stemmers, Lemmatization, Sentence Segmantation, Word Segmantation.