本文针对英文文本,介绍的全部都是基于Python3.7,
利用NLTK库进行文本分类的过程.
文本分词
文本分词即将文本拆解成词语单元,英文文本以英文单词空格连接成句,分词过程较为简单。以下介绍几种方法。
正则表达式分词
1.以空格进行分词
import re
text = 'I was just a kid, and loved it very much! What a fantastic song!'
print(re.split(r' ',text))
2.re匹配符号进行分词
print(re.split(r'\W+', text))
print(re.findall(r'\w+|\S\w*', text))
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", text))
3.NLTK正则表达式分词器
import re
import nltk
text = 'I was just a kid, and loved it very much! What a fantastic song!'
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|