1.下载网页:html = urlopen(url).read()
2.剥离html:raw = nltk.clean_html(html)
3.整理内容:raw = raw[750:3425]
4.标记文本:tokens = nltk.wordpunct_tokenize(raw)
5.取出其中感兴趣的:tokens = tokens[20:500]
6.创建NLTK文本:text = nltk.Text(tokens)
7.标准化处理创建词汇表
1.下载网页:html = urlopen(url).read()
2.剥离html:raw = nltk.clean_html(html)
3.整理内容:raw = raw[750:3425]
4.标记文本:tokens = nltk.wordpunct_tokenize(raw)
5.取出其中感兴趣的:tokens = tokens[20:500]
6.创建NLTK文本:text = nltk.Text(tokens)
7.标准化处理创建词汇表