网络文本处理与分词-CSDN博客

本文链接：https://blog.csdn.net/csdn_lzw/article/details/80411406

从网络访问文本

1. 电子书

import nltk
from urllib.request import urlopen

url="http://www.gutenberg.org/files/2554/2554-0.txt"
raw = urlopen(url).read()
print (type(raw))
print (len(raw))
print (raw[:75])

输出

<class 'bytes'>
1201733
b'\xef\xbb\xbfThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsk'

python3 中变量 raw 是 bytes类型，需要通过str类型和bytes类型的转换，才能调用分词的函数nltk.word_tokenize()

c=str(raw,encoding='utf-8')  ##将字节转换成字符
print (type(c))

tokens = nltk.word_tokenize(c)  #分词
print (type(tokens))
print (len(tokens))
print (tokens[:10])

输出

<class 'str'>
<class 'list'>
257726
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

把分词得到的list 变成text，便可以如第一章一样对text操作

text = nltk.Text(tokens)
print (type(text))

<class 'nltk.text.Text'>

读取本地文件

document.txt 放在和.py文件在同一个目录

f = open('document.txt')
raw = f.read()     
print (type(raw))   #classs 'str'
print (raw)

文件放在桌面

f = open(r'C:\Users\Administrator.LYH-20170315DBK\Desktop\document.txt')

NLTK（处理原始文本）

从网络访问文本

1. 电子书

读取本地文件