这个2554.txt已经改名了貌似,改成2554-0.txt了。把代码也相应改了。
长度变成了:1176965
多了一些编码:
>>> len(tokens)
257726
>>> tokens[:10]
['\ufeffThe', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']
>>> text[1024:1062]
['an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K.', 'bridge', '.', 'He', 'had', 'successfully']
>>> text.collocations()
Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Ilya Petrovitch; Project
Gutenberg; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens
>>> raw.find("PART I")
5336
>>> raw.rfind("End of Project Gutenberg's Crime")
-1
在处理网页编码的时候,遇到一个新的编码格式:ISO-8859-1,由默认代码发现:
# http.client.parse_headers() decodes as ISO-8859-1. Recover the
# original bytes and percent-encode non-ASCII bytes, and any special
# characters such as the space.
根据百科词条,以一般这个对付通用的HTML应该足够了。
这张看似内容多,实际上都是以讲述python3的知识内容,NLTK的新知识并没有太多出现,编码这个地方还是要多加注意,随时都可能遇到新坑。