python3打开txt文件_使用Python 3从文本文件读取句子并添加到列表中

最新推荐文章于 2020-11-24 12:09:38 发布

weixin_39798943

最新推荐文章于 2020-11-24 12:09:38 发布

阅读量246

点赞数 2

文章标签： python3打开txt文件

nltk 句子分割正则表达式机器学习文本处理

关键词由CSDN通过智能技术生成

这是一个相当困难的问题，而且没有一个简单的答案。您可以尝试编写一个正则表达式来捕获所有已知的情况，但复杂的正则表达式往往很难维护和调试。有许多现有的库可以帮助您实现这一点。最值得注意的是The Natural Language Toolkit，它内置了许多标记器。你可以用pip安装这个pip install nltk

然后得到你的句子将是一个相当直接（虽然高度定制）的事情。下面是一个使用提供的句子标记器的简单示例import nltk

with(open('text.txt', 'r') as in_file):

text = in_file.read()

sents = nltk.sent_tokenize(text)

我不完全清楚你的句子是如何被分隔的，如果不是用普通标点符号，但是在你的文本上运行上面的代码我得到：[

"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",

"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",

"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",

"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"

]

但是在输入上失败了：[这是一个句子]，“中间的一段时间”[

在传递输入时，例如：[“这是一个句子，中间有一个句号”]

不过，我不知道你是否会比开箱即用的好得多。从nltk代码：A sentence tokenizer which uses an unsupervised algorithm to build

a model for abbreviation words, collocations, and words that start

sentences; and then uses that model to find sentence boundaries.

This approach has been shown to work well for many European

languages.

因此nltk解决方案实际上是使用机器学习来构建句子的模型。比正则表达式好得多，但仍然不够完美。该死的自然语言。&gt；：（）

希望这有帮助：）