python3打开txt文件_使用Python 3从文本文件读取句子并添加到列表中

这是一个相当困难的问题,而且没有一个简单的答案。您可以尝试编写一个正则表达式来捕获所有已知的情况,但复杂的正则表达式往往很难维护和调试。有许多现有的库可以帮助您实现这一点。最值得注意的是The Natural Language Toolkit,它内置了许多标记器。你可以用pip安装这个pip install nltk

然后得到你的句子将是一个相当直接(虽然高度定制)的事情。下面是一个使用提供的句子标记器的简单示例import nltk

with(open('text.txt', 'r') as in_file):

text = in_file.read()

sents = nltk.sent_tokenize(text)

我不完全清楚你的句子是如何被分隔的,如果不是用普通标点符号,但是在你的文本上运行上面的代码我得到:[

"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",

"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",

"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",

"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"

]

但是在输入上失败了:[这是一个句子],“中间的一段时间”[

在传递输入时,例如:[“这是一个句子,中间有一个句号”]

不过,我不知道你是否会比开箱即用的好得多。从nltk代码:A sentence tokenizer which uses an unsupervised algorithm to build

a model for abbreviation words, collocations, and words that start

sentences; and then uses that model to find sentence boundaries.

This approach has been shown to work well for many European

languages.

因此nltk解决方案实际上是使用机器学习来构建句子的模型。比正则表达式好得多,但仍然不够完美。该死的自然语言。>:()

希望这有帮助:)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值