这是一个相当困难的问题,而且没有一个简单的答案。您可以尝试编写一个正则表达式来捕获所有已知的情况,但复杂的正则表达式往往很难维护和调试。有许多现有的库可以帮助您实现这一点。最值得注意的是The Natural Language Toolkit,它内置了许多标记器。你可以用pip安装这个pip install nltk
然后得到你的句子将是一个相当直接(虽然高度定制)的事情。下面是一个使用提供的句子标记器的简单示例import nltk
with(open('text.txt', 'r') as in_file):
text = in_file.read()
sents = nltk.sent_tokenize(text)
我不完全清楚你的句子是如何被分隔的,如果不是用普通标点符号,但是在你的文本上运行上面的代码我得到:[
"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",
"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",
"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",
"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"
]
但是在输入上失败了:[这是一个句子],“中间的一段时间”[
在传递输入时,例如:[“这是一个句子,中间有一个句号”]
不过,我不知道你是否会比开箱即用的好得多。从nltk代码:A sentence tokenizer which uses an unsupervised algorithm to build
a model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries.
This approach has been shown to work well for many European
languages.
因此nltk解决方案实际上是使用机器学习来构建句子的模型。比正则表达式好得多,但仍然不够完美。该死的自然语言。>;:()
希望这有帮助:)