spacy
import spacy
nlp = spacy.load('en')
sentences = nlp(u"That view is widespread in the custodial industry which has reacted with some alarm to the suggestion in the SIB paper.Instead, Mr Whitehill argues, the answer is to improve the drafting of sub-custodial agreements. Morgan Stanley asks its sub-contractors to guarantee them against wilful mismanagement, non-performance or negligence.In the event of such activities, Morgan Stanley will make its own clients whole and seek to recover from the sub-custodian. 'Increasingly, clients do ask for some protection,' he said. Morgan Stanley carries out six-monthly reviews of all its sub-custodial arrangements to reassure itself of the safety of its clients' money.")
for i, x in enumerate(sentences.sents):
print(i, x)
>>>
0 That view is widespread in the custodial industry which has reacted with some alarm to the suggestion in the SIB paper.
1 Instead, Mr Whitehill argues, the answer is to improve the drafting of sub-custodial agreements.
2 Morgan Stanley asks its sub-contractors to guarantee them against wilful mismanagement, non-performance or negligence.
3 In the event of such activities, Morgan Stanley will make its own clients whole and seek to recover from the sub-custodian. '
4 Increasingly, clients do ask for some protection,' he said.
5 Morgan Stanley carries out six-monthly reviews of all its sub-custodial arrangements to reassure itself of the safety of its clients' money.
但是用spacy的时候,运行速度很慢,而且如果文档很长的话,就会超出内容限制,然后报错。所以还是用nltk比较好
nltk
import nltk
content = "That view is widespread in the custodial industry which has reacted with some alarm to the suggestion in the SIB paper.Instead, Mr Whitehill argues, the answer is to improve the drafting of sub-custodial agreements. Morgan Stanley asks its sub-contractors to guarantee them against wilful mismanagement, non-performance or negligence.In the event of such activities, Morgan Stanley will make its own clients whole and seek to recover from the sub-custodian. 'Increasingly, clients do ask for some protection,' he said. Morgan Stanley carries out six-monthly reviews of all its sub-custodial arrangements to reassure itself of the safety of its clients' money."
sen_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sen_tokenizer.tokenize(content)
for i, x in enumerate(sentences.sents):
print(i, x)
>>>
0 That view is widespread in the custodial industry which has reacted with some alarm to the suggestion in the SIB paper.
1 Instead, Mr Whitehill argues, the answer is to improve the drafting of sub-custodial agreements.
2 Morgan Stanley asks its sub-contractors to guarantee them against wilful mismanagement, non-performance or negligence.
3 In the event of such activities, Morgan Stanley will make its own clients whole and seek to recover from the sub-custodian.
4 'Increasingly, clients do ask for some protection,' he said.
5 Morgan Stanley carries out six-monthly reviews of all its sub-custodial arrangements to reassure itself of the safety of its clients' money.
使用nltk要注意的一点
对于句子中的点号,如果点号后面没有换行符或者其他符号,而且直接跟了字母的话,会被认为是缩写,而不把它判为一个句子的结束符
像这样,e.g.
(表示举个例子
)就不会用来分割句子
import nltk
content = 'The evaluation noted that the employee had frequently exhibited irresponsible behavior (e.g., coming to work late, failing to complete projects). '
sen_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sen_tokenizer.tokenize(content)
for i, x in enumerate(sentences.sents):
print(i, x)
>>>
0 The evaluation noted that the employee had frequently exhibited irresponsible behavior (i.e., coming to work late, failing to complete projects).