nltk中文分句_如何改进NLTK的分句技术?

本文介绍了如何利用nltk的PunktSentenceTokenizer无监督算法,针对新的文本进行训练,以改善英文句子分割。通过示例展示了训练过程,并且提到了训练后的模型能够自动学习缩写类型。
摘要由CSDN通过智能技术生成

Kiss和Strunk(2006)Punkt算法的可怕之处在于它是无监督的。所以给一个新的文本,你应该重新训练这个模型并将它应用到你的文本中,例如>>> from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters

>>> text = "An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law. It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."

# Training a new model with the text.

>>> tokenizer = PunktSentenceTokenizer()

>>> tokenizer.train(text)

# It automatically learns the abbreviations.

>>> tokenizer._params.abbrev_types

{'f', 'fr', 'j'}

# Use the customized tokenizer.

>>> tokenizer.tokenize(text)

['An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952.', 'Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the schools of business and law.', "It was during the tenure of F. Henry J. McAnulty that Fr. Gallagher's ambitious plans were put to action."]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值