python 字符串截取_科学网—Python提取句子 - 吕波的博文

weixin_39655362

于 2020-11-24 05:20:57 发布

阅读量204

点赞数

文章标签： python 字符串截取

将一段话中的句子分离出来不是一件容易的事。因为句子的开头和结尾并不是很规则，而且句子内部会出现句号。这使得通过单一的正则表达式分离句子是不可能的。有时你能成功，但大多数时候你会出错。这里我们用nltk模块来做。

第一部分：使用正则表达式

import re

paragraph = "Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. I say. What's wrong with you? I am confused by your activity."

#匹配句尾的那个特殊空格，所有后面只能用依据空格用split分割

rule = re.compile(r"(?

result = re.split(rule, paragraph)

for sentence in result:

print sentence

#如果段落中含有双引号就报错。此时我们应该改用三双引号或三单引号，亲测有效。当然，正则表达式也需要变化。下面是利用正则表达式提取文本文件中的句子的代码。

import re

#open the txt file which must be in ANSI format

#TXT file in unicode format doesn't work. I don't why.

input = open('test.txt')

input_result = input.read()

rule = re.compile(r"(?

result = re.split(rule, input_result)

#for sentence in result:

#print sentence

input.close()

#This command will create the ouput.txt file for you.

output = open("ouput.txt","a+")

for sentence in result:

output.write(sentence)

output.write("n")

output.close()

第二部分：提取字符串中的句子

from nltk import tokenize

paragraph = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

print tokenize.sent_tokenize(paragraph)

第三部分：提取文本文件中的句子

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

fp = open("test.txt")

data = fp.read()

print 'n-----n'.join(tokenizer.tokenize(data))

备注：暂时无法成功安装nltk模块，提示缺少某dll文件！

参考资料

转载本文请联系原作者获取授权，同时请注明本文来自吕波科学网博客。

链接地址：http://blog.sciencenet.cn/blog-645111-1013989.html

上一篇：Python统计单词频数

下一篇：Python提取网页中的文本

weixin_39655362

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。