python提取句子,Python-从段落中提取句子

I am new to python & can use some help:

This is just a sample :

I have a dictionary (with same key values repeating inside a list:

list_dummy = [{'a': 1, 'b':"The house is great. I loved it.",'e':"loved,the"}, {'a': 3, 'b': "Building is white in colour. I liked it.",'e':"colour"}, {'a': 5, 'b': "She is looking pretty. She is in my college",'e':"pretty"}]

'b' - consists of body text

'e' - consists of words(can be more than one)

I want to extract sentences out of 'b' which contains either one or more words from 'e' in them.

I need to first split the text into sentences by sent_tokenize & than need to extract. Sent_tokenize takes only string as an input. How to proceed?

解决方案

Well I can't seem to get the nltk module working to test but as long as sent_tokenize() returns a list of sentence strings something like this I think should do what you're hoping (if I understood correctly):

ans = []

for d in list_dummy:

tmp = sent_tokenize(d['b'])

s = [x for x in tmp if any(w.upper() in x.upper() for w in d['e'].split(","))]

ans += s

This assumes that e will always be a comma separated list and that you're interested in case insensitive searching. The ans variable will just be a flat list of sentences that contain a word from the 'e' value in the dictionary.

EDIT

If you prefer using regular expressions you could use the re module:

import re

ans = []

for d in list_dummy:

b = sent_tokenize(d['b'])

e = d['e'].split(",")

rstring = ".*" + "|".join(e) + ".*"

r = re.compile(rstring)

ans.append([x for x in b if r.match(x)])

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值