python序列模式的关联算法_Python中的序列模式匹配算法

最新推荐文章于 2022-12-11 10:25:42 发布

weixin_39889642

最新推荐文章于 2022-12-11 10:25:42 发布

阅读量385

点赞数

文章标签： python序列模式的关联算法

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_39889642/article/details/111453259

版权

我发现在这种情况下，我需要在Python中实现序列模式匹配的算法。搜索数小时后，在internet上找不到任何可用的库/代码段。在

问题定义：

实现函数顺序模式匹配input: tokens, (an ordered collection of strings)

output: a list of tuples, each tuple = (any subcollection of tokens, tag)

领域专家将定义匹配规则，通常使用regextest(tokens) -> tag or None

示例：input: ["Singapore", "Python", "User", "Group", "is", "here"]

output: [(["Singapore", "Python", "User", "Group"], "ORGANIZATION"), ("is", 'O'), ("here", 'O')]

“O”表示不匹配。在

冲突解决规则：首先出现的匹配项具有更高的优先级。

e、 “新加坡房地产销售如有冲突，先用新加坡房地产”和“新加坡房地产销售”相匹配。在

较长的匹配比较短的匹配具有更高的优先级。

e、 g.作为组织的“新加坡Python用户组”的优先级高于“Singapore”作为位置+“Python”作为语言的单独匹配。在

凭借我在算法和数据结构方面的专业知识，这是我的实现：from itertools import ifilter, imap

MAX_PATTERN_LENGTH = 3

def test(tokens):

length = len(tokens)

if (length == 1):

if tokens[0] == "Nexium":

return "MEDICINE"

elif tokens[0] == "pain":

return "SYMPTOM"

elif (length == 2):

string = ' '.join(tokens)

if string == "Barium Swallow":

return "INTERVENTION"

elif string == "Swallow Test":

return "INTERVENTION"

else:

if ' '.join(tokens) == "pain in stomach":

return "SYMPTOM"

def _evaluate(tokens):

tag = test(tokens)

if tag:

return (tokens, tag)

elif len(tokens) == 1:

return (tokens, 'O')

def _splits(tokens):

return ((tokens[:i], tokens[i:]) for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1))

def sequential_pattern_match(tokens):

return ifilter(bool, imap(_halves_match, _splits(tokens))).next()

def _halves_match(halves):

result = _evaluate(halves[0])

if result:

return [result] + (halves[1] and sequential_pattern_match(halves[1]))

if __name__ == "__main__":

tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()

output = sequential_pattern_match(tokens)

slashTags = ' '.join(t + '/' + tag for tokens, tag in output for t in tokens)

print(slashTags)

assert slashTags == "I/O went/O to/O a/O clinic/O to/O do/O a/O Barium/INTERVENTION Swallow/INTERVENTION Test/O because/O I/O had/O pain/SYMPTOM in/SYMPTOM stomach/SYMPTOM after/O taking/O Nexium/MEDICINE"

import timeit

t = timeit.Timer(

'sequential_pattern_match("I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split())',

'from __main__ import sequential_pattern_match'

)

print(t.repeat(3, 10000))

我觉得再快不过了。不幸的是，它是用函数式的风格编写的，这在Python中可能不合适。你能用OO或者命令式的风格来更快的实现吗？在

(注意：我确信如果用C实现会更快，但目前我没有计划使用Python以外的其他语言)

weixin_39889642

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。