我发现在这种情况下,我需要在Python中实现序列模式匹配的算法。搜索数小时后,在internet上找不到任何可用的库/代码段。在
问题定义:
实现函数顺序模式匹配input: tokens, (an ordered collection of strings)
output: a list of tuples, each tuple = (any subcollection of tokens, tag)
领域专家将定义匹配规则,通常使用regextest(tokens) -> tag or None
示例:input: ["Singapore", "Python", "User", "Group", "is", "here"]
output: [(["Singapore", "Python", "User", "Group"], "ORGANIZATION"), ("is", 'O'), ("here", 'O')]
“O”表示不匹配。在
冲突解决规则:首先出现的匹配项具有更高的优先级。
e、 “新加坡房地产销售如有冲突,先用新加坡房地产”和“新加坡房地产销售”相匹配。在
较长的匹配比较短的匹配具有更高的优先级。
e、 g.作为组织的“新加坡Python用户组”的优先级高于“Singapore”作为位置+“Python”作为语言的单独匹配。在
凭借我在算法和数据结构方面的专业知识,这是我的实现:from itertools import ifilter, imap
MAX_PATTERN_LENGTH = 3
def test(tokens):
length = len(tokens)
if (length == 1):
if tokens[0] == "Nexium":
return "MEDICINE"
elif tokens[0] == "pain":
return "SYMPTOM"
elif (length == 2):
string = ' '.join(tokens)
if string == "Barium Swallow":
return "INTERVENTION"
elif string == "Swallow Test":
return "INTERVENTION"
else:
if ' '.join(tokens) == "pain in stomach":
return "SYMPTOM"
def _evaluate(tokens):
tag = test(tokens)
if tag:
return (tokens, tag)
elif len(tokens) == 1:
return (tokens, 'O')
def _splits(tokens):
return ((tokens[:i], tokens[i:]) for i in xrange(min(len(tokens), MAX_PATTERN_LENGTH), 0, -1))
def sequential_pattern_match(tokens):
return ifilter(bool, imap(_halves_match, _splits(tokens))).next()
def _halves_match(halves):
result = _evaluate(halves[0])
if result:
return [result] + (halves[1] and sequential_pattern_match(halves[1]))
if __name__ == "__main__":
tokens = "I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split()
output = sequential_pattern_match(tokens)
slashTags = ' '.join(t + '/' + tag for tokens, tag in output for t in tokens)
print(slashTags)
assert slashTags == "I/O went/O to/O a/O clinic/O to/O do/O a/O Barium/INTERVENTION Swallow/INTERVENTION Test/O because/O I/O had/O pain/SYMPTOM in/SYMPTOM stomach/SYMPTOM after/O taking/O Nexium/MEDICINE"
import timeit
t = timeit.Timer(
'sequential_pattern_match("I went to a clinic to do a Barium Swallow Test because I had pain in stomach after taking Nexium".split())',
'from __main__ import sequential_pattern_match'
)
print(t.repeat(3, 10000))
我觉得再快不过了。不幸的是,它是用函数式的风格编写的,这在Python中可能不合适。你能用OO或者命令式的风格来更快的实现吗?在
(注意:我确信如果用C实现会更快,但目前我没有计划使用Python以外的其他语言)