阿里笔试--智能对话简化版之query指令槽位识别

最新推荐文章于 2021-06-11 16:25:20 发布

置顶这个夏天在冬天

最新推荐文章于 2021-06-11 16:25:20 发布

阅读量1.1k

点赞数

文章标签： python 算法语义槽位

本文链接：https://blog.csdn.net/u012019376/article/details/82527302

版权

头天贴主参加了阿里的笔试，第一道编程题就够做了。不是时间短，是脑子不够用。好了，不废话了，上干货。

题目介绍

现下互联网AI战争以智能音箱为切入口，敲开市场大门，抢夺市场用户。智能音箱需要语音交互，这就涉及到query指令的语义理解。例如："我要看章子怡的一代宗师"，这里边需要识别出来：动作“看”，“章子怡”，“一代宗师”。一般都会建立一个知识库：名词标注各种标签，这里边可能会有：章子怡是演员，一代宗师是电影。另外有的名词可能会有包含关系，例如，“周杰”，“周杰伦”，这个采用从左到右最长字符串优选匹配原则。

输入格式：

第一行是简化版知识库：<标签1>_<名词1>|<名词2>|<名词3>;<标签2>_<名词2>|<名词4>|<名词5>;.....

第二行是query指令。

输出格式：

**** <名词1>/<标签1> **** <名词2>/<标签1>,<标签2> ****

范例：

输入：

singer_周杰|周杰伦|刘德华|王力宏;song_冰雨|北京欢迎你|七里香;actor_周杰伦|孙俪

请播放周杰伦的七里香给周杰伦周杰孙俪听周杰王力宏

输出：

请播放周杰伦/singer,actor 的七里香/song 给周杰伦/singer,actor 周杰/singer 孙俪/actor 听周杰/singer 王力宏/singer

思路

由于有从左到右优先的匹配的选择，所以考虑把输入的知识库转换成名词映射多个标签：{"<名词2>": ["<标签1>", "<标签2>"], ....}，并且按key=<名词k>倒序；之后按排序后的key依次遍历匹配query，匹配成功，则替换query中名词为带编号的特殊标识（为了不让后边短的子字符串覆盖长的父字符串），并且记录这个名词，遍历完之后，得到一个有序的名词列表，最后遍历有序的名词列表，替换query中的特殊标识得到最终的输出结果。

代码

#!/usr/bin/env python
# -*-encoding=utf8-*-

import re

def match_process():
    row1 = raw_input()
    datas = {}
    entity_str_list = row1.split(";")
    for entity_str in entity_str_list:
        entity_name, entity_values = entity_str.split("_")
        entity_value_list = entity_values.split("|")
        for entity_value in entity_value_list:
            if datas.has_key(entity_value):
                datas[entity_value].append(entity_name)
            else:
                datas[entity_value] = [entity_name]
    entity_list = sorted(datas.keys(), reverse=True)
    row2 = raw_input()
    words = row2
    result = []
    tmp_words = words
    temp_entity = ""
    count = 1
    for entity_value in entity_list:
        if entity_value in tmp_words:
            if temp_entity == "":
                temp_entity = entity_value
            else:
                resulta.append(temp_entity)
                tmp_words = tmp_words.replace(temp_entity, "|&{}&|".format(count))
                count += 1
                temp_entity = entity_value
        else:
            if temp_entity != "":
                resulta.append(temp_entity)
                tmp_words = tmp_words.replace(temp_entity, "|&{}&|".format(count))
                count += 1
                temp_entity = ""
    if temp_entity != "":
        resulta.append(temp_entity)
        tmp_words = tmp_words.replace(temp_entity, "|&{}&|".format(count))

    final_list = sorted(result, reverse=True)
    for index in xrange(len(final_list)):
        st = ",".join(datas[final_list[index]])
        new_str = " " + final_list[index] + "/" + st + " "
        tmp_words = tmp_words.replace("|&{}&|".format(index + 1), new_str)

    print " ".join(tmp_words.split())


if __name__ == '__main__':
    match_process()

#singer_周杰|周杰伦|刘德华|王力宏;song_冰雨|北京欢迎你|七里香;actor_周杰伦|孙俪
# 请播放周杰伦的七里香给周杰伦周杰孙俪听周杰王力宏

运行截图：