中文语序识别的方法一

最新推荐文章于 2025-04-18 13:19:38 发布

中国小宝

最新推荐文章于 2025-04-18 13:19:38 发布

阅读量2.3k

点赞数 4

分类专栏：数据挖掘随记

本文链接：https://blog.csdn.net/weixin_39128119/article/details/84585402

版权

随记同时被 2 个专栏收录

17 篇文章

订阅专栏

数据挖掘

4 篇文章

订阅专栏

一、前言

语序识别的两个思路：1.依据分词器收录的词语进行匹配查询； 2.依据海量词向量进行预测实现。

二、主要思路

本文用第一种方法实现一下，主要思路如下：

1.检查输入的字符串，并将各字随机排列组合生成不同的“词语”；

2.将1中的词语在分词器的词库中进行匹配，返回词频；

3.选择词频最大的词语作为正确的语序进行返回。

三、实现代码

import jieba
from itertools import permutations

# 获得汉字的所有排列方式
def get_all_possible_word(str):
    word_list = list(permutations(str))
    for i in range(len(word_list)):
        word_list[i] = ''.join(word_list[i])
    return word_list


# 寻找列表中最长的词
def find_longest(list):
    l = 0
    index = 0
    for i, word in enumerate(list):
        if len(word) > l:
            l = len(word)
            index = i
    return index


# 输入词列表，返回结巴分词内词频最高的词
def find_highest_frequency(possible_words):
    word_dict = dicts(r'C:\Users\Simon\AppData\Local\Programs\Python\Python36\Lib\site-packages\jieba\dict.txt')
    possible_dict = {}
    for possible_word in possible_words:
        possible_dict[word_dict[possible_word]] = possible_word
    sorted = sort_dict_by_key(possible_dict)

# 对输入的字典根据key大小排序
def sort_dict_by_key(dic):
    return [(k, dic[k]) for k in sorted(dic.keys())]

# 将dict.txt转换为字典
def dicts(filename):
    with open(filename) as f:
        array_lines = f.readlines()
    Dict = {}
    for line in array_lines:
        line = line.strip()
        listFromLine = line.split()
        Dict[listFromLine[0]] = int(listFromLine[1])
    return Dict

# 语序识别
def recog_word_order(str):
    l = len(str)
    word_list = get_all_possible_word(str)
    possible_words = []
    for word in word_list:
        seg_list = jieba.lcut(word, cut_all=True)
        print(seg_list)
        index = find_longest(seg_list)
        if len(seg_list[index]) == l:
            possible_words.append(seg_list[index])
    if len(possible_words) == 1:
        return possible_words[0]
    elif len(possible_words) > 1:
        return find_highest_frequency(possible_words)
    else:
        return "jieba暂未收录该词语"