一个简单的分词系统

一、基于枚举法的中文分词工具

  •  前向最大匹配
    • 例子:我们经常有意见分歧
    • 词典:["我们", “经常”,“有”, “有意见”, “意见”, “分歧”]
    • 我们定义max_len = 5
    • 开始执行前向匹配算法
      • (1)
        • [我们经常有]意见分歧 (词典中没有, max_len缩小一位)
        • [我们经常]有意见分歧(词典中没有, max_len缩小一位)
        • [我们经]常有意见分歧(词典中没有, max_len缩小一位)
        • [我们]经常有意见分歧(词典中有,将“我们”划分出来)
      • (2)
        • [经常有意见]分歧(词典中没有, max_len缩小一位)
        • [经常有意]见分歧(词典中没有, max_len缩小一位)
        • [经常有]意见分歧(词典中没有, max_len缩小一位)
        • [经常]有意见分歧(词典有,将“经常”划分出来
      • 。。。。。。
      • 最终我们得到划分结果为[“我们”,“经常”,“有意见”,“分歧”]
  • 后向最大匹配
    • 例子与上面的相同,过程也相似
    • (1)
      • 我们经常[有意见分歧](词典中没有, max_len缩小一位)
      • 我们经常有[意见分歧](词典中没有, max_len缩小一位)
      • 我们经常有意[见分歧](词典中没有, max_len缩小一位)
      • 我们经常有意见[分歧](词典有,将“分歧”划分出来)
    • (2)
      • 我们[经常有意见](词典中没有, max_len缩小一位)
      • 我们经[常有意见](词典中没有, max_len缩小一位)
      • 我们经常[有意见](词典中没有, max_len缩小一位)
    • 。。。。。。
    • 最终我们得到划分结果为[“我们”,“经常”,“有意见”,“分歧”]
  • 基于上面的两个分词方法,我们对给定的字符串:”我们学习人工智能,人工智能是未来“, 找出一些的分割方式
    • [我们,学习,人工智能,人工智能,是,未来]
    • [我们,学习,人工,智能,人工智能,是,未来]
    • [我们,学习,人工,智能,人工,智能,是,未来]
    • [我们,学习,人工智能,人工,智能,是,未来] .......
    • 另外我们给定unigram概率:p(我们)=0.25, p(学习)=0.15, p(人工)=0.05, p(智能)=0.1, p(人工智能)=0.2, p(未来)=0.1, p(是)=0.15

  • 我们也可以计算出每一个切分之后句子的概率

    • p(我们,学习,人工智能,人工智能,是,未来)= -log p(我们)-log p(学习)-log p(人工智能)-log p(人工智能)-log p(是)-log p(未来)
    • p(我们,学习,人工,智能,人工智能,是,未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工智能)-log p(是)-log p(未来)
    • p(我们,学习,人工,智能,人工,智能,是,未来)=-log p(我们)-log p(学习)-log p(人工)-log p(智能)-log p(人工)-log p(智能)-log p(是)-log p(未来)
    • p(我们,学习,人工智能,人工,智能,是,未来)=-log p(我们)-log p(学习)-log p(人工智能)-log p(人工)-log p(智能)-log(是)-log p(未来) .....
  • import math
    
    
    dic_words = [
        "北京",
        "的","天",
        "气","天气",
        "真","好","真好","啊","真好啊",
        "今","今天","课程","内容","有",
        "很","很有","意思","有意思","课",
        "程","经常","意见","意","见",
        "有意见","分歧","分","歧",
    ]
    
    word_prob = {
        "北京":0.03,"的":0.08,"天":0.005,"气":0.005,
         "天气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02,
         "今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03,
         "很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01,
         "程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005,
         "有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005
    }
    
    def cal_score(segment: list) -> float:
        sum = 0.0
        for word in segment:
            sum += (-1.0 * math.log10(word_prob.get(word)))
        return sum
    
    def word_segment_naive(input_str: str):
        global best_segment
        segments = []
        i = 0
        L = len(input_str)
        while i < L:
            pre = input_str[0: i]
            post = input_str[i:]
            segment = []
            s1 = pre_segment_naive(pre)
            s2 = pre_segment_naive(post)
            if (len(s1) > 0 or len(pre) == 0) and len(s2) > 0:
                segment.extend(s1)
                segment.extend(s2)
            if len(segment) > 0 and segment not in segments:
                segments.append(segment)
            pre = input_str[0: i]
            post = input_str[i:]
            segment = []
            s1 = post_segment_naive(pre)
            s2 = post_segment_naive(post)
            if (len(s1) > 0 or len(pre) == 0) and len(s2) > 0:
                segment.extend(s1)
                segment.extend(s2)
            if len(segment) > 0 and segment not in segments:
                segments.append(segment)
            i += 1
        best_score = -1
        for  seg in segments:
            score = cal_score(seg)
            if score < best_score or best_score == -1:
                best_score = score
                best_segment = seg
        return best_segment
    
    def pre_segment_naive(words: str):
        segment = []
        max_len = 5
        while len(words) > 0:
            if len(words) < max_len:
                use_len = len(words)
            else:
                use_len = max_len
            while use_len > 0:
                tmp_src = words[: use_len]
                if tmp_src in word_prob:
                    segment.append(tmp_src)
                    words = words[use_len: ]
                    break
                else:
                    use_len -= 1
            if use_len == 0:
                break
        return segment
    
    def post_segment_naive(words: str):
        segment = []
        max_len = 5
        while len(words) > 0:
            if len(words) < max_len:
                use_len = len(words)
            else:
                use_len = max_len
            while use_len > 0:
                tmp_src = words[-use_len : ]
                if tmp_src in word_prob:
                    segment.append(tmp_src)
                    words = words[0 : len(words) - use_len]
                    break
                else:
                    use_len -= 1
            if use_len == 0:
                break
        segment.reverse()
        return segment
    
    def Print(word: str):
        seg = word_segment_naive(word)
        for s in seg:
            print(s)
    
    if __name__ == '__main__':
        print(word_segment_naive("北京的天气真好啊"))
        print(word_segment_naive("今天的课程内容很有意思"))
        print(word_segment_naive("经常有意见分歧"))
    
    
    

    输出结果为:

  • ['北京', '的', '天气', '真好啊']
    ['今天', '的', '课程', '内容', '很有', '意思']
    ['经常', '有意见', '分歧'] 

 二、基于维特比算法来优化上述流程(维比特算法

  • 根据词典,输入的句子和 word_prob来创建带权重的有向图(Directed Graph)
  • 编写维特比算法(viterebi)算法来找出其中最好的PATH, 也就是最好的句子切分
  • 返回结果 

 

import xlrd
import math

dic_words = [
    "北京",
    "的","天",
    "气","天气",
    "真","好","真好","啊","真好啊",
    "今","今天","课程","内容","有",
    "很","很有","意思","有意思","课",
    "程","经常","意见","意","见",
    "有意见","分歧","分","歧"
]

word_prob = {
    "北京":0.03,"的":0.08,"天":0.005,"气":0.005,
     "天气":0.06,"真":0.04,"好":0.05,"真好":0.04,"啊":0.01,"真好啊":0.02,
     "今":0.01,"今天":0.07,"课程":0.06,"内容":0.06,"有":0.05,"很":0.03,
     "很有":0.04,"意思":0.06,"有意思":0.005,"课":0.01,
     "程":0.005,"经常":0.08,"意见":0.08,"意":0.01,"见":0.005,
     "有意见":0.02,"分歧":0.04,"分":0.02, "歧":0.005
}



def get_segment(words: str):
    segment = []
    tmp = ""
    for i in range(len(words)):
        tmp += words[i]
        if tmp in dic_words:
            segment.append(tmp)
            tmp = ""
    while len(tmp) > 0:
        tmp = segment[len(segment) - 1] + tmp
        del segment[len(segment) - 1]
        if tmp in dic_words:
            segment.append(tmp)
            tmp = ""
    return segment


def word_segment_viterbi(words: str):
    segment = get_segment(words)
    graph = [[] for i in range(len(segment) + 1)]

    # vector<pair<int, string, double>> v[10000]
    # v[1].push_back({2, 'offer', 0.9})

    Info = []

    for i in range(len(segment)):
        Info.append((i + 1, segment[i], -1.0 * math.log10(word_prob[segment[i]])))

    for i in range(len(segment)):
        graph[i].append(Info[i])

    for i in range(len(segment)):
        word = segment[i]
        for j in range(i + 1, len(segment)):
            word += segment[j]
            if word in dic_words:
                graph[i].append((j + 1, word, -1.0 * math.log10(word_prob[word])))

    dp = [-1 for x in range(len(segment) + 1)]
    pre = [-1 for x in range(len(segment) + 1)]
    dfs(dp, pre, graph, 0, 0, 0)
    ret = []
    L = len(dp) - 1
    while True:
        R = L
        L = pre[R]
        if R == 0:
            break
        word = ""
        for j in range(L, R):
            word += segment[j]
        ret.append(word)
    ret.reverse()
    return ret





def dfs(dp: list, pre: list, graph: list, u: int, v: int, val: int):
    pre[v] = u
    dp[v] = val
    for info in graph[v]:
        x = info[0]
        w = info[2]
        if dp[x] == -1:
            dfs(dp, pre, graph, v, x, val + w)
        elif dp[x] > (val + w):
            dfs(dp, pre, graph, v, x, val + w)


if __name__ == '__main__':
    print(word_segment_viterbi("北京的天气真好啊"))
    print(word_segment_viterbi("今天的课程内容很有意思"))
    print(word_segment_viterbi("经常有意见分歧"))

输出结果为

['北京', '的', '天气', '真好啊']
['今天', '的', '课程', '内容', '很有', '意思']
['经常', '有意见', '分歧']
 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
好的,下面是一个简单的基于RNN的中文分词系统的Python实现。请注意,这只是一个示例,并不是一个完整的系统,只能处理一些简单的句子,您可以根据您的需求进行修改和扩展。 首先,需要导入相关的包和库: ```python import numpy as np import tensorflow as tf from tensorflow.contrib import rnn ``` 然后,定义一些超参数: ```python # 输入特征数 input_size = 1 # 隐藏层大小 hidden_size = 128 # 输出特征数 output_size = 4 # 学习率 learning_rate = 0.01 # 训练轮数 training_epochs = 2000 # 每轮训练的batch大小 batch_size = 32 # 每个样本的时间步数 time_steps = 20 ``` 接下来,定义一些辅助函数: ```python # 获取训练数据 def get_train_data(): train_data = [ "我 爱 你", "你 爱 我", "他 爱 她", "她 爱 他" ] labels = [ [0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1] ] return train_data, labels # 将中文文本转换成向量序列 def get_input_vec(text): input_vec = [] for c in text: input_vec.append([ord(c)]) return input_vec # 将向量序列转换成中文文本 def get_text(input_vec): text = "" for vec in input_vec: text += chr(vec[0]) return text # 将标签向量转换成标签字符串 def get_label(labels): if labels[0] == 1: return "S" elif labels[1] == 1: return "B" elif labels[2] == 1: return "M" elif labels[3] == 1: return "E" # 将标签字符串转换成标签向量 def get_label_vec(label): if label == "S": return [1, 0, 0, 0] elif label == "B": return [0, 1, 0, 0] elif label == "M": return [0, 0, 1, 0] elif label == "E": return [0, 0, 0, 1] ``` 然后,定义输入、输出和RNN层: ```python # 输入 x = tf.placeholder(tf.float32, [None, time_steps, input_size]) # 输出 y = tf.placeholder(tf.float32, [None, output_size]) # RNN层 cell = rnn.BasicLSTMCell(hidden_size) outputs, states = tf.nn.dynamic_rnn(cell, x, dtype=tf.float32) ``` 接下来,定义输出层和损失函数: ```python # 输出层 W = tf.Variable(tf.truncated_normal([hidden_size, output_size], stddev=0.1)) b = tf.Variable(tf.zeros([output_size])) logits = tf.matmul(outputs[:, -1, :], W) + b prediction = tf.nn.softmax(logits) # 损失函数 loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=logits)) ``` 然后,定义优化器和训练过程: ```python # 优化器 optimizer = tf.train.AdamOptimizer(learning_rate) train_op = optimizer.minimize(loss) # 训练过程 init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) train_data, labels = get_train_data() for epoch in range(training_epochs): total_loss = 0 for i in range(0, len(train_data), batch_size): batch_data = train_data[i:i+batch_size] batch_labels = labels[i:i+batch_size] input_data = [] for text in batch_data: input_vec = get_input_vec(text) if len(input_vec) < time_steps: input_vec += [[0]] * (time_steps - len(input_vec)) else: input_vec = input_vec[:time_steps] input_data.append(input_vec) _, batch_loss = sess.run([train_op, loss], feed_dict={x: input_data, y: batch_labels}) total_loss += batch_loss if epoch % 100 == 0: print("Epoch:", epoch, "Loss:", total_loss) ``` 最后,可以使用训练好的模型进行预测: ```python # 使用训练好的模型进行预测 input_data = [[[ord("我")], [ord("爱")], [ord("你")], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]], [[ord("你")], [ord("爱")], [ord("我")], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]], [[ord("他")], [ord("爱")], [ord("她")], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]], [[ord("她")], [ord("爱")], [ord("他")], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]]] with tf.Session() as sess: sess.run(init) for input_vec in input_data: input_vec = [input_vec] output = sess.run(prediction, feed_dict={x: input_vec}) output_labels = [get_label_vec(np.argmax(output[0][i])) for i in range(time_steps)] output_text = get_text(input_vec[0]) i = 0 while i < time_steps: if output_labels[i][0] == 1: print(output_text[i], end=" ") i += 1 elif output_labels[i][1] == 1: j = i while j < time_steps and output_labels[j][2] == 0 and output_labels[j][3] == 0: j += 1 if j < time_steps and output_labels[j][3] == 1: j += 1 print(output_text[i:j], end=" ") i = j else: i += 1 print() ``` 这个代码实现了一个简单的基于RNN的中文分词系统,它可以将输入的中文文本分词并进行词性标注。请注意,这个代码只是一个示例,您可以根据自己的需求进行修改和扩展。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值