基于词表的三种分词算法

CIPCU

已于 2023-10-25 21:27:52 修改

阅读量534

点赞数 18

分类专栏：自然语言处理文章标签： linux 运维服务器

于 2023-10-25 19:47:28 首次发布

本文链接：https://blog.csdn.net/2301_80340360/article/details/134042307

版权

自然语言处理专栏收录该内容

5 篇文章 0 订阅

订阅专栏

# 查看当前挂载的数据集目录, 该目录下的变更重启环境后会自动还原
# View dataset directory. This directory will be recovered automatically after resetting environment. 
!ls /home/aistudio/data
ls = list directory contents
# 查看工作区文件, 该目录下的变更将会持久保存. 请及时清理不必要的文件, 避免加载过慢.
# View personal work directory. All changes under this directory will be kept even after reset. Please clean unnecessary files in time to speed up environment loading. 
!ls /home/aistudio/work
# 如果需要进行持久化安装, 需要使用持久化路径, 如下方代码示例:
# If a persistence installation is required, you need to use the persistence path as the following: 
!mkdir /home/aistudio/external-libraries
!pip install beautifulsoup4 -t /home/aistudio/external-libraries
mkdir = make directory, t = target

基于词表的分词算法也成为基于规则的分词算法。请根据代码内容给程序加上适当的注释，并且分析出该段代码执行的功能，在文档中写出。

import sys 
# sys.path是一个列表list
sys.path.append('/home/aistudio/external-libraries')
def FMM(dict, sentence):	# 正向最大匹配Forward Maximum Matching
    fmmresult = []    
    max_len = max([len(item) for item in dict]) # 词典中最长词长度
    start = 0
    # FMM为正向，start从初始位置开始，指向结尾即为结束
    while start != len(sentence):
        # index的初始值为start的索引+词典中元素的最大长度或句子末尾
        index = start + max_len
        if index > len(sentence):
            index = len(sentence)
        for _ in range(max_len):
            # 当分词在字典中时或分到最后一个字时，将其加入到结果列表中
            if (sentence[start:index] in dict) or \
(len(sentence[start:index]) == 1):                
                fmmresult.append(sentence[start:index])                
                start = index 	# 分出一个词，start设置到index处
                break            	
            index += -1		# 正向时index每次向句头挪一位
    return fmmresult

下面这段代码执行的工作是：

def RMM(dict, sentence):	# 逆向最大匹配Reverse Maximum Matching
    rmmresult = []
    # 词典中最长词长度
    max_len = max([len(item) for item in dict])
    start = len(sentence)
    # RMM为逆向，start从末尾位置开始，指向开头位置即为结束
    while start != 0:
        # 逆向时index的初始值为start的索引-词典中元素的最大长度或句子开头
        index = start - max_len
        if index < 0:
            index = 0
        for _ in range(max_len):
            # 当分词在字典中时或分到最后一个字时，将其加入到结果列表中
            if (sentence[index:start] in dict) or \
(len(sentence[index:start]) == 1):
                # print(sentence[index:start], end='/')
                rmmresult.insert(0, sentence[index:start])                
                start = index	# 分出一个词，start设置到index处
                break            
            index += 1			# 逆向时index每次向句尾挪一位
 return rmmresult

请写出以下代码段执行工作。

def BM(dict, sentence):    	# 双向最大匹配Bi-directctional Matching
    res1 = FMM(dict, sentence)		# res1为FMM的结果
    res2 = RMM(dict, sentence)		# res2为RMM的结果
    if len(res1) == len(res2):        
        if res1 == res2:	# FMM与RMM的结果相同时，取任意一个
            return res1
        else:
            # res1_sn 和 res2_sn 为两个分词结果的单字数量，返回单字较少的
            res1_sn = len([i for i in res1 if len(i) == 1])
            res2_sn = len([i for i in res2 if len(i) == 1])
            return res1 if res1_sn < res2_sn else res2
    else:
        # 分词数不同则取分出词较少的
        return res1 if len(res1) < len(res2) else res2

添加代码，根据以上提供的分词算法，调用三种分词算法，完成“我在燕山大学读书，专业是软件工程。”这句话的分词。

#请在此处添加代码

#代码包括定义词典，定义待分词变量，调用并且输出三种分词函数。

dict = ['我', '在', '燕山大学', '读书', '专业', '是', '软件', '工程', '软件工程']
sentence = '我在燕山大学读书，专业是软件工程'

print("the results of FMM :\n", FMM(dict, sentence), end="\n")

print("the results of RMM :\n", RMM(dict, sentence), end="\n")

print("the results of BM :\n", BM(dict, sentence))

CIPCU

关注

18
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
基于词表的三种分词算法

基于词表的分词算法也成为基于规则的分词算法。请根据代码内容给程序加上适当的注释，并且分析出该段代码执行的功能，在文档中写出。添加代码，根据以上提供的分词算法，调用三种分词算法，完成“我在燕山大学读书，专业是软件工程。#代码包括定义词典，定义待分词变量，调用并且输出三种分词函数。请写出以下代码段执行工作。
复制链接

扫一扫

专栏目录