分词是中文NLP中第一步。比如,在英语中,单词本身就是“词”的表达,一篇英文文章就是“单词”加分隔符(空格)来表示的;而在汉语中,词以字为基本单位,但是一篇文章的语义表达却仍然使用词来划分。在处理中文文本时,需要进行分词处理,将句子转化为词的表达。这个切词的过程就是中文分词,他需要通过计算机自动识别出句子的词,在词间加入边界标记符,分隔出各个词汇。分词的难点在于分词歧义。例如,“南京市长江大桥”,应该分词为“南京市/长江大桥”还是“南京/市长/江大桥”。
中文分词技术方法主要分为三类:规则分词、统计分词和混合分词
规则分词主要是通过人工设立词库,按照一定方式进行匹配切分,其实现简单高效,但对新词不太友好。
- 正向最大匹配法(Maximum Match Method)
算法基本思想:假设分词词典中最长的词有i个字,则用被处理文档的当前字符串中的前i个字作为匹配字段,查找字典。若字典中有这样的一个i字词,则匹配成功;如果没有,则将匹配字段中的最后一个字去掉,对剩下的字串重新进行匹配。如此重复操作,直至匹配成功,即切分出一个词或剩余字符串的长度为零为止。
- 反向最大匹配法(Reverse Maximum Match Method)
算法基本思想:假设分词词典中最长的词有i个字,从被处理文档的末尾开始匹配扫描,每次取最末端的i个字符作为匹配字段,若匹配失败,则去掉匹配字段最前面的一个字,继续匹配,直至匹配成功,即切分出一个词或剩余字符串的长度为零为止。
实际实现中使用的分词词典是逆序词典,其中的每个词条都将按逆序方式存放,继续匹配。在实现时,可以先将文档进行倒排处理,生成逆序文档。然后根据逆序词典,对逆序文档用正向最大匹配法处理即可。
- 双向最大匹配法(Bi-direction Maximum Match Method)
算法基本思想:将正向最大匹配法得到的分词结果和逆向最大匹配结果进行比较,然后按照最大匹配原则,选取词数切分词数最少的作为结果。如果分词结果词数相同,分词结果相同,可以返回任意一个,分词结果不同,返回单字较少的那个。
代码实现:
# -*- coding: utf-8 -*-
"""
Created on Fri Dec 25 09:32:58 2020
"""
class BIMM(object):
def __init__(self, dic_path):
self.dictionary = []
self.maximum = 5
# 读取训练文本返回字典
with open(dic_path, "r", encoding='utf-8') as f:
try:
file_content = f.read().split()
finally:
f.close()
self.dictionary = list(set(file_content))
#正向切分
def zcut(self, text_path, result_path):
result_list = []
result = open(result_path, 'w', encoding='utf-8')
with open(text_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
index = len(line)
while index > 0:
piece = line[0:self.maximum]
while piece not in self.dictionary:
if len(piece) > 1:
piece = piece[0:len(piece) - 1]
else:
break
result_list.append(piece)
line = line[len(piece):]
index = len(line)
for t in result_list:
if t == '\n':
result.write('\n')
else:
result.write(t + " ")
result.close()
#反向切分
def fcut(self, text_path, result_path):
result = open(result_path, 'w', encoding='utf-8')
with open(text_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
result_list = []
index = len(line)
while index > 0:
piece = line[-self.maximum:]
while piece not in self.dictionary:
if len(piece) > 1:
piece = piece[-(len(piece)-1):]
else:
break
result_list.append(piece)
print(result_list)
line = line[:index - len(piece)]
index = len(line)
result_list = result_list[::-1]
for t in result_list:
if t == '\n':
result.write('\n')
else:
result.write(t + " ")
result.close()
def main():
# 训练语料
traindata = './segdata1/train.txt'
# 测试语料
testdata = './segdata1/test.txt'
# 正向分词生成结果
zhengxiangresult = './segdata1/test_sc_zhengxiang.txt'
# 反向分词生成结果
fanxiangresult = './segdata1/test_sc_fanxiang.txt'
tokenizer = BIMM(traindata)
# tokenizer.zcut(testdata, zhengxiangresult)
tokenizer.fcut(testdata, fanxiangresult)
if __name__ == '__main__':
main()
# -*- coding: utf-8 -*-
"""
Created on Fri Dec 25 09:32:58 2020
@author: vip
"""
import operator
class BIMM(object):
def __init__(self, dic_path):
self.dictionary = []
self.maximum = 5
# 读取训练文本返回字典
with open(dic_path, "r", encoding='utf-8') as f:
try:
file_content = f.read().split()
finally:
f.close()
self.dictionary = list(set(file_content))
def zcut(self, text):
result_list = []
index = len(text)
while index > 0:
piece = text[0:self.maximum]
while piece not in self.dictionary:
if len(piece) > 1:
piece = piece[0:len(piece) - 1]
else:
break
result_list.append(piece)
text = text[len(piece):]
index = len(text)
return result_list
def fcut(self, text):
result_list = []
index = len(text)
while index > 0:
piece = text[-self.maximum:]
while piece not in self.dictionary:
if len(piece) > 1:
piece = piece[-(len(piece)-1):]
else:
break
result_list.append(piece)
text = text[:index - len(piece)]
index = len(text)
return result_list
def scut(self, text_path, result_path):
result = open(result_path, 'w', encoding='utf-8')
with open(text_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
result_list = []
zr = self.zcut(line)
fr = self.fcut(line)
fr = fr[::-1]
if len(zr) != len(fr):
if len(zr) > len(fr):
result_list = fr
else:
result_list = zr
elif len(zr) == len(fr):
if operator.eq(zr,fr):
result_list = zr
else:
zr_count = 0
fr_count = 0
for z in zr:
if len(z) == 1:
zr_count += 1
for f in fr:
if len(f) == 1:
fr_count += 1
if zr_count > fr_count:
result_list = fr
else:
result_list = zr
for t in result_list:
if t == '\n':
result.write('\n')
else:
result.write(t + " ")
result.close()
def main():
# 训练语料
traindata = './segdata1/train.txt'
# 测试语料
testdata = './segdata1/test.txt'
# 正向分词生成结果
# zhengxiangresult = './segdata1/test_sc_zhengxiang.txt'
# 反向分词生成结果
#fanxiangresult = './segdata1/test_sc_fanxiang.txt'
# 双向分词生成结果
shuangxiangresult = './segdata1/test_sc_shuangxiang.txt'
tokenizer = BIMM(traindata)
# tokenizer.zcut(testdata, zhengxiangresult)
#tokenizer.fcut(testdata, fanxiangresult)
tokenizer.scut(testdata,shuangxiangresult)
if __name__ == '__main__':
main()