文章目录
分词
常用的分词方法有:
- 基于规则的分词:正向匹配、逆向匹配、双向匹配
- 基于统计的分词:基于语言模型、基于序列模型
- 混合分词:综合多种分词
基于规则的分词
基于规则的分词是通过维护字典的方法,在切分语句时将语句中的字符与词典进行逐一匹配去划分词语,是一种比较机械的分词方式
my_dict = ["江大桥", "研究", "生命科学", "南京市", "研究生", "大桥", "科学", "课题", "南京市长", "生命", "长江大桥", "南京", "市长"]
max_length = max([len(word) for word in my_dict])
前向匹配 MM (maximum match)
def word_cut_mm(sentence):
"""正向匹配"""
sentence = sentence.strip()
word_length = len(sentence)
cut_word_list = []
while word_length > 0:
max_cut_length = min(max_length, word_length)
sub_sentence = sentence[:max_cut_length]
while max_cut_length > 0:
if sub_sentence in my_dict or max_cut_length == 1:
cut_word_list.append(sub_sentence)
break
else:
max_cut_length = max_cut_length - 1
sub_sentence = sentence[:max_cut_length]
word_leng