文章目录
1. 数据准备
1.1 构建语料库
如果没有给定语料库文件(如corpus.txt),则可使用训练集、测试集数据来构建语料库文件,具体代码如下所示(代码文件名为):
filtered_line = set()
with open('../../data/raw/train.txt', 'r') as f:
line = f.readline()
while line:
if line[-1] != '\n':
line += '\n'
filtered_line.add(line)
line = f.readline()
with open('../../data/raw/test.txt', 'r') as f:
line = f.readline()
while line:
if line[-1] != '\n':
line += '\n'
filtered_line.add(line)
line = f.readline()
with open('../../data/raw/corpus.txt', 'w')