使用20_newsgroup集做训练集,载入Glove预训练权重训练模型
预训练20_newsgroup数据集
Load sample
Preview file folder
首先下载 news20.tar.gz, 解压到jupyter notebook的相应运行位置,会出现一个20_newsgroup的file folder,这就是我们的训练集,打开这个folder, 发现又有19个不同的folder, 每个folder都讲的内容应该都不一样,我们后面会对出自不同folder的text做labels.
Define the path to 20_newsgroup folder
import os
from os.path import join
Base_dir = '.'
Text_dir = join(Base_dir,'20_newsgroup')
print(Text_dir)
Load data from all the child folder in 20_newsgroup
texts, labels, labels_index = [], {
}, []
# texts 做完下面的运算后会变成一个装了各个文件文本的list
# labels 对不同文件夹进行编号的字典
# labels_index 相当于对每个文件文本进行编号, 这个list的长度和texts一致
for name in sorted(os.listdir(Text_dir)):
# every file_folder under the root_file_folder should be labels with a unique number
labels[name] = len(labels) #
path = join(Text_dir, name)
for fname in sorted(os.listdir(path)):
if fname.isdigit():# The training set we want is all have a digit name
fpath = join(path,fname)
labels_index.append(labels[name])
# skip header
f = open(fpath, encoding='latin-1')
t = f.read()
i = t.find('\n\n')
if i > 0:
t = t[i:]# 去除每篇文章没用的header
texts.append(t)
f.close(