使用20_newsgroup集做训练集，载入Glove预训练权重训练模型

最新推荐文章于 2023-10-23 09:53:49 发布

Jason24_Zeng

最新推荐文章于 2023-10-23 09:53:49 发布

阅读量1.2k

点赞数

分类专栏： DL 文章标签：深度学习 tensorflow 自然语言处理 python 机器学习

原文链接：https://github.com/BrambleXu/nlp-beginner-guide-keras

版权

使用20_newsgroup集做训练集，载入Glove预训练权重训练模型

预训练20_newsgroup数据集
构建Embedding层并使用Glove权重

预训练20_newsgroup数据集

Load sample

Preview file folder

首先下载 news20.tar.gz, 解压到jupyter notebook的相应运行位置，会出现一个20_newsgroup的file folder，这就是我们的训练集，打开这个folder, 发现又有19个不同的folder, 每个folder都讲的内容应该都不一样，我们后面会对出自不同folder的text做labels.

Define the path to 20_newsgroup folder

import os
from os.path import join
Base_dir = '.'
Text_dir = join(Base_dir,'20_newsgroup')
print(Text_dir)

Load data from all the child folder in 20_newsgroup

texts, labels, labels_index = [], {
   }, []
# texts 做完下面的运算后会变成一个装了各个文件文本的list
# labels 对不同文件夹进行编号的字典
# labels_index 相当于对每个文件文本进行编号， 这个list的长度和texts一致
for name in sorted(os.listdir(Text_dir)):
    #  every file_folder under the root_file_folder should be labels with a unique number
    labels[name] = len(labels) # 
    path = join(Text_dir, name)
    for fname in sorted(os.listdir(path)):
        if fname.isdigit():# The training set we want is all have a digit name
            fpath = join(path,fname)
            labels_index.append(labels[name])
            # skip header
            f = open(fpath, encoding='latin-1')
            t = f.read()
            i = t.find('\n\n')
            if i > 0:
                t = t[i:]# 去除每篇文章没用的header
            texts.append(t)
            f.close(

最低0.47元/天解锁文章

Jason24_Zeng

关注

0
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
使用20_newsgroup集做训练集，载入Glove预训练权重训练模型

使用20_newsgroup集做训练集，载入Glove预训练权重训练模型预训练20_newsgroup数据集Load samplePreview file folderDefine the path to 20_newsgroup folderLoad data from all the child folder in 20_newsgroupPreprocess the texts dataImport LibraryTokenizerPad_sequencesPreprocess the labelsI
复制链接

扫一扫