NLP学习:（一）数据集探索

最新推荐文章于 2023-03-26 16:19:17 发布

优雅一只猫

最新推荐文章于 2023-03-26 16:19:17 发布

阅读量263

点赞数 1

分类专栏： NLP 文章标签： NLP 机器学习个人笔记

本文链接：https://blog.csdn.net/weixin_41492426/article/details/90143567

版权

NLP 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

数据集探索

总结起来三步走

读取数据集文件
建立ID与字符串映射
数组填充

本文使用THUCNews数据集

1.读取数据集文件

encoding参数主要解决编码问题，不报错可忽略

#读取词典
with open("./dataset/cnews/cnews.vocab.txt",encoding='UTF-8') as f:
    vocab = f.readlines()
    vocab = [word.strip() for word in vocab]

#读取训练集(训练集有点大，这里用了测试集合)
with open("./dataset/cnews/cnews.test.txt",encoding='UTF-8') as f:
    test_data = f.readlines()
    contents = []
    labels = []
    for line in test_data:
        label, content = line.strip().split("\t")
        contents.append(content)
        labels.append(label)

2.建立字符和ID的映射

# 建立单词与ID索引
#  技巧： dict(zip(字词列表, 数字列表))会自动一一对应生成字典
word_to_id =dict(zip(vocab, range(len(vocab))))
reverse_word_id = dict(zip(range(len(vocab)), vocab))

# 建立类别与ID索引
categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐']
categories_to_id =dict(zip(categories, range(len(categories))))

# 将数据集的字符串每个字符按词典转成ID
labels_id = [categories_to_id[label] for label in labels]
contents_id = []
for content in contents:
    contents_id.append([word_to_id[char] for char in content if char in word_to_id])

3.数组填充

由于one-hot编码太耗费空间，就使用了数组填充

mport tensorflow as tf
from tensorflow import keras
test_data = contents_id
# test_data已经是数字ID列表，0代表填充字符<PAD>
test_data = keras.preprocessing.sequence.pad_sequences(test_data, 
                                                        value=0,
                                                        padding='post',
                                                        maxlen=1800)

参考链接:
https://tensorflow.google.cn/tutorials/keras/basic_text_classification
https://blog.csdn.net/u011439796/article/details/77692621