数据集探索
总结起来三步走
- 读取数据集文件
- 建立ID与字符串映射
- 数组填充
本文使用THUCNews数据集
1.读取数据集文件
encoding参数主要解决编码问题, 不报错可忽略
#读取词典
with open("./dataset/cnews/cnews.vocab.txt",encoding='UTF-8') as f:
vocab = f.readlines()
vocab = [word.strip() for word in vocab]
#读取训练集(训练集有点大,这里用了测试集合)
with open("./dataset/cnews/cnews.test.txt",encoding='UTF-8') as f:
test_data = f.readlines()
contents = []
labels = []
for line in test_data:
label, content = line.strip().split("\t")
contents.append(content)
labels.append(label)
2.建立字符和ID的映射
# 建立单词与ID索引
# 技巧: dict(zip(字词列表, 数字列表))会自动一一对应生成字典
word_to_id =dict(zip(vocab, range(len(vocab))))
reverse_word_id = dict(zip(range(len(vocab)), vocab))
# 建立类别与ID索引
categories = ['体育', '财经', '房产', '家居', '教育', '科技', '时尚', '时政', '游戏', '娱乐']
categories_to_id =dict(zip(categories, range(len(categories))))
# 将数据集的字符串每个字符按词典转成ID
labels_id = [categories_to_id[label] for label in labels]
contents_id = []
for content in contents:
contents_id.append([word_to_id[char] for char in content if char in word_to_id])
3.数组填充
由于one-hot编码太耗费空间,就使用了数组填充
mport tensorflow as tf
from tensorflow import keras
test_data = contents_id
# test_data已经是数字ID列表,0代表填充字符<PAD>
test_data = keras.preprocessing.sequence.pad_sequences(test_data,
value=0,
padding='post',
maxlen=1800)
参考链接:
https://tensorflow.google.cn/tutorials/keras/basic_text_classification
https://blog.csdn.net/u011439796/article/details/77692621