文本分类任务实战
数据集构建:影评数据集进行情感分析(分类任务)
词向量模型:加载训练好的词向量或者自己训练都可以
序列网络模型:训练RNN模型进行识别
需要什么?
(batch,max-long,word2ect)
表示:(每次处理n个样本数据,最大长度:当前输入的词数,每个词映射成n维向量)
如(64,128,300)
import os
import warnings
warnings.filterwarnings("ignore")
import tensorflow as tf
import numpy as np
import pprint
import logging
import time
from collections import Counter
from pathlib import Path
from tqdm import tqdm
加载影评数据集(imdb),可以手动下载放到对应位置
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data()
Downloading data from
https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 2735s 157us/step
x_train.shape
(25000,)
读进来的数据是已经转换成ID映射的,一般的数据读进来都是词语,都需要手动转换成ID映射的
x_train[0]
[1, 13, 586, 851, 14, 31, 60, 23, 2863, 2364, 314]
词和ID的映射表,空出来3个的目的是加上特殊字符
_word2idx = tf.keras.datasets.imdb.get_word_index()
word2idx = {
w: i+3 for w, i in _word2idx.items()}
word2idx['<pad>'] = 0 #本任务用3个特殊字符
word2idx['<start>'] = 1
word2idx['<unk>'] = 2 #没有的词,如地名
idx2word = {
i: w for w, i in word2idx.items()}
Downloading data from
https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 44s 27us/step
按文本长度大小进行排序
def sort_by_len(x, y):
x, y = np.asarray(x), np.asarray(y)
idx = sorted(range(len(x)), key=lambda i: len(x[i]))
return x[idx], y[idx]
将中间结果保存到本地,万一程序崩了还得重玩,保存的是文本数据,不是ID
x_train, y_train = sort_by_len(x_train, y_train)
x_test, y_test = sort_by_len(x_test, y_test)
def write_file(f_path, xs, ys):
with open(f_path, 'w',encoding='utf-8') as f:
for x, y in zip(xs, ys):
f.write(str(y)+'\t'+' '.join([idx2word[i] for i in x][1:])+'\n')
write_file('./data/train.txt', x_train, y_train)
write_file('./data/test.txt', x_test, y_test)
构建语料表,基于词频来进行统计
counter = Counter()
with open('./data/train.txt',encoding='utf-8') as f:
for line in f:
line = line.rstrip()
label, words = line.split('\t')
words = words.split(' ')
counter.update(words)
words = ['<pad>'] + [w for w, freq in counter.most_common() if freq >= 10] #正常词拿出来构成语料表
print('Vocab Size:', len(words))
Path('./vocab').mkdir(exist_ok=True)
with open('./vocab/word.txt', 'w',encoding='utf-8') as f:
for w in words:
f.write(w+'\n')
Vocab Size: 20598
得到新的word2id映射表
word2idx = {
}
with open('./vocab/word.txt',encoding='utf-8') as f:
for i