TensorFlow2.0.0_课时63-71_基于RNN的文本分类任务实战

文本分类任务实战
数据集构建:影评数据集进行情感分析(分类任务)
词向量模型:加载训练好的词向量或者自己训练都可以
序列网络模型:训练RNN模型进行识别
在这里插入图片描述
在这里插入图片描述
需要什么?
(batch,max-long,word2ect)
表示:(每次处理n个样本数据,最大长度:当前输入的词数,每个词映射成n维向量)
如(64,128,300)

import os
import warnings
warnings.filterwarnings("ignore")
import tensorflow as tf
import numpy as np
import pprint
import logging
import time
from collections import Counter
from pathlib import Path
from tqdm import tqdm

加载影评数据集(imdb),可以手动下载放到对应位置

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.imdb.load_data()

Downloading data from
https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17465344/17464789 [==============================] - 2735s 157us/step

x_train.shape

(25000,)

读进来的数据是已经转换成ID映射的,一般的数据读进来都是词语,都需要手动转换成ID映射的

x_train[0]

[1, 13, 586, 851, 14, 31, 60, 23, 2863, 2364, 314]

词和ID的映射表,空出来3个的目的是加上特殊字符

_word2idx = tf.keras.datasets.imdb.get_word_index()
word2idx = {
   w: i+3 for w, i in _word2idx.items()}
word2idx['<pad>'] = 0		#本任务用3个特殊字符
word2idx['<start>'] = 1
word2idx['<unk>'] = 2		#没有的词,如地名
idx2word = {
   i: w for w, i in word2idx.items()}

Downloading data from
https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 44s 27us/step

按文本长度大小进行排序

def sort_by_len(x, y):
    x, y = np.asarray(x), np.asarray(y)
    idx = sorted(range(len(x)), key=lambda i: len(x[i]))
    return x[idx], y[idx]

将中间结果保存到本地,万一程序崩了还得重玩,保存的是文本数据,不是ID

x_train, y_train = sort_by_len(x_train, y_train)
x_test, y_test = sort_by_len(x_test, y_test)

def write_file(f_path, xs, ys):
    with open(f_path, 'w',encoding='utf-8') as f:
        for x, y in zip(xs, ys):
            f.write(str(y)+'\t'+' '.join([idx2word[i] for i in x][1:])+'\n')

write_file('./data/train.txt', x_train, y_train)
write_file('./data/test.txt', x_test, y_test)

构建语料表,基于词频来进行统计

counter = Counter()
with open('./data/train.txt',encoding='utf-8') as f:
    for line in f:
        line = line.rstrip()
        label, words = line.split('\t')
        words = words.split(' ')
        counter.update(words)

words = ['<pad>'] + [w for w, freq in counter.most_common() if freq >= 10]		#正常词拿出来构成语料表
print('Vocab Size:', len(words))

Path('./vocab').mkdir(exist_ok=True)

with open('./vocab/word.txt', 'w',encoding='utf-8') as f:
    for w in words:
        f.write(w+'\n')

Vocab Size: 20598

得到新的word2id映射表

word2idx = {
   }
with open('./vocab/word.txt',encoding='utf-8') as f:
    for i
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值