【从官方案例学框架Tensorflow/Keras】基于词嵌入GloVe的文本分类

最新推荐文章于 2022-06-23 13:39:32 发布

阿芒Aris

最新推荐文章于 2022-06-23 13:39:32 发布

阅读量1.1k

点赞数 3

分类专栏：深度学习框架向文章标签：深度学习自然语言处理 tensorflow Keras

本文链接：https://blog.csdn.net/qq_44574333/article/details/109586980

版权

深度学习框架向专栏收录该内容

13 篇文章 3 订阅

订阅专栏

【从官方案例学框架Tensorflow/Keras】基于词嵌入GloVe的文本分类

Keras官方案例链接
 Tensorflow官方案例链接
 Paddle官方案例链接
 Pytorch官方案例链接

注：本系列仅帮助大家快速理解、学习并能独立使用相关框架进行深度学习的研究，理论部分还请自行学习补充，每个框架的官方经典案例写的都非常好，很值得进行学习使用。可以说在完全理解官方经典案例后加以修改便可以解决大多数常见的相关任务。

摘要：【从官方案例学框架Keras】基于词嵌入GloVe的文本分类

Introduction

本例展示了从未预处理的原始数据中使用Keras进行文本分类，所用数据集为IMDB，电影情感分类，使用TextVectorization层来分词和索引

1 Setup

导入所需包

import numpy as np
import tensorflow as tf
from tensorflow import keras

2 Load the data: IMDB movie review sentiment classification

数据集下载

本地可通过此链接下载
https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

glove词嵌入矩阵下载

本地可通过此链接下载
http://nlp.stanford.edu/data/glove.6B.zip

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip

载入数据集

data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

查看数据

import os
import pathlib

data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

在这里插入图片描述
查看示例数据

print(open(data_dir / "comp.graphics" / "38987").read())

在这里插入图片描述
正如上面所示，部分文本的标题直接或间接的泄露了该文本的类别，如comp.graphics。因此，让我们去除标题部分

samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        # 去除标题，去除前十行
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

在这里插入图片描述
打乱顺序，并划分训练集、验证集

# Shuffle the data
seed = 2020
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

3 Create a vocabulary index

让我们使用文本向量化来索引数据集中发现的词汇表。稍后，我们将使用同一层实例对样本进行向量化。

我们的层将只考虑最上面的20,000个单词，并将截断/填充使得序列长度是200个tokens

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)

# 适配数据集
vectorizer.adapt(text_ds)

查看词汇表

vectorizer.get_vocabulary()[:10]

在这里插入图片描述

建立{词：索引}字典

voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]

在这里插入图片描述
查看vectorizer的效果，使用vectorizer可以分词，将词汇映射到词汇表中的索引，并对序列做截断/填充使得序列长度一致，适应后续模型使用，若长度不一致将无法使用后面的CNN模型

output = vectorizer([["the cat sat on the mat"]])
output.numpy()

在这里插入图片描述

4 Load pre-trained word embeddings

glove文件包含各种大小的文本编码向量:50维、100维、200维、300维。我们将使用100D的。让我们创建一个字典将单词(字符串)映射到它们的NumPy向量表示

path_to_glove_file = "./input/glove.6B.100d.txt"

embeddings_index = {}
with open(path_to_glove_file,encoding='utf-8') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

在这里插入图片描述

此时的embeddings_index便是词嵌入矩阵，每个词的值将由100个连续值组成，即100维的词向量

为了能用Keras Embedding层，我们将通过上面的embeddings_index做出与Keras Embedding形状一致的Embedding矩阵

num_tokens = len(voc) + 2 # +2是为了''和'[UNK]'
embedding_dim = 100
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

在这里插入图片描述
接下来，导入预训练的词向量矩阵至Embedding层
注意：若需要微调词向量矩阵则将参数trainable改为True

from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False, # 冻结embedding层，不做训练
)

5 Build the model

from tensorflow.keras import layers

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

在这里插入图片描述
可以看出来Non-trainable params为2，000，200正是我们的Embedding参数量

6 Train the model

向量化文本

x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

y_train = np.array(train_labels)
y_val = np.array(val_labels)

model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

在这里插入图片描述

7 Export an end-to-end model

做成端到端模型去预测

string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)

class_names[np.argmax(probabilities[0])]

在这里插入图片描述

8 Summary

完整代码如下

import os
import pathlib
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Embedding
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization


'''读取数据'''
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)

samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        # 去除前十行
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

'''打乱数据。划分数据集'''
# Shuffle the data
seed = 2020
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

'''文本向量化'''
vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)

voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()
y_train = np.array(train_labels)
y_val = np.array(val_labels)

'''导入GloVe至Embedding层'''
path_to_glove_file = "./input/glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file,encoding='utf-8') as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs
num_tokens = len(voc) + 2 # +2是为了''和'[UNK]'
embedding_dim = 100
hits = 0
misses = 0
# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

"""定义模型"""
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.compile(
    loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))

"""模型预测"""
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)

probabilities = end_to_end_model.predict(
    [["this message is about computer graphics and 3D modeling"]]
)
class_names[np.argmax(probabilities[0])]