【从官方案例学框架Tensorflow/Keras】基于词嵌入GloVe的文本分类
注:本系列仅帮助大家快速理解、学习并能独立使用相关框架进行深度学习的研究,理论部分还请自行学习补充,每个框架的官方经典案例写的都非常好,很值得进行学习使用。可以说在完全理解官方经典案例后加以修改便可以解决大多数常见的相关任务。
摘要:【从官方案例学框架Keras】基于词嵌入GloVe的文本分类
目录
Introduction
本例展示了从未预处理的原始数据中使用Keras进行文本分类,所用数据集为IMDB,电影情感分类,使用TextVectorization层来分词和索引
1 Setup
导入所需包
import numpy as np
import tensorflow as tf
from tensorflow import keras
2 Load the data: IMDB movie review sentiment classification
- 数据集下载
本地可通过此链接下载
https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
- glove词嵌入矩阵下载
本地可通过此链接下载
http://nlp.stanford.edu/data/glove.6B.zip
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
载入数据集
data_path = keras.utils.get_file(
"news20.tar.gz",
"http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
untar=True,
)
查看数据
import os
import pathlib
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)
fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])
查看示例数据
print(open(data_dir / "comp.graphics" / "38987").read())
正如上面所示,部分文本的标题直接或间接的泄露了该文本的类别,如comp.graphics。因此,让我们去除标题部分
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
class_names.append(dirname)
dirpath = data_dir / dirname
fnames = os.listdir(dirpath)
print("Processing %s, %d files found" % (dirname, len(fnames)))
for fname in fnames:
fpath = dirpath / fname
f = open(fpath, encoding="latin-1")
content = f.read()
lines = content.split("\n")
# 去除标题,去除前十行
lines = lines[10:]
content = "\n".join(lines)
samples.append(content)
labels.append(class_index)
class_index += 1
print("Classes:", class_names)
print("Number of samples:", len(samples))
打乱顺序,并划分训练集、验证集
# Shuffle the data
seed = 2020
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)
# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]
3 Create a vocabulary index
让我们使用文本向量化来索引数据集中发现的词汇表。稍后,我们将使用同一层实例对样本进行向量化。
我们的层将只考虑最上面的20,000个单词,并将截断/填充使得序列长度是200个tokens
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
# 适配数据集
vectorizer.adapt(text_ds)
查看词汇表
vectorizer.get_vocabulary()[:10]
建立{词:索引}字典
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
test = ["the", "cat", "sat", "on", "the", "mat"]
[word_index[w] for w in test]
查看vectorizer的效果,使用vectorizer可以分词,将词汇映射到词汇表中的索引,并对序列做截断/填充使得序列长度一致,适应后续模型使用,若长度不一致将无法使用后面的CNN模型
output = vectorizer([["the cat sat on the mat"]])
output.numpy()
4 Load pre-trained word embeddings
glove文件包含各种大小的文本编码向量:50维、100维、200维、300维。我们将使用100D的。让我们创建一个字典将单词(字符串)映射到它们的NumPy向量表示
path_to_glove_file = "./input/glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file,encoding='utf-8') as f:
for line in f:
word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, "f", sep=" ")
embeddings_index[word] = coefs
print("Found %s word vectors." % len(embeddings_index))
此时的embeddings_index便是词嵌入矩阵,每个词的值将由100个连续值组成,即100维的词向量
为了能用Keras Embedding层,我们将通过上面的embeddings_index做出与Keras Embedding形状一致的Embedding矩阵
num_tokens = len(voc) + 2 # +2是为了''和'[UNK]'
embedding_dim = 100
hits = 0
misses = 0
# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
# This includes the representation for "padding" and "OOV"
embedding_matrix[i] = embedding_vector
hits += 1
else:
misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
接下来,导入预训练的词向量矩阵至Embedding层
注意:若需要微调词向量矩阵则将参数trainable改为True
from tensorflow.keras.layers import Embedding
embedding_layer = Embedding(
num_tokens,
embedding_dim,
embeddings_initializer=keras.initializers.Constant(embedding_matrix),
trainable=False, # 冻结embedding层,不做训练
)
5 Build the model
from tensorflow.keras import layers
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()
可以看出来Non-trainable params为2,000,200正是我们的Embedding参数量
6 Train the model
向量化文本
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()
y_train = np.array(train_labels)
y_val = np.array(val_labels)
model.compile(
loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))
7 Export an end-to-end model
做成端到端模型去预测
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)
probabilities = end_to_end_model.predict(
[["this message is about computer graphics and 3D modeling"]]
)
class_names[np.argmax(probabilities[0])]
8 Summary
完整代码如下
import os
import pathlib
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import Embedding
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
'''读取数据'''
data_path = keras.utils.get_file(
"news20.tar.gz",
"http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
untar=True,
)
data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
class_names.append(dirname)
dirpath = data_dir / dirname
fnames = os.listdir(dirpath)
print("Processing %s, %d files found" % (dirname, len(fnames)))
for fname in fnames:
fpath = dirpath / fname
f = open(fpath, encoding="latin-1")
content = f.read()
lines = content.split("\n")
# 去除前十行
lines = lines[10:]
content = "\n".join(lines)
samples.append(content)
labels.append(class_index)
class_index += 1
print("Classes:", class_names)
print("Number of samples:", len(samples))
'''打乱数据。划分数据集'''
# Shuffle the data
seed = 2020
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)
# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]
'''文本向量化'''
vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
vectorizer.adapt(text_ds)
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()
y_train = np.array(train_labels)
y_val = np.array(val_labels)
'''导入GloVe至Embedding层'''
path_to_glove_file = "./input/glove.6B.100d.txt"
embeddings_index = {}
with open(path_to_glove_file,encoding='utf-8') as f:
for line in f:
word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, "f", sep=" ")
embeddings_index[word] = coefs
num_tokens = len(voc) + 2 # +2是为了''和'[UNK]'
embedding_dim = 100
hits = 0
misses = 0
# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
# This includes the representation for "padding" and "OOV"
embedding_matrix[i] = embedding_vector
hits += 1
else:
misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
embedding_layer = Embedding(
num_tokens,
embedding_dim,
embeddings_initializer=keras.initializers.Constant(embedding_matrix),
trainable=False,
)
"""定义模型"""
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(len(class_names), activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.compile(
loss="sparse_categorical_crossentropy", optimizer="rmsprop", metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=20, validation_data=(x_val, y_val))
"""模型预测"""
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
preds = model(x)
end_to_end_model = keras.Model(string_input, preds)
probabilities = end_to_end_model.predict(
[["this message is about computer graphics and 3D modeling"]]
)
class_names[np.argmax(probabilities[0])]