今天写NLP的第二篇,还没有看过第一篇的,可以先看一下《NLP超详细新手快速入门上手篇(1)常用函数》,先熟悉一下基本函数和处理方法。
1、数据准备
我们用到的数据集是:新闻标题讽刺数据集大家可以先自行下载。
数据包含三个字段:
- 标签:是否具有讽刺性
- 标题:新闻标题
- 链接:新闻对应链接
数据一共28619条
2、读取数据
import json
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
# 初始化参数
vocab_size = 10000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = 20000
# LoadData
sentences=[] # headline
labels=[] # is_sarcastic
urls=[] # article_link
file = open("Sarcasm_Headlines_Dataset_v2.json", 'r')
for line in file.readlines():
dic = json.loads(line)
sentences.append(dic['headline'])
labels.append(dic['is_sarcastic'])
urls.append(dic['article_link'])
3、数据预处理
3.1 划分训练集、测试集
# divide traning and testing
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]
3.2预处理
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(training_padded[0])
print(training_padded.shape)
3.3 数据转换
# 列表转数组
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)
4、定义、编译模型
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(24, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
tf.keras.layers.Embedding 是 TensorFlow Keras 中的一个层,用于将离散的词汇表中的单词嵌入到连续向量空间中。它可以将每个离散的单词映射到一个固定大小的向量,并将这些向量作为输入传递给后续的层。
5、训练模型
num_epochs = 30 # 迭代训练次数
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)
6、训练过程可视化
import matplotlib.pyplot as plt
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")
7、验证模型
这里我们构建了两个标题,进行预处理后,交给模型进行预测
sentence = ["granny starting to fear spiders in the garden might be real", "game of thrones season finale showing this sunday night"]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(model.predict(padded))
结果