NLP超详细新手快速入门上手篇（2）我的第一个模型，讽刺性新闻标题

本文链接：https://blog.csdn.net/m0_48300767/article/details/131024512

今天写NLP的第二篇，还没有看过第一篇的，可以先看一下《NLP超详细新手快速入门上手篇（1）常用函数》，先熟悉一下基本函数和处理方法。

1、数据准备

我们用到的数据集是：新闻标题讽刺数据集大家可以先自行下载。

在这里插入图片描述
数据包含三个字段：

标签：是否具有讽刺性
标题：新闻标题
链接：新闻对应链接
数据一共28619条

2、读取数据

import json
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# 初始化参数 
vocab_size = 10000
embedding_dim = 16
max_length = 100
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"
training_size = 20000


# LoadData 
sentences=[] # headline
labels=[]  # is_sarcastic
urls=[]  # article_link

file = open("Sarcasm_Headlines_Dataset_v2.json", 'r')
for line in file.readlines():
    dic = json.loads(line)
    sentences.append(dic['headline'])
    labels.append(dic['is_sarcastic'])
    urls.append(dic['article_link'])

3、数据预处理

3.1 划分训练集、测试集

# divide traning and testing
training_sentences = sentences[0:training_size]
testing_sentences = sentences[training_size:]
training_labels = labels[0:training_size]
testing_labels = labels[training_size:]

3.2预处理

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

word_index = tokenizer.word_index

training_sequences = tokenizer.texts_to_sequences(training_sentences)
training_padded = pad_sequences(training_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

print(training_padded[0])
print(training_padded.shape)

在这里插入图片描述

3.3 数据转换

# 列表转数组
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)
testing_padded = np.array(testing_padded)
testing_labels = np.array(testing_labels)

4、定义、编译模型

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

tf.keras.layers.Embedding 是 TensorFlow Keras 中的一个层，用于将离散的词汇表中的单词嵌入到连续向量空间中。它可以将每个离散的单词映射到一个固定大小的向量，并将这些向量作为输入传递给后续的层。
在这里插入图片描述

5、训练模型

num_epochs = 30 # 迭代训练次数
history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

在这里插入图片描述

6、训练过程可视化

import matplotlib.pyplot as plt
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()
  
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")

在这里插入图片描述

7、验证模型

这里我们构建了两个标题，进行预处理后，交给模型进行预测

sentence = ["granny starting to fear spiders in the garden might be real", "game of thrones season finale showing this sunday night"]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
print(model.predict(padded))