疫情数据情感分类，仿照推特文本生成，利用transformer进行摘要

最新推荐文章于 2024-06-26 17:26:20 发布

qq_48566899

最新推荐文章于 2024-06-26 17:26:20 发布

阅读量777

点赞数

分类专栏：机器学习自然语言处理 python 文章标签： transformer 分类深度学习

本文链接：https://blog.csdn.net/qq_48566899/article/details/121878610

版权

python 同时被 3 个专栏收录

64 篇文章 0 订阅

订阅专栏

机器学习

22 篇文章 0 订阅

订阅专栏

自然语言处理

11 篇文章 0 订阅

订阅专栏

该博客介绍了使用深度学习进行情感分析和文本生成的方法。首先，通过Kaggle数据集进行情感分类，包括数据预处理、词云图绘制、模型构建（使用双向LSTM）及模型评估。接着，利用Transformers进行文本生成，训练模型并生成种子文本的后续内容。最后，展示了模型的测试结果和应用，如文本摘要。整个过程突显了深度学习在理解和生成自然语言方面的效能。

摘要由CSDN通过智能技术生成

一、疫情数据情感分类

数据来源
https://www.kaggle.com/datatattle/covid-19-nlp-text-classification

1.读取数据

import pandas as pd
train = pd.read_csv('Corona_NLP_train.cs')
train.isnull().sum()

import seaborn as sns
import matplotlib.pyplot as plt
a=train['Sentiment'].value_counts()
import plotly.express as px
fig = px.bar(a, x=['Positive','Negative','Neutral','Extremely Positive','ExtremelyNegative'],y='Sentiment')
fig.show()

在这里插入图片描述

2.数据预处理

train['Sentiment']=train['Sentiment'].map({'Positive':0,'Negative':1,'Neutral':2,'Extremely Positive':3,'Extremely Negative':4})
import re
import nltk
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
def cleaner(tweet):
    tweet = re.sub(r'http\S+', ' ', tweet)# remove urls
    tweet = re.sub(r'<.*?>',' ', tweet) # remove html tags
    tweet = re.sub(r'\d+',' ', tweet)# remove digits
    tweet = re.sub(r'#\w+',' ', tweet)  # remove hashtags
    tweet = re.sub(r'@\w+',' ', tweet)    # remove mentions
    tweet = tweet.split()
    tweet = " ".join([word for word in tweet if not word in stop_words])
    return  tweetstop_words = stopwords.words('english')
train['OriginalTweet']=train['OriginalTweet'].apply(lambda x:x.lower())
train['OriginalTweet'] = train['OriginalTweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
train_cleaned = train['OriginalTweet'].apply(cleaner)
train_cleaned.head()

train['OriginalTweet']=train_cleaned

在这里插入图片描述

3.绘制词云图,了解概况：

词云图又叫文字云，是对文本数据中出现频率较高的关键词予以视觉上的突出,形成"关键词的渲染"就类似云一样的彩色图片,从而过滤掉大量的文本信息,，使人一眼就可以领略文本数据的主要表达意思。

import wordcloud
all_words = ' '.join([text for text in train_cleaned])
from wordcloud import WordCloud
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Positive=''.join([text for text in train['OriginalTweet'][train['Sentiment'] == 0]])
wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(Positive)
plt.figure(figsize=(10, 7))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

在这里插入图片描述

4.将文本语料库转换为整数序列。

Tokenizer将每个单词都赋予了一个编号

tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_cleaned)
X = tokenizer.texts_to_sequences(train_cleaned)
vocab_size = len(tokenizer.word_index)+1
print("Vocabulary size: {}".format(vocab_size))

y=train['Sentiment'].copy()
encoding = {'Extremely Negative': 0,'Negative': 0,'Neutral': 1,'Positive':2,'Extremely Positive': 2  }
labels = ['Negative', 'Neutral', 'Positive']
y.replace(encoding, inplace=True)
y

在这里插入图片描述

5.构建深度模型并训练

Bidirectional双边：从开头到结尾，从结尾到开头，能够有更好的记忆。不至于学了后面忘了前面

EPOCHS =30
BATCH_SIZE = 32
embedding_dim = 16
units = 256
model = tf.keras.Sequential([
    Layers.Embedding(vocab_size, embedding_dim, input_length=X.shape[1]),
    Layers.Bidirectional(Layers.LSTM(units,return_sequences=True)),
    Layers.GlobalMaxPool1D(),
    Layers.Dropout(0.4),
    Layers.Dense(64, activation="relu"),
    Layers.Dropout(0.4),
    Layers.Dense(3)])
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True),optimizer='adam',metrics=['accuracy'])
model.summary()

在这里插入图片描述

6.拟合模型

history=model.fit(X,y,epochs=EPOCHS,validation_split=0.12, batch_size=BATCH_SIZE)

在这里插入图片描述

7.画出精度随epoch变化曲线

观察模型在什么时候开始收敛，可以用来确定epoch的取值，Epochs不是越多越好，由于电脑的配置问题，这里只是示意了较小的epoch

fig = px.line(history.history, y=['accuracy'],
    labels={'index': 'epoch', 'value': 'accuracy'})
fig.show()




fig=px.line(history.history,y=['loss', 'val_loss'],labels={'index': 'epoch', 'value': 'loss'})
fig.show()

在这里插入图片描述

8.预测

test=pd.read_csv("Corona_NLP_test.csv")
X_test = test['OriginalTweet'].copy()
y_test = test['Sentiment'].copy()
X_test = X_test.apply(cleaner)
X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, padding='post')
y_test.replace(encoding, inplace=True)
pred = model.predict_classes(X_test)
loss, acc = model.evaluate(X_test,y_test,verbose=0)
print('Test loss: {}'.format(loss))
print('Test Accuracy: {}'.format(acc))

在这里插入图片描述

9.自己写一段文本进行测试

x_test=pd.Series(['i believe we will get through the pandamic,i hope it will end as soon as possible'])
x_test = x_test.apply(cleaner)
x_test = tokenizer.texts_to_sequences(x_test)
x_test = pad_sequences(x_test, padding='post')
Y_pred = model.predict(x_test)
Y_pred = np.int64(Y_pred>0.5)
Y_pred

在这里插入图片描述

二、仿照推特生成评论

1.定义tokenizer对象，并准备训练数据

因为与上面的数据不同，情感分类的x和y分别是文本和情感标签，而文本生成任务中x和y是文本和文本的下一个词，所以重新生成tokenizer对象

train['clean']=train_cleaned
data=''
for i in train[train['Sentiment']=='Positive']['clean'][:1000]:
data=data+i+'\n'
tokenizer = Tokenizer()
corpus = data.lower().split('\n')

Corpus的内容是在这里插入图片描述

Tokenizer将每个单词都赋予了一个编号在这里插入图片描述

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
input_sequences = []
for line in corpus:
    line=line+"\n"
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, padding='pre', maxlen=max_sequence_len))

在这里插入图片描述

2.构建<seed, next_word>训练数据对

参考以下博客：
https://blog.csdn.net/qq_48566899/article/details/120695871

xs, labels = input_sequences[:,:-1], input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

由xs表示seed 来推测下一个单词labels。
比如input_sequences第一行就是用288推出下一个单词399
然后在用288 399推出下一个单词887

3.构建深度模型并训练

embed_dim = 100
model = Sequential()
model.add(Embedding(total_words, embed_dim, input_length=max_sequence_len-1))
#model.add(Bidirectional(LSTM(128)))
model.add(Bidirectional(LSTM(96)))
model.add(Dropout(0.3))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.summary()

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
history = model.fit(xs, ys, batch_size=64, epochs=30, verbose=1)

在这里插入图片描述

4.画出精度随epoch变化曲线

import matplotlib.pyplot as plt
def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.show()

plot_graphs(history, 'acc')
plot_graphs(history, 'loss')

在这里插入图片描述

5.输入种子文本，并产生接下来的单词

给定了种子，每次结果是一样的，生成的文本确定性问题，但是不需要文本一样，所以需要一定的随机性，但也不能完全随机，如果随机从文本中输出单词就没有意义了。max对应概率最大的进行输出。为了使结果能够不一样，保留部分的随机性，但是不能完全随机，使用多项式分布提取样本。

def predict_next_words(seed_text, next_words):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted=model.predict(token_list,verbose=0)[0]
        len_p=len(predicted)
        temperature=0.5
        predicted=predicted**(1/temperature)
        p=predicted/np.sum(predicted)
        top_n=5
        vocab_size=1
        p[np.argsort(p)[:-top_n]] = 0#选取了概率较大的前k个
        p = p / np.sum(p) # 归一化概率 
        predicted = np.random.choice(list(range(0,len_p)), 1, p=p)[0]# 随机选取一个字符
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    print(seed_text)
return seed_text
seed_text = "i hope the pandamic will end soon and"
next_words = 50
generated_text = predict_next_words(seed_text, next_words)

在这里插入图片描述

三、应用Transformers

kaggle上运行的话，这个是示例
https://www.kaggle.com/tbhavnani/abstractive-summarization-using-transformers

from transformers import pipeline
classifier = pipeline('summarization')
data=''
su=train['OriginalTweet'].apply(cleaner)
for i in su[:300]:
    data=data+i+'\n'
    
article=data
print(classifier(article, max_length=130, min_length=30))

在这里插入图片描述

结果分析

1.查看词语图，了解评论大致内容
在这里插入图片描述

2.利用情感分类模型，输入一段语句进行判断是正面还是负面或是中性
输入语句：‘i believe we will get through the pandamic,i hope it will end as soon as possible’
判断结果是：在这里插入图片描述
表示Positive

由结果得知，判断正确，对输入的i believe we will get through the pandamic,i hope it will end as soon as possible判断是正面积极的语言。

3.通过画图的方式，画出精度随epoch变化曲线，观察模型在什么时候开始收敛，选择epoch参数。
在这里插入图片描述

4.查看模型测试结果（情感分类模型）
在这里插入图片描述

6.文本生成结果
在这里插入图片描述

5.利用transformer的pipeline对所有的评论总结
[{‘summary_text’: 'iraq’s covid- pandemic has caused a surge in consumer confidence . a u.s. gov. john mccarthy says he’s a “failure” in the uk . “i’m a big fan of covid,” he says. “it’s not a bad thing to be able to eat”]

总结

1.词云图是一种用来展现高频关键词的可视化表达，通过文字、色彩、图形的搭配，产生有冲击力地视觉效果，而且能够传达有价值的信息。词云就是通过形成“关键词云层”或“关键词渲染”，对网络文本中出现频率较高的“关键词”的视觉上的突出。词云图过滤掉大量的文本信息，使浏览网页者只要一眼扫过文本就可以领略文本的主旨。
2.对于评论数据，数据清洗的时候比较麻烦，要去除一些特殊的语言和符号，比如再说url，表情，引用等等。
3.由数据可以看出，评论的情绪被划分为了5个等级，在本次实验中，将情绪分为了3类分别为积极、中性、消极，通过提取不同的情感，再结合词云图，可以大致看出人们的态度和感受。
由数据可以看出，人们的负面情绪在于希望更多的帮助，并且人们在杂货店里购买了大量的手纸，食物等来应对隔离。正面情绪在于对工作人员的感谢。
4.通过调参发现，模型用LSTM拟合并且经过调参发现用双向LSTM效果较好，并且测试精度达0.82，能够通过人们在网络上的评论，判断是正面还是负面的。Bidirectional双边：从开头到结尾，从结尾到开头，能够有更好的记忆。不至于学了后面忘了前面。
5.可以通过画图的方式画出精度随epoch变化曲线，观察模型在什么时候开始收敛，选择epoch参数
6.在生成文本时，需要给一个种子片段作为输入，然后就可以进行生成，重复进行以下几步：
把segment输入神经网络
神经网络输出各个字符的概率
从概率值中进行Sample得到next_char
把新生成的字符接到片段的后面
7.Transformers提供用于自然语言理解和自然语言生成的模型，实际上是提供了很多bert模型，模型中至少有上亿个参数。Transformers拥有超过32种预训练模型，支持100多种语言，非常方便使用。能够帮助降低计算成本，可以直接用训练好的模型，而不必总是再训练，减少计算时间和生产成本。还可以使用这些模型，进行finetune。
8.通过transformers得到文本摘要，发现人们觉得疫情导致消费的激增，在疫情期间能够多吃点是好事，也有很多人觉得英国政府在疫情中没有处理妥当。

qq_48566899

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
疫情数据情感分类，仿照推特文本生成，利用transformer进行摘要

一、疫情数据情感分类数据来源https://www.kaggle.com/datatattle/covid-19-nlp-text-classification1.读取数据import pandas as pdtrain = pd.read_csv('Corona_NLP_train.cs')train.isnull().sum()import seaborn as snsimport matplotlib.pyplot as plta=train['Sentiment'].value_
复制链接

扫一扫

专栏目录