Python应用案例——基于TensorFlow 2.3建立RNN搭配Word2Vec Embedding进行文本分类

目录

一、数据准备 

二、实验流程 

1、加载数据集

​2、分词处理

3、导入Word2Vec模型

4、逗号位置后移

5、配置训练集

6、搭建RNN

7、源码分享


一、数据准备 

中文数据源:ChineseNLPCorpus: 中文自然语言处理数据集,平时做做实验的材料。欢迎补充提交合并。

中文停用词列表: GitHub - goto456/stopwords: 中文常用停用词表(哈工大停用词表、百度停用词表等)

中文停用词列表链接:百度网盘 请输入提取码

提取码:xbf1

测试用例:某外卖平台收集的用户评价,正向 4000 条,负向 约 8000 条—https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k/intro.ipynb

二、实验流程 

1、加载数据集

#加载数据
import pandas as pd

TB=pd.read_csv('waimai_10k.csv')

2、分词处理

#jieba分词
import jieba
#使用lambda函数对每一列进行分词处理
TB.review.apply(lambda sen: jieba.lcut(sen))
0                           [很快, ,, 好吃, ,, 味道, 足, ,, 量, 大]
1                                 [没有, 送水, 没有, 送水, 没有, 送水]
2                                     [非常, 快, ,, 态度, 好, 。]
3                      [方便, ,, 快捷, ,, 味道, 可口, ,, 快, 递给, 力]
4                             [菜, 味道, 很棒, !, 送餐, 很, 及时, !]
                               ...                        
11982             [以前, 几乎, 天天, 吃, ,, 现在, 调料, 什么, 都, 不放, ,]
11983    [昨天, 订, 凉皮, 两份, ,, 什么, 调料, 都, 没有, 放, ,, 就, 放, ...
11984                                  [凉皮, 太辣, ,, 吃不下, 都]
11985                       [本来, 迟到, 了, 还, 自己, 点, !, !, !]
11986    [肉夹馍, 不错, ,, 羊肉, 泡馍, 酱肉, 包, 很, 一般, 。, 凉面, 没, 想...
Name: review, Length: 11987, dtype: object

将分词结果连接到数据表 

TB['text']=TB.review.apply(lambda sen: jieba.lcut(sen))#结巴分词

 

 3、导入Word2Vec模型

#导入Word2Vec模型
from gensim.models import Word2Vec

myWord2Vec = Word2Vec(sentences=TB.text,vector_size=250, sg=1,min_count=1)#min_count=1只要出现一次就加入模型

print(myWord2Vec)
Word2Vec<vocab=11008, vector_size=250, alpha=0.025>

向量长度大小取每句话的平均长度,约为250。

获取“,”的词向量: 

print(myWord2Vec.wv.get_vector(","))
print(myWord2Vec.wv.get_vector(",").shape)
[-1.24408239e-02 -1.14726588e-01  2.47033477e-01  1.31037146e-01
 -1.94809809e-01 -2.16878414e-01  7.63326213e-02  4.79248613e-01
 -1.56891033e-01  2.79191583e-02  7.65182748e-02 -2.11410582e-01
 -1.21347949e-01  2.79119730e-01 -2.75594443e-01 -1.18430011e-01
  8.29209387e-03  1.05018750e-01 -1.04862735e-01  3.06678284e-02
  1.99599266e-01 -1.38059049e-03  1.08982801e-01 -1.00153044e-01
  9.64863002e-02 -1.78601265e-01  3.11386213e-02 -1.59458425e-02
 -3.22362274e-01  7.29258731e-02 -3.09222400e-01 -7.19489437e-03
 -4.57239114e-02 -2.33146802e-01  9.37329382e-02 -1.03120364e-01
  3.72090369e-01 -7.00908229e-02  1.09613158e-01 -1.77778825e-01
 -1.46992385e-01 -3.89091927e-03 -2.82386154e-01  2.91620661e-02
  3.23712789e-02 -9.60682556e-02 -1.46306649e-01 -3.62146832e-02
  2.99114317e-01 -2.33087912e-02 -2.16334328e-01 -1.30818877e-02
 -8.38805735e-02 -3.49560268e-02  2.84558326e-01 -3.80333401e-02
  2.01856103e-02  2.51057204e-02  6.37874380e-02  2.07886666e-01
 -1.71267241e-02 -1.15914047e-01 -5.19167334e-02 -1.29951477e-01
  1.70055345e-01  1.47780970e-01 -1.60399109e-01  1.00582980e-01
 -3.66804861e-02  1.54437721e-01 -5.74397705e-02  1.94721684e-01
  1.08824961e-01 -9.47291628e-02  2.60765135e-01  9.58999768e-02
  9.55417827e-02 -1.33736253e-01 -1.05333753e-01 -7.50511466e-03
  7.85463303e-03  1.53486669e-01  9.28018764e-02  2.18778417e-01
 -2.72222996e-01  1.75863147e-01  5.30329160e-02  2.72722572e-01
  1.48632646e-01  1.57090932e-01  1.08813553e-03 -2.42346451e-02
  7.34620020e-02 -2.49828592e-01  3.03890049e-01  1.82332903e-01
 -2.60147423e-01 -9.62509140e-02  7.84767733e-04 -4.74975770e-03
 -1.46705359e-01  2.48901963e-01 -8.41196924e-02  1.38966620e-01
  1.34686023e-01  7.81016797e-02 -1.38785005e-01  3.70077848e-01
 -1.21465199e-01 -2.89184988e-01 -1.96699828e-01 -5.97397722e-02
 -1.07141703e-01  1.90362617e-01  6.53386265e-02  1.25934884e-01
 -1.50998771e-01  2.07748681e-01 -2.77070343e-01 -6.67625815e-02
 -3.63008343e-02 -5.67756072e-02  1.51918661e-02 -2.63019860e-01
  5.58041148e-02  3.85505259e-02 -2.32700005e-01  8.43593627e-02
 -1.21860422e-01 -2.43539028e-02  2.69366652e-01 -3.07116359e-01
  1.22425035e-01 -1.41671598e-01  3.36526223e-02  1.11485884e-01
  2.23691970e-01  1.63073793e-01  1.47343911e-02  1.92776415e-02
  1.20959081e-01 -7.67013580e-02 -1.53942239e-02 -7.81191629e-04
 -4.70119193e-02  2.63462275e-01  1.02707744e-01 -2.24397644e-01
  3.55879039e-01  3.28966707e-01  6.27099350e-02 -3.04748159e-04
 -1.38461292e-01 -5.18774018e-02 -1.24785475e-01 -7.25892326e-03
 -3.93128721e-03  1.09467402e-01 -1.02222422e-02 -1.60950005e-01
 -5.32938587e-03  9.68365893e-02 -8.87990817e-02 -1.72120750e-01
 -1.39100805e-01  2.73072422e-02  7.71344304e-02 -1.54324412e-01
  2.29326412e-02  4.46925983e-02  1.65298477e-01  1.94667861e-01
 -1.70138150e-01  3.89234513e-01 -2.18917891e-01 -2.58248925e-01
 -1.38952816e-02 -9.74827781e-02  3.65327373e-02  7.83699527e-02
 -2.40603343e-01  1.53960615e-01 -1.98522862e-02 -4.31579016e-02
 -1.97944880e-01  2.33925949e-03 -2.65323758e-01 -1.44990176e-01
 -8.67372677e-02  1.24838695e-01  3.68875802e-01 -3.03966161e-02
  1.53521448e-01 -2.48228699e-01  2.11544726e-02  3.27512711e-01
  4.43835407e-02 -4.28637713e-02 -1.44940719e-01  9.76067707e-02
  9.13163200e-02  1.48789808e-01  1.23899817e-01 -2.12268770e-01
  4.46118385e-01 -9.15914923e-02  2.74303079e-01  1.43120006e-01
  8.95549133e-02  2.03270540e-02 -1.74899444e-01 -2.54027173e-02
  5.15332334e-02  2.74816100e-02  9.13057383e-03 -9.77091342e-02
  8.38149041e-02  1.91514775e-01 -1.86888173e-01 -1.63023531e-01
 -1.83735609e-01 -1.17754675e-01 -8.02150741e-03 -1.98843658e-01
  1.22469790e-01  1.40912622e-01  2.02520043e-01 -5.68287354e-03
 -2.00743854e-01  2.60207981e-01  2.59814672e-02 -1.36807635e-01
 -6.63179681e-02 -8.19087774e-03 -2.03191452e-02  1.97690874e-02
 -1.54288888e-01  9.06180143e-02  9.36938971e-02 -2.49418505e-02
 -6.81422949e-02 -8.33036602e-02  1.52526289e-01  2.16028109e-01
 -6.72417954e-02 -9.63013843e-02 -2.45497152e-01 -1.30919814e-01
 -1.97613224e-01  7.56083280e-02]
(250,)#250行一列

获取“,”在Word2Vec模型词汇表中的下标

myWord2Vec.wv.get_index(",")
0

获取“好吃”在Word2Vec模型词汇表中的向量

vc = myWord2Vec.wv.get_vector("好吃")
vc
array([-3.69522065e-01,  4.37786151e-03,  2.29101434e-01,  2.26836745e-02,
        1.03728533e-01, -3.62663567e-01, -4.03029437e-04,  2.37036154e-01,
       -1.39683560e-01,  2.76549488e-01, -2.54234195e-01,  7.29166567e-02,
       -2.11284220e-01,  4.67608906e-02, -2.52396047e-01, -2.63711751e-01,
       -2.35222712e-01,  7.06711365e-03, -8.35900903e-02,  2.68125147e-01,
        2.60109067e-01, -9.49401781e-02,  8.23656991e-02, -2.44502723e-01,
        3.52774970e-02, -1.16491288e-01, -6.49365783e-02,  3.09367329e-01,
       -2.16241792e-01,  3.49693373e-02,  1.24673039e-01,  5.45583218e-02,
       -1.84918463e-01,  2.55738273e-02,  2.47410864e-01,  9.49822962e-02,
        7.86316395e-03,  4.48170342e-02, -5.82654290e-02, -2.20037512e-02,
       -6.77363798e-02,  5.71702197e-02, -2.90090889e-01,  3.57576795e-02,
        3.10779940e-02,  5.43692149e-02,  3.52953114e-02, -2.32062221e-01,
        2.77692825e-01, -3.50889921e-01,  2.49353498e-02, -2.32463554e-01,
        1.02144778e-02, -4.81871106e-02, -2.77454481e-02, -2.09912136e-02,
        1.68007299e-01,  4.28577334e-01,  8.22413638e-02, -2.08648387e-02,
       -1.61410779e-01, -6.30850866e-02, -7.60364160e-02, -1.31462157e-01,
        7.63839558e-02, -6.24827705e-02, -1.54609635e-01,  1.75234541e-01,
        2.56496780e-02,  3.28170657e-01, -2.54711062e-01, -3.70941833e-02,
        1.28647506e-01,  1.00567371e-01,  5.02138197e-01, -2.06861228e-01,
       -1.13421671e-01, -3.46682072e-01, -2.64279172e-02,  3.98371220e-01,
       -1.47127539e-01,  2.39393532e-01, -1.47403264e-02, -7.75861368e-02,
       -4.88856994e-02,  2.54989322e-02, -8.29336122e-02,  1.97217375e-01,
        2.57111341e-01,  4.95956577e-02, -2.17457078e-02, -1.21569186e-01,
        1.92515571e-02, -3.24660152e-01,  1.09084897e-01,  9.39435288e-02,
        4.78389822e-02,  1.64842695e-01,  3.04058865e-02, -2.18797356e-01,
       -1.45021111e-01,  1.88397482e-01, -2.95012265e-01, -1.20642401e-01,
        6.11534379e-02,  5.43345213e-02, -1.99843273e-01,  3.46804708e-01,
       -1.01921141e-01,  3.30776833e-02, -2.04752997e-01, -9.70176607e-03,
        1.61281645e-01,  3.83198053e-01, -5.69708832e-02, -6.97976351e-02,
        7.51625597e-02,  3.13534997e-02, -3.27581853e-01, -1.24401838e-01,
        1.38628513e-01, -1.72127619e-01, -2.05904514e-01, -2.89864033e-01,
       -1.08060231e-02, -1.04934357e-01, -5.57961799e-02,  1.10415285e-02,
       -1.15984771e-03, -9.48944092e-02,  2.55775452e-01, -3.69051218e-01,
        1.36772439e-01, -4.32525218e-01, -1.19043574e-01,  1.49537042e-01,
       -7.88062587e-02,  1.18424393e-01,  5.22552505e-02, -1.11135863e-01,
       -4.02813479e-02, -2.80203700e-01, -2.46382300e-02, -1.65774629e-01,
        1.25236720e-01, -2.89802421e-02,  1.30250052e-01, -6.22620821e-01,
        3.93101573e-01,  4.22589108e-02,  1.79253936e-01,  2.72062004e-01,
       -1.52948007e-01, -4.92342040e-02, -7.88181797e-02, -1.23133793e-01,
       -2.64385045e-01, -3.83840427e-02, -4.72662337e-02, -7.09735528e-02,
        1.78143620e-01, -2.36265883e-02, -4.06596698e-02,  1.62911341e-02,
       -9.47694704e-02, -1.77087173e-01, -1.11058123e-01, -2.25221500e-01,
        9.98197198e-02,  1.37549667e-02,  2.07858846e-01,  2.37657204e-01,
       -2.19980657e-01,  2.48171940e-01, -2.30442822e-01, -1.88807636e-01,
        1.20565832e-01, -2.44208932e-01,  3.07807066e-02,  2.19011486e-01,
       -1.40350983e-01,  2.00060144e-01, -2.96447068e-01, -1.71878472e-01,
       -7.59124830e-02,  3.90236527e-02, -4.67614859e-01,  2.79001892e-01,
       -3.47703435e-02,  9.52234678e-03,  4.21999544e-01, -2.82229602e-01,
        2.80508459e-01,  4.21625786e-02, -6.64850846e-02,  3.05039644e-01,
        2.34202385e-01, -1.11196794e-01, -1.48827776e-01,  4.42277305e-02,
       -8.03061575e-02, -2.08696332e-02, -5.20934202e-02, -2.13797569e-01,
        5.91897845e-01, -7.54428804e-02,  4.85218465e-01,  1.35757655e-01,
       -2.73129009e-02, -6.32495210e-02,  1.47155493e-01, -7.16560334e-02,
       -9.62558389e-03, -1.53998271e-01, -3.50465834e-01, -8.51925686e-02,
        2.09491044e-01, -7.80593306e-02, -1.61195740e-01,  3.95132303e-02,
       -7.20454976e-02, -3.86397056e-02,  6.18456192e-02, -1.03744216e-01,
        2.15368301e-01,  3.16945642e-01,  1.54146746e-01, -9.84106511e-02,
       -2.55556911e-01, -1.11188009e-01, -1.05821015e-02, -1.91050962e-01,
       -4.09577750e-02,  1.39795378e-01,  4.07037675e-01,  3.69239151e-02,
       -2.75765032e-01,  4.66491506e-02,  3.50173801e-01, -2.51376152e-01,
       -1.12376608e-01, -1.93271920e-01,  1.47691935e-01,  1.55679872e-02,
       -4.50120151e-01,  7.54415467e-02, -6.66278243e-01, -2.24472418e-01,
       -1.10847458e-01,  2.37356618e-01], dtype=float32)

整个模型中的词向量大小 

(myWord2Vec.wv.vectors).shape
(11008, 250)

可以看出有11008个词

获取全部词内容

myWord2Vec.wv.index_to_key

获取全部词及对应的下标

myWord2Vec.wv.key_to_index

4、逗号位置后移

可以看到 ',': 0, 0号位对应的词是逗号, 如果因为数据长短不一,向量需要补0时,结果补的实际上是逗号,这是不理想的,我们必须把0给空出来,是词汇表一次向后移动一位。

embedding_matrix[0]#0标签指示的是逗号
array([-1.24408239e-02, -1.14726588e-01,  2.47033477e-01,  1.31037146e-01,
       -1.94809809e-01, -2.16878414e-01,  7.63326213e-02,  4.79248613e-01,
       -1.56891033e-01,  2.79191583e-02,  7.65182748e-02, -2.11410582e-01,
       -1.21347949e-01,  2.79119730e-01, -2.75594443e-01, -1.18430011e-01,
        8.29209387e-03,  1.05018750e-01, -1.04862735e-01,  3.06678284e-02,
        1.99599266e-01, -1.38059049e-03,  1.08982801e-01, -1.00153044e-01,
        9.64863002e-02, -1.78601265e-01,  3.11386213e-02, -1.59458425e-02,
       -3.22362274e-01,  7.29258731e-02, -3.09222400e-01, -7.19489437e-03,
       -4.57239114e-02, -2.33146802e-01,  9.37329382e-02, -1.03120364e-01,
        3.72090369e-01, -7.00908229e-02,  1.09613158e-01, -1.77778825e-01,
       -1.46992385e-01, -3.89091927e-03, -2.82386154e-01,  2.91620661e-02,
        3.23712789e-02, -9.60682556e-02, -1.46306649e-01, -3.62146832e-02,
        2.99114317e-01, -2.33087912e-02, -2.16334328e-01, -1.30818877e-02,
       -8.38805735e-02, -3.49560268e-02,  2.84558326e-01, -3.80333401e-02,
        2.01856103e-02,  2.51057204e-02,  6.37874380e-02,  2.07886666e-01,
       -1.71267241e-02, -1.15914047e-01, -5.19167334e-02, -1.29951477e-01,
        1.70055345e-01,  1.47780970e-01, -1.60399109e-01,  1.00582980e-01,
       -3.66804861e-02,  1.54437721e-01, -5.74397705e-02,  1.94721684e-01,
        1.08824961e-01, -9.47291628e-02,  2.60765135e-01,  9.58999768e-02,
        9.55417827e-02, -1.33736253e-01, -1.05333753e-01, -7.50511466e-03,
        7.85463303e-03,  1.53486669e-01,  9.28018764e-02,  2.18778417e-01,
       -2.72222996e-01,  1.75863147e-01,  5.30329160e-02,  2.72722572e-01,
        1.48632646e-01,  1.57090932e-01,  1.08813553e-03, -2.42346451e-02,
        7.34620020e-02, -2.49828592e-01,  3.03890049e-01,  1.82332903e-01,
       -2.60147423e-01, -9.62509140e-02,  7.84767733e-04, -4.74975770e-03,
       -1.46705359e-01,  2.48901963e-01, -8.41196924e-02,  1.38966620e-01,
        1.34686023e-01,  7.81016797e-02, -1.38785005e-01,  3.70077848e-01,
       -1.21465199e-01, -2.89184988e-01, -1.96699828e-01, -5.97397722e-02,
       -1.07141703e-01,  1.90362617e-01,  6.53386265e-02,  1.25934884e-01,
       -1.50998771e-01,  2.07748681e-01, -2.77070343e-01, -6.67625815e-02,
       -3.63008343e-02, -5.67756072e-02,  1.51918661e-02, -2.63019860e-01,
        5.58041148e-02,  3.85505259e-02, -2.32700005e-01,  8.43593627e-02,
       -1.21860422e-01, -2.43539028e-02,  2.69366652e-01, -3.07116359e-01,
        1.22425035e-01, -1.41671598e-01,  3.36526223e-02,  1.11485884e-01,
        2.23691970e-01,  1.63073793e-01,  1.47343911e-02,  1.92776415e-02,
        1.20959081e-01, -7.67013580e-02, -1.53942239e-02, -7.81191629e-04,
       -4.70119193e-02,  2.63462275e-01,  1.02707744e-01, -2.24397644e-01,
        3.55879039e-01,  3.28966707e-01,  6.27099350e-02, -3.04748159e-04,
       -1.38461292e-01, -5.18774018e-02, -1.24785475e-01, -7.25892326e-03,
       -3.93128721e-03,  1.09467402e-01, -1.02222422e-02, -1.60950005e-01,
       -5.32938587e-03,  9.68365893e-02, -8.87990817e-02, -1.72120750e-01,
       -1.39100805e-01,  2.73072422e-02,  7.71344304e-02, -1.54324412e-01,
        2.29326412e-02,  4.46925983e-02,  1.65298477e-01,  1.94667861e-01,
       -1.70138150e-01,  3.89234513e-01, -2.18917891e-01, -2.58248925e-01,
       -1.38952816e-02, -9.74827781e-02,  3.65327373e-02,  7.83699527e-02,
       -2.40603343e-01,  1.53960615e-01, -1.98522862e-02, -4.31579016e-02,
       -1.97944880e-01,  2.33925949e-03, -2.65323758e-01, -1.44990176e-01,
       -8.67372677e-02,  1.24838695e-01,  3.68875802e-01, -3.03966161e-02,
        1.53521448e-01, -2.48228699e-01,  2.11544726e-02,  3.27512711e-01,
        4.43835407e-02, -4.28637713e-02, -1.44940719e-01,  9.76067707e-02,
        9.13163200e-02,  1.48789808e-01,  1.23899817e-01, -2.12268770e-01,
        4.46118385e-01, -9.15914923e-02,  2.74303079e-01,  1.43120006e-01,
        8.95549133e-02,  2.03270540e-02, -1.74899444e-01, -2.54027173e-02,
        5.15332334e-02,  2.74816100e-02,  9.13057383e-03, -9.77091342e-02,
        8.38149041e-02,  1.91514775e-01, -1.86888173e-01, -1.63023531e-01,
       -1.83735609e-01, -1.17754675e-01, -8.02150741e-03, -1.98843658e-01,
        1.22469790e-01,  1.40912622e-01,  2.02520043e-01, -5.68287354e-03,
       -2.00743854e-01,  2.60207981e-01,  2.59814672e-02, -1.36807635e-01,
       -6.63179681e-02, -8.19087774e-03, -2.03191452e-02,  1.97690874e-02,
       -1.54288888e-01,  9.06180143e-02,  9.36938971e-02, -2.49418505e-02,
       -6.81422949e-02, -8.33036602e-02,  1.52526289e-01,  2.16028109e-01,
       -6.72417954e-02, -9.63013843e-02, -2.45497152e-01, -1.30919814e-01,
       -1.97613224e-01,  7.56083280e-02], dtype=float32)
import numpy as np

embedding_matrix = np.vstack((np.array(np.zeros(250)),embedding_matrix))

这段代码的作用是将一个全零向量添加到embedding_matrix的顶部。具体来说,它首先创建一个长度为250的全零向量,然后使用numpy的vstack函数将这个全零向量与原始的embedding_matrix垂直堆叠在一起。

embedding_matrix.shape
(11009, 250)

通过在词汇表中最顶上堆加了一个全0向量,将逗号的位置后移,所以此刻向量个数是11009个。

embedding_matrix[0]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

index=0的向量是全0向量。

len(TB.text)
11987

全部要处理的句子有11987个

x_train = np.zeros([len(TB.text),30],dtype="float64")

创建一个名为x_train的NumPy数组,其形状为[len(TB.text), 30],数据类型为"float64"。其中,len(TB.text)表示TB.text的长度,30表示每个样本的特征数量。截取每句话的前30个字进行训练。

x_train
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

5、配置训练集

#配置训练集
for i in range(len(TB.text)):
    for j in range(min(len(TB.text[i]),30)):#每句子的长度set值最大是30
        x_train[i,j]=1+myWord2Vec.wv.get_index(TB.text[i][j])#加1是因为第一个向量是全零向量,非词汇向量,不参与训练
        

这段代码的作用是将文本数据转换为词向量表示。具体来说,它遍历文本中的每个句子,并将每个句子的前30个单词的词向量存储在x_train数组中。这里的myWord2Vec是一个预训练好的词向量模型,用于将单词转换为向量表示。 

x_train
array([[5.800e+01, 1.000e+00, 1.700e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [2.400e+01, 3.419e+03, 2.400e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [4.700e+01, 3.800e+01, 1.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       ...,
       [3.810e+02, 9.610e+02, 6.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.870e+02, 3.780e+02, 2.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [5.780e+02, 2.100e+01, 1.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00]])
x_train[0]
array([ 58.,   1.,  17.,   1.,  15., 307.,   1.,  93., 109.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.])
y_train=TB.label
y_train
0        1
1        1
2        1
3        1
4        1
        ..
11982    0
11983    0
11984    0
11985    0
11986    0
Name: label, Length: 11987, dtype: int64

x_train是每句评论的词向量数组(最大长度是30),y_train是每句话对应的类别(好评1或差评0)

6、搭建RNN

import tensorflow as tf
from  tensorflow import keras
from  tensorflow.keras import layers
RNN=keras.Sequential(name='RNN')
#RNN

RNN.add(layers.Embedding(len(myWord2Vec.wv.key_to_index)+1,250))

RNN.add(layers.SimpleRNN(64))

RNN.add(layers.Dense(2,activation='softmax'))

RNN.summary()

这段代码是用于构建一个RNN模型的。它首先添加了一个嵌入层,将词汇表中的单词转换为250维的向量表示。然后添加了一个SimpleRNN层,该层有64个神经元。最后添加了一个全连接层,输出2个神经元,并使用softmax激活函数进行分类。最后打印出模型的结构信息。

Model: "RNN"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 250)         2752250   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 64)                20160     
_________________________________________________________________
dense (Dense)                (None, 2)                 130       
=================================================================
Total params: 2,772,540
Trainable params: 2,772,540
Non-trainable params: 0
_________________________________________________________________
#词汇表第一层全0不需要训练
RNN.layers[0].set_weights([embedding_matrix])
RNN.layers[0].trainable=False
RNN.summary()

这段代码的作用是将RNN模型的第一层(即词嵌入层)设置为不可训练,并使用预训练的词嵌入矩阵embedding_matrix进行初始化。这样在训练过程中,第一层的权重就不会被更新。 

Model: "RNN"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 250)         2752250   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 64)                20160     
_________________________________________________________________
dense (Dense)                (None, 2)                 130       
=================================================================
Total params: 2,772,540
Trainable params: 20,290
Non-trainable params: 2,752,250
_________________________________________________________________
RNN.compile(optimizer='Adam',
           loss=keras.losses.sparse_categorical_crossentropy,
            metrics=['accuracy']
           )

这段代码是用于编译一个RNN模型的。它使用了Adam优化器、稀疏分类交叉熵损失函数和准确率作为评估指标。

模型训练 

RNN.fit(x_train,y_train,epochs=50)

进行预测 

predicts = RNN.predict(x_train)

 预测结果

classes_x=np.argmax(predicts,axis=1)
print(classes_x[:10])
[1 0 1 1 1 1 1 1 0 1]
y_train[:10]
0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    1
Name: label, dtype: int64

模型评估 

RNN.evaluate(x_train,y_train)
375/375 [==============================] - 0s 1ms/step - loss: 0.2434 - accuracy: 0.9121 [0.24341031908988953, 0.9120714068412781]

7、源码分享

链接:https://pan.baidu.com/s/1bIuxrg89kil0T5GsjTMGEQ 
提取码:m5f3 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

代码骑士

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值