目录
一、数据准备
中文数据源:ChineseNLPCorpus: 中文自然语言处理数据集,平时做做实验的材料。欢迎补充提交合并。
中文停用词列表: GitHub - goto456/stopwords: 中文常用停用词表(哈工大停用词表、百度停用词表等)
中文停用词列表链接:百度网盘 请输入提取码
提取码:xbf1
测试用例:某外卖平台收集的用户评价,正向 4000 条,负向 约 8000 条—https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k/intro.ipynb
二、实验流程
1、加载数据集
#加载数据
import pandas as pd
TB=pd.read_csv('waimai_10k.csv')
2、分词处理
#jieba分词
import jieba
#使用lambda函数对每一列进行分词处理
TB.review.apply(lambda sen: jieba.lcut(sen))
0 [很快, ,, 好吃, ,, 味道, 足, ,, 量, 大]
1 [没有, 送水, 没有, 送水, 没有, 送水]
2 [非常, 快, ,, 态度, 好, 。]
3 [方便, ,, 快捷, ,, 味道, 可口, ,, 快, 递给, 力]
4 [菜, 味道, 很棒, !, 送餐, 很, 及时, !]
...
11982 [以前, 几乎, 天天, 吃, ,, 现在, 调料, 什么, 都, 不放, ,]
11983 [昨天, 订, 凉皮, 两份, ,, 什么, 调料, 都, 没有, 放, ,, 就, 放, ...
11984 [凉皮, 太辣, ,, 吃不下, 都]
11985 [本来, 迟到, 了, 还, 自己, 点, !, !, !]
11986 [肉夹馍, 不错, ,, 羊肉, 泡馍, 酱肉, 包, 很, 一般, 。, 凉面, 没, 想...
Name: review, Length: 11987, dtype: object
将分词结果连接到数据表
TB['text']=TB.review.apply(lambda sen: jieba.lcut(sen))#结巴分词

3、导入Word2Vec模型
#导入Word2Vec模型
from gensim.models import Word2Vec
myWord2Vec = Word2Vec(sentences=TB.text,vector_size=250, sg=1,min_count=1)#min_count=1只要出现一次就加入模型
print(myWord2Vec)
Word2Vec<vocab=11008, vector_size=250, alpha=0.025>
向量长度大小取每句话的平均长度,约为250。
获取“,”的词向量:
print(myWord2Vec.wv.get_vector(","))
print(myWord2Vec.wv.get_vector(",").shape)
[-1.24408239e-02 -1.14726588e-01 2.47033477e-01 1.31037146e-01 -1.94809809e-01 -2.16878414e-01 7.63326213e-02 4.79248613e-01 -1.56891033e-01 2.79191583e-02 7.65182748e-02 -2.11410582e-01 -1.21347949e-01 2.79119730e-01 -2.75594443e-01 -1.18430011e-01 8.29209387e-03 1.05018750e-01 -1.04862735e-01 3.06678284e-02 1.99599266e-01 -1.38059049e-03 1.08982801e-01 -1.00153044e-01 9.64863002e-02 -1.78601265e-01 3.11386213e-02 -1.59458425e-02 -3.22362274e-01 7.29258731e-02 -3.09222400e-01 -7.19489437e-03 -4.57239114e-02 -2.33146802e-01 9.37329382e-02 -1.03120364e-01 3.72090369e-01 -7.00908229e-02 1.09613158e-01 -1.77778825e-01 -1.46992385e-01 -3.89091927e-03 -2.82386154e-01 2.91620661e-02 3.23712789e-02 -9.60682556e-02 -1.46306649e-01 -3.62146832e-02 2.99114317e-01 -2.33087912e-02 -2.16334328e-01 -1.30818877e-02 -8.38805735e-02 -3.49560268e-02 2.84558326e-01 -3.80333401e-02 2.01856103e-02 2.51057204e-02 6.37874380e-02 2.07886666e-01 -1.71267241e-02 -1.15914047e-01 -5.19167334e-02 -1.29951477e-01 1.70055345e-01 1.47780970e-01 -1.60399109e-01 1.00582980e-01 -3.66804861e-02 1.54437721e-01 -5.74397705e-02 1.94721684e-01 1.08824961e-01 -9.47291628e-02 2.60765135e-01 9.58999768e-02 9.55417827e-02 -1.33736253e-01 -1.05333753e-01 -7.50511466e-03 7.85463303e-03 1.53486669e-01 9.28018764e-02 2.18778417e-01 -2.72222996e-01 1.75863147e-01 5.30329160e-02 2.72722572e-01 1.48632646e-01 1.57090932e-01 1.08813553e-03 -2.42346451e-02 7.34620020e-02 -2.49828592e-01 3.03890049e-01 1.82332903e-01 -2.60147423e-01 -9.62509140e-02 7.84767733e-04 -4.74975770e-03 -1.46705359e-01 2.48901963e-01 -8.41196924e-02 1.38966620e-01 1.34686023e-01 7.81016797e-02 -1.38785005e-01 3.70077848e-01 -1.21465199e-01 -2.89184988e-01 -1.96699828e-01 -5.97397722e-02 -1.07141703e-01 1.90362617e-01 6.53386265e-02 1.25934884e-01 -1.50998771e-01 2.07748681e-01 -2.77070343e-01 -6.67625815e-02 -3.63008343e-02 -5.67756072e-02 1.51918661e-02 -2.63019860e-01 5.58041148e-02 3.85505259e-02 -2.32700005e-01 8.43593627e-02 -1.21860422e-01 -2.43539028e-02 2.69366652e-01 -3.07116359e-01 1.22425035e-01 -1.41671598e-01 3.36526223e-02 1.11485884e-01 2.23691970e-01 1.63073793e-01 1.47343911e-02 1.92776415e-02 1.20959081e-01 -7.67013580e-02 -1.53942239e-02 -7.81191629e-04 -4.70119193e-02 2.63462275e-01 1.02707744e-01 -2.24397644e-01 3.55879039e-01 3.28966707e-01 6.27099350e-02 -3.04748159e-04 -1.38461292e-01 -5.18774018e-02 -1.24785475e-01 -7.25892326e-03 -3.93128721e-03 1.09467402e-01 -1.02222422e-02 -1.60950005e-01 -5.32938587e-03 9.68365893e-02 -8.87990817e-02 -1.72120750e-01 -1.39100805e-01 2.73072422e-02 7.71344304e-02 -1.54324412e-01 2.29326412e-02 4.46925983e-02 1.65298477e-01 1.94667861e-01 -1.70138150e-01 3.89234513e-01 -2.18917891e-01 -2.58248925e-01 -1.38952816e-02 -9.74827781e-02 3.65327373e-02 7.83699527e-02 -2.40603343e-01 1.53960615e-01 -1.98522862e-02 -4.31579016e-02 -1.97944880e-01 2.33925949e-03 -2.65323758e-01 -1.44990176e-01 -8.67372677e-02 1.24838695e-01 3.68875802e-01 -3.03966161e-02 1.53521448e-01 -2.48228699e-01 2.11544726e-02 3.27512711e-01 4.43835407e-02 -4.28637713e-02 -1.44940719e-01 9.76067707e-02 9.13163200e-02 1.48789808e-01 1.23899817e-01 -2.12268770e-01 4.46118385e-01 -9.15914923e-02 2.74303079e-01 1.43120006e-01 8.95549133e-02 2.03270540e-02 -1.74899444e-01 -2.54027173e-02 5.15332334e-02 2.74816100e-02 9.13057383e-03 -9.77091342e-02 8.38149041e-02 1.91514775e-01 -1.86888173e-01 -1.63023531e-01 -1.83735609e-01 -1.17754675e-01 -8.02150741e-03 -1.98843658e-01 1.22469790e-01 1.40912622e-01 2.02520043e-01 -5.68287354e-03 -2.00743854e-01 2.60207981e-01 2.59814672e-02 -1.36807635e-01 -6.63179681e-02 -8.19087774e-03 -2.03191452e-02 1.97690874e-02 -1.54288888e-01 9.06180143e-02 9.36938971e-02 -2.49418505e-02 -6.81422949e-02 -8.33036602e-02 1.52526289e-01 2.16028109e-01 -6.72417954e-02 -9.63013843e-02 -2.45497152e-01 -1.30919814e-01 -1.97613224e-01 7.56083280e-02] (250,)#250行一列
获取“,”在Word2Vec模型词汇表中的下标
myWord2Vec.wv.get_index(",")
0
获取“好吃”在Word2Vec模型词汇表中的向量
vc = myWord2Vec.wv.get_vector("好吃")
vc
array([-3.69522065e-01, 4.37786151e-03, 2.29101434e-01, 2.26836745e-02,
1.03728533e-01, -3.62663567e-01, -4.03029437e-04, 2.37036154e-01,
-1.39683560e-01, 2.76549488e-01, -2.54234195e-01, 7.29166567e-02,
-2.11284220e-01, 4.67608906e-02, -2.52396047e-01, -2.63711751e-01,
-2.35222712e-01, 7.06711365e-03, -8.35900903e-02, 2.68125147e-01,
2.60109067e-01, -9.49401781e-02, 8.23656991e-02, -2.44502723e-01,
3.52774970e-02, -1.16491288e-01, -6.49365783e-02, 3.09367329e-01,
-2.16241792e-01, 3.49693373e-02, 1.24673039e-01, 5.45583218e-02,
-1.84918463e-01, 2.55738273e-02, 2.47410864e-01, 9.49822962e-02,
7.86316395e-03, 4.48170342e-02, -5.82654290e-02, -2.20037512e-02,
-6.77363798e-02, 5.71702197e-02, -2.90090889e-01, 3.57576795e-02,
3.10779940e-02, 5.43692149e-02, 3.52953114e-02, -2.32062221e-01,
2.77692825e-01, -3.50889921e-01, 2.49353498e-02, -2.32463554e-01,
1.02144778e-02, -4.81871106e-02, -2.77454481e-02, -2.09912136e-02,
1.68007299e-01, 4.28577334e-01, 8.22413638e-02, -2.08648387e-02,
-1.61410779e-01, -6.30850866e-02, -7.60364160e-02, -1.31462157e-01,
7.63839558e-02, -6.24827705e-02, -1.54609635e-01, 1.75234541e-01,
2.56496780e-02, 3.28170657e-01, -2.54711062e-01, -3.70941833e-02,
1.28647506e-01, 1.00567371e-01, 5.02138197e-01, -2.06861228e-01,
-1.13421671e-01, -3.46682072e-01, -2.64279172e-02, 3.98371220e-01,
-1.47127539e-01, 2.39393532e-01, -1.47403264e-02, -7.75861368e-02,
-4.88856994e-02, 2.54989322e-02, -8.29336122e-02, 1.97217375e-01,
2.57111341e-01, 4.95956577e-02, -2.17457078e-02, -1.21569186e-01,
1.92515571e-02, -3.24660152e-01, 1.09084897e-01, 9.39435288e-02,
4.78389822e-02, 1.64842695e-01, 3.04058865e-02, -2.18797356e-01,
-1.45021111e-01, 1.88397482e-01, -2.95012265e-01, -1.20642401e-01,
6.11534379e-02, 5.43345213e-02, -1.99843273e-01, 3.46804708e-01,
-1.01921141e-01, 3.30776833e-02, -2.04752997e-01, -9.70176607e-03,
1.61281645e-01, 3.83198053e-01, -5.69708832e-02, -6.97976351e-02,
7.51625597e-02, 3.13534997e-02, -3.27581853e-01, -1.24401838e-01,
1.38628513e-01, -1.72127619e-01, -2.05904514e-01, -2.89864033e-01,
-1.08060231e-02, -1.04934357e-01, -5.57961799e-02, 1.10415285e-02,
-1.15984771e-03, -9.48944092e-02, 2.55775452e-01, -3.69051218e-01,
1.36772439e-01, -4.32525218e-01, -1.19043574e-01, 1.49537042e-01,
-7.88062587e-02, 1.18424393e-01, 5.22552505e-02, -1.11135863e-01,
-4.02813479e-02, -2.80203700e-01, -2.46382300e-02, -1.65774629e-01,
1.25236720e-01, -2.89802421e-02, 1.30250052e-01, -6.22620821e-01,
3.93101573e-01, 4.22589108e-02, 1.79253936e-01, 2.72062004e-01,
-1.52948007e-01, -4.92342040e-02, -7.88181797e-02, -1.23133793e-01,
-2.64385045e-01, -3.83840427e-02, -4.72662337e-02, -7.09735528e-02,
1.78143620e-01, -2.36265883e-02, -4.06596698e-02, 1.62911341e-02,
-9.47694704e-02, -1.77087173e-01, -1.11058123e-01, -2.25221500e-01,
9.98197198e-02, 1.37549667e-02, 2.07858846e-01, 2.37657204e-01,
-2.19980657e-01, 2.48171940e-01, -2.30442822e-01, -1.88807636e-01,
1.20565832e-01, -2.44208932e-01, 3.07807066e-02, 2.19011486e-01,
-1.40350983e-01, 2.00060144e-01, -2.96447068e-01, -1.71878472e-01,
-7.59124830e-02, 3.90236527e-02, -4.67614859e-01, 2.79001892e-01,
-3.47703435e-02, 9.52234678e-03, 4.21999544e-01, -2.82229602e-01,
2.80508459e-01, 4.21625786e-02, -6.64850846e-02, 3.05039644e-01,
2.34202385e-01, -1.11196794e-01, -1.48827776e-01, 4.42277305e-02,
-8.03061575e-02, -2.08696332e-02, -5.20934202e-02, -2.13797569e-01,
5.91897845e-01, -7.54428804e-02, 4.85218465e-01, 1.35757655e-01,
-2.73129009e-02, -6.32495210e-02, 1.47155493e-01, -7.16560334e-02,
-9.62558389e-03, -1.53998271e-01, -3.50465834e-01, -8.51925686e-02,
2.09491044e-01, -7.80593306e-02, -1.61195740e-01, 3.95132303e-02,
-7.20454976e-02, -3.86397056e-02, 6.18456192e-02, -1.03744216e-01,
2.15368301e-01, 3.16945642e-01, 1.54146746e-01, -9.84106511e-02,
-2.55556911e-01, -1.11188009e-01, -1.05821015e-02, -1.91050962e-01,
-4.09577750e-02, 1.39795378e-01, 4.07037675e-01, 3.69239151e-02,
-2.75765032e-01, 4.66491506e-02, 3.50173801e-01, -2.51376152e-01,
-1.12376608e-01, -1.93271920e-01, 1.47691935e-01, 1.55679872e-02,
-4.50120151e-01, 7.54415467e-02, -6.66278243e-01, -2.24472418e-01,
-1.10847458e-01, 2.37356618e-01], dtype=float32)
整个模型中的词向量大小
(myWord2Vec.wv.vectors).shape
(11008, 250)
可以看出有11008个词
获取全部词内容
myWord2Vec.wv.index_to_key

获取全部词及对应的下标
myWord2Vec.wv.key_to_index

4、逗号位置后移
可以看到 ',': 0, 0号位对应的词是逗号, 如果因为数据长短不一,向量需要补0时,结果补的实际上是逗号,这是不理想的,我们必须把0给空出来,是词汇表一次向后移动一位。

embedding_matrix[0]#0标签指示的是逗号
array([-1.24408239e-02, -1.14726588e-01, 2.47033477e-01, 1.31037146e-01,
-1.94809809e-01, -2.16878414e-01, 7.63326213e-02, 4.79248613e-01,
-1.56891033e-01, 2.79191583e-02, 7.65182748e-02, -2.11410582e-01,
-1.21347949e-01, 2.79119730e-01, -2.75594443e-01, -1.18430011e-01,
8.29209387e-03, 1.05018750e-01, -1.04862735e-01, 3.06678284e-02,
1.99599266e-01, -1.38059049e-03, 1.08982801e-01, -1.00153044e-01,
9.64863002e-02, -1.78601265e-01, 3.11386213e-02, -1.59458425e-02,
-3.22362274e-01, 7.29258731e-02, -3.09222400e-01, -7.19489437e-03,
-4.57239114e-02, -2.33146802e-01, 9.37329382e-02, -1.03120364e-01,
3.72090369e-01, -7.00908229e-02, 1.09613158e-01, -1.77778825e-01,
-1.46992385e-01, -3.89091927e-03, -2.82386154e-01, 2.91620661e-02,
3.23712789e-02, -9.60682556e-02, -1.46306649e-01, -3.62146832e-02,
2.99114317e-01, -2.33087912e-02, -2.16334328e-01, -1.30818877e-02,
-8.38805735e-02, -3.49560268e-02, 2.84558326e-01, -3.80333401e-02,
2.01856103e-02, 2.51057204e-02, 6.37874380e-02, 2.07886666e-01,
-1.71267241e-02, -1.15914047e-01, -5.19167334e-02, -1.29951477e-01,
1.70055345e-01, 1.47780970e-01, -1.60399109e-01, 1.00582980e-01,
-3.66804861e-02, 1.54437721e-01, -5.74397705e-02, 1.94721684e-01,
1.08824961e-01, -9.47291628e-02, 2.60765135e-01, 9.58999768e-02,
9.55417827e-02, -1.33736253e-01, -1.05333753e-01, -7.50511466e-03,
7.85463303e-03, 1.53486669e-01, 9.28018764e-02, 2.18778417e-01,
-2.72222996e-01, 1.75863147e-01, 5.30329160e-02, 2.72722572e-01,
1.48632646e-01, 1.57090932e-01, 1.08813553e-03, -2.42346451e-02,
7.34620020e-02, -2.49828592e-01, 3.03890049e-01, 1.82332903e-01,
-2.60147423e-01, -9.62509140e-02, 7.84767733e-04, -4.74975770e-03,
-1.46705359e-01, 2.48901963e-01, -8.41196924e-02, 1.38966620e-01,
1.34686023e-01, 7.81016797e-02, -1.38785005e-01, 3.70077848e-01,
-1.21465199e-01, -2.89184988e-01, -1.96699828e-01, -5.97397722e-02,
-1.07141703e-01, 1.90362617e-01, 6.53386265e-02, 1.25934884e-01,
-1.50998771e-01, 2.07748681e-01, -2.77070343e-01, -6.67625815e-02,
-3.63008343e-02, -5.67756072e-02, 1.51918661e-02, -2.63019860e-01,
5.58041148e-02, 3.85505259e-02, -2.32700005e-01, 8.43593627e-02,
-1.21860422e-01, -2.43539028e-02, 2.69366652e-01, -3.07116359e-01,
1.22425035e-01, -1.41671598e-01, 3.36526223e-02, 1.11485884e-01,
2.23691970e-01, 1.63073793e-01, 1.47343911e-02, 1.92776415e-02,
1.20959081e-01, -7.67013580e-02, -1.53942239e-02, -7.81191629e-04,
-4.70119193e-02, 2.63462275e-01, 1.02707744e-01, -2.24397644e-01,
3.55879039e-01, 3.28966707e-01, 6.27099350e-02, -3.04748159e-04,
-1.38461292e-01, -5.18774018e-02, -1.24785475e-01, -7.25892326e-03,
-3.93128721e-03, 1.09467402e-01, -1.02222422e-02, -1.60950005e-01,
-5.32938587e-03, 9.68365893e-02, -8.87990817e-02, -1.72120750e-01,
-1.39100805e-01, 2.73072422e-02, 7.71344304e-02, -1.54324412e-01,
2.29326412e-02, 4.46925983e-02, 1.65298477e-01, 1.94667861e-01,
-1.70138150e-01, 3.89234513e-01, -2.18917891e-01, -2.58248925e-01,
-1.38952816e-02, -9.74827781e-02, 3.65327373e-02, 7.83699527e-02,
-2.40603343e-01, 1.53960615e-01, -1.98522862e-02, -4.31579016e-02,
-1.97944880e-01, 2.33925949e-03, -2.65323758e-01, -1.44990176e-01,
-8.67372677e-02, 1.24838695e-01, 3.68875802e-01, -3.03966161e-02,
1.53521448e-01, -2.48228699e-01, 2.11544726e-02, 3.27512711e-01,
4.43835407e-02, -4.28637713e-02, -1.44940719e-01, 9.76067707e-02,
9.13163200e-02, 1.48789808e-01, 1.23899817e-01, -2.12268770e-01,
4.46118385e-01, -9.15914923e-02, 2.74303079e-01, 1.43120006e-01,
8.95549133e-02, 2.03270540e-02, -1.74899444e-01, -2.54027173e-02,
5.15332334e-02, 2.74816100e-02, 9.13057383e-03, -9.77091342e-02,
8.38149041e-02, 1.91514775e-01, -1.86888173e-01, -1.63023531e-01,
-1.83735609e-01, -1.17754675e-01, -8.02150741e-03, -1.98843658e-01,
1.22469790e-01, 1.40912622e-01, 2.02520043e-01, -5.68287354e-03,
-2.00743854e-01, 2.60207981e-01, 2.59814672e-02, -1.36807635e-01,
-6.63179681e-02, -8.19087774e-03, -2.03191452e-02, 1.97690874e-02,
-1.54288888e-01, 9.06180143e-02, 9.36938971e-02, -2.49418505e-02,
-6.81422949e-02, -8.33036602e-02, 1.52526289e-01, 2.16028109e-01,
-6.72417954e-02, -9.63013843e-02, -2.45497152e-01, -1.30919814e-01,
-1.97613224e-01, 7.56083280e-02], dtype=float32)
import numpy as np
embedding_matrix = np.vstack((np.array(np.zeros(250)),embedding_matrix))
这段代码的作用是将一个全零向量添加到embedding_matrix的顶部。具体来说,它首先创建一个长度为250的全零向量,然后使用numpy的vstack函数将这个全零向量与原始的embedding_matrix垂直堆叠在一起。
embedding_matrix.shape
(11009, 250)
通过在词汇表中最顶上堆加了一个全0向量,将逗号的位置后移,所以此刻向量个数是11009个。
embedding_matrix[0]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
index=0的向量是全0向量。
len(TB.text)
11987
全部要处理的句子有11987个
x_train = np.zeros([len(TB.text),30],dtype="float64")
创建一个名为x_train的NumPy数组,其形状为[len(TB.text), 30],数据类型为"float64"。其中,len(TB.text)表示TB.text的长度,30表示每个样本的特征数量。截取每句话的前30个字进行训练。
x_train
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
5、配置训练集
#配置训练集
for i in range(len(TB.text)):
for j in range(min(len(TB.text[i]),30)):#每句子的长度set值最大是30
x_train[i,j]=1+myWord2Vec.wv.get_index(TB.text[i][j])#加1是因为第一个向量是全零向量,非词汇向量,不参与训练
这段代码的作用是将文本数据转换为词向量表示。具体来说,它遍历文本中的每个句子,并将每个句子的前30个单词的词向量存储在x_train数组中。这里的myWord2Vec是一个预训练好的词向量模型,用于将单词转换为向量表示。
x_train
array([[5.800e+01, 1.000e+00, 1.700e+01, ..., 0.000e+00, 0.000e+00,
0.000e+00],
[2.400e+01, 3.419e+03, 2.400e+01, ..., 0.000e+00, 0.000e+00,
0.000e+00],
[4.700e+01, 3.800e+01, 1.000e+00, ..., 0.000e+00, 0.000e+00,
0.000e+00],
...,
[3.810e+02, 9.610e+02, 6.000e+00, ..., 0.000e+00, 0.000e+00,
0.000e+00],
[3.870e+02, 3.780e+02, 2.000e+00, ..., 0.000e+00, 0.000e+00,
0.000e+00],
[5.780e+02, 2.100e+01, 1.000e+00, ..., 0.000e+00, 0.000e+00,
0.000e+00]])
x_train[0]
array([ 58., 1., 17., 1., 15., 307., 1., 93., 109., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0.])
y_train=TB.label
y_train
0 1
1 1
2 1
3 1
4 1
..
11982 0
11983 0
11984 0
11985 0
11986 0
Name: label, Length: 11987, dtype: int64
x_train是每句评论的词向量数组(最大长度是30),y_train是每句话对应的类别(好评1或差评0)
6、搭建RNN
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
RNN=keras.Sequential(name='RNN')
#RNN
RNN.add(layers.Embedding(len(myWord2Vec.wv.key_to_index)+1,250))
RNN.add(layers.SimpleRNN(64))
RNN.add(layers.Dense(2,activation='softmax'))
RNN.summary()
这段代码是用于构建一个RNN模型的。它首先添加了一个嵌入层,将词汇表中的单词转换为250维的向量表示。然后添加了一个SimpleRNN层,该层有64个神经元。最后添加了一个全连接层,输出2个神经元,并使用softmax激活函数进行分类。最后打印出模型的结构信息。
Model: "RNN" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 250) 2752250 _________________________________________________________________ simple_rnn (SimpleRNN) (None, 64) 20160 _________________________________________________________________ dense (Dense) (None, 2) 130 ================================================================= Total params: 2,772,540 Trainable params: 2,772,540 Non-trainable params: 0 _________________________________________________________________
#词汇表第一层全0不需要训练
RNN.layers[0].set_weights([embedding_matrix])
RNN.layers[0].trainable=False
RNN.summary()
这段代码的作用是将RNN模型的第一层(即词嵌入层)设置为不可训练,并使用预训练的词嵌入矩阵embedding_matrix进行初始化。这样在训练过程中,第一层的权重就不会被更新。
Model: "RNN" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, None, 250) 2752250 _________________________________________________________________ simple_rnn (SimpleRNN) (None, 64) 20160 _________________________________________________________________ dense (Dense) (None, 2) 130 ================================================================= Total params: 2,772,540 Trainable params: 20,290 Non-trainable params: 2,752,250 _________________________________________________________________
RNN.compile(optimizer='Adam',
loss=keras.losses.sparse_categorical_crossentropy,
metrics=['accuracy']
)
这段代码是用于编译一个RNN模型的。它使用了Adam优化器、稀疏分类交叉熵损失函数和准确率作为评估指标。
模型训练
RNN.fit(x_train,y_train,epochs=50)

进行预测
predicts = RNN.predict(x_train)
预测结果
classes_x=np.argmax(predicts,axis=1)
print(classes_x[:10])
[1 0 1 1 1 1 1 1 0 1]
y_train[:10]
0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 1 Name: label, dtype: int64
模型评估
RNN.evaluate(x_train,y_train)
375/375 [==============================] - 0s 1ms/step - loss: 0.2434 - accuracy: 0.9121 [0.24341031908988953, 0.9120714068412781]
7、源码分享
链接:https://pan.baidu.com/s/1bIuxrg89kil0T5GsjTMGEQ
提取码:m5f3
848

被折叠的 条评论
为什么被折叠?



