0 序
本文保证干货满满~
看完本文后,你只需要一个glove或者其他已经训练好的词库,也就是一个类似txt的文件,那么你就可以把一个英文单词用一个多维(如300维向量)表示出来!并且会带入到keras中训练一条龙服务~说专业点,这就是词嵌入。在之前 ,我们使用过keras自带的embedding层进行词嵌入,效果肯定是没有glove这些好的。
keras自带的词嵌入使用如下:
model = Sequential()
model.add(Embedding(2000,embed_dim, input_length=X.shape[1],dropout = 0.2))
其中2000代表不同的单词数,也就是有2000个不同的单词。embd_dim为单个单词变成的维度。
当然了,在此之前,我默认你已经对一个句子进行了fit_on_text以及text_to_sentence以及pad_sentence操作。如果这些基础知识不明白,建议先看
https://blog.csdn.net/ssswill/article/details/88533623
不要灰心。因为知识确实是一步一步来的。上面的链接我用keras自带的embedding层实现了对urdu小语种的情感分析。没有用一些高级的embedding库原因就是没有这种语言的训练好的的词库。。。
1.glove、fasttext几个G的txt文件里到底是个啥?
其实拿到这些训练好的库,我们第一时间就会想,这文件那么大,里面到底是什么东西?那么大打开都要小半天。好,接下来我来打开这些文件先看看里面是些啥东西,再做后面的步骤。
https://fasttext.cc/docs/en/english-vectors.html
https://nlp.stanford.edu/projects/glove/
上面的链接中包含了我们说的两种训练好的模型,或者说是词库。
#crawl-300d-2M.vec--> https://fasttext.cc/docs/en/english-vectors.html
#When pre-train embedding is helpful? https://www.aclweb.org/anthology/N18-2084
#There are many pretrained word embedding models:
#fasttext, GloVe, Word2Vec, etc
#crawl-300d-2M.vec is trained from Common Crawl (a website that collects almost everything)
#it has 2 million words. Each word is represent by a vector of 300 dimensions.
#https://nlp.stanford.edu/projects/glove/
#GloVe is similar to crawl-300d-2M.vec. Probably, they use different algorithms.
#glove.840B.300d.zip: Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
#tokens mean words. It has 2.2M different words and 840B (likely duplicated) words in total
上面是一些介绍。
在下载完任意一个词库之后,开始打开~
with open(file) as f:
for i in range(10):
print(f.readline())
我们打印前10行,输出是啥呢?就是这个!
, -0.082752 0.67204 -0.14987 -0.064983 0.056491 0.40228 0.0027747 -0.3311 -0.30691 2.0817 0.031819 0.013643 0.30265 0.0071297 -0.5819 -0.2774 -0.062254 1.1451 -0.24232 0.1235 -0.12243 0.33152 -0.006162 -0.30541 -0.13057 -0.054601 0.037083 -0.070552 0.5893 -0.30385 0.2898 -0.14653 -0.27052 0.37161 0.32031 -0.29125 0.0052483 -0.13212 -0.052736 0.087349 -0.26668 -0.16897 0.015162 -0.0083746 -0.14871 0.23413 -0.20719 -0.091386 0.40075 -0.17223 0.18145 0.37586 -0.28682 0.37289 -0.16185 0.18008 0.3032 -0.13216 0.18352 0.095759 0.094916 0.008289 0.11761 0.34046 0.03677 -0.29077 0.058303 -0.027814 0.082941 0.1862 -0.031494 0.27985 -0.074412 -0.13762 -0.21866 0.18138 0.040855 -0.113 0.24107 0.3657 -0.27525 -0.05684 0.34872 0.011884 0.14517 -0.71395 0.48497 0.14807 0.62287 0.20599 0.58379 -0.13438 0.40207 0.18311 0.28021 -0.42349 -0.25626 0.17715 -0.54095 0.16596 -0.036058 0.08499 -0.64989 0.075549 -0.28831 0.40626 -0.2802 0.094062 0.32406 0.28437 -0.26341 0.11553 0.071918 -0.47215 -0.18366 -0.34709 0.29964 -0.66514 0.002516 -0.42333 0.27512 0.36012 0.16311 0.23964 -0.05923 0.3261 0.20559 0.038677 -0.045816 0.089764 0.43151 -0.15954 0.08532 -0.26572 -0.15001 0.084286 -0.16714 -0.43004 0.060807 0.13121 -0.24112 0.66554 0.4453 -0.18019 -0.13919 0.56252 0.21457 -0.46443 -0.012211 0.029988 -0.051094 -0.20135 0.80788 0.47377 -0.057647 0.46216 0.16084 -0.20954 -0.05452 0.15572 -0.13712 0.12972 -0.011936 -0.003378 -0.13595 -0.080711 0.20065 0.054056 0.046816 0.059539 0.046265 0.17754 -0.31094 0.28119 -0.24355 0.085252 -0.21011 -0.19472 0.0027297 -0.46341 0.14789 -0.31517 -0.065939 0.036106 0.42903 -0.33759 0.16432 0.32568 -0.050392 -0.054297 0.24074 0.41923 0.13012 -0.17167 -0.37808 -0.23089 -0.019477 -0.29291 -0.30824 0.30297 -0.22659 0.081574 -0.18516 -0.21408 0.40616 -0.28974 0.074174 -0.17795 0.28595 -0.039626 -0.2339 -0.36054 -0.067503 -0.091065 0.23438 -0.0041331 0.003232 0.0072134 0.008697 0.21614 0.049904 0.35582 0.13748 0.073361 0.14166 0.2412 -0.013322 0.15613 0.083381 0.088146 -0.019357 0.43795 0.083961 0.45309 -0.50489 -0.10865 -0.2527 -0.18251 0.20441 0.13319 0.1294 0.050594 -0.15612 -0.39543 0.12538 0.24881 -0.1927 -0.31847 -0.12719 0.4341 0.31177 -0.0040946 -0.2094 -0.079961 0.1161 -0.050794 0.015266 -0.2803 -0.12486 0.23587 0.2339 -0.14023 0.028462 0.56923 -0.1649 -0.036429 0.010051 -0.17107 -0.042608 0.044965 -0.4393 -0.26137 0.30088 -0.060772 -0.45312 -0.19076 -0.20288 0.27694 -0.060888 0.11944 0.62206 -0.19343 0.47849 -0.30113 0.059389 0.074901 0.061068 -0.4662 0.40054 -0.19099 -0.14331 0.018267 -0.18643 0.20709 -0.35598 0.05338 -0.050821 -0.1918 -0.37846 -0.06589
. 0.012001 0.20751 -0.12578 -0.59325 0.12525 0.15975 0.13748 -0.33157 -0.13694 1.7893 -0.47094 0.70434 0.26673 -0.089961 -0.18168 0.067226 0.053347 1.5595 -0.2541 0.038413 -0.01409 0.056774 0.023434 0.024042 0.31703 0.19025 -0.37505 0.035603 0.1181 0.012032 -0.037566 -0.5046 -0.049261 0.092351 0.11031 -0.073062 0.33994 0.28239 0.13413 0.070128 -0.022099 -0.28103 0.49607 -0.48693 -0.090964 -0.1538 -0.38011 -0.014228 -0.19392 -0.11068 -0.014088 -0.17906 0.24509 -0.16878 -0.15351 -0.13808 0.02151 0.13699 0.0068061 -0.14915 -0.38169 0.12727 0.44007 0.32678 -0.46117 0.068687 0.34747 0.18827 -0.31837 0.4447 -0.2095 -0.26987 0.48945 0.15388 0.05295 -0.049831 0.11207 0.14881 -0.37003 0.30777 -0.33865 0.045149 -0.18987 0.26634 -0.26401 -0.47556 0.68381 -0.30653 0.24606 0.31611 -0.071098 0.030417 0.088119 0.045025 0.20125 -0.21618 -0.36371 -0.25948 -0.42398 -0.14305 -0.10208 0.21498 -0.21924 -0.17935 0.21546 0.13801 0.24504 -0.2559 0.054815 0.21307 0.2564 -0.25673 0.17961 -0.47638 -0.25181 -0.0091498 -0.054362 -0.21007 0.12597 -0.40795 -0.021164 0.20585 0.18925 -0.0051896 -0.51394 0.28862 -0.077748 -0.27676 0.46567 -0.14225 -0.17879 -0.4357 -0.32481 0.15034 -0.058367 0.49652 0.20472 0.019866 0.13326 0.12823 -1.0177 0.29007 0.28995 0.029994 -0.10763 0.28665 -0.24387 0.22905 -0.26249 -0.069269 -0.17889 0.21936 0.15146 0.04567 -0.050497 0.071482 -0.1027 -0.080705 0.30296 0.031302 0.26613 -0.0060951 0.10313 -0.39987 -0.043945 -0.057625 0.08702 -0.098152 0.22835 -0.005211 0.038075 0.01591 -0.20622 0.021853 0.0040426 -0.043063 -0.002294 -0.26097 -0.25802 -0.28158 -0.23118 -0.010404 -0.30102 -0.4042 0.014653 -0.10445 0.30377 -0.20957 0.3119 0.068272 0.1008 0.010423 0.54011 0.29865 0.12653 0.013761 0.21738 -0.39521 0.066633 0.50327 0.14913 -0.11554 0.010042 0.095698 0.16607 -0.18808 0.055019 0.026715 -0.3164 -0.046583 -0.051591 0.023475 -0.11007 0.085642 0.28394 0.040497 0.071986 0.14157 -0.021199 0.44718 0.20088 -0.12964 -0.067183 0.47614 0.13394 -0.17287 -0.37324 -0.17285 0.02683 -0.1316 0.09116 -0.46487 0.1274 -0.090159 -0.10552 0.068006 -0.13381 0.17056 0.089509 -0.23133 -0.27572 0.061534 -0.051646 0.28377 0.25286 -0.24139 -0.19905 0.12049 -0.1011 0.27392 0.27843 0.26449 -0.18292 -0.048961 0.19198 0.17192 0.33659 -0.20184 -0.34305 -0.24553 -0.15399 0.3945 0.22839 -0.25753 -0.25675 <