如何使用glove,fasttext等词库进行word embedding?(原理篇)

10 篇文章 1 订阅

0 序

本文保证干货满满~
看完本文后,你只需要一个glove或者其他已经训练好的词库,也就是一个类似txt的文件,那么你就可以把一个英文单词用一个多维(如300维向量)表示出来!并且会带入到keras中训练一条龙服务~说专业点,这就是词嵌入。在之前 ,我们使用过keras自带的embedding层进行词嵌入,效果肯定是没有glove这些好的。
keras自带的词嵌入使用如下:

model = Sequential()
model.add(Embedding(2000,embed_dim, input_length=X.shape[1],dropout = 0.2))

其中2000代表不同的单词数,也就是有2000个不同的单词。embd_dim为单个单词变成的维度。
当然了,在此之前,我默认你已经对一个句子进行了fit_on_text以及text_to_sentence以及pad_sentence操作。如果这些基础知识不明白,建议先看
https://blog.csdn.net/ssswill/article/details/88533623
不要灰心。因为知识确实是一步一步来的。上面的链接我用keras自带的embedding层实现了对urdu小语种的情感分析。没有用一些高级的embedding库原因就是没有这种语言的训练好的的词库。。。

1.glove、fasttext几个G的txt文件里到底是个啥?

其实拿到这些训练好的库,我们第一时间就会想,这文件那么大,里面到底是什么东西?那么大打开都要小半天。好,接下来我来打开这些文件先看看里面是些啥东西,再做后面的步骤。
https://fasttext.cc/docs/en/english-vectors.html
https://nlp.stanford.edu/projects/glove/
上面的链接中包含了我们说的两种训练好的模型,或者说是词库。

#crawl-300d-2M.vec--> https://fasttext.cc/docs/en/english-vectors.html
#When pre-train embedding is helpful? https://www.aclweb.org/anthology/N18-2084
#There are many pretrained word embedding models: 
#fasttext, GloVe, Word2Vec, etc
#crawl-300d-2M.vec is trained from Common Crawl (a website that collects almost everything)
#it has 2 million words. Each word is represent by a vector of 300 dimensions.

#https://nlp.stanford.edu/projects/glove/
#GloVe is similar to crawl-300d-2M.vec. Probably, they use different algorithms.
#glove.840B.300d.zip: Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download)
#tokens mean words. It has 2.2M different words and 840B (likely duplicated) words in total

上面是一些介绍。
在下载完任意一个词库之后,开始打开~

with open(file) as f:
    for i in range(10):
        print(f.readline())

我们打印前10行,输出是啥呢?就是这个!

, -0.082752 0.67204 -0.14987 -0.064983 0.056491 0.40228 0.0027747 -0.3311 -0.30691 2.0817 0.031819 0.013643 0.30265 0.0071297 -0.5819 -0.2774 -0.062254 1.1451 -0.24232 0.1235 -0.12243 0.33152 -0.006162 -0.30541 -0.13057 -0.054601 0.037083 -0.070552 0.5893 -0.30385 0.2898 -0.14653 -0.27052 0.37161 0.32031 -0.29125 0.0052483 -0.13212 -0.052736 0.087349 -0.26668 -0.16897 0.015162 -0.0083746 -0.14871 0.23413 -0.20719 -0.091386 0.40075 -0.17223 0.18145 0.37586 -0.28682 0.37289 -0.16185 0.18008 0.3032 -0.13216 0.18352 0.095759 0.094916 0.008289 0.11761 0.34046 0.03677 -0.29077 0.058303 -0.027814 0.082941 0.1862 -0.031494 0.27985 -0.074412 -0.13762 -0.21866 0.18138 0.040855 -0.113 0.24107 0.3657 -0.27525 -0.05684 0.34872 0.011884 0.14517 -0.71395 0.48497 0.14807 0.62287 0.20599 0.58379 -0.13438 0.40207 0.18311 0.28021 -0.42349 -0.25626 0.17715 -0.54095 0.16596 -0.036058 0.08499 -0.64989 0.075549 -0.28831 0.40626 -0.2802 0.094062 0.32406 0.28437 -0.26341 0.11553 0.071918 -0.47215 -0.18366 -0.34709 0.29964 -0.66514 0.002516 -0.42333 0.27512 0.36012 0.16311 0.23964 -0.05923 0.3261 0.20559 0.038677 -0.045816 0.089764 0.43151 -0.15954 0.08532 -0.26572 -0.15001 0.084286 -0.16714 -0.43004 0.060807 0.13121 -0.24112 0.66554 0.4453 -0.18019 -0.13919 0.56252 0.21457 -0.46443 -0.012211 0.029988 -0.051094 -0.20135 0.80788 0.47377 -0.057647 0.46216 0.16084 -0.20954 -0.05452 0.15572 -0.13712 0.12972 -0.011936 -0.003378 -0.13595 -0.080711 0.20065 0.054056 0.046816 0.059539 0.046265 0.17754 -0.31094 0.28119 -0.24355 0.085252 -0.21011 -0.19472 0.0027297 -0.46341 0.14789 -0.31517 -0.065939 0.036106 0.42903 -0.33759 0.16432 0.32568 -0.050392 -0.054297 0.24074 0.41923 0.13012 -0.17167 -0.37808 -0.23089 -0.019477 -0.29291 -0.30824 0.30297 -0.22659 0.081574 -0.18516 -0.21408 0.40616 -0.28974 0.074174 -0.17795 0.28595 -0.039626 -0.2339 -0.36054 -0.067503 -0.091065 0.23438 -0.0041331 0.003232 0.0072134 0.008697 0.21614 0.049904 0.35582 0.13748 0.073361 0.14166 0.2412 -0.013322 0.15613 0.083381 0.088146 -0.019357 0.43795 0.083961 0.45309 -0.50489 -0.10865 -0.2527 -0.18251 0.20441 0.13319 0.1294 0.050594 -0.15612 -0.39543 0.12538 0.24881 -0.1927 -0.31847 -0.12719 0.4341 0.31177 -0.0040946 -0.2094 -0.079961 0.1161 -0.050794 0.015266 -0.2803 -0.12486 0.23587 0.2339 -0.14023 0.028462 0.56923 -0.1649 -0.036429 0.010051 -0.17107 -0.042608 0.044965 -0.4393 -0.26137 0.30088 -0.060772 -0.45312 -0.19076 -0.20288 0.27694 -0.060888 0.11944 0.62206 -0.19343 0.47849 -0.30113 0.059389 0.074901 0.061068 -0.4662 0.40054 -0.19099 -0.14331 0.018267 -0.18643 0.20709 -0.35598 0.05338 -0.050821 -0.1918 -0.37846 -0.06589

. 0.012001 0.20751 -0.12578 -0.59325 0.12525 0.15975 0.13748 -0.33157 -0.13694 1.7893 -0.47094 0.70434 0.26673 -0.089961 -0.18168 0.067226 0.053347 1.5595 -0.2541 0.038413 -0.01409 0.056774 0.023434 0.024042 0.31703 0.19025 -0.37505 0.035603 0.1181 0.012032 -0.037566 -0.5046 -0.049261 0.092351 0.11031 -0.073062 0.33994 0.28239 0.13413 0.070128 -0.022099 -0.28103 0.49607 -0.48693 -0.090964 -0.1538 -0.38011 -0.014228 -0.19392 -0.11068 -0.014088 -0.17906 0.24509 -0.16878 -0.15351 -0.13808 0.02151 0.13699 0.0068061 -0.14915 -0.38169 0.12727 0.44007 0.32678 -0.46117 0.068687 0.34747 0.18827 -0.31837 0.4447 -0.2095 -0.26987 0.48945 0.15388 0.05295 -0.049831 0.11207 0.14881 -0.37003 0.30777 -0.33865 0.045149 -0.18987 0.26634 -0.26401 -0.47556 0.68381 -0.30653 0.24606 0.31611 -0.071098 0.030417 0.088119 0.045025 0.20125 -0.21618 -0.36371 -0.25948 -0.42398 -0.14305 -0.10208 0.21498 -0.21924 -0.17935 0.21546 0.13801 0.24504 -0.2559 0.054815 0.21307 0.2564 -0.25673 0.17961 -0.47638 -0.25181 -0.0091498 -0.054362 -0.21007 0.12597 -0.40795 -0.021164 0.20585 0.18925 -0.0051896 -0.51394 0.28862 -0.077748 -0.27676 0.46567 -0.14225 -0.17879 -0.4357 -0.32481 0.15034 -0.058367 0.49652 0.20472 0.019866 0.13326 0.12823 -1.0177 0.29007 0.28995 0.029994 -0.10763 0.28665 -0.24387 0.22905 -0.26249 -0.069269 -0.17889 0.21936 0.15146 0.04567 -0.050497 0.071482 -0.1027 -0.080705 0.30296 0.031302 0.26613 -0.0060951 0.10313 -0.39987 -0.043945 -0.057625 0.08702 -0.098152 0.22835 -0.005211 0.038075 0.01591 -0.20622 0.021853 0.0040426 -0.043063 -0.002294 -0.26097 -0.25802 -0.28158 -0.23118 -0.010404 -0.30102 -0.4042 0.014653 -0.10445 0.30377 -0.20957 0.3119 0.068272 0.1008 0.010423 0.54011 0.29865 0.12653 0.013761 0.21738 -0.39521 0.066633 0.50327 0.14913 -0.11554 0.010042 0.095698 0.16607 -0.18808 0.055019 0.026715 -0.3164 -0.046583 -0.051591 0.023475 -0.11007 0.085642 0.28394 0.040497 0.071986 0.14157 -0.021199 0.44718 0.20088 -0.12964 -0.067183 0.47614 0.13394 -0.17287 -0.37324 -0.17285 0.02683 -0.1316 0.09116 -0.46487 0.1274 -0.090159 -0.10552 0.068006 -0.13381 0.17056 0.089509 -0.23133 -0.27572 0.061534 -0.051646 0.28377 0.25286 -0.24139 -0.19905 0.12049 -0.1011 0.27392 0.27843 0.26449 -0.18292 -0.048961 0.19198 0.17192 0.33659 -0.20184 -0.34305 -0.24553 -0.15399 0.3945 0.22839 -0.25753 -0.25675 -0.37332 -0.23884 -0.048816 0.78323 0.18851 -0.26477 0.096566 0.062658 -0.30668 -0.43334 0.10006 0.21136 0.039459 -0.11077 0.24421 0.60942 -0.46646 0.086385 -0.39702 -0.23363 0.021307 -0.10778 -0.2281 0.50803 0.11567 0.16165 -0.066737 -0.29556 0.022612 -0.28135 0.0635 0.14019 0.13871 -0.36049 -0.035

the 0.27204 -0.06203 -0.1884 0.023225 -0.018158 0.0067192 -0.13877 0.17708 0.17709 2.5882 -0.35179 -0.17312 0.43285 -0.10708 0.15006 -0.19982 -0.19093 1.1871 -0.16207 -0.23538 0.003664 -0.19156 -0.085662 0.039199 -0.066449 -0.04209 -0.19122 0.011679 -0.37138 0.21886 0.0011423 0.4319 -0.14205 0.38059 0.30654 0.020167 -0.18316 -0.0065186 -0.0080549 -0.12063 0.027507 0.29839 -0.22896 -0.22882 0.14671 -0.076301 -0.1268 -0.0066651 -0.052795 0.14258 0.1561 0.05551 -0.16149 0.09629 -0.076533 -0.049971 -0.010195 -0.047641 -0.16679 -0.2394 0.0050141 -0.049175 0.013338 0.41923 -0.10104 0.015111 -0.077706 -0.13471 0.119 0.10802 0.21061 -0.051904 0.18527 0.17856 0.041293 -0.014385 -0.082567 -0.035483 -0.076173 -0.045367 0.089281 0.33672 -0.22099 -0.0067275 0.23983 -0.23147 -0.88592 0.091297 -0.012123 0.013233 -0.25799 -0.02972 0.016754 0.01369 0.32377 0.039546 0.042114 -0.088243 0.30318 0.087747 0.16346 -0.40485 -0.043845 -0.040697 0.20936 -0.77795 0.2997 0.2334 0.14891 -0.39037 -0.053086 0.062922 0.065663 -0.13906 0.094193 0.10344 -0.2797 0.28905 -0.32161 0.020687 0.063254 -0.23257 -0.4352 -0.017049 -0.32744 -0.047064 -0.075149 -0.18788 -0.015017 0.029342 -0.3527 -0.044278 -0.13507 -0.11644 -0.1043 0.1392 0.0039199 0.37603 0.067217 -0.37992 -1.1241 -0.057357 -0.16826 0.03941 0.2604 -0.023866 0.17963 0.13553 0.2139 0.052633 -0.25033 -0.11307 0.22234 0.066597 -0.11161 0.062438 -0.27972 0.19878 -0.36262 -1.0006e-05 -0.17262 0.29166 -0.15723 0.054295 0.06101 -0.39165 0.2766 0.057816 0.39709 0.025229 0.24672 -0.08905 0.15683 -0.2096 -0.22196 0.052394 -0.01136 0.050417 -0.14023 -0.042825 -0.031931 -0.21336 -0.20402 -0.23272 0.07449 0.088202 -0.11063 -0.33526 -0.014028 -0.29429 -0.086911 -0.1321 -0.43616 0.20513 0.0079362 0.48505 0.064237 0.14261 -0.43711 0.12783 -0.13111 0.24673 -0.27496 0.15896 0.43314 0.090286 0.24662 0.066463 -0.20099 0.1101 0.03644 0.17359 -0.15689 -0.086328 -0.17316 0.36975 -0.40317 -0.064814 -0.034166 -0.013773 0.062854 -0.17183 -0.12366 -0.034663 -0.22793 -0.23172 0.239 0.27473 0.15332 0.10661 -0.060982 -0.024805 -0.13478 0.17932 -0.37374 -0.02893 -0.11142 -0.08389 -0.055932 0.068039 -0.10783 0.1465 0.094617 -0.084554 0.067429 -0.3291 0.034082 -0.16747 -0.25997 -0.22917 0.020159 -0.02758 0.16136 -0.18538 0.037665 0.57603 0.20684 0.27941 0.16477 -0.018769 0.12062 0.069648 0.059022 -0.23154 0.24095 -0.3471 0.04854 -0.056502 0.41566 -0.43194 0.4823 -0.051759 -0.27285 -0.25893 0.16555 -0.1831 -0.06734 0.42457 0.010346 0.14237 0.25939 0.17123 -0.13821 -0.066846 0.015981 -0.30193 0.043579 -0.043102 0.35025 -0.19681 -0.4281 0.16899 0.22511 -0.28557 -0.1028 -0.018168 0.11407 0.13015 -0.18317 0.1323

and -0.18567 0.066008 -0.25209 -0.11725 0.26513 0.064908 0.12291 -0.093979 0.024321 2.4926 -0.017916 -0.071218 -0.24782 -0.26237 -0.2246 -0.21961 -0.12927 1.0867 -0.66072 -0.031617 -0.057328 0.056903 -0.27939 -0.39825 0.14251 -0.085146 -0.14779 0.055067 -0.0028687 -0.20917 -0.070735 0.22577 -0.15881 -0.10395 0.09711 -0.56251 -0.32929 -0.20853 0.0098711 0.049777 0.0014883 0.15884 0.042771 -0.0026956 -0.02462 -0.19213 -0.22556 0.10838 0.090086 -0.13291 0.32559 -0.17038 -0.1099 -0.23986 -0.024289 0.014656 -0.237 0.084828 -0.35982 -0.076746 0.048909 0.11431 -0.21013 0.24765 -0.017531 -0.14028 0.046191 0.22972 0.1175 0.12724 0.012992 0.4587 0.41085 0.039106 0.15713 -0.18376 0.26834 0.056662 0.16844 -0.053788 -0.091892 0.11193 -0.08681 -0.13324 0.15062 -0.31733 -0.22078 0.25038 0.34131 0.36419 -0.089514 -0.22193 0.24471 0.040091 0.47798 -0.029996 0.0019212 0.063511 -0.20417 -0.26478 0.20649 0.015573 -0.27722 -0.18861 -0.10289 -0.49773 0.14986 -0.010877 0.25085 -0.28117 0.18966 -0.065879 0.094753 -0.15338 -0.055071 -0.36747 0.24993 0.096527 0.23538 0.18405 0.052859 0.22967 0.12582 0.15536 -0.17275 0.33946 -0.10049 0.074948 -0.093575 -0.04049 -0.016922 -0.0058039 -0.18108 0.19537 0.45178 0.10965 0.2337 -0.09905 -0.078633 0.21678 -0.71231 -0.099759 0.33333 -0.1646 -0.091688 0.21056 0.023669 0.028922 0.1199 -0.12512 -0.026037 -0.062217 0.55816 0.0050273 -0.30888 0.038611 0.17568 -0.11163 -0.10815 -0.19444 0.29433 0.14519 -0.042878 0.18534 0.018891 -0.61883 0.13352 0.036007 0.33995 0.22109 -0.079328 0.071319 0.17678 0.16378 -0.23142 -0.1434 -0.098122 -0.019286 0.2356 -0.34013 -0.061007 -0.23208 -0.31152 0.10063 -0.15957 0.20183 -0.016345 -0.12303 0.022667 -0.20986 -0.20127 -0.087883 0.064731 0.10195 -0.1786 0.33056 0.21407 -0.32165 -0.17106 0.19407 -0.38618 -0.2148 -0.052254 0.023175 0.47389 0.18612 0.12711 0.20855 -0.10256 -0.12016 -0.40488 0.029695 -0.027419 -0.0085227 -0.11415 0.081134 -0.17228 0.19142 0.026514 0.043789 -0.12399 0.13354 0.10112 0.081682 -0.15085 0.0075806 -0.18971 0.24669 0.22491 0.35553 -0.3277 -0.21821 0.1402 0.28604 0.055226 -0.086544 0.02111 -0.19236 0.074245 0.076782 0.00081666 0.034097 -0.57719 0.10657 0.28134 -0.11964 -0.68281 -0.32893 -0.24442 -0.025847 0.0091273 0.2025 -0.050959 -0.11042 0.010962 0.076773 0.40048 -0.40739 -0.44773 0.31954 -0.036326 -0.012789 -0.17282 0.1476 0.2356 0.080642 -0.36528 -0.0083443 0.6239 -0.24379 0.019917 -0.28803 -0.010494 0.038412 -0.11718 -0.072462 0.16381 0.38488 -0.029783 0.23444 0.4532 0.14815 -0.027021 -0.073181 -0.1147 -0.0054545 0.47796 0.090912 0.094489 -0.36882 -0.59396 -0.097729 0.20072 0.17055 -0.0047356 -0.039709 0.32498 -0.023452 0.12302 0.3312

to 0.31924 0.06316 -0.27858 0.2612 0.079248 -0.21462 -0.10495 0.15495 -0.03353 2.4834 -0.50904 0.08749 0.21426 0.22151 -0.25234 -0.097544 -0.1927 1.3606 -0.11592 -0.10383 0.21929 0.11997 -0.11063 0.14212 -0.16643 0.21815 0.0042086 -0.070012 -0.23532 -0.26518 0.031248 0.16669 -0.089777 0.20059 0.31614 -0.5583 0.075735 0.27635 0.12741 -0.18185 -0.12722 0.024686 -0.077233 -0.48998 0.020355 0.0039164 0.1215 0.089723 -0.078975 0.081443 -0.099087 -0.055621 0.10737 -0.0044042 0.48496 0.11717 -0.017329 0.109 -0.35558 0.051084 0.15714 0.17961 -0.29711 0.033645 -0.025792 -0.013931 -0.23 -0.040306 0.22282 -0.013544 0.011554 0.3911 0.26533 -0.31012 0.40539 -0.042975 0.020811 -0.33033 0.19573 -0.037958 0.10274 -0.0013581 -0.44505 0.077886 0.08511 -0.20285 -0.19481 0.056933 0.53105 0.034154 -0.56996 -0.18469 0.093403 0.28044 -0.23349 0.10938 -0.014288 -0.274 0.034196 -0.098479 0.13268 0.19437 0.13463 -0.099059 0.040324 -0.66272 0.3571 0.15429 0.18598 0.087542 0.080538 -0.25121 0.24155 0.1783 0.036011 -0.027677 0.21161 -0.29107 -0.0083456 0.11317 0.31064 -0.10693 -0.27367 -0.039785 0.039881 0.034462 -0.16518 0.16115 0.060826 0.3075 -0.22398 0.14619 -0.2661 0.49732 -0.13996 -0.24287 0.039469 -0.084495 -0.24315 0.070701 -1.0136 -0.21733 -0.36878 -0.24973 0.17472 -0.011592 0.068561 -0.090411 0.21878 -0.2639 0.11904 0.14285 -0.18707 -0.13474 -0.13232 -0.26553 0.22947 -0.018215 0.0067383 -0.1019 0.10053 -0.1127 -0.13295 0.15951 0.14906 -0.095578 0.26992 0.011057 0.056568 0.021386 0.20215 0.00048589 0.5336 -0.22947 0.29275 0.17378 0.25423 -0.10976 0.058816 0.014616 -0.04306 0.10732 -0.028149 -0.19181 0.1025 -0.063892 0.012737 -0.12913 0.015037 0.26562 -0.017049 -0.060716 -0.094919 0.017775 0.13221 0.1683 -0.19323 -0.17612 0.075506 0.18939 0.12508 -0.1988 -0.16017 -0.21092 0.46933 0.044747 0.098349 0.011637 0.22281 -0.010837 -0.04833 -0.47335 -0.36811 -0.13592 -0.15086 0.25416 0.069531 0.14211 -0.26703 -0.1259 0.12076 -0.26117 0.033024 -0.034398 -0.13968 0.13446 -0.16709 0.15002 -0.13724 0.091226 -0.27718 0.020098 0.26919 0.43016 0.094019 -0.085496 -0.25192 -0.11645 -0.039734 0.0046738 0.54178 -0.16636 0.34546 0.098501 0.47819 -0.38428 -0.3238 -0.14822 -0.47817 0.16704 -0.064505 0.11834 -0.3448 0.096891 0.32309 0.41471 0.19463 -0.20891 -0.12223 -0.058298 -0.20268 0.2948 0.043397 0.10112 0.27177 -0.52124 -0.073794 0.044808 0.41388 0.088782 0.62255 -0.072391 0.090129 0.15428 0.023163 -0.13028 0.061762 0.33803 -0.091581 0.21039 0.05108 0.19184 0.10444 0.2138 -0.35091 -0.23702 0.038399 -0.10031 0.18359 0.025178 -0.12977 0.3713 0.18888 -0.0042738 -0.10645 -0.2581 -0.044629 0.082745 0.097801 0.25045

of 0.060216 0.21799 -0.04249 -0.38618 -0.15388 0.034635 0.22243 0.21718 0.0068483 2.4375 -0.27418 0.13572 0.31086 -0.063206 0.00038225 -0.18597 -0.19333 1.4447 -0.38541 -0.28549 0.075627 -0.036799 -0.46068 -0.016835 0.19821 -0.092746 0.18954 -0.00032648 -0.17081 0.50359 0.46256 0.26901 -0.12256 0.24713 0.069305 -0.20777 -0.4456 0.30223 -0.0098344 0.32772 0.11038 0.41271 -0.15854 -0.056983 0.38918 -0.21158 -0.13307 0.40406 0.1749 0.053949 0.10984 -0.18476 -0.054014 0.040112 -0.10175 0.12662 0.069709 -0.24071 -0.20995 -0.051381 0.28219 0.18598 -0.5018 0.27572 -0.18497 -0.18399 0.15696 -0.038444 -0.52238 0.22753 0.048672 -0.078837 0.065448 0.18399 0.40211 -0.12745 -0.12302 0.31072 0.099588 0.036047 -0.25946 0.36128 0.12748 -0.18667 0.16502 -0.3912 -0.67549 0.11291 0.040743 0.034973 -0.04091 -0.039791 -0.40544 -0.015867 0.10239 0.046868 -0.082776 0.015132 -0.14899 -0.25125 0.25244 -0.11851 -0.34127 0.016516 0.30405 -0.541 0.305 0.39065 0.42362 -0.41721 -0.054247 -0.26014 -0.14048 -0.14166 -0.02105 0.050822 -0.078053 0.45922 0.17598 -0.0157 0.09118 0.034263 -0.49995 0.028574 0.12068 0.19781 -0.013025 -0.22418 0.12503 0.14653 -0.23085 0.21987 -0.059321 -0.088169 -0.1252 0.0075112 -0.22421 0.6214 0.2009 -0.02899 -0.65073 0.0053506 -0.12073 0.20988 -0.1684 0.041826 0.054582 0.35247 0.2006 0.031903 -0.053307 -0.44009 0.22495 -0.30616 -0.32855 -0.015779 -0.13913 0.34309 -0.13569 -0.22276 0.14295 0.05501 -0.10616 0.23597 -0.20701 -0.30963 0.13528 -0.16144 0.29108 0.12301 0.2365 -0.26153 0.31022 0.20612 -0.19885 0.10971 -0.0018054 0.14621 0.15177 -0.4468 0.0067433 -0.028784 0.13821 -0.16566 -0.45517 0.016623 0.10703 -0.48399 0.040033 0.049625 -0.26454 -0.1468 0.13651 0.15261 0.067522 0.50405 -0.18848 0.15256 -0.26997 0.055578 0.047077 -0.17848 -0.33567 -0.03148 0.19107 0.18818 0.18778 0.18313 -0.364 -0.0054127 -0.15763 0.16386 -0.084828 -0.19838 -0.40454 0.41031 -0.41393 0.029771 0.10544 -0.11295 -0.068076 -0.22372 -0.19084 -0.080269 -0.38345 0.064712 0.23111 0.21408 0.28038 0.14221 -0.20696 0.015874 -0.14112 0.089859 -0.21533 -0.020105 0.22703 0.083425 -0.2958 0.018036 0.19885 0.17794 0.13688 -0.10302 0.029651 0.051271 -0.14787 -0.41824 0.019828 -0.26385 -0.074654 -0.015718 0.48094 0.12492 -0.11409 0.58127 0.095836 -0.095912 -0.057435 0.13883 0.10307 0.081362 -0.4669 0.50705 0.021685 -0.071623 -0.063827 -0.11154 0.61792 -0.56329 0.023565 0.18041 -0.2578 -0.50956 0.14737 -0.033317 -0.037053 0.24062 0.12641 -0.027091 0.4039 -0.02836 -0.022235 -0.11493 -0.2285 -0.05746 0.2952 -0.21914 -0.13307 -0.23647 -0.42484 0.11606 0.0048131 -0.39629 -0.26823 0.3292 -0.17597 0.11709 -0.16692 -0.094085

a 0.043798 0.024779 -0.20937 0.49745 0.36019 -0.37503 -0.052078 -0.60555 0.036744 2.2085 -0.23389 -0.06836 -0.22355 -0.053989 -0.15198 -0.17319 0.053355 1.6485 -0.047991 -0.085311 -0.15712 -0.64425 -0.39819 0.278 0.15364 0.031678 0.055414 0.015939 0.31851 -0.058979 0.038584 0.1077 0.1041 -0.077346 0.37396 -0.21482 0.3832 -0.27737 -0.18352 -0.83838 0.34124 0.58164 0.18543 -0.31028 0.17666 -0.069421 -0.34422 -0.13665 -0.10823 0.23637 -0.32923 0.61348 0.1972 0.087123 0.10785 0.3073 0.13757 0.30809 0.24331 -0.29422 -0.0098214 0.55675 -0.04888 0.099468 0.30543 -0.37597 -0.19525 0.046246 -0.036675 0.34023 0.14905 0.0978 -0.26664 0.056834 -0.043201 -0.23338 0.13111 -0.35742 -0.3607 0.30997 -0.19727 -0.1432 -0.16747 0.00042435 -0.1512 0.067562 -0.38644 0.025349 0.24918 -0.23955 -0.15615 0.49868 0.0082758 -0.1912 -0.14906 0.48757 -0.015281 0.010196 0.37642 -0.01946 -0.27835 0.16355 -0.24127 -0.21405 -0.21562 -0.79697 0.34321 0.093209 0.073977 -0.27147 0.20539 0.15061 0.020734 0.11267 0.028714 0.2967 -0.21267 0.43214 0.12788 0.29249 0.19056 -0.29113 -0.11382 -0.038242 -0.2029 0.18301 -0.16661 -0.27116 0.0012685 0.071704 -0.18583 0.08985 -0.039895 0.39479 0.0053211 -0.00061548 -0.27082 -0.089782 -0.2879 -0.14865 -1.3746 0.16515 0.20598 0.15252 0.034723 -0.38531 -0.094574 -0.19871 0.50239 -0.28702 -0.088727 0.056881 0.13634 0.19034 -0.19353 0.40506 -0.19317 0.22908 0.10055 -0.26895 -0.034727 -0.08401 0.057806 0.011076 -0.043349 -0.26917 -0.19333 0.22181 0.26123 -0.11761 0.10092 -0.15078 0.47153 0.11253 -0.26749 -0.038785 -0.03652 -0.089248 -0.24427 -0.041381 -0.021785 -0.35738 -0.063409 -0.53983 -0.010112 0.00041238 -0.097049 0.42628 -0.21349 -0.41055 -0.2494 -0.033571 -0.4954 0.15557 0.19882 0.10498 -0.24372 0.11429 -0.039279 -0.36258 0.10318 0.129 -0.41785 -0.041607 0.33522 0.073186 0.13362 0.010812 0.052645 0.18801 -0.30185 0.20333 -0.32258 -0.24673 0.21124 0.79132 -0.41539 0.3622 0.099852 -0.035378 -0.0419 -0.13851 -0.063255 0.13635 0.090863 -0.3994 0.099062 0.3221 -0.12256 -0.085906 -0.10218 0.2635 -0.18689 -0.1856 -0.43923 -0.325 -0.1991 0.17831 -0.27283 0.33473 0.082382 0.12825 0.39275 -0.034929 0.16148 -0.026713 0.40129 -0.39503 -0.064823 -0.08982 -0.066592 -0.34537 0.046283 0.36837 -0.024573 0.32213 0.30641 -0.28112 0.0066449 0.087743 -0.03417 0.60373 0.4212 -0.073349 0.26682 -0.1586 0.23765 -0.0062604 0.15236 -0.23409 0.31634 -0.08786 -0.15747 -0.24955 -0.18766 -0.096743 -0.27994 -0.24334 0.32643 0.29906 0.42763 0.22266 -0.17464 -0.019916 -0.31206 -0.34009 -0.14993 -0.28818 0.1475 -0.040503 -0.10347 0.0033634 0.2176 -0.20409 0.092415 0.080421 -0.061246 -0.30099 -0.14584 0.28188

in 0.089187 0.25792 0.26282 -0.029365 0.47187 -0.10389 -0.10013 0.08123 0.20883 2.5726 -0.67854 0.036121 0.13085 0.0012462 0.14769 0.26926 0.37144 1.3501 -0.11326 -0.23036 -0.26575 -0.18077 0.092455 -0.16215 0.15003 -0.34547 0.072295 0.40659 0.010021 -0.0079257 -0.11435 0.017008 -0.29789 0.19079 0.37112 -0.26588 0.16212 0.065469 -0.31781 -0.03226 0.081969 0.3445 -0.17362 -0.35745 0.054487 0.39941 0.13699 -0.022066 0.11025 -0.41898 0.1276 -0.095869 -0.17944 -0.17443 0.27302 -0.19464 0.26747 -0.28241 0.1638 -0.11518 0.013196 -0.10616 -0.36093 0.023634 0.13464 0.021652 -0.27094 -0.018737 0.10017 0.36071 -0.093951 0.47634 0.12874 0.0011868 0.1377 -0.14034 -0.1887 -0.16405 -0.15349 0.32347 -0.17616 0.3523 -0.023531 -0.19121 -0.054809 -0.099521 -0.30056 0.36632 -0.21509 0.074123 -0.20267 0.1286 -0.38111 -0.025482 0.45103 0.088633 0.36288 -0.23406 -0.086024 -0.50604 0.034242 0.43998 -0.083023 -0.11969 0.68686 -0.34115 0.21228 0.40039 0.26367 -0.37144 0.16206 -0.42854 0.078658 -0.2905 0.21727 -0.27484 0.35887 0.27055 -0.11326 -0.14848 -0.0050659 -0.076862 0.078621 -0.24922 0.42026 -0.069698 0.071595 0.0071665 0.27473 -0.15664 0.25713 -0.058461 -0.29733 -0.090996 0.5246 0.14889 -0.20883 -0.13004 -0.20022 0.4503 -0.34654 -0.26007 0.35247 -0.34757 0.033738 0.19907 -0.32912 -0.084689 0.65319 0.20954 0.079274 0.1086 0.0026466 -0.12843 -0.22811 0.051501 -0.27429 0.14505 -0.1843 -0.34825 -0.11701 0.34034 0.075848 0.08239 -0.39188 -0.022312 -0.080373 0.14477 0.29701 -0.10523 0.092893 0.029813 -0.11761 0.16308 0.098382 0.46152 -0.162 -0.2456 0.20293 -0.11344 0.057902 -0.19528 -0.20141 -0.22874 -0.014101 0.2637 -0.10028 -0.051896 0.18859 -0.17767 -0.11556 0.121 0.17303 0.11773 0.034837 0.28485 -0.30447 0.061024 -0.26442 -0.081135 -0.044524 -0.036931 -0.15217 0.29175 0.44926 -0.28875 0.33193 -0.01242 -0.18805 -0.19832 -0.19736 0.26893 0.11106 -0.67383 -0.1518 -0.16615 -0.16563 0.0093671 -0.15945 -0.33468 0.22038 -0.16724 -0.1535 -0.61782 -0.17258 0.088928 0.019411 0.18296 0.32967 -0.0024906 -0.09208 0.514 0.0042484 -0.084377 -0.71448 -0.22148 -0.04835 0.043761 -0.29376 -0.22287 0.18001 0.072197 0.46499 0.056466 0.40844 -0.23641 -0.038946 0.087363 -0.21901 -0.3231 -0.19989 -0.3128 -0.067656 -0.22596 0.090926 0.28365 0.31462 0.46082 -0.024871 -0.14605 0.30454 0.17704 -0.011311 0.26807 -0.032461 -0.16644 -0.15313 -0.20426 -0.3082 -0.2459 0.085848 -0.11767 -0.063056 -0.18133 -0.18629 -0.17694 0.29618 0.35987 0.0020102 0.38616 0.36712 -0.055112 -0.34733 -0.072678 -0.051119 -0.29069 0.053598 0.019587 0.16808 -0.27456 -0.097179 -0.054541 0.19229 -0.48128 -0.20304 0.19368 -0.32546 0.14421 -0.169 0.26501

" -0.075242 0.57337 -0.31908 -0.18484 0.88867 -0.27381 0.077588 0.13905 -0.47746 1.4442 -0.56159 0.085829 0.27504 0.1567 0.088067 0.038404 0.13146 0.80903 -0.16476 -0.26437 -0.25213 -0.10082 0.23976 -0.0017618 -0.14791 -0.042768 0.087014 0.4747 -0.0018207 -0.38313 -0.18743 -0.17626 -0.31186 0.11831 0.23195 -0.19336 -0.54827 -0.11649 -0.23389 -0.04854 -0.31656 0.04927 0.0074875 -0.3366 0.39277 -0.25234 -0.24983 0.19855 -0.038369 -0.13467 -0.11403 0.37989 -0.16665 -0.090585 0.22177 0.28434 0.25214 -0.03522 -0.0362 0.092851 -0.1177 -0.1239 0.75229 -0.33378 0.27406 -0.14008 0.12932 -0.081124 -0.23386 0.53654 0.45947 -0.14284 0.080147 0.12924 -0.24142 -0.50243 -0.16263 0.22523 0.528 0.089438 0.010835 -0.15715 -0.49085 -0.30451 -0.06281 -0.033262 -0.34762 -0.61321 0.12422 0.52602 -0.14759 -0.29829 -0.28189 -0.075447 -0.52833 0.20302 -0.36833 -0.095067 0.49005 -0.28781 -0.15406 0.2827 -0.085221 -0.24289 -0.22717 0.71889 -0.32832 0.1002 0.20946 0.59244 -0.41214 0.62329 -0.026238 -0.17192 0.49179 -0.0016847 -0.11607 -0.24131 0.11644 -0.1921 0.03934 -0.084155 -0.0996 0.022975 -0.31975 -0.050044 0.52722 0.2252 0.016595 0.20643 -0.0098633 0.31594 -0.078281 0.12947 -0.062864 0.52739 -0.09028 0.14762 0.043752 0.41388 -1.2517 0.16028 0.029431 0.37164 -0.14389 -0.090422 0.33523 0.65587 0.029611 -0.31848 0.15365 0.080007 -0.30963 -0.082044 0.25966 0.070255 -0.56391 -0.70405 -0.16871 0.025228 -0.31917 0.095542 0.03875 -0.27151 -0.45413 -0.27367 0.15582 -0.23491 0.017226 -0.12817 0.19779 -0.070536 -0.46543 -0.30177 0.126 0.28083 0.27455 -0.097111 -0.097692 -0.49537 -0.34073 -0.22939 -0.14542 0.036665 -0.71022 0.3501 -0.23481 0.73786 -0.1357 -0.15819 -0.25091 0.1612 -0.17855 0.19617 0.13535 0.29201 -0.33659 -0.34374 0.022833 0.72946 -0.14079 -0.23412 -0.23124 0.043603 -0.14916 -0.13349 0.26499 -0.072018 0.26515 -0.11531 -0.049195 -0.3157 0.16259 -0.21417 0.26391 0.19793 -0.14555 0.1195 -0.6928 -0.50451 0.21449 0.49722 -0.15937 0.07369 0.28543 -0.16188 0.043208 0.089521 -0.061378 0.005429 0.23594 -0.16455 0.6693 -0.68702 -0.02823 0.30058 -0.053056 -0.35263 -0.21279 0.2776 -0.067762 0.045954 0.23585 -0.18751 0.13449 -0.10923 -0.010326 -0.3995 -0.44222 0.0080668 0.094805 -0.10209 -0.17181 -0.15658 -0.12305 0.61632 0.050099 0.018423 -0.204 0.090514 0.19061 0.022614 0.28797 0.14858 0.18576 -0.025255 -0.061938 -0.20647 0.43831 -0.16193 0.013259 -0.40052 -0.20146 -0.42659 0.38983 -0.084942 0.13971 0.011352 0.025489 0.25282 -0.10832 -0.081995 0.29953 -0.22407 -0.11489 -0.37855 -0.52748 -0.17067 0.16029 0.29872 -0.035604 -0.022669 0.42531 0.063414 0.36213 -0.2128 -0.22615 0.328 -0.10934 -0.37948

: 0.008746 0.33214 -0.29175 -0.15119 -0.41842 -0.23931 -0.23458 -0.055618 -0.09896 0.75175 -0.66615 -0.10734 0.021663 -0.12194 0.022265 0.029731 0.036949 1.3326 -0.10886 -0.22681 -0.28436 0.021524 0.22749 -0.093169 -0.11529 0.51138 0.13868 -0.10885 -0.11482 -0.0074179 0.16234 0.0082633 -0.0023698 -0.39662 0.29591 0.22499 -0.46529 0.40232 0.027284 0.14321 0.034624 0.36936 -0.37351 0.22866 -0.29724 0.28951 -0.44012 0.47265 -0.070029 0.54446 0.30543 0.28181 0.063914 -0.30986 -0.40254 -0.032463 -0.39762 0.45387 0.075187 0.068059 0.12686 0.056289 0.29042 0.2362 0.34559 -0.14253 -0.016066 -0.058892 0.22277 0.31318 -0.37625 -0.044296 -0.017026 0.14938 0.87661 0.30364 -0.57488 -0.075509 -0.14493 0.16592 -0.67818 0.45022 -0.23441 -0.077216 0.32643 -0.1757 -0.0067939 -0.51045 0.56891 0.16143 0.18519 0.037305 -0.4579 -0.12869 0.19132 -0.38693 -0.1352 0.050239 0.36475 -0.061642 -0.181 -0.19424 0.46758 -0.25859 0.00027713 1.8061 -0.031111 -0.253 -0.043878 0.33484 0.21194 -0.16946 -0.012677 0.10138 -0.067128 0.2808 0.16923 -0.30368 -0.36514 0.18905 -0.36382 0.25917 0.18678 -0.054908 -0.068399 -0.083712 0.56264 -0.058912 -0.11257 -0.47151 0.62617 0.16101 0.17465 0.29054 -0.17968 0.17995 -0.28868 0.14772 -0.15869 0.12875 -0.050562 0.12813 0.018328 0.087067 -0.42093 0.26649 0.033762 0.46031 0.091036 -0.23591 0.37056 0.061394 -0.1232 -0.3872 0.31078 -0.40397 0.21185 0.14069 -0.32281 0.052793 -0.34045 -0.75221 -0.063703 0.094202 -0.45663 -0.53987 0.48825 -0.18757 -0.19803 0.40647 -0.24817 0.22975 0.21493 -0.48105 0.20716 0.32023 0.63723 -0.069866 0.3692 -0.15806 0.15572 0.36047 -0.010431 -0.27427 -0.087574 -0.37989 -0.1767 0.13558 0.056266 0.10345 -0.40615 0.11801 -0.32919 0.14333 -0.30102 -0.10898 -0.25298 0.33375 0.27642 0.71642 -0.091013 -0.002913 -0.19669 -0.39123 -0.056526 -0.1143 -0.28571 0.17814 -0.038271 -0.19628 -0.0057383 -0.68218 0.55404 -0.31276 -0.11263 -0.16157 -0.40151 0.40366 -0.21163 0.13927 0.32245 0.65676 0.039262 0.1051 -0.40708 -0.061696 0.30114 0.14276 0.24082 -0.29747 0.047918 0.3043 -0.15456 -0.27875 -0.39602 0.26501 -0.19017 0.054386 0.31772 0.44834 0.18064 -0.27069 0.15007 -0.037164 0.35867 0.25197 -0.42951 -0.080519 0.18769 0.35934 -0.12622 -0.034525 -0.44941 -0.27189 0.1923 0.3202 0.085719 -0.31613 0.12747 0.41687 -0.033986 0.16322 0.093101 0.012885 -0.13576 -0.50731 0.34072 0.01102 0.33894 0.043664 -0.22551 0.067386 -0.0061831 -0.10494 0.059349 0.43297 0.55025 0.30155 -0.1616 0.18268 -0.27236 -0.027163 0.61137 0.0027296 0.13913 0.051779 -0.19778 -0.03439 -0.088886 -0.096511 0.33936 -0.041628 -0.5592 0.22176 -0.41515 0.70059 -0.21371 -0.28677 -0.22663 -0.05087

我相信你已经明白了,里面不过如此嘛~就是一个单词(或者标点)加300个数字而已。
没错~就是那么简单,但是问题是如何用这个文件?我们处理的数据基本不会涵盖这里所有的单词的,所以必然需要做一些处理。换句话说,我们处理的数据必然要和这个大txt文件做一定的联系才可以!
好,下面来看看我们需要处理的数据是什么?

2.我们需要处理的数据又长啥样?

有人会说,你这是什么问题,我们处理的数据什么样我能不知道吗?不就是一句句话,然后一句句话是又一个个单词组成的,而且我们用了数字来表示他们。表示成了类似这种样子嘛:
过程如下:首先这样:在这里插入图片描述
然后这样:
在这里插入图片描述
再这样:
在这里插入图片描述
上面的过程如果看不懂,或者有疑问还请移步:
https://blog.csdn.net/ssswill/article/details/88533623
也就是刚刚那个链接。

需要说一下的是,上面的过程是urdu语,不是英语。但是二者过程是类似的。我想你应该明白我的意思。

好了!到这里,我们这个标题基本说清楚了,我们要处理的数据是由1,2,3.。。。这些数字表示的。

2.这两种表示方法我们已经知道了,那么怎么转换的?

ok。来看一下我们的数据是用1,2,3代表了"a","the"这些单词。而且这些我们可以通过:

tokenizer.word_index

查看对应关系的。
输出:
在这里插入图片描述
如果词向量表示的也有类似的字典就好了!(后面会讲怎么生成这个字典)最好长这样:
在这里插入图片描述
在这里插入图片描述
上面两个截图拼起来就是一个大字典了!哈~这下我们就有了两个大字典。
有了他们我们能做些什么?

3.有点懵了现在,现在两个字典有了,我们做词嵌入到底怎么做的啊?

好~这是个根本问题,或者这个问题换种问法,就是我们的词嵌入的目标是什么?

是生成一个权重矩阵哦~

嗯?矩阵?还权重矩阵?怎么说?
问这个问题只能说明我让你先看的我的那个链接你还没看~或者看的不仔细。。。
因为我在那篇博文里提及到:
在这里插入图片描述

明白了吗?做词嵌入就是要得到这个矩阵!

再说一些:
在这里插入图片描述
首先,一句话被表示成了(None,219,1)onehot之后是(none,219,2000),2000代表有2000种不同的单词。也就是tokenizer.word_index的长度。
之后来一个权重矩阵就行了。就变成了(none,219,128)了!就可以放到lstm里训练了!
我现在先透露一下:
权重矩阵有多少行呢?你有多少不同的单词就有多少行!
有多少列呢?你要把一个单词变成多少维就有多少列!在我们这里glove用的是300。
每一行又代表什么呢?
还记得这个吧!
在这里插入图片描述
对咯~第一行就是the的300维表示。
第二行就是to的300维表示。以此类推!!!
明白了吧!
最终矩阵长这样:
在这里插入图片描述
框出来的都是一行,第一行都是0,是因为你在pad_sentence时不是用0填充的吗,它不对应任何单词,所以在glove里找不到它,就用全0代替,不影响的。
第二行就是the的300维表示了~

4.end

看到了这里我觉得你挺厉害的了~因为写了很多,感谢阅读。我写那么多其实是和一个比赛有关的,一些理解我写在github里了,但是github上代码没有上传完,会陆续更新。但是如果觉得有帮助的话,还请到github里star一下!感谢!
https://github.com/willinseu/kaggle-Jigsaw-Unintended-Bias-in-Toxicity-Classification-solution
至于本文的代码篇,也就是如何生成上面的矩阵以及词典,会另开一篇博文说明,二者是独立的。

  • 6
    点赞
  • 25
    收藏
    觉得还不错? 一键收藏
  • 6
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值