字符级别word2vec

论文《End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF》在做词性标注任务的时候,提到了对字符进行编码,用卷积神经网络编码字符级别信息。

实验中提到字符级别的embeddings 维度30,范围在[-sqrt(3/dim),sqrt(3/dim)]。所以先用word2vec实验了一下字符embedding。

#训练字符级别词向量
from gensim.models.word2vec import Word2Vec
from gensim.models import word2vec
alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789,.)(; '
f = open('text').read()
text = f.replace('\n', ' ').lower()
chars = [ch for ch in text if ch in alphabet]
filtered =''.join(chars)
tokens = filtered.split(' ')
words = [t for t in tokens if len(t) >=2]
#print(words)
char_sequences = [list(w) for w in words]
print(char_sequences)
model = Word2Vec(char_sequences,size=30,window=5,min_count=1)
model.save('char_embeddings.vec')

处理得到的字符序列为:
这里写图片描述

得到的模型测试了一下:

print(model['a'])
print(model.most_similar('a',topn=5))
---------------------------------------
array([-0.01051879,  0.00305209,  0.00773612,  0.01362684,  0.01594807,
        0.01029609,  0.00346048,  0.00261297, -0.01034051,  0.00964036,
       -0.00509238,  0.0021358 , -0.00605083,  0.0087046 ,  0.00930654,
        0.01411205,  0.00340451, -0.0071094 , -0.00138468,  0.00443402,
        0.00809182, -0.00498053, -0.00288919,  0.01092559, -0.01460177,
       -0.00596451, -0.00200858, -0.01376272,  0.00229289,  0.01006972], dtype=float32)

[('w', 0.5829492211341858), ('c', 0.34324681758880615), ('k', 0.3245270252227783), ('u', 0.20812581479549408), ('i', 0.15292495489120483)]  
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值