论文《End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF》在做词性标注任务的时候,提到了对字符进行编码,用卷积神经网络编码字符级别信息。
实验中提到字符级别的embeddings 维度30,范围在[-sqrt(3/dim),sqrt(3/dim)]。所以先用word2vec实验了一下字符embedding。
#训练字符级别词向量
from gensim.models.word2vec import Word2Vec
from gensim.models import word2vec
alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789,.)(; '
f = open('text').read()
text = f.replace('\n', ' ').lower()
chars = [ch for ch in text if ch in alphabet]
filtered =''.join(chars)
tokens = filtered.split(' ')
words = [t for t in tokens if len(t) >=2]
#print(words)
char_sequences = [list(w) for w in words]
print(char_sequences)
model = Word2Vec(char_sequences,size=30,window=5,min_count=1)
model.save('char_embeddings.vec')
处理得到的字符序列为:
得到的模型测试了一下:
print(model['a'])
print(model.most_similar('a',topn=5))
---------------------------------------
array([-0.01051879, 0.00305209, 0.00773612, 0.01362684, 0.01594807,
0.01029609, 0.00346048, 0.00261297, -0.01034051, 0.00964036,
-0.00509238, 0.0021358 , -0.00605083, 0.0087046 , 0.00930654,
0.01411205, 0.00340451, -0.0071094 , -0.00138468, 0.00443402,
0.00809182, -0.00498053, -0.00288919, 0.01092559, -0.01460177,
-0.00596451, -0.00200858, -0.01376272, 0.00229289, 0.01006972], dtype=float32)
[('w', 0.5829492211341858), ('c', 0.34324681758880615), ('k', 0.3245270252227783), ('u', 0.20812581479549408), ('i', 0.15292495489120483)]