关于如何使用Chinese-Word-Vector工具
做中文NLP最重要的是中文分词和词嵌入,有一些预训练的词嵌入文件已经写好了,我们只需要加载使用就好
因为加载的过程会出现编码的错误,所以直接open
的方式读取词嵌入文档是不行的
下面介绍三种不同的加载Chinese-Word-Vector方法,拿sgns.zhihu.bigram-char.bz2
举例
方法一
from gensim.models.keyedvectors import KeyedVectors
w2v_model = KeyedVectors.load_word2vec_format("sgns.zhihu.bigram-char.bz2", binary=False,unicode_errors='ignore')
print(w2v_model)
方法二
with bz2.open("sgns.zhihu.bigram-char.bz2", 'rb') as f:
word_vecs = f.readlines()
word_vecs = [i.decode('utf-8') for i in word_vecs]
print(word_vecs)
方法三
def load_dense_drop_repeat(path):
vocab_size, size = 0, 0
vocab = {"itos": [], "stoi": {}}
count = 0
with codecs.open(path, "r", "utf-8", errors='ignore') as f:
first_line = True
for line in f:
if first_line:
first_line = False
vocab_size = int(line.strip().split()[0])
size = int(line.rstrip().split()[1])
matrix = np.zeros(shape=(vocab_size, size), dtype=np.float32)
continue
vec = line.strip().split()
if not vocab["stoi"].__contains__(vec[0]):
vocab["stoi"][vec[0]] = count
matrix[count, :] = np.array([float(x) for x in vec[1:]])
count += 1
for w, i in vocab["stoi"].items():
vocab["itos"].append(w)
return matrix, vocab, size, len(vocab["itos"])
这三种方法加载出来的结构有所不同,具体使用哪一种需要看实际情况