word2vec训练目前只支持cpu,当我们训练大规模语料时,如果直接将所有的语料加载到内存,势必导致内存不足,一种解决方法是,训练时,从本地读取训练语料,这里提供一种本地读取文本函数如下:
def sentence2words(sentence, stopWords=False, stopWords_set=None):
words = []
for word in sentence.split():
words.append(word)
return words
class MySentences(object):
def __init__(self, list_csv):
self.fns = list_csv
def __iter__(self):
for fn in self.fns:
with open(fn, 'r') as f:
for line in f:
yield sentence2words(line.strip())
list_csv为输入文件数组,例如我们有训练语料文件text1.txt.text2.txt,调用代码如下:
files1=[]
files1.append('text1.txt')
files1.ap