编码遇到些错误,所以写一下部分解决办法:
①f = codecs.open(’./sentence.txt’,‘r’,‘utf-8’)
RuntimeError: you must first build vocabulary before training the model
open(file)默认GBK,所以要注明用UTF-8来读文件。中文词先建立词库。
②s1=ss.split(" ".encode(encoding=‘utf-8’))
TypeError: must be str or None, not bytes
split需要str格式读取
③g=open(‘D:\Download\code\w2v\sentence.txt’, ‘rb’,'utf-8)
TypeError:an integer is required (got type str)
二进制读取不能用utf-8转换
改为:
with codecs.open('./sentence.txt','r','utf-8') as f:
sss=[]
while True:
ss=f.readline().replace('\n','').rstrip()#对str才能操作
if ss==''