使用Gensim包下的corpora构造词典,报错:
Traceback (most recent call last):
File "D:\BaiduNetdiskDownload\sample.py", line 38, in <module>
dictionary = corpora.Dictionary(allwords)
File "D:\soft\Python27\lib\site-packages\gensim\corpora\dictionary.py", line 79, in __init__
self.add_documents(documents, prune_at=prune_at)
File "D:\soft\Python27\lib\site-packages\gensim\corpora\dictionary.py", line 195, in add_documents
self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
File "D:\soft\Python27\lib\site-packages\gensim\corpora\dictionary.py", line 233, in doc2bow
raise TypeError("doc2bow expects an array of unicode tokens on input, not a single string")
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
出错代码如下:
####分词
real_documents = []
stopwords = dt.LoadStopwords("stopwords.txt")
cut_documents = [list(jieba.cut(item_text,cut_all=False)) for item_text in tqdm(real_test_raw)]
allwords = []
for sentence in cut_documents:
outstr = ''
for word in sentence:
if word not in stopwords:
outstr += word
outstr += ' '
allwords.append(word)
real_documents.append(outstr.strip())
###保存分词结果
#dt.SaveList2File("afterJson.txt",real_documents)
###构造词典
#for i in allwords:
# print type(i),i
dictionary = corpora.Dictionary(allwords)
修改:将最后一行代码dictionary = corpora.Dictionary(allwords),改为下述:
dictionary = corpora.Dictionary([allwords])