我使用scikit-learn中的TfidfVectorizer学习从文本数据中提取一些特征。我有一个带标志的CSV文件(可以是+1或-1)和一个评论(文本)。我将这些数据导入DataFrame,以便运行Vectorizer。
代码如下:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv("train_new.csv",
names = ['Score', 'Review'], sep=',')
# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()
v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])
报错:
ValueError: np.nan is an invalid document, expected byte or unicode string.
解决方案:
x = v.fit_transform(df['Review'].values.astype('U')) ## Even astype(str) would work
我们从说明文档中可以看到:
fit_transform(raw_documents, y=None)
Parameters: raw_documents : iterable
an iterable which yields either str, unicode or file objects