根据对特征提取的精心提出,您可以使用scikit库中的tfidvectorizer从推文中提取重要的单词.使用默认配置,再加上一个简单的LogisticRegression,它给我0.8精度.希望有所帮助.
以下是如何使用它来解决问题的示例:
train_df_raw = pd.read_csv('train.csv',header=None, names=['label','tweet'])
test_df_raw = pd.read_csv('test.csv',header=None, names=['label','tweet'])
train_df_raw = train_df_raw[train_df_raw['tweet'].notnull()]
test_df_raw = test_df_raw[test_df_raw['tweet'].notnull()]
test_df_raw = test_df_raw[test_df_raw['label']!=2]
y_train = [x if x==0 else 1 for x in train_df_raw['label'].tolist()]
y_test = [x if x==0 else 1 for x in test_df_raw['label'].tolist()]
X_train = train_df_raw['tweet'].tolist()
X_test = test_df_raw['tweet'].tolist()
print('At vectorizer')
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
print('At vectorizer for test data')
X_test = vectorizer.transform(X_test)
print('at Classifier')
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
confusion_matrix = confusion_matrix(y_test, predictions)
print(confusion_matrix)
Accuracy: 0.8
[[135 42]
[ 30 153]]