分类器的实现较为简单,主要从sklearn库中调取需要的函数即可。sklearn yyds!!!
数据准备
数据不均衡问题
比如说本题,分类为0的样本有400多个,但是分类为1的样本有1500多个,此时如果直接使用数据去训练分类器,会产生问题。因为分类器全部判别为1,就会有很高的准确率了。
SMOTE过采样
# 首先分割训练集与测试集
from sklearn.model_selection import train_test_split
train_xx, test_xx,train_yy_before, test_yy = train_test_split(xx, yy, test_size= 0.3)
# 定义SMOTE过采样解决数据不均衡问题
from imblearn.over_sampling import SMOTE
smo = SMOTE()
train_xx, train_yy = smo.fit_resample(train_xx, train_yy_before)
print('原始数据:{}'.format(collections.Counter(yy)))
print('测试集:{}'.format(collections.Counter(test_yy)))
print('训练集:{}'.format(collections.Counter(train_yy_before)))
print('过采样的训练集:{}'.format(collections.Counter(train_yy)))
EasyEnsembleClassifier
据说这个集成分类器可以解决不均衡问题,但实际体验下来感觉很脑残,最后还是换成了SMOTE。EasyEnsembleClassifier的使用很简单,直接将原来的分离器传参给其即可:
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.ensemble import AdaBoostClassifier
ee_ada = EasyEnsembleClassifier(n_estimators=20,base_estimator=AdaBoostClassifier())
# ee_ada = AdaBoostClassifier()
ee_ada.fit(train_xx, train_yy)
print("准确率:{}, F-measure:{}".format(ee_ada.score(test_xx, test_yy),
f1_score(test_yy,
ee_ada.predict(test_xx))))
plot_AUC(ee_ada, test_xx, test_yy)
具体的分类器
- Adaboost
- XGboost
- RandomForestClassifier(这个有回归器和分类器,不要弄错)
- GradientBoostingClassifier
- DecisionTreeClassifier
- BernoulliNB
- svm
- VotingClassifier(集成分类器,可选软硬投票)
使用起来都很简单,分为三部
- 导入相关的包
- 实例化
- 使用fit(x,y)函数进行训练(y为label)
- 使用predict()函数进行预测
- 可以使用feature_importances_属性查看各个变量对分类的贡献
下面列出VotingClassifier的实例代码:
from sklearn.ensemble import VotingClassifier
clf = VotingClassifier(estimators=[
('xgboost',
XGBClassifier(use_label_encoder=False,
eval_metric=['logloss', 'auc', 'error'])),
('adaboost', AdaBoostClassifier()),
# ('random_tree', RandomForestClassifier(n_estimators=30)),
('gradient_boost', GradientBoostingClassifier()),
# ('desision', tree.DecisionTreeClassifier())
],
voting='soft')
ee_clf = EasyEnsembleClassifier(n_estimators=20, base_estimator=clf)
ee_clf.fit(train_xx, train_yy)
print("准确率:{}, F-measure:{}".format(ee_clf.score(test_xx, test_yy),
f1_score(test_yy,
ee_clf.predict(test_xx))))
plot_AUC(ee_clf, test_xx, test_yy)
相关的各种包:
from sklearn import svm
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import VotingClassifier
from sklearn import tree
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics
import xgboost
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, accuracy_score
另外,绘制AUC图的函数
from sklearn import metrics
def plot_AUC(model,X_test,y_test):
probs = model.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()