此篇是为CharityML寻找捐献者的第2篇笔记,第1篇数据预处理在这里:https://blog.csdn.net/xiuxiuxiu666/article/details/104301124
- Naive Predictor(朴素//天真预测器)
通过查看收入超过和不超过 $50,000 的人数,我们能发现多数被调查者年收入没有超过 $50,000。如果我们简单地预测说“这个人的收入没有超过 $50,000”,我们就可以得到一个 准确率超过 50% 的预测。
这样我们甚至不用看数据就能做到一个准确率超过 50%。这样一个预测被称作是天真的。通常对数据使用一个天真的预测器是十分重要的,这样能够帮助建立一个模型表现是否好的基准。
TP = float(len(y_val[y_val == 1]))
FP = float(len(y_val[y_val == 0]))
FN = 0
# 计算准确率
accuracy = float(len(y_val[y_val == 1])) / len(y_val)
# 计算查准率 Precision
precision = TP / (TP+FP)
# 计算查全率 Recall
recall = TP / (TP+FN)
# 使用上面的公式,设置beta=0.5,计算F-score
fscore = (1+0.5**2)* ( (precision*recall) / (0.5**2*precision + recall) )
# 打印结果
print "Naive Predictor on validation data: \n \
Accuracy score: {:.4f} \n \
Precision: {:.4f} \n \
Recall: {:.4f} \n \
F-score: {:.4f}".format(accuracy, precision, recall, fscore)
Naive Predictor on validation data:
Accuracy score: 0.2478
Precision: 0.2478
Recall: 1.0000
F-score: 0.2917
- 监督学习模型
- 创建一个训练和预测的流程
# 从sklearn中导入两个评价指标 - fbeta_score和accuracy_score
from sklearn.metrics import fbeta_score, accuracy_score
def train_predict(learner, sample_size, X_train, y_train, X_val, y_val):
'''
inputs:
- learner: the learning algorithm to be trained and predicted on
- sample_size: the size of samples (number) to be drawn from training set
- X_train: features training set
- y_train: income training set
- X_val: features validation set
- y_val: income validation set
'''
results = {}
# 使用sample_size大小的训练数据来拟合学习器
# Fit the learner to the training data using slicing with 'sample_size'
start = time() # 获得程序开始时间
learner = learner.fit(X_train[:sample_size],y_train[:sample_size])
end = time() # 获得程序结束时间
# 计算训练时间
results['train_time'] = end - start
# 得到在验证集上的预测值
# 然后得到对前300个训练数据的预测结果
start = time() # 获得程序开始时间
predictions_val = learner.predict(X_val)
predictions_train = learner.predict(X_train[:300])
end = time() # 获得程序结束时间
# 计算预测用时
results['pred_time'] = end - start
# 计算在最前面的300个训练数据的准确率
results['acc_train'] = accuracy_score(y_train[:300],predictions_train)
# 计算在验证上的准确率
results['acc_val'] = accuracy_score(y_val,predictions_val)
# 计算在最前面300个训练数据上的F-score
results['f_train'] = fbeta_score(y_train[:300],predictions_train,beta=0.5)
# 计算验证集上的F-score
results['f_val'] = fbeta_score(y_val,predictions_val,beta=0.5)
# 成功
print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
# 返回结果
return results
- 模型的评估
选择6个感兴趣的分类模型进行探究
导入感兴趣的监督学习模型;计算1%, 10%, 100%的训练数据分别对应多少个数据点,将这些值存储在’samples_1’, ‘samples_10’, 'samples_100’中
# 从sklearn中导入三个监督学习模型
from sklearn import naive_bayes,tree,ensemble
# 初始化三个模型
clf_A = naive_bayes.GaussianNB() #高斯朴素贝叶斯 (GaussianNB)
clf_B = tree.DecisionTreeClassifier(random_state = 0) #决策树 (DecisionTree)
clf_C = ensemble.AdaBoostClassifier(random_state = 0) #集成方法 AdaBoost
# 计算1%, 10%, 100%的训练数据分别对应多少点
samples_1 = int(X_train.shape[0]*0.01)
samples_10 = int(X_train.shape[0]*0.1)
samples_100 = int(X_train.shape[0])
# 收集学习器的结果
results = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_val, y_val)
# 对选择的三个模型得到的评价结果进行可视化
vs.evaluate(results, accuracy, fscore)
# 从sklearn中导入三个监督学习模型
from sklearn import svm,linear_model,neighbors
# TODO:初始化三个模型
clf_A = svm.SVC(random_state = 0) #支撑向量机 (SVM)
clf_B = linear_model.LogisticRegression(random_state = 0) #Logistic回归(LogisticRegression)
clf_C = neighbors.KNeighborsClassifier() #K近邻 (K Nearest Neighbors)
# 计算1%, 10%, 100%的训练数据分别对应多少点
samples_1 = int(X_train.shape[0]*0.01)
samples_10 = int(X_train.shape[0]*0.1)
samples_100 = int(X_train.shape[0])
# 收集学习器的结果
results = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_val, y_val)
# 对选择的三个模型得到的评价结果进行可视化
vs.evaluate(results, accuracy, fscore)
上述模拟结果展示可以看出逻辑回归模型最适合于该问题,在模拟精度相差不大的情况下,该方法预测时间最短。且模拟的精度在上述的评估模型里面最好。
- 模型调优
对逻辑回归模型进行参数调优,l1/l2正则化以及参数c
# 导入'GridSearchCV', 'make_scorer'和其他一些需要的库
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
# TODO:初始化分类器
clf = LogisticRegression(random_state = 0)
# 创建你希望调节的参数列表
parameters = { 'penalty':('l1', 'l2'), 'C':(0.01, 0.1, 1, 10, 100)}
# 创建一个fbeta_score打分对象
scorer = make_scorer(fbeta_score, beta=0.5)
# 在分类器上使用网格搜索,使用'scorer'作为评价函数
grid_obj = GridSearchCV(clf,parameters,scoring=scorer)
# 用训练数据拟合网格搜索对象并找到最佳参数
grid_obj.fit(X_train,y_train)
# 得到estimator
best_clf = grid_obj.best_estimator_
# 使用没有调优的模型做预测
predictions = (clf.fit(X_train, y_train)).predict(X_val)
best_predictions = best_clf.predict(X_val)
# 汇报调优后的模型
print ("best_clf\n------")
print (best_clf)
# 汇报调参前和调参后的分数
print ("\nUnoptimized model\n------")
print ("Accuracy score on validation data: {:.4f}".format(accuracy_score(y_val, predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, predictions, beta = 0.5)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the validation data: {:.4f}".format(accuracy_score(y_val, best_predictions)))
print ("Final F-score on the validation data: {:.4f}".format(fbeta_score(y_val, best_predictions, beta = 0.5)))
best_clf
------
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l1', random_state=0, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
Unoptimized model
------
Accuracy score on validation data: 0.8536
F-score on validation data: 0.7182
Optimized Model
------
Final accuracy score on the validation data: 0.8545
Final F-score on the validation data: 0.7210
可以看到:
评价指标 | 未优化的模型 | 优化的模型 |
---|---|---|
准确率 | 0.8536 | 0.8545 |
F-score | 0.7182 | 0.7210 |
优化后模型比未优化的模型有小幅度性能提升。
- 提取重要特征
选择一个scikit-learn中有feature_importance_属性的监督学习分类器,这个属性是一个在做预测的时候根据所选择的算法来对特征重要性进行排序的功能。
# 导入一个有'feature_importances_'的监督学习模型
from sklearn.ensemble import AdaBoostClassifier
# 在训练集上训练一个监督学习模型
model = AdaBoostClassifier()
model.fit(X_train,y_train)
# 提取特征重要性
importances = model.feature_importances_
# 绘图
vs.feature_plot(importances, X_train, y_train)
可以看到这五个权重对模拟预测影响较大,加起来超过0.5了。
下一步考虑只使用五个最重要的特征在相同的训练集上训练模型。尝试去减小特征空间,简化模型需要学习的信息。
# 导入克隆模型的功能
from sklearn.base import clone
# 减小特征空间
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_val_reduced = X_val[X_val.columns.values[(np.argsort(importances)[::-1])[:5]]]
# 在前面的网格搜索的基础上训练一个“最好的”模型
clf_on_reduced = (clone(best_clf)).fit(X_train_reduced, y_train)
# 做一个新的预测
reduced_predictions = clf_on_reduced.predict(X_val_reduced)
# 对于每一个版本的数据汇报最终模型的分数
print ("Final Model trained on full data\n------")
print ("Accuracy on validation data: {:.4f}".format(accuracy_score(y_val, best_predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, best_predictions, beta = 0.5)))
print ("\nFinal Model trained on reduced data\n------")
print ("Accuracy on validation data: {:.4f}".format(accuracy_score(y_val, reduced_predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, reduced_predictions, beta = 0.5)))
Final Model trained on full data
------
Accuracy on validation data: 0.8545
F-score on validation data: 0.7210
Final Model trained on reduced data
------
Accuracy on validation data: 0.8150
F-score on validation data: 0.6276
- 在测试集上测试你的模型
#test your model on testing data and report accuracy and F score
predict_test = grid_obj.predict(X_test)
accuracy_score_test = accuracy_score(y_test, predict_test)
fbeta_score_test = fbeta_score(y_test, predict_test, beta=0.5)
print(accuracy_score_test, fbeta_score_test)
0.846323935876 0.703242430749