为CharityML寻找捐献者（基于多个监督学习算法）

最新推荐文章于 2021-07-25 10:30:31 发布

Kira_Tseng

最新推荐文章于 2021-07-25 10:30:31 发布

阅读量587

点赞数 1

本文链接：https://blog.csdn.net/Kira_Tseng/article/details/101163041

版权

为CharityML寻找捐献者

在这个项目中，将使用1994年美国人口普查收集的数据，选用几个监督学习算法以准确地建模被调查者的收入。然后，将根据初步结果从中选择出最佳的候选算法，并进一步优化该算法以最好地建模这些数据。目标是建立一个能够准确地预测被调查者年收入是否超过50000美元的模型。这种类型的任务会出现在那些依赖于捐款而存在的非营利性组织。了解人群的收入情况可以帮助一个非营利性的机构更好地了解他们要多大的捐赠，或是否他们应该接触这些人。虽然很难直接从公开的资源中推断出一个人的一般收入阶层，但是可以（也正是我们将要做的）从其他的一些公开的可获得的资源中获得一些特征从而推断出该值。

这个项目的数据集来自UCI机器学习知识库。这个数据集是由Ron Kohavi和Barry Becker在发表文章_“Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid”_之后捐赠的，你可以在Ron Kohavi提供的在线版本中找到这个文章。我们在这里探索的数据集相比于原有的数据集有一些小小的改变，比如说移除了特征'fnlwgt' 以及一些遗失的或者是格式不正确的记录。

探索数据

下面的代码用以载入需要的Python库并导入人口普查数据。数据集的最后一列'income'将是需要预测的列（表示被调查者的年收入会大于或者是最多50,000美元），人口普查数据中的每一列都将是关于被调查者的特征。

# 为这个项目导入需要的库
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # 允许为DataFrame使用display()

# 导入附加的可视化代码visuals.py
import visuals as vs

# 为notebook提供更加漂亮的可视化
%matplotlib inline

# 导入人口普查数据
data = pd.read_csv("census.csv")

# 成功 - 显示第一条记录
display(data.head(n=1))

	age	workclass	education_level	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	income
0	39	State-gov	Bachelors	13.0	Never-married	Adm-clerical	Not-in-family	White	Male	2174.0	0.0	40.0	United-States	<=50K

粗略探索

首先，对数据集进行一个粗略的探索，看看每一个类别里会有多少被调查者，并查看这些里面多大比例是年收入大于50,000美元的。

在下面的代码单元中，将计算以下量：

总的记录数量，'n_records'
年收入大于50,000美元的人数，'n_greater_50k'.
年收入最多为50,000美元的人数 'n_at_most_50k'.
年收入大于50,000美元的人所占的比例， 'greater_percent'.

# 总的记录数
n_records = len(data)

# 被调查者的收入大于$50,000的人数
n_greater_50k = len(data.loc[data["income"] == ">50K"])

# 被调查者的收入最多为$50,000的人数
n_at_most_50k = len(data.loc[data["income"] == "<=50K"])

# 被调查者收入大于$50,000所占的比例
greater_percent = 100*(n_greater_50k/n_records)

# 打印结果
print ("Total number of records: {}".format(n_records))
print ("Individuals making more than $50,000: {}".format(n_greater_50k))
print ("Individuals making at most $50,000: {}".format(n_at_most_50k))
print ("Percentage of individuals making more than $50,000: {:.2f}%".format(greater_percent))

Total number of records: 45222
Individuals making more than $50,000: 11208
Individuals making at most $50,000: 34014
Percentage of individuals making more than $50,000: 24.78%

准备数据

在数据能够被作为输入提供给机器学习算法之前，它经常需要被清洗，格式化，和重新组织 - 这通常被叫做预处理。幸运的是，对于这个数据集，没有我们必须处理的无效或丢失的条目，然而，由于某一些特征存在的特性我们必须进行一定的调整。这个预处理都可以极大地帮助我们提升几乎所有的学习算法的结果和预测能力。

获得特征和标签

income 列是我们需要的标签，记录一个人的年收入是否高于50K。因此我们应该把他从数据中剥离出来，单独存放。

# 将数据切分成特征和对应的标签
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

转换倾斜的连续特征

一个数据集有时可能包含至少一个靠近某个数字的特征，但有时也会有一些相对来说存在极大值或者极小值的不平凡分布的的特征。算法对这种分布的数据会十分敏感，并且如果这种数据没有能够很好地规一化处理会使得算法表现不佳。在人口普查数据集的两个特征符合这个描述：'capital-gain'和'capital-loss'。

下面的代码将创建一个关于这两个特征的条形图。

# 可视化 'capital-gain'和'capital-loss' 两个特征
vs.distribution(features_raw)

在这里插入图片描述

对于高度倾斜分布的特征如'capital-gain'和'capital-loss'，常见的做法是对数据施加一个对数转换，将数据转换成对数，这样非常大和非常小的值不会对学习算法产生负面的影响。并且使用对数变换显著降低了由于异常值所造成的数据范围异常。但是在应用这个变换时必须小心：因为0的对数是没有定义的，所以我们必须先将数据处理成一个比0稍微大一点的数以成功完成对数转换。

下面的代码将执行数据的转换和可视化结果。

# 对于倾斜的数据使用Log转换
skewed = ['capital-gain', 'capital-loss']
features_raw[skewed] = data[skewed].apply(lambda x: np.log(x + 1))

# 可视化对数转换后 'capital-gain'和'capital-loss' 两个特征
vs.distribution(features_raw, transformed = True)

在这里插入图片描述

规一化数字特征

除了对于高度倾斜的特征施加转换，对数值特征施加一些形式的缩放通常会是一个好的习惯。在数据上面施加一个缩放并不会改变数据分布的形式（比如上面说的’capital-gain’ or ‘capital-loss’）；但是，规一化保证了每一个特征在使用监督学习器的时候能够被平等的对待。注意一旦使用了缩放，观察数据的原始形式不再具有它本来的意义了，就像下面的例子展示的。

下面的代码将规一化每一个数字特征。我们将使用sklearn.preprocessing.MinMaxScaler来完成这个任务。

from sklearn.preprocessing import MinMaxScaler

# 初始化一个 scaler，并将它施加到特征上
scaler = MinMaxScaler()
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_raw[numerical] = scaler.fit_transform(data[numerical])

# 显示一个经过缩放的样例记录
display(features_raw.head(n = 1))

D:\Software\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:334: DataConversionWarning: Data with input dtype int64, float64 were all converted to float64 by MinMaxScaler.
  return self.partial_fit(X, y)

	age	workclass	education_level	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country
0	0.30137	State-gov	Bachelors	0.8	Never-married	Adm-clerical	Not-in-family	White	Male	0.02174	0.0	0.397959	United-States

数据预处理

从上面的数据探索中的表中，可以看到有几个属性的每一条记录都是非数字的。通常情况下，学习算法期望输入是数字的，这要求非数字的特征（称为类别变量）被转换。转换类别变量的一种流行的方法是使用独热编码方案。独热编码为每一个非数字特征的每一个可能的类别创建一个_“虚拟”_变量。例如，假设someFeature有三个可能的取值A，B或者C。我们将把这个特征编码成someFeature_A, someFeature_B和someFeature_C.

在这里插入图片描述

此外，对于非数字的特征，需要将非数字的标签'income'转换成数值以保证学习算法能够正常工作。因为这个标签只有两种可能的类别（"<=50K"和">50K"），我们不必要使用独热编码，可以直接将他们编码分别成两个类0和1。

在下面的代码将实现以下功能：

使用pandas.get_dummies()对'features_raw'数据来施加一个独热编码。
将目标标签'income_raw'转换成数字项。
- 将"<=50K"转换成0；将">50K"转换成1。

# 使用pandas.get_dummies()对'features_raw'数据进行独热编码
features = pd.get_dummies(features_raw)

# 将'income_raw'编码成数字值
income = income_raw.replace(["<=50K", ">50K"], [0, 1])

# 打印经过独热编码之后的特征数量
encoded = list(features.columns)
print ("{} total features after one-hot encoding.".format(len(encoded)))

# 移除下面一行的注释以观察编码的特征名字
print (encoded)

103 total features after one-hot encoding.
['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_level_ 10th', 'education_level_ 11th', 'education_level_ 12th', 'education_level_ 1st-4th', 'education_level_ 5th-6th', 'education_level_ 7th-8th', 'education_level_ 9th', 'education_level_ Assoc-acdm', 'education_level_ Assoc-voc', 'education_level_ Bachelors', 'education_level_ Doctorate', 'education_level_ HS-grad', 'education_level_ Masters', 'education_level_ Preschool', 'education_level_ Prof-school', 'education_level_ Some-college', 'marital-status_ Divorced', 'marital-status_ Married-AF-spouse', 'marital-status_ Married-civ-spouse', 'marital-status_ Married-spouse-absent', 'marital-status_ Never-married', 'marital-status_ Separated', 'marital-status_ Widowed', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'relationship_ Husband', 'relationship_ Not-in-family', 'relationship_ Other-relative', 'relationship_ Own-child', 'relationship_ Unmarried', 'relationship_ Wife', 'race_ Amer-Indian-Eskimo', 'race_ Asian-Pac-Islander', 'race_ Black', 'race_ Other', 'race_ White', 'sex_ Female', 'sex_ Male', 'native-country_ Cambodia', 'native-country_ Canada', 'native-country_ China', 'native-country_ Columbia', 'native-country_ Cuba', 'native-country_ Dominican-Republic', 'native-country_ Ecuador', 'native-country_ El-Salvador', 'native-country_ England', 'native-country_ France', 'native-country_ Germany', 'native-country_ Greece', 'native-country_ Guatemala', 'native-country_ Haiti', 'native-country_ Holand-Netherlands', 'native-country_ Honduras', 'native-country_ Hong', 'native-country_ Hungary', 'native-country_ India', 'native-country_ Iran', 'native-country_ Ireland', 'native-country_ Italy', 'native-country_ Jamaica', 'native-country_ Japan', 'native-country_ Laos', 'native-country_ Mexico', 'native-country_ Nicaragua', 'native-country_ Outlying-US(Guam-USVI-etc)', 'native-country_ Peru', 'native-country_ Philippines', 'native-country_ Poland', 'native-country_ Portugal', 'native-country_ Puerto-Rico', 'native-country_ Scotland', 'native-country_ South', 'native-country_ Taiwan', 'native-country_ Thailand', 'native-country_ Trinadad&Tobago', 'native-country_ United-States', 'native-country_ Vietnam', 'native-country_ Yugoslavia']

混洗和切分数据

现在所有的 类别变量 已被转换成数值特征，而且所有的数值特征已被规一化。和一般情况下做的一样，现在将数据（包括特征和它们的标签）切分成训练和测试集。其中80%的数据将用于训练和20%的数据用于测试。然后再进一步把训练数据分为训练集和验证集，用来选择和优化模型。

下面的代码将完成切分。

# 导入 train_test_split
from sklearn.model_selection import train_test_split

# 将'features'和'income'数据切分成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(features, income, test_size = 0.2, random_state = 0,
                                                    stratify = income)
# 将'X_train'和'y_train'进一步切分为训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0,
                                                    stratify = y_train)

# 显示切分的结果
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Validation set has {} samples.".format(X_val.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 28941 samples.
Validation set has 7236 samples.
Testing set has 9045 samples.

评价模型性能

在这一部分中，将尝试四种不同的算法，并确定哪一个能够最好地建模数据。四种算法包含一个天真的预测器 和三个我选择的监督学习器。

评价方法和朴素的预测器

CharityML通过他们的研究人员知道被调查者的年收入大于$50,000最有可能向他们捐款。因为这个原因CharityML对于准确预测谁能够获得$50,000以上收入尤其有兴趣。这样看起来使用准确率作为评价模型的标准是合适的。另外，把没有收入大于$50,000的人识别成年收入大于$50,000对于CharityML来说是有害的，因为他想要找到的是有意愿捐款的用户。这样，我们期望的模型具有准确预测那些能够年收入大于$50,000的能力比模型去查全这些被调查者更重要。我们能够使用F-beta score作为评价指标，这样能够同时考虑查准率和查全率：

$F_{\beta} = (1 + \beta^2) \cdot \frac{precision \cdot recall}{\left( \beta^2 \cdot precision \right) + recall}$

尤其是，当 $\beta = 0.5$ 的时候更多的强调查准率，这叫做F $_{0.5}$ score （或者为了简单叫做F-score）。

天真的预测器的性能

通过查看收入超过和不超过 $50,000 的人数，能发现多数被调查者年收入没有超过 $50,000。如果简单地预测说*“这个人的收入没有超过 $50,000”，就可以得到一个准确率超过 50% 的预测。这样甚至不用看数据就能做到一个准确率超过 50%。这样一个预测被称作是天真的。通常对数据使用一个天真的预测器*是十分重要的，这样能够帮助建立一个模型表现是否好的基准。

下面的代码将计算天真的预测器的相关性能，然后，计算结果赋值给'accuracy', ‘precision’, ‘recall’ 和 'fscore'，这些值会在后面被使用。

# 假阳性 False_Positive
False_Positive = len(y_val[y_val == 0])
# 真阳性 True_Positive
True_Positive = len(y_val[y_val == 1])
# 假阴性 False_Negative
False_Negative = 0

# 计算准确率
accuracy = True_Positive/len(y_val)

# 计算查准率 Precision
precision = True_Positive/(True_Positive + False_Positive)

# 计算查全率 Recall
recall = True_Positive/(True_Positive + False_Negative)

# 使用上面的公式，设置beta=0.5，计算F-score
fscore = (1+0.5*0.5)*(precision*recall)/((0.5*0.5*precision)+recall)

# 打印结果
print ("Naive Predictor on validation data: \n \
    Accuracy score: {:.4f} \n \
    Precision: {:.4f} \n \
    Recall: {:.4f} \n \
    F-score: {:.4f}".format(accuracy, precision, recall, fscore))

Naive Predictor on validation data: 
     Accuracy score: 0.2478 
     Precision: 0.2478 
     Recall: 1.0000 
     F-score: 0.2917

监督学习模型

模型应用

我将选择三个监督学习模型来建模。

模型1

模型名称：集成方法-随机森林（Random Forest）

优势：能处理很高维度的数据、训练速度快、有很强的抗干扰能力。在生成足够多树的情况下表现最好。

缺点：处理小数据或低维数据，可能不能产生很好的分类。解决回归问题没有分类的性能好。处理回归问题时表现很差。

选择原因：随机森林擅长处理分类问题，而该项目里的问题可看做分类问题。

模型2

模型名称：Logistic 回归

优势：容易使用和解释、模型训练高效。在特征没有相关性，且特征维度远小于数据量的时候表现最好。

缺点：不能用来解决非线性问题、特征之间相关性强的话模型性能表现很差。

选择原因：逻辑回归善于处理二分类问题，且当前数据集的特征之间相关性低。

模型3

模型名称：支撑向量机（SVM）

优势：可以提高泛化性能、可以解决高维问题、可以解决非线性问题。在小样本情况下表现最好。

缺点：对大规模训练样本难以实施、解决多分类问题存在困难。在特征数据缺失多和训练样本大的时候表现很差。

选择原因：SVM 能较好地处理分类问题。

创建一个训练和预测的流水线

为了正确评估我选择的每一个模型的性能，创建一个能够帮助快速有效地使用不同大小的训练集并在验证集上做预测的训练和验证的流水线是十分重要的。

在下面的代码单元中，将实现以下功能：

从sklearn.metrics中导入fbeta_score和accuracy_score。
用训练集拟合学习器，并记录训练时间。
对训练集的前300个数据点和验证集进行预测并记录预测时间。
计算预测训练集的前300个数据点的准确率和F-score。
计算预测验证集的准确率和F-score。

# 从sklearn中导入两个评价指标 - fbeta_score和accuracy_score
from sklearn.metrics import fbeta_score, accuracy_score

def train_predict(learner, sample_size, X_train, y_train, X_val, y_val): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_val: features validation set
       - y_val: income validation set
    '''
    
    results = {}
    
    # 使用sample_size大小的训练数据来拟合学习器
    # Fit the learner to the training data using slicing with 'sample_size'
    start = time() # 获得程序开始时间
    learner = learner.fit(X_train[:sample_size],y_train[:sample_size])
    end = time() # 获得程序结束时间
    
    # 计算训练时间
    results['train_time'] = end - start
    
    # 得到在验证集上的预测值
    # 然后得到对前300个训练数据的预测结果
    start = time() # 获得程序开始时间
    predictions_val = learner.predict(X_val)
    predictions_train = learner.predict(X_train[:300])
    end = time() # 获得程序结束时间
    
    # 计算预测用时
    results['pred_time'] = end - start
            
    # 计算在最前面的300个训练数据的准确率
    results['acc_train'] = accuracy_score(y_train[:300],predictions_train)
        
    # 计算在验证上的准确率
    results['acc_val'] = accuracy_score(y_val,predictions_val)
    
    # 计算在最前面300个训练数据上的F-score
    results['f_train'] = fbeta_score(y_train[:300],predictions_train, beta=0.5)
        
    # 计算验证集上的F-score
    results['f_val'] = fbeta_score(y_val,predictions_val, beta=0.5)
       
    # 成功
    print ("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
        
    # 返回结果
    return results

初始模型的评估

在下面的代码单元中，将实现以下功能：

导入在前面讨论的三个监督学习模型。
初始化三个模型并存储在'clf_A'，'clf_B'和'clf_C'中。
- 使用模型的默认参数值，在接下来的部分中你将需要对某一个模型的参数进行调整。
- 设置random_state (如果有这个参数)。
计算1%， 10%， 100%的训练数据分别对应多少个数据点，并将这些值存储在'samples_1', 'samples_10', 'samples_100'中

# 从sklearn中导入三个监督学习模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# 初始化三个模型
clf_A = RandomForestClassifier(random_state=0)
clf_B = LogisticRegression(random_state=6)
clf_C = SVC(random_state=5)

# 计算1%， 10%， 100%的训练数据分别对应多少点
samples_1 = int(len(X_train)*0.01)
samples_10 = int(len(X_train)*0.1)
samples_100 = len(X_train)

# 收集学习器的结果
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_val, y_val)

# 对选择的三个模型得到的评价结果进行可视化
vs.evaluate(results, accuracy, fscore)

D:\Software\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)


RandomForestClassifier trained on 289 samples.
RandomForestClassifier trained on 2894 samples.
RandomForestClassifier trained on 28941 samples.
LogisticRegression trained on 289 samples.
LogisticRegression trained on 2894 samples.


D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


LogisticRegression trained on 28941 samples.
SVC trained on 289 samples.


D:\Software\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
D:\Software\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)


SVC trained on 2894 samples.


D:\Software\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)


SVC trained on 28941 samples.

在这里插入图片描述

提高效果

在这最后一节中，将从三个有监督的学习模型中选择 最好的 模型来使用学生数据。将在整个训练集（X_train和y_train）上使用网格搜索优化至少调节一个参数以获得一个比没有调节之前更好的 F-score。

选择最佳的模型

基于前面做的评价指标，预测/训练时间，我认为逻辑回归模型最合适。因为训练时间和预测时间最短，在小样本的训练结果分数跟其他的模型持平，而在全部样本的训练结果分数却是最高的。

用通俗的话解释模型

模型通过分析数据的各个特征来预测一个公民是否是潜在的捐赠者。结果只有“是”和“否”这二分类，模型先通过公民特征数据计算分类所用的方程式，然后用梯度下降法求得方程式的最佳参数，最后对“是否是潜在的捐赠者”这个未来结果发生的概率进行计算，若概率大于0.5，结果便为“1”，即该公民的收入>50K，预测该公民是潜在的捐赠者；若概率小于等于0.5，结果便为“0”，即该公民的收入<=50K，预测该公民不是潜在的捐赠者。

模型调优

调节选择的模型的参数。使用网格搜索（GridSearchCV）来调整模型的重要参数，这个参数将使用整个训练集来尝试3个不同的值。

在接下来的代码单元中，将实现以下功能：

导入sklearn.model_selection.GridSearchCV 和 sklearn.metrics.make_scorer.
初始化我选择的分类器，并将其存储在clf中。
创建一个对于这个模型你希望调整参数的字典。例如: parameters = {‘parameter’ : [list of values]}。
使用make_scorer来创建一个fbeta_score评分对象（设置 $\beta = 0.5$ ）。
在分类器clf上用’scorer’作为评价函数运行网格搜索，并将结果存储在grid_obj中。
用训练集（X_train, y_train）训练grid search object,并将结果存储在grid_fit中。

# 导入'GridSearchCV', 'make_scorer'和其他一些需要的库
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# 初始化分类器
clf = LogisticRegression(random_state = 10)

# 创建你希望调节的参数列表
parameters = {'C':[0.05,1.0,5.0,10.0,50.0,100.0]}

# 创建一个fbeta_score打分对象
scorer = make_scorer(fbeta_score, beta=0.5)

# 在分类器上使用网格搜索，使用'scorer'作为评价函数
grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

# 用训练数据拟合网格搜索对象并找到最佳参数
grid_obj.fit(X_train, y_train)

# 得到estimator
best_clf = grid_obj.best_estimator_

# 使用没有调优的模型做预测
predictions = (clf.fit(X_train, y_train)).predict(X_val)
best_predictions = best_clf.predict(X_val)

# 汇报调优后的模型
print ("best_clf\n------")
print (best_clf)

# 汇报调参前和调参后的分数
print ("\nUnoptimized model\n------")
print ("Accuracy score on validation data: {:.4f}".format(accuracy_score(y_val, predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, predictions, beta = 0.5)))
print ("\nOptimized Model\n------")
print ("Final accuracy score on the validation data: {:.4f}".format(accuracy_score(y_val, best_predictions)))
print ("Final F-score on the validation data: {:.4f}".format(fbeta_score(y_val, best_predictions, beta = 0.5)))

D:\Software\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


best_clf
------
LogisticRegression(C=5.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=10, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Unoptimized model
------
Accuracy score on validation data: 0.8536
F-score on validation data: 0.7182

Optimized Model
------
Final accuracy score on the validation data: 0.8542
Final F-score on the validation data: 0.7182

最终模型评估

评价指标	未优化的模型	优化的模型
准确率	0.8536	0.8542
F-score	0.7182	0.7182

结论： 分数比未优化的模型要好，但提高幅度很小，有可能是选择的参数对模型的优化作用不大。

特征的重要性

在数据上（比如我们这里使用的人口普查的数据）使用监督学习算法的一个重要的任务是决定哪些特征能够提供最强的预测能力。专注于少量的有效特征和标签之间的关系，我们能够更加简单地理解这些现象，这在很多情况下都是十分有用的。在这个项目的情境下这表示我们希望选择一小部分特征，这些特征能够在预测被调查者是否年收入大于$50,000这个问题上有很强的预测能力。

下面将使用一个有 'feature_importance_' 属性的scikit学习分类器，'feature_importance_' 属性是对特征的重要性排序的函数。在下一个代码单元中，这个分类器将拟合训练集数据并使用这个属性来决定人口普查数据中最重要的5个特征。

观察特征相关性

当探索数据的时候，它显示在这个人口普查数据集中每一条记录我们有十三个可用的特征。
在这十三个记录中，我认为以下五个特征对于预测是最重要的。重要性排序：特征1排第一，特征5排最后，中间的依次递减。

特征1:capital-gain，资本收益直接影响收入水平。
特征2:occupation，不同的职业领域，收入千差万别。
特征3:workclass，劳动类型越高，收入应该越高。
特征4:age，人到中年阶段，应是财富积聚的巅峰时期。
特征5:education_level，教育程度越高，越有可能获得高收入。

提取特征重要性

我选择了一个scikit-learn中有feature_importance_属性的监督学习分类器，这个属性是一个在做预测的时候根据所选择的算法来对特征重要性进行排序的功能。

在下面的代码单元中，将实现以下功能：

如果这个模型和你前面使用的三个模型不一样的话从sklearn中导入一个监督学习模型。
在整个训练集上训练一个监督学习模型。
使用模型中的 'feature_importances_'提取特征的重要性。

# 在训练集上训练一个监督学习模型
model = RandomForestClassifier(max_depth=2, random_state=10)

model.fit(X_train,y_train)

# 提取特征重要性
importances = model.feature_importances_

# 绘图
vs.feature_plot(importances, X_train, y_train)

D:\Software\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

在这里插入图片描述

提取特征重要性

观察上面的可视化图像，得出结论：

这五个特征的权重加起来超过了0.5 。
这五个特征和我之前猜测的不太一样。
原因可能是当地的人结婚后，双方依然继续工作，双薪的话更加直接影响家庭收入。

特征选择

通过使用更少的特征来训练，在评价指标的角度来看我们的期望是训练和预测的时间会更少。从上面的可视化来看，可以看到前五个最重要的特征贡献了数据中所有特征中超过一半的重要性。这提示了可以尝试去减小特征空间，简化模型需要学习的信息。

下面代码单元将使用前面发现的优化模型，并只使用五个最重要的特征在相同的训练集上训练模型。

# 导入克隆模型的功能
from sklearn.base import clone

# 减小特征空间
X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]
X_val_reduced = X_val[X_val.columns.values[(np.argsort(importances)[::-1])[:5]]]

# 在前面的网格搜索的基础上训练一个“最好的”模型
clf_on_reduced = (clone(best_clf)).fit(X_train_reduced, y_train)

# 做一个新的预测
reduced_predictions = clf_on_reduced.predict(X_val_reduced)

# 对于每一个版本的数据汇报最终模型的分数
print ("Final Model trained on full data\n------")
print ("Accuracy on validation data: {:.4f}".format(accuracy_score(y_val, best_predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, best_predictions, beta = 0.5)))
print ("\nFinal Model trained on reduced data\n------")
print ("Accuracy on validation data: {:.4f}".format(accuracy_score(y_val, reduced_predictions)))
print ("F-score on validation data: {:.4f}".format(fbeta_score(y_val, reduced_predictions, beta = 0.5)))

Final Model trained on full data
------
Accuracy on validation data: 0.8542
F-score on validation data: 0.7182

Final Model trained on reduced data
------
Accuracy on validation data: 0.8208
F-score on validation data: 0.6418


D:\Software\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

特征选择的影响

最终模型在只是用五个特征的数据上和使用所有的特征数据上的 F-score 和准确率相比有所降低，但降低的幅度还能接受。
如果训练时间是一个要考虑的因素，我会考虑使用部分特征的数据作为你的训练集。*

在测试集上测试模型

#TODO test your model on testing data and report accuracy and F score
y_test_pred = best_clf.predict(X_test)
print (accuracy_score(y_test, y_test_pred))
print (fbeta_score(y_test, y_test_pred, beta=0.5))

0.848424543946932
0.7064191315292637

对于处理二分类问题，我们首先应该尝试逻辑回归模型。而项目数据的特征之间的相关性低，使得模型在数据上能表现出较高的性能和较好的准确率。最后通过网格搜索和调参，提高了准确率。

Kira_Tseng

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
为CharityML寻找捐献者（基于多个监督学习算法）

为CharityML寻找捐献者在这个项目中，将使用1994年美国人口普查收集的数据，选用几个监督学习算法以准确地建模被调查者的收入。然后，将根据初步结果从中选择出最佳的候选算法，并进一步优化该算法以最好地建模这些数据。目标是建立一个能够准确地预测被调查者年收入是否超过50000美元的模型。这种类型的任务会出现在那些依赖于捐款而存在的非营利性组织。了解人群的收入情况可以帮助一个非营利性的机构更好地...
复制链接

扫一扫