sklearn 使用笔记1-titanic

最新推荐文章于 2022-10-15 16:10:40 发布

宋老板的笔记

最新推荐文章于 2022-10-15 16:10:40 发布

阅读量239

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_41684423/article/details/115540324

版权

记性不好，怕忘，随用随写点东西给自己看

一、Seaborn，pyplot

详细使用参见git地址：https://github.com/jamjar102/kaggle_titanic

其中：

 facet = sns.FacetGrid(train, hue="Survived", aspect=2)
    facet.map(sns.kdeplot, 'Age', shade=True) #年龄与存活率
    facet.set(xlim=(0, train['Age'].max()))
    facet.add_legend()
    plt.xlabel('Age')
    plt.ylabel('density')

1.先sns.FacetGrid画出轮廓 2.再用map进行填充

其中kdeplot为绘制核密度图，主要用来绘制特征变量y值的分布。例如如下数据

致谢：https://blog.csdn.net/qq_39112101/article/details/86439415

https://www.jianshu.com/p/6a210c2ad3ad

二、pandas

1.dataframe中选取符合一定条件的项目用loc，代码如下

    Female_Child_Group = all_data.loc[
        (all_data['FamilyGroup'] >= 2) & ((all_data['Age'] <= 12) | (all_data['Sex'] == 'female'))]
    Male_Adult_Group = all_data.loc[
        (all_data['FamilyGroup'] >= 2) & (all_data['Age'] > 12) & (all_data['Sex'] == 'male')]

2.如果dataframe中某列有多个值，新加一列计算该列的值的个数，做一本字典，并且apply到dataframe的这列上，代码如下

    Ticket_Count = dict(all_data['Ticket'].value_counts())
    all_data['TicketGroup'] = all_data['Ticket'].apply(lambda x:Ticket_Count[x])

3.查看哪列还有空值

    print(all_data.isnull().sum()[all_data.isnull().sum() > 0])

4.对有空值的某列进行填充

    all_data.loc[(all_data.Age.isnull()),'Age']=predictedAges

5.groupby+mean+value_counts()

    Female_Child = pd.DataFrame(Female_Child_Group.groupby('Surname')['Survived'].mean().value_counts())

groupby result： <pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EB880E4E48>
+mean() result:Surname
Abbott 1.0
Abelseth NaN
Abelson 1.0
Aks 1.0
Allen 1.0
...
Yasbeck 1.0
Zabour 0.0
de Messemaeker 1.0
del Carlo NaN
van Billiard NaN

+values_count()result：
Female_Child GroupCount
1.000000 115
0.000000 31
0.750000 2
0.333333 1
0.142857 1

三、sklearn

1.pipeline

pipe = Pipeline([('select', SelectKBest(k=20)),
                 ('classify', RandomForestClassifier(random_state=10, max_features='sqrt'))])

pipeline是构建了一个由步骤列表组成的管道对象，其中每个步骤都是一个元组，包含一个名称（自定义）和一个估计器的实例。

也可以使用make_pipeline不需要定义名称即可以创建pipeline，会进行自动命名。自动命名的步骤名称是类名称的小写版本，如果多个步骤属于同一个类，则会附加一个数字。

pipeline的工作流程是按照步骤（缩放器）调用其中的fit函数，其中优点在于，我们可以在cross_val_score或GridSearchCV中使用这个估计器（pipeline）

    select = SelectKBest(k=20)
    clf = RandomForestClassifier(random_state=10, warm_start=True,
                                 n_estimators=26,
                                 max_depth=6,
                                 max_features='sqrt')
    pipeline = make_pipeline(select, clf)

致谢：https://blog.csdn.net/elma_tww/article/details/88427695

2.gridSearchSV 网格搜索

返回结果得到最好的模型参数的对应的模型。

搜索后返回的结果通常使用

gsearch.best_params_, gsearch.best_score_

查看最好的参数，并且该返回值以此参数用于训练

    pipe = Pipeline([('select', SelectKBest(k=20)),
                     ('classify', RandomForestClassifier(random_state=10, max_features='sqrt'))])

    param_test = {'classify__n_estimators': list(range(20, 50, 2)),
                  'classify__max_depth': list(range(3, 60, 3))}
    gsearch = GridSearchCV(estimator=pipe, param_grid=param_test, scoring='roc_auc', cv=10)

参数说明：

（1）       estimator

选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法：estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features='sqrt',random_state=10),

（2）       param_grid

需要最优化的参数的取值，值为字典或者列表，例如：param_grid =param_test1，param_test1 = {'n_estimators':range(10,71,10)}。

（3）       scoring=None

模型评价标准，默认None,这时需要使用score函数；或者如scoring='roc_auc'，根据所选模型不同，评价准则不同。字符串（函数名），或是可调用对象，需要其函数签名形如：scorer(estimator, X, y)；如果是None，则使用estimator的误差估计函数。具体值的选取看本篇第三节内容。

（4）        cv=None

交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。

致谢：https://blog.csdn.net/weixin_41988628/article/details/83098130

3.SelectKBest #todo 还不是特别明白以后待补充

其中有两个参数一个是score_func,一个则是k。我们可以理解为，score_func是函数，它的作用是给特征进行打分，然后从高到底选取特征。那么特征该选取多少个呢？后面的k就是限定特征个数的，默认是选取10个特征。而score_func有很多，如果自己不定义，也就是采用默认的函数的话，是不能进行回归任务的，因为默认的函数是只能对分类的特征进行打分。由于score_func有很多函数，例如fclassif，即利用ANOVA方法来给特征打分，还有基于互信息的，卡方检验的方法来给特征打分后进行特征选择。同样，也存在给回归问题特征进行打分的，比如fregression ,以及mutual_info_regression等很多函数，可以在解决特定问题的时候再进行挑选。

致谢：https://zhuanlan.zhihu.com/p/81345169