27. Pandas怎样找出最影响结果的那些特征?

Pandas怎样找出最影响结果的那些特征?

应用场景:

  • 机器学习的特征选择,去除无用的特征,可以提升模型效果、降低训练时间等等
  • 数据分析领域,找出收入波动的最大因素!!

实例演示:泰坦尼克沉船事件中,最影响生死的因素有哪些?

1、导入相关的包

import pandas as pd
import numpy as np

# 特征最影响结果的K个特征
from sklearn.feature_selection import SelectKBest

# 卡方检验,作为SelectKBest的参数
from sklearn.feature_selection import chi2

2、导入泰坦尼克号的数据

df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
df.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103male22.0107.2500S
1211female38.01071.2833C
2313female26.0007.9250S
3411female35.01053.1000S
4503male35.0008.0500S

3、数据清理和转换

3.1 查看是否有空值列
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB
3.2 给Age列填充平均值
df["Age"] = df["Age"].fillna(df["Age"].median())
df.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103male22.0107.2500S
1211female38.01071.2833C
2313female26.0007.9250S
3411female35.01053.1000S
4503male35.0008.0500S
3.2 将性别列变成数字
# 性别
df.Sex.unique()
array(['male', 'female'], dtype=object)
df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1
df.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103022.0107.2500S
1211138.01071.2833C
2313126.0007.9250S
3411135.01053.1000S
4503035.0008.0500S
3.3 给Embarked列填充空值,字符串转换成数字
# Embarked
df.Embarked.unique()
array(['S', 'C', 'Q', nan], dtype=object)
# 填充空值
df["Embarked"] = df["Embarked"].fillna(0)

# 字符串变成数字
df.loc[df["Embarked"] == "S", "Embarked"] = 1
df.loc[df["Embarked"] == "C", "Embarked"] = 2
df.loc[df["Embarked"] == "Q", "Embarked"] = 3
df.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103022.0107.25001
1211138.01071.28332
2313126.0007.92501
3411135.01053.10001
4503035.0008.05001

4、将特征列和结果列拆分开

y = df.pop("Survived")
X = df
X.head()
PassengerIdPclassSexAgeSibSpParchFareEmbarked
013022.0107.25001
121138.01071.28332
233126.0007.92501
341135.01053.10001
453035.0008.05001
y.head()
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

5、使用卡方检验选择topK的特征

# 选择所有的特征,目的是看到特征重要性排序
bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
fit = bestfeatures.fit(X, y)

6、按照重要性顺序打印特征列表

df_scores = pd.DataFrame(fit.scores_)
df_scores
0
03.312934
130.873699
2170.348127
321.649163
42.581865
510.097499
64518.319091
72.771019
df_columns = pd.DataFrame(X.columns)
df_columns
0
0PassengerId
1Pclass
2Sex
3Age
4SibSp
5Parch
6Fare
7Embarked
# 合并两个df
df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
# 列名
df_feature_scores.columns = ['feature_name','Score']  #naming the dataframe columns

# 查看
df_feature_scores
feature_nameScore
0PassengerId3.312934
1Pclass30.873699
2Sex170.348127
3Age21.649163
4SibSp2.581865
5Parch10.097499
6Fare4518.319091
7Embarked2.771019
df_feature_scores.sort_values(by="Score", ascending=False)
feature_nameScore
6Fare4518.319091
2Sex170.348127
1Pclass30.873699
3Age21.649163
5Parch10.097499
0PassengerId3.312934
7Embarked2.771019
4SibSp2.581865

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值