27. Pandas怎样找出最影响结果的那些特征？

最新推荐文章于 2024-03-03 22:13:43 发布

hello rpa

最新推荐文章于 2024-03-03 22:13:43 发布

阅读量599

点赞数

分类专栏： pandas 文章标签：数据分析 python pandas

本文链接：https://blog.csdn.net/lvlinjier/article/details/112852949

版权

pandas 专栏收录该内容

43 篇文章 23 订阅

订阅专栏

Pandas怎样找出最影响结果的那些特征？

应用场景：

机器学习的特征选择，去除无用的特征，可以提升模型效果、降低训练时间等等
数据分析领域，找出收入波动的最大因素！！

实例演示：泰坦尼克沉船事件中，最影响生死的因素有哪些？

1、导入相关的包

import pandas as pd
import numpy as np

# 特征最影响结果的K个特征
from sklearn.feature_selection import SelectKBest

# 卡方检验，作为SelectKBest的参数
from sklearn.feature_selection import chi2

2、导入泰坦尼克号的数据

df = pd.read_csv("./datas/titanic/titanic_train.csv")
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df = df[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]].copy()
df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	male	22.0	1	7.2500	S
1	2	1	1	female	38.0	1	71.2833	C
2	3	1	3	female	26.0	0	7.9250	S
3	4	1	1	female	35.0	1	53.1000	S
4	5	0	3	male	35.0	0	8.0500	S

3、数据清理和转换

3.1 查看是否有空值列

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB

3.2 给Age列填充平均值

df["Age"] = df["Age"].fillna(df["Age"].median())

df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	male	22.0	1	7.2500	S
1	2	1	1	female	38.0	1	71.2833	C
2	3	1	3	female	26.0	0	7.9250	S
3	4	1	1	female	35.0	1	53.1000	S
4	5	0	3	male	35.0	0	8.0500	S

3.2 将性别列变成数字

# 性别
df.Sex.unique()

array(['male', 'female'], dtype=object)

df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1

df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	0	22.0	1	7.2500	S
1	2	1	1	1	38.0	1	71.2833	C
2	3	1	3	1	26.0	0	7.9250	S
3	4	1	1	1	35.0	1	53.1000	S
4	5	0	3	0	35.0	0	8.0500	S

3.3 给Embarked列填充空值，字符串转换成数字

# Embarked
df.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

# 填充空值
df["Embarked"] = df["Embarked"].fillna(0)

# 字符串变成数字
df.loc[df["Embarked"] == "S", "Embarked"] = 1
df.loc[df["Embarked"] == "C", "Embarked"] = 2
df.loc[df["Embarked"] == "Q", "Embarked"] = 3

df.head()

	PassengerId	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	0	3	0	22.0	1	7.2500	1
1	2	1	1	1	38.0	1	71.2833	2
2	3	1	3	1	26.0	0	7.9250	1
3	4	1	1	1	35.0	1	53.1000	1
4	5	0	3	0	35.0	0	8.0500	1

4、将特征列和结果列拆分开

y = df.pop("Survived")
X = df

X.head()

	PassengerId	Pclass	Sex	Age	SibSp	Fare	Embarked
0	1	3	0	22.0	1	7.2500	1
1	2	1	1	38.0	1	71.2833	2
2	3	3	1	26.0	0	7.9250	1
3	4	1	1	35.0	1	53.1000	1
4	5	3	0	35.0	0	8.0500	1

y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

5、使用卡方检验选择topK的特征

# 选择所有的特征，目的是看到特征重要性排序
bestfeatures = SelectKBest(score_func=chi2, k=len(X.columns))
fit = bestfeatures.fit(X, y)

6、按照重要性顺序打印特征列表

df_scores = pd.DataFrame(fit.scores_)
df_scores

	0
0	3.312934
1	30.873699
2	170.348127
3	21.649163
4	2.581865
5	10.097499
6	4518.319091
7	2.771019

df_columns = pd.DataFrame(X.columns)
df_columns

	0
0	PassengerId
1	Pclass
2	Sex
3	Age
4	SibSp
5	Parch
6	Fare
7	Embarked

# 合并两个df
df_feature_scores = pd.concat([df_columns,df_scores],axis=1)
# 列名
df_feature_scores.columns = ['feature_name','Score']  #naming the dataframe columns

# 查看
df_feature_scores

	feature_name	Score
0	PassengerId	3.312934
1	Pclass	30.873699
2	Sex	170.348127
3	Age	21.649163
4	SibSp	2.581865
5	Parch	10.097499
6	Fare	4518.319091
7	Embarked	2.771019

df_feature_scores.sort_values(by="Score", ascending=False)

	feature_name	Score
6	Fare	4518.319091
2	Sex	170.348127
1	Pclass	30.873699
3	Age	21.649163
5	Parch	10.097499
0	PassengerId	3.312934
7	Embarked	2.771019
4	SibSp	2.581865

hello rpa

关注

0
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
27. Pandas怎样找出最影响结果的那些特征？

Pandas怎样找出最影响结果的那些特征？应用场景：机器学习的特征选择，去除无用的特征，可以提升模型效果、降低训练时间等等数据分析领域，找出收入波动的最大因素！！实例演示：泰坦尼克沉船事件中，最影响生死的因素有哪些？1、导入相关的包import pandas as pdimport numpy as np# 特征最影响结果的K个特征from sklearn.feature_selection import SelectKBest# 卡方检验，作为SelectKBest的参数fr
复制链接

扫一扫