Pandas基本常用数据处理操作——第二节

最新推荐文章于 2022-07-30 15:08:08 发布

志存高远脚踏实地

最新推荐文章于 2022-07-30 15:08:08 发布

阅读量870

点赞数 1

文章标签： Pandas数据处理 Pandas缺失值处理 Pandas处理数据常用操作

本文链接：https://blog.csdn.net/weixin_44451032/article/details/99301298

版权

Pandas基本常用数据处理操作——第二节

以下是本次实验使用的数据，如需要数据表学习的请留言
在这里插入图片描述

读取数据并显示

#本节使用的数据表示kaggle泰坦尼克拯救比赛的数据
#导入模块  读取数据
import pandas as pd
import numpy as np
titanic = pd.read_csv('titanic_train.csv')
#显示数据  仅显示前10行
titanic.head(10)
#passengerid  表示游客id
#survived  表示是否被救 1表示成功被救   0表示没有获救
#Pclass  表示船舱等级
#name  游客名字
#sex  表示游客性别
#age  游客年龄、
#sibsp  表示家里兄弟姐妹的个数
#parch   表示家里父母和孩子的个数
#ticket  船票编码
#fare  船票
#cabin  船舱位置
#embarked  登船地点

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

可以看到cabin中有很多 NaN值 Pandas用NaN值表示缺失值可以使用isnull这个函数来判断某一个值是否是缺失值

#查看Age这一列有多少个缺失值
age = titanic['Age']
age_null_number = 0
for this_age in age:
    if pd.isnull(this_age):
        age_null_number += 1
print(age_null_number)

当存在缺失值的时候要计算某一项数据时不能包括缺失值，否则最后结果也是缺失值

#计算所有游客的年龄
avg_age = sum(titanic['Age']) / len(titanic['Age'])
avg_age  #因为游客中有的人的年龄确实为nan所以计算时会返回nan

nan

想要得到所有游客的平均年龄，必须去掉nan值，这时类似在numpy中用的逻辑索引可以通过判断是否为nan值返回的逻辑值索引

real_age = titanic['Age'][pd.isnull(titanic["Age"]) == False]
avg_real_age = sum(real_age) / len(real_age)
avg_real_age

29.69911764705882

显然通过上述方法完全可以得到平均年龄，但是比较繁琐，pandas提供了mean函数可以直接忽略nan值进行计算

mean_age = titanic['Age'].mean()
mean_age

29.69911764705882

还可以分别计算三种等级船舱的平均价格

#mean fare for every pclass
passenger_classes = [1,2,3]
fare_by_class = {}
for this_class in passenger_classes:
    this_class_mean_fare = titanic['Fare'][(titanic['Pclass'] == this_class)].mean()
    fare_by_class[this_class] = this_class_mean_fare
fare_by_class

{1: 84.1546875, 2: 20.662183152173913, 3: 13.675550101832993}

上述方法显然可以计算出每种船舱的平均价格但是pandas提供了一个pivot_table函数可以直接计算

#index 表示索引值 最后的计算结果按照index值分组
#values   表示想要进行计算的列
# aggfunc  表示计算的方法  默认计算均值
fare_by_class = titanic.pivot_table(index='Pclass',values='Fare',aggfunc=np.average)
fare_by_class

	Fare
Pclass
1	84.154687
2	20.662183
3	13.675550

计算三种等级的船舱每种船舱平均被救的人数

every_pclass_survived = titanic.pivot_table(index='Pclass',values='Survived')
every_pclass_survived

	Survived
Pclass
1	0.629630
2	0.472826
3	0.242363

计算每种等级船舱游客的平均年龄

every_pclass_avg_age = titanic.pivot_table(index='Pclass',values='Age')
every_pclass_avg_age

	Age
Pclass
1	38.233441
2	29.877630
3	25.140620

计算三种登船地点的收钱总数和被救人数

ports_embarked = titanic.pivot_table(index='Embarked',values=['Fare','Survived'],aggfunc=np.sum)
ports_embarked

	Fare	Survived
Embarked
C	10072.2962	93
Q	1022.2543	30
S	17439.3988	217

分别计算三种登船地点的总人数

every_port_passenger_number = titanic.pivot_table(index='Embarked',values='PassengerId',aggfunc=np.count_nonzero)
every_port_passenger_number

	PassengerId
Embarked
C	168
Q	77
S	644

计算三种登船地点的获救率

every_port_survived_rate = ports_embarked['Survived'] / every_port_passenger_number['PassengerId']
every_port_survived_rate

Embarked
C    0.553571
Q    0.389610
S    0.336957
dtype: float64

对缺失数据进行处理

#dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
# axis : {0 or 'index', 1 or 'columns'}, default 0    Determine if rows or columns which contain missing values ar eremoved.
#how : {'any', 'all'}, default 'any'  Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
#'any' : If any NA values are present, drop that row or column.
# 'all' : If all values are NA, drop that row or column.
#subset : array-like, optional   Labels along other axis to consider, e.g. if you are dropping rows
#int, optional    Require that many non-NA values.
#inplace : bool, default False   If True, do operation inplace and return None.
titanic[0:5]

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

默认axis = 0,how = ‘any’ 使用drop去掉含有nan值的行

titanic.dropna().head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S

可以看到Cabin有很多缺失值去掉Cabin这一列

titanic.drop(axis= 1,labels='Cabin').head()  #只显示前五行

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	S

去掉所有含有nan的列但是这在实际操作中是很少用到的

titanic.dropna(axis=1).head()

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Ticket	Fare
0	1	0	3	Braund, Mr. Owen Harris	male	1	A/5 21171	7.2500
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	1	PC 17599	71.2833
2	3	1	3	Heikkinen, Miss. Laina	female	0	STON/O2. 3101282	7.9250
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	1	113803	53.1000
4	5	0	3	Allen, Mr. William Henry	male	0	373450	8.0500

去掉age和sex两列只要有一个为缺失值的行

new_titantic = titanic.dropna(axis=0,subset=['Age','Sex'])
new_titantic.head(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S

去掉age和sex都为为缺失值的行

new_titantic = titanic.dropna(axis=0,subset=['Age','Sex'],how='all')
new_titantic.head(10)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
5	6	0	3	Moran, Mr. James	male	NaN	0	0	330877	8.4583	NaN	Q
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
7	8	0	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S
8	9	1	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S
9	10	1	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C

定位csv文件中的某一个值

row_index_3_name = titanic.loc[3,"Name"]
row_index_3_name

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

titanic['Name'][3]

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

titanic.loc[3]['Name']

'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

对年龄这一列排序

titanic.sort_values('Age',ascending=False).head()

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
630	631	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	27042	30.0000	A23	S
851	852	0	3	Svensson, Mr. Johan	male	74.0	347060	7.7750	NaN	S
493	494	0	1	Artagaveytia, Mr. Ramon	male	71.0	PC 17609	49.5042	NaN	C
96	97	0	1	Goldschmidt, Mr. George B	male	71.0	PC 17754	34.6542	A5	C
116	117	0	3	Connors, Mr. Patrick	male	70.5	370369	7.7500	NaN	Q

排序后NaN值默认排在最后

titanic.sort_values('Age',ascending=False).tail(5)

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
859	860	3	Razi, Mr. Raihed	male	NaN	0	0	2629	7.2292	NaN	C
863	864	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.5500	NaN	S
868	869	3	van Melkebeke, Mr. Philemon	male	NaN	0	0	345777	9.5000	NaN	S
878	879	3	Laleff, Mr. Kristo	male	NaN	0	0	349217	7.8958	NaN	S
888	889	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S

排序之后重置索引值由于drop = False实际上是重新生成了一列索引值

titanic.sort_values('Age',ascending=False).reset_index(drop = False).head()

	index	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
0	630	631	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	27042	30.0000	A23	S
1	851	852	0	3	Svensson, Mr. Johan	male	74.0	347060	7.7750	NaN	S
2	493	494	0	1	Artagaveytia, Mr. Ramon	male	71.0	PC 17609	49.5042	NaN	C
3	96	97	0	1	Goldschmidt, Mr. George B	male	71.0	PC 17754	34.6542	A5	C
4	116	117	0	3	Connors, Mr. Patrick	male	70.5	370369	7.7500	NaN	Q

排序之后重置索引值 drop = True 丢弃原来的索引值

titanic.sort_values('Age',ascending=False).reset_index(drop=True).head()

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
0	631	1	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	27042	30.0000	A23	S
1	852	0	3	Svensson, Mr. Johan	male	74.0	347060	7.7750	NaN	S
2	494	0	1	Artagaveytia, Mr. Ramon	male	71.0	PC 17609	49.5042	NaN	C
3	97	0	1	Goldschmidt, Mr. George B	male	71.0	PC 17754	34.6542	A5	C
4	117	0	3	Connors, Mr. Patrick	male	70.5	370369	7.7500	NaN	Q

志存高远脚踏实地

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Pandas基本常用数据处理操作——第二节

Pandas基本常用数据处理操作——第二节读取数据并显示#本节使用的数据表示kaggle泰坦尼克救拯救比赛的数据#导入模块读取数据import pandas as pdimport numpy as nptitanic = pd.read_csv('titanic_train.csv')#显示数据仅显示前10行titanic.head(10)#passengerid 表...
复制链接

扫一扫