Pandas基本常用数据处理操作——第二节

Pandas基本常用数据处理操作——第二节

以下是本次实验使用的数据,如需要数据表学习的请留言
在这里插入图片描述

读取数据并显示

#本节使用的数据表示kaggle泰坦尼克拯救比赛的数据
#导入模块  读取数据
import pandas as pd
import numpy as np
titanic = pd.read_csv('titanic_train.csv')
#显示数据  仅显示前10行
titanic.head(10)
#passengerid  表示游客id
#survived  表示是否被救 1表示成功被救   0表示没有获救
#Pclass  表示船舱等级
#name  游客名字
#sex  表示游客性别
#age  游客年龄、
#sibsp  表示家里兄弟姐妹的个数
#parch   表示家里父母和孩子的个数
#ticket  船票编码
#fare  船票
#cabin  船舱位置
#embarked  登船地点
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC

可以看到cabin中有很多 NaN值 Pandas用NaN值表示缺失值 可以使用isnull这个函数来判断某一个值是否是缺失值

#查看Age这一列有多少个缺失值
age = titanic['Age']
age_null_number = 0
for this_age in age:
    if pd.isnull(this_age):
        age_null_number += 1
print(age_null_number)
177

当存在缺失值的时候要计算某一项数据时不能包括缺失值,否则最后结果也是缺失值

#计算所有游客的年龄
avg_age = sum(titanic['Age']) / len(titanic['Age'])
avg_age  #因为游客中有的人的年龄确实为nan所以计算时会返回nan
nan

想要得到所有游客的平均年龄,必须去掉nan值,这时类似在numpy中用的逻辑索引可以通过判断是否为nan值返回的逻辑值索引

real_age = titanic['Age'][pd.isnull(titanic["Age"]) == False]
avg_real_age = sum(real_age) / len(real_age)
avg_real_age
29.69911764705882

显然通过上述方法完全可以得到平均年龄,但是比较繁琐,pandas提供了mean函数可以直接忽略nan值进行计算

mean_age = titanic['Age'].mean()
mean_age
29.69911764705882

还可以分别计算三种等级船舱的平均价格

#mean fare for every pclass
passenger_classes = [1,2,3]
fare_by_class = {}
for this_class in passenger_classes:
    this_class_mean_fare = titanic['Fare'][(titanic['Pclass'] == this_class)].mean()
    fare_by_class[this_class] = this_class_mean_fare
fare_by_class
{1: 84.1546875, 2: 20.662183152173913, 3: 13.675550101832993}

上述方法显然可以计算出每种船舱的平均价格 但是pandas提供了一个pivot_table函数可以直接计算

#index 表示索引值 最后的计算结果按照index值分组
#values   表示想要进行计算的列
# aggfunc  表示计算的方法  默认计算均值
fare_by_class = titanic.pivot_table(index='Pclass',values='Fare',aggfunc=np.average)
fare_by_class
Fare
Pclass
184.154687
220.662183
313.675550

计算三种等级的船舱每种船舱平均被救的人数

every_pclass_survived = titanic.pivot_table(index='Pclass',values='Survived')
every_pclass_survived
Survived
Pclass
10.629630
20.472826
30.242363

计算每种等级船舱游客的平均年龄

every_pclass_avg_age = titanic.pivot_table(index='Pclass',values='Age')
every_pclass_avg_age
Age
Pclass
138.233441
229.877630
325.140620

计算三种登船地点的收钱总数和被救人数

ports_embarked = titanic.pivot_table(index='Embarked',values=['Fare','Survived'],aggfunc=np.sum)
ports_embarked
FareSurvived
Embarked
C10072.296293
Q1022.254330
S17439.3988217

分别计算三种登船地点的总人数

every_port_passenger_number = titanic.pivot_table(index='Embarked',values='PassengerId',aggfunc=np.count_nonzero)
every_port_passenger_number
PassengerId
Embarked
C168
Q77
S644

计算三种登船地点的获救率

every_port_survived_rate = ports_embarked['Survived'] / every_port_passenger_number['PassengerId']
every_port_survived_rate
Embarked
C    0.553571
Q    0.389610
S    0.336957
dtype: float64

对缺失数据进行处理

#dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
# axis : {0 or 'index', 1 or 'columns'}, default 0    Determine if rows or columns which contain missing values ar eremoved.
#how : {'any', 'all'}, default 'any'  Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
#'any' : If any NA values are present, drop that row or column.
# 'all' : If all values are NA, drop that row or column.
#subset : array-like, optional   Labels along other axis to consider, e.g. if you are dropping rows
#int, optional    Require that many non-NA values.
#inplace : bool, default False   If True, do operation inplace and return None.
titanic[0:5]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

默认axis = 0,how = ‘any’ 使用drop去掉含有nan值的行

titanic.dropna().head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S

可以看到Cabin有很多缺失值 去掉Cabin这一列

titanic.drop(axis= 1,labels='Cabin').head()  #只显示前五行
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500S
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250S
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000S
4503Allen, Mr. William Henrymale35.0003734508.0500S

去掉所有含有nan的列 但是这在实际操作中是很少用到的

titanic.dropna(axis=1).head()
PassengerIdSurvivedPclassNameSexSibSpParchTicketFare
0103Braund, Mr. Owen Harrismale10A/5 211717.2500
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female10PC 1759971.2833
2313Heikkinen, Miss. Lainafemale00STON/O2. 31012827.9250
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female1011380353.1000
4503Allen, Mr. William Henrymale003734508.0500

去掉age和sex两列只要有一个为缺失值的行

new_titantic = titanic.dropna(axis=0,subset=['Age','Sex'])
new_titantic.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S

去掉age和sex都为为缺失值的行

new_titantic = titanic.dropna(axis=0,subset=['Age','Sex'],how='all')
new_titantic.head(10)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC

定位csv文件中的某一个值

row_index_3_name = titanic.loc[3,"Name"]
row_index_3_name
'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
titanic['Name'][3]
'Futrelle, Mrs. Jacques Heath (Lily May Peel)'
titanic.loc[3]['Name']
'Futrelle, Mrs. Jacques Heath (Lily May Peel)'

对年龄这一列排序

titanic.sort_values('Age',ascending=False).head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
11611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ

排序后NaN值默认排在最后

titanic.sort_values('Age',ascending=False).tail(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
85986003Razi, Mr. RaihedmaleNaN0026297.2292NaNC
86386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.5500NaNS
86886903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS
87887903Laleff, Mr. KristomaleNaN003492177.8958NaNS
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS

排序之后重置索引值 由于drop = False实际上是重新生成了一列索引值

titanic.sort_values('Age',ascending=False).reset_index(drop = False).head()
indexPassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
063063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
185185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
249349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
3969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
411611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ

排序之后重置索引值 drop = True 丢弃原来的索引值

titanic.sort_values('Age',ascending=False).reset_index(drop=True).head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
249401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
39701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
411703Connors, Mr. Patrickmale70.5003703697.7500NaNQ
  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值