【Pandas】2.5.空难数据再处理

包含全部示例的代码仓库见GIthub

1 导入库

import pandas as pd
import numpy as np

2 导入数据

data = pd.read_csv('./dataset/air1908.csv')
data.head()
# output
Date	Time	Location	Operator	Flight #	Route	Type	Registration	cn/In	Aboard	Fatalities	Ground	Summary
0	09/17/1908	17:18	Fort Myer, Virginia	Military - U.S. Army	NaN	Demonstration	Wright Flyer III	NaN	1	2.0	1.0	0.0	During a demonstration flight, a U.S. Army fly...
1	07/12/1912	06:30	AtlantiCity, New Jersey	Military - U.S. Navy	NaN	Test flight	Dirigible	NaN	NaN	5.0	5.0	0.0	First U.S. dirigible Akron exploded just offsh...
2	08/06/1913	NaN	Victoria, British Columbia, Canada	Private	-	NaN	Curtiss seaplane	NaN	NaN	1.0	1.0	0.0	The first fatal airplane accident in Canada oc...
3	09/09/1913	18:30	Over the North Sea	Military - German Navy	NaN	NaN	Zeppelin L-1 (airship)	NaN	NaN	20.0	14.0	0.0	The airship flew into a thunderstorm and encou...
4	10/17/1913	10:30	Near Johannisthal, Germany	Military - German Navy	NaN	NaN	Zeppelin L-2 (airship)	NaN	NaN	30.0	30.0	0.0	Hydrogen gas which was being vented was sucked...

data.columns
# output
Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',
       'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground', 'Summary'],
      dtype='object')

3 数据预处理

选取列

data = data[['Location','Type','Aboard','Fatalities']]
data.head()
# output
    Location	Type	Aboard	Fatalities
0	Fort Myer, Virginia	Wright Flyer III	2.0	1.0
1	AtlantiCity, New Jersey	Dirigible	5.0	5.0
2	Victoria, British Columbia, Canada	Curtiss seaplane	1.0	1.0
3	Over the North Sea	Zeppelin L-1 (airship)	20.0	14.0
4	Near Johannisthal, Germany	Zeppelin L-2 (airship)	30.0	30.0

3.1 缺失值处理

any(1) 这一行有一个nan值就会被记为True

(data.isnull().any(1)).sum()
# output
66

保证['Location','Type']中都没有nan,删除他们中的nan值

data = data.dropna(subset=['Location','Type'])
(data.isnull().any(1)).sum()
# output
19

将nan fill为0

data.fillna(0, inplace=True) # 将nan fill为0
# output
    Location	Type	Aboard	Fatalities
0	Fort Myer, Virginia	Wright Flyer III	2.0	1.0
1	AtlantiCity, New Jersey	Dirigible	5.0	5.0
2	Victoria, British Columbia, Canada	Curtiss seaplane	1.0	1.0
3	Over the North Sea	Zeppelin L-1 (airship)	20.0	14.0
4	Near Johannisthal, Germany	Zeppelin L-2 (airship)	30.0	30.0
...	...	...	...	...
5263	Near Madiun, Indonesia	Lockheed C-130 Hercules	112.0	98.0
5264	Near Isiro, DemocratiRepubliCongo	Antonov An-26	4.0	4.0
5265	AtlantiOcean, 570 miles northeast of Natal, Br...	Airbus A330-203	228.0	228.0
5266	Near Port Hope Simpson, Newfoundland, Canada	Britten-Norman BN-2A-27 Islander	1.0	1.0
5267	State of Arunachal Pradesh, India	Antonov An-32	13.0	13.0

3.2 重复值处理

是否有重复值

data.duplicated().sum()
# output
10

查看重复值

data[data.duplicated()]
# output
    Location	Type	Aboard	Fatalities
33	Cleveland, Ohio	De Havilland DH-4	1.0	1.0
51	Cleveland, Ohio	De Havilland DH-4	1.0	1.0
71	Barcelona, Spain	Breguet 14	1.0	1.0
147	Over the Gulf of Finland	Junkers F-13	6.0	6.0
410	Mexico	Lockheed Orion	1.0	1.0
701	Kunming, China	Douglas C-47	3.0	3.0
1046	North Sea	Douglas DC-3	7.0	7.0
2208	Off Phu Quoc, Vietnam	Lockheed P-3B Orion	12.0	12.0
2771	Willow, Alaska	Piper PA-18	3.0	3.0
2863	Moscow, Russia	Tupolev TU-124	61.0	61.0

删除重复值

data.drop_duplicates(inplace=True)
data.info()
# output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5211 entries, 0 to 5267
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Location    5211 non-null   object 
 1   Type        5211 non-null   object 
 2   Aboard      5192 non-null   float64
 3   Fatalities  5202 non-null   float64
dtypes: float64(2), object(2)
memory usage: 203.6+ KB

data.head()
# output
    Location	Type	Aboard	Fatalities
0	Fort Myer, Virginia	Wright Flyer III	2.0	1.0
1	AtlantiCity, New Jersey	Dirigible	5.0	5.0
2	Victoria, British Columbia, Canada	Curtiss seaplane	1.0	1.0
3	Over the North Sea	Zeppelin L-1 (airship)	20.0	14.0
4	Near Johannisthal, Germany	Zeppelin L-2 (airship)	30.0	30.0

4 数据分析

取出国名

data.Location.str.split(',').str[-1] 
# output
0                                  [Fort Myer,  Virginia]
1                              [AtlantiCity,  New Jersey]
2                  [Victoria,  British Columbia,  Canada]
3                                    [Over the North Sea]
4                           [Near Johannisthal,  Germany]
                              ...                        
5263                            [Near Madiun,  Indonesia]
5264                 [Near Isiro,  DemocratiRepubliCongo]
5265    [AtlantiOcean,  570 miles northeast of Natal, ...
5266     [Near Port Hope Simpson,  Newfoundland,  Canada]
5267                 [State of Arunachal Pradesh,  India]
Name: Location, Length: 5211, dtype: object

空难次数前20的地区

data.Location.str.split(',').str[-1].value_counts()[:20].index
# output
Index([' Brazil', ' Alaska', ' Russia', ' Colombia', ' Canada', ' California',
       ' France', ' England', ' India', ' Indonesia', ' Mexico', ' China',
       ' Germany', ' Italy', ' Australia', ' Philippines', ' Spain',
       ' New York', ' USSR', ' Venezuela'],
      dtype='object')

前20个国家的空难次数占所有空难次数的比值

s = data.Location.str.split(',').str[-1].value_counts()
s[:20].sum()/s.sum()
# output
0.39666090961427747

前20个品牌的空难次数占所有空难次数的比值

t = data.Type.str.split().str[0].value_counts()  # 按空格划分时,split不用给参数
t[:20].sum()/t.sum()
# output
0.7299942429476108

计算Boeing品牌的空难幸存率

data_b = data[data.Type.str.contains('Boeing')]
data_b.Aboard.replace(0.0, 1)  # 将Aboard中0替换成1,因为飞机上最少有一个人
# output
137       2.0
140       2.0
206       1.0
228       3.0
231       1.0
        ...  
5226     90.0
5227      3.0
5231     88.0
5251    134.0
5261      7.0
Name: Aboard, Length: 378, dtype: float64

((data_b.Aboard - data_b.Fatalities)/data_b.Aboard).mean()
# output
0.24761622144180062
  • 16
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

LouHerGetUp

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值