包含全部示例的代码仓库见GIthub
1 导入库
import pandas as pd
import numpy as np
2 导入数据
data = pd.read_csv('./dataset/air1908.csv')
data.head()
# output
Date Time Location Operator Flight # Route Type Registration cn/In Aboard Fatalities Ground Summary
0 09/17/1908 17:18 Fort Myer, Virginia Military - U.S. Army NaN Demonstration Wright Flyer III NaN 1 2.0 1.0 0.0 During a demonstration flight, a U.S. Army fly...
1 07/12/1912 06:30 AtlantiCity, New Jersey Military - U.S. Navy NaN Test flight Dirigible NaN NaN 5.0 5.0 0.0 First U.S. dirigible Akron exploded just offsh...
2 08/06/1913 NaN Victoria, British Columbia, Canada Private - NaN Curtiss seaplane NaN NaN 1.0 1.0 0.0 The first fatal airplane accident in Canada oc...
3 09/09/1913 18:30 Over the North Sea Military - German Navy NaN NaN Zeppelin L-1 (airship) NaN NaN 20.0 14.0 0.0 The airship flew into a thunderstorm and encou...
4 10/17/1913 10:30 Near Johannisthal, Germany Military - German Navy NaN NaN Zeppelin L-2 (airship) NaN NaN 30.0 30.0 0.0 Hydrogen gas which was being vented was sucked...
data.columns
# output
Index(['Date', 'Time', 'Location', 'Operator', 'Flight #', 'Route', 'Type',
'Registration', 'cn/In', 'Aboard', 'Fatalities', 'Ground', 'Summary'],
dtype='object')
3 数据预处理
选取列
data = data[['Location','Type','Aboard','Fatalities']]
data.head()
# output
Location Type Aboard Fatalities
0 Fort Myer, Virginia Wright Flyer III 2.0 1.0
1 AtlantiCity, New Jersey Dirigible 5.0 5.0
2 Victoria, British Columbia, Canada Curtiss seaplane 1.0 1.0
3 Over the North Sea Zeppelin L-1 (airship) 20.0 14.0
4 Near Johannisthal, Germany Zeppelin L-2 (airship) 30.0 30.0
3.1 缺失值处理
any(1)
这一行有一个nan值就会被记为True
(data.isnull().any(1)).sum()
# output
66
保证['Location','Type']
中都没有nan,删除他们中的nan值
data = data.dropna(subset=['Location','Type'])
(data.isnull().any(1)).sum()
# output
19
将nan fill
为0
data.fillna(0, inplace=True) # 将nan fill为0
# output
Location Type Aboard Fatalities
0 Fort Myer, Virginia Wright Flyer III 2.0 1.0
1 AtlantiCity, New Jersey Dirigible 5.0 5.0
2 Victoria, British Columbia, Canada Curtiss seaplane 1.0 1.0
3 Over the North Sea Zeppelin L-1 (airship) 20.0 14.0
4 Near Johannisthal, Germany Zeppelin L-2 (airship) 30.0 30.0
... ... ... ... ...
5263 Near Madiun, Indonesia Lockheed C-130 Hercules 112.0 98.0
5264 Near Isiro, DemocratiRepubliCongo Antonov An-26 4.0 4.0
5265 AtlantiOcean, 570 miles northeast of Natal, Br... Airbus A330-203 228.0 228.0
5266 Near Port Hope Simpson, Newfoundland, Canada Britten-Norman BN-2A-27 Islander 1.0 1.0
5267 State of Arunachal Pradesh, India Antonov An-32 13.0 13.0
3.2 重复值处理
是否有重复值
data.duplicated().sum()
# output
10
查看重复值
data[data.duplicated()]
# output
Location Type Aboard Fatalities
33 Cleveland, Ohio De Havilland DH-4 1.0 1.0
51 Cleveland, Ohio De Havilland DH-4 1.0 1.0
71 Barcelona, Spain Breguet 14 1.0 1.0
147 Over the Gulf of Finland Junkers F-13 6.0 6.0
410 Mexico Lockheed Orion 1.0 1.0
701 Kunming, China Douglas C-47 3.0 3.0
1046 North Sea Douglas DC-3 7.0 7.0
2208 Off Phu Quoc, Vietnam Lockheed P-3B Orion 12.0 12.0
2771 Willow, Alaska Piper PA-18 3.0 3.0
2863 Moscow, Russia Tupolev TU-124 61.0 61.0
删除重复值
data.drop_duplicates(inplace=True)
data.info()
# output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5211 entries, 0 to 5267
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Location 5211 non-null object
1 Type 5211 non-null object
2 Aboard 5192 non-null float64
3 Fatalities 5202 non-null float64
dtypes: float64(2), object(2)
memory usage: 203.6+ KB
data.head()
# output
Location Type Aboard Fatalities
0 Fort Myer, Virginia Wright Flyer III 2.0 1.0
1 AtlantiCity, New Jersey Dirigible 5.0 5.0
2 Victoria, British Columbia, Canada Curtiss seaplane 1.0 1.0
3 Over the North Sea Zeppelin L-1 (airship) 20.0 14.0
4 Near Johannisthal, Germany Zeppelin L-2 (airship) 30.0 30.0
4 数据分析
取出国名
data.Location.str.split(',').str[-1]
# output
0 [Fort Myer, Virginia]
1 [AtlantiCity, New Jersey]
2 [Victoria, British Columbia, Canada]
3 [Over the North Sea]
4 [Near Johannisthal, Germany]
...
5263 [Near Madiun, Indonesia]
5264 [Near Isiro, DemocratiRepubliCongo]
5265 [AtlantiOcean, 570 miles northeast of Natal, ...
5266 [Near Port Hope Simpson, Newfoundland, Canada]
5267 [State of Arunachal Pradesh, India]
Name: Location, Length: 5211, dtype: object
空难次数前20的地区
data.Location.str.split(',').str[-1].value_counts()[:20].index
# output
Index([' Brazil', ' Alaska', ' Russia', ' Colombia', ' Canada', ' California',
' France', ' England', ' India', ' Indonesia', ' Mexico', ' China',
' Germany', ' Italy', ' Australia', ' Philippines', ' Spain',
' New York', ' USSR', ' Venezuela'],
dtype='object')
前20个国家的空难次数占所有空难次数的比值
s = data.Location.str.split(',').str[-1].value_counts()
s[:20].sum()/s.sum()
# output
0.39666090961427747
前20个品牌的空难次数占所有空难次数的比值
t = data.Type.str.split().str[0].value_counts() # 按空格划分时,split不用给参数
t[:20].sum()/t.sum()
# output
0.7299942429476108
计算Boeing
品牌的空难幸存率
data_b = data[data.Type.str.contains('Boeing')]
data_b.Aboard.replace(0.0, 1) # 将Aboard中0替换成1,因为飞机上最少有一个人
# output
137 2.0
140 2.0
206 1.0
228 3.0
231 1.0
...
5226 90.0
5227 3.0
5231 88.0
5251 134.0
5261 7.0
Name: Aboard, Length: 378, dtype: float64
((data_b.Aboard - data_b.Fatalities)/data_b.Aboard).mean()
# output
0.24761622144180062