引入相关库
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
参考网站
link='https://projects.fivethirtyeight.com/flights/'
数据获取
半个月的14列20万行的美国航班数据
df=pd.read_csv('../homework/usa_flights.csv')
df.shape
(201664, 14)
’flight_date‘:航班飞行时间;’unique_carrier‘:航空公司号;’origin’:出发机场;‘dest’:到达机场;‘arr_delay’:实际到达目的地的延迟时间;‘cancelled’:是否被取消;‘distance’:距离;‘actual_elapsed_time’:实际飞行时间,这里只考虑‘arr_delay’,不考虑其他delay
df.head()
flight_date | unique_carrier | flight_num | origin | dest | arr_delay | cancelled | distance | carrier_delay | weather_delay | late_aircraft_delay | nas_delay | security_delay | actual_elapsed_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 02/01/2015 0:00 | AA | 1 | JFK | LAX | -19.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 381.0 |
1 | 03/01/2015 0:00 | AA | 1 | JFK | LAX | -39.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 358.0 |
2 | 04/01/2015 0:00 | AA | 1 | JFK | LAX | -12.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 385.0 |
3 | 05/01/2015 0:00 | AA | 1 | JFK | LAX | -8.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 389.0 |
4 | 06/01/2015 0:00 | AA | 1 | JFK | LAX | 25.0 | 0 | 2475 | 0.0 | 0.0 | 0.0 | 25.0 | 0.0 | 424.0 |
df.tail()
flight_date | unique_carrier | flight_num | origin | dest | arr_delay | cancelled | distance | carrier_delay | weather_delay | late_aircraft_delay | nas_delay | security_delay | actual_elapsed_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
201659 | 10/01/2015 0:00 | NK | 188 | OAK | LAS | -16.0 | 0 | 407 | NaN | NaN | NaN | NaN | NaN | 77.0 |
201660 | 11/01/2015 0:00 | NK | 188 | OAK | LAS | -4.0 | 0 | 407 | NaN | NaN | NaN | NaN | NaN | 87.0 |
201661 | 12/01/2015 0:00 | NK | 188 | OAK | LAS | -7.0 | 0 | 407 | NaN | NaN | NaN | NaN | NaN | 82.0 |
201662 | 13/01/2015 0:00 | NK | 188 | OAK | LAS | 23.0 | 0 | 407 | 3.0 | 0.0 | 0.0 | 20.0 | 0.0 | 103.0 |
201663 | 14/01/2015 0:00 | NK | 188 | OAK | LAS | -7.0 | 0 | 407 | NaN | NaN | NaN | NaN | NaN | 82.0 |
对’arr_delay’做一个降序的排序
df.sort_values('arr_delay',ascending=False)
flight_date | unique_carrier | flight_num | origin | dest | arr_delay | cancelled | distance | carrier_delay | weather_delay | late_aircraft_delay | nas_delay | security_delay | actual_elapsed_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11073 | 11/01/2015 0:00 | AA | 1595 | AUS | DFW | 1444.0 | 0 | 190 | 1444.0 | 0.0 | 0.0 | 0.0 | 0.0 | 59.0 |
10214 | 13/01/2015 0:00 | AA | 1487 | OMA | DFW | 1392.0 | 0 | 583 | 1392.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117.0 |
12430 | 03/01/2015 0:00 | AA | 1677 | MEM | DFW | 1384.0 | 0 | 432 | 1380.0 | 0.0 | 0.0 | 4.0 | 0.0 | 104.0 |
8443 | 04/01/2015 0:00 | AA | 1279 | OMA | DFW | 1237.0 | 0 | 583 | 1222.0 | 0.0 | 15.0 | 0.0 | 0.0 | 102.0 |
10328 | 05/01/2015 0:00 | AA | 1495 | EGE | DFW | 1187.0 | 0 | 721 | 1019.0 | 0.0 | 168.0 | 0.0 | 0.0 | 127.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
201554 | 06/01/2015 0:00 | NK | 174 | FLL | LGA | NaN | 1 | 1076 | NaN | NaN | NaN | NaN | NaN | NaN |
201594 | 07/01/2015 0:00 | NK | 180 | FLL | LGA | NaN | 1 | 1076 | NaN | NaN | NaN | NaN | NaN | NaN |
201642 | 06/01/2015 0:00 | NK | 188 | LGA | MYR | NaN | 1 | 563 | NaN | NaN | NaN | NaN | NaN | NaN |
201643 | 07/01/2015 0:00 | NK | 188 | LGA | MYR | NaN | 1 | 563 | NaN | NaN | NaN | NaN | NaN | NaN |
201648 | 12/01/2015 0:00 | NK | 188 | LGA | MYR | NaN | 1 | 563 | NaN | NaN | NaN | NaN | NaN | NaN |
201664 rows × 14 columns
获取延误时间最长Top10
df.sort_values('arr_delay',ascending=False)[:10]
flight_date | unique_carrier | flight_num | origin | dest | arr_delay | cancelled | distance | carrier_delay | weather_delay | late_aircraft_delay | nas_delay | security_delay | actual_elapsed_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11073 | 11/01/2015 0:00 | AA | 1595 | AUS | DFW | 1444.0 | 0 | 190 | 1444.0 | 0.0 | 0.0 | 0.0 | 0.0 | 59.0 |
10214 | 13/01/2015 0:00 | AA | 1487 | OMA | DFW | 1392.0 | 0 | 583 | 1392.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117.0 |
12430 | 03/01/2015 0:00 | AA | 1677 | MEM | DFW | 1384.0 | 0 | 432 | 1380.0 | 0.0 | 0.0 | 4.0 | 0.0 | 104.0 |
8443 | 04/01/2015 0:00 | AA | 1279 | OMA | DFW | 1237.0 | 0 | 583 | 1222.0 | 0.0 | 15.0 | 0.0 | 0.0 | 102.0 |
10328 | 05/01/2015 0:00 | AA | 1495 | EGE | DFW | 1187.0 | 0 | 721 | 1019.0 | 0.0 | 168.0 | 0.0 | 0.0 | 127.0 |
36570 | 04/01/2015 0:00 | DL | 1435 | MIA | MSP | 1174.0 | 0 | 1501 | 1174.0 | 0.0 | 0.0 | 0.0 | 0.0 | 231.0 |
36495 | 04/01/2015 0:00 | DL | 1367 | ROC | ATL | 1138.0 | 0 | 749 | 1112.0 | 0.0 | 0.0 | 26.0 | 0.0 | 171.0 |
59072 | 14/01/2015 0:00 | DL | 1687 | SAN | MSP | 1084.0 | 0 | 1532 | 1070.0 | 0.0 | 0.0 | 14.0 | 0.0 | 240.0 |
32173 | 05/01/2015 0:00 | AA | 970 | LAS | LAX | 1042.0 | 0 | 236 | 1033.0 | 0.0 | 9.0 | 0.0 | 0.0 | 66.0 |
56488 | 12/01/2015 0:00 | DL | 2117 | ATL | COS | 1016.0 | 0 | 1184 | 1016.0 | 0.0 | 0.0 | 0.0 | 0.0 | 193.0 |
2.计算延误和没有延误所占比例
'arr_delay’大于0表示没有延误,小于0表示有延误
可以使用value_counts统计某一列数值数量的多少
df['cancelled'].value_counts()
0 196873
1 4791
Name: cancelled, dtype: int64
通过apply方法传入lambda表达式来区分’arr_delay‘是否大于零
df['delayed']=df['arr_delay'].apply(lambda x:x>0)
df.head()
flight_date | unique_carrier | flight_num | origin | dest | arr_delay | cancelled | distance | carrier_delay | weather_delay | late_aircraft_delay | nas_delay | security_delay | actual_elapsed_time | delayed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 02/01/2015 0:00 | AA | 1 | JFK | LAX | -19.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 381.0 | False |
1 | 03/01/2015 0:00 | AA | 1 | JFK | LAX | -39.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 358.0 | False |
2 | 04/01/2015 0:00 | AA | 1 | JFK | LAX | -12.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 385.0 | False |
3 | 05/01/2015 0:00 | AA | 1 | JFK | LAX | -8.0 | 0 | 2475 | NaN | NaN | NaN | NaN | NaN | 389.0 | False |
4 | 06/01/2015 0:00 | AA | 1 | JFK | LAX | 25.0 | 0 | 2475 | 0.0 | 0.0 | 0.0 | 25.0 | 0.0 | 424.0 | True |
统计延误和没有延误航班具体占了多少个
delay_data=df['delayed'].value_counts()
结果是一个Series数据
type(delay_data)
pandas.core.series.Series
delay_data
False 103037
True 98627
Name: delayed, dtype: int64
延误航班占比的计算方法
delay_data[1]/(delay_data[0]+delay_data[1])
0.4890659711202793
3.每一个航空公司延误的情况
根据’unique_carrier’和’delayed’这两个columns做groupby
delay_group=df.groupby(['unique_carrier','delayed'])
delay_group
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023C9F77A048>
显示整个delay_group的数据的情况,显示为一个多级Series
delay_group.size()
unique_carrier delayed
AA False 8912
True 9841
AS False 3527
True 2104
B6 False 4832
True 4401
DL False 17719
True 9803
EV False 10596
True 11371
F9 False 1103
True 1848
HA False 1351
True 1354
MQ False 4692
True 8060
NK False 1550
True 2133
OO False 9977
True 10804
UA False 7885
True 8624
US False 7850
True 6353
VX False 1254
True 781
WN False 21789
True 21150
dtype: int64
通过unstack把这个二级Series转化为DataFrame
df_delay=delay_group.size().unstack()
df_delay
delayed | False | True |
---|---|---|
unique_carrier | ||
AA | 8912 | 9841 |
AS | 3527 | 2104 |
B6 | 4832 | 4401 |
DL | 17719 | 9803 |
EV | 10596 | 11371 |
F9 | 1103 | 1848 |
HA | 1351 | 1354 |
MQ | 4692 | 8060 |
NK | 1550 | 2133 |
OO | 9977 | 10804 |
UA | 7885 | 8624 |
US | 7850 | 6353 |
VX | 1254 | 781 |
WN | 21789 | 21150 |
引入画图相关库,画图显示
import matplotlib.pyplot as plot
df_delay.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x23c9f4c2ac8>
传入一些参数改变画图的形式,变成柱状图,显示更加直观
df_delay.plot(kind='barh',stacked=True,figsize=[16,6],colormap='winter')
<matplotlib.axes._subplots.AxesSubplot at 0x23ca7f8fb48>
4.透视表功能
创建一个透视表,显示每一天的航班的数据情况
flights_by_carrier=df.pivot_table(index='flight_date',columns='unique_carrier',values='flight_num',aggfunc='count')
flights_by_carrier
unique_carrier | AA | AS | B6 | DL | EV | F9 | HA | MQ | NK | OO | UA | US | VX | WN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
flight_date | ||||||||||||||
02/01/2015 0:00 | 1545 | 477 | 759 | 2271 | 1824 | 254 | 224 | 1046 | 287 | 1763 | 1420 | 1177 | 176 | 3518 |
03/01/2015 0:00 | 1453 | 449 | 711 | 2031 | 1744 | 192 | 202 | 937 | 285 | 1681 | 1233 | 1028 | 160 | 3328 |
04/01/2015 0:00 | 1534 | 458 | 759 | 2258 | 1833 | 249 | 206 | 1027 | 284 | 1731 | 1283 | 1158 | 169 | 3403 |
05/01/2015 0:00 | 1532 | 433 | 754 | 2212 | 1811 | 264 | 209 | 1039 | 288 | 1737 | 1432 | 1157 | 174 | 3506 |
06/01/2015 0:00 | 1400 | 415 | 692 | 2054 | 1686 | 249 | 202 | 966 | 279 | 1527 | 1294 | 1003 | 152 | 3396 |
07/01/2015 0:00 | 1420 | 419 | 694 | 2107 | 1705 | 228 | 208 | 998 | 284 | 1557 | 1259 | 1143 | 151 | 3398 |
08/01/2015 0:00 | 1447 | 444 | 731 | 2214 | 1759 | 224 | 208 | 1009 | 284 | 1625 | 1331 | 1171 | 164 | 3398 |
09/01/2015 0:00 | 1448 | 447 | 730 | 2204 | 1767 | 229 | 222 | 1010 | 289 | 1631 | 1331 | 1169 | 167 | 3364 |
10/01/2015 0:00 | 1285 | 412 | 641 | 1584 | 1133 | 168 | 201 | 768 | 285 | 1270 | 908 | 912 | 109 | 2668 |
11/01/2015 0:00 | 1426 | 432 | 711 | 2082 | 1585 | 222 | 206 | 979 | 286 | 1558 | 1162 | 1061 | 156 | 2888 |
12/01/2015 0:00 | 1445 | 426 | 718 | 2280 | 1754 | 236 | 209 | 1003 | 281 | 1628 | 1337 | 1086 | 161 | 3324 |
13/01/2015 0:00 | 1404 | 407 | 666 | 2079 | 1678 | 210 | 202 | 972 | 272 | 1514 | 1261 | 996 | 145 | 3374 |
14/01/2015 0:00 | 1414 | 412 | 667 | 2146 | 1688 | 226 | 206 | 998 | 279 | 1559 | 1258 | 1142 | 151 | 3374 |