分组和透视实战

引入相关库

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

参考网站

link='https://projects.fivethirtyeight.com/flights/'

数据获取

半个月的14列20万行的美国航班数据

df=pd.read_csv('../homework/usa_flights.csv')
df.shape
(201664, 14)

’flight_date‘:航班飞行时间;’unique_carrier‘:航空公司号;’origin’:出发机场;‘dest’:到达机场;‘arr_delay’:实际到达目的地的延迟时间;‘cancelled’:是否被取消;‘distance’:距离;‘actual_elapsed_time’:实际飞行时间,这里只考虑‘arr_delay’,不考虑其他delay

df.head()
flight_dateunique_carrierflight_numorigindestarr_delaycancelleddistancecarrier_delayweather_delaylate_aircraft_delaynas_delaysecurity_delayactual_elapsed_time
002/01/2015 0:00AA1JFKLAX-19.002475NaNNaNNaNNaNNaN381.0
103/01/2015 0:00AA1JFKLAX-39.002475NaNNaNNaNNaNNaN358.0
204/01/2015 0:00AA1JFKLAX-12.002475NaNNaNNaNNaNNaN385.0
305/01/2015 0:00AA1JFKLAX-8.002475NaNNaNNaNNaNNaN389.0
406/01/2015 0:00AA1JFKLAX25.0024750.00.00.025.00.0424.0
df.tail()
flight_dateunique_carrierflight_numorigindestarr_delaycancelleddistancecarrier_delayweather_delaylate_aircraft_delaynas_delaysecurity_delayactual_elapsed_time
20165910/01/2015 0:00NK188OAKLAS-16.00407NaNNaNNaNNaNNaN77.0
20166011/01/2015 0:00NK188OAKLAS-4.00407NaNNaNNaNNaNNaN87.0
20166112/01/2015 0:00NK188OAKLAS-7.00407NaNNaNNaNNaNNaN82.0
20166213/01/2015 0:00NK188OAKLAS23.004073.00.00.020.00.0103.0
20166314/01/2015 0:00NK188OAKLAS-7.00407NaNNaNNaNNaNNaN82.0

对’arr_delay’做一个降序的排序

df.sort_values('arr_delay',ascending=False)
flight_dateunique_carrierflight_numorigindestarr_delaycancelleddistancecarrier_delayweather_delaylate_aircraft_delaynas_delaysecurity_delayactual_elapsed_time
1107311/01/2015 0:00AA1595AUSDFW1444.001901444.00.00.00.00.059.0
1021413/01/2015 0:00AA1487OMADFW1392.005831392.00.00.00.00.0117.0
1243003/01/2015 0:00AA1677MEMDFW1384.004321380.00.00.04.00.0104.0
844304/01/2015 0:00AA1279OMADFW1237.005831222.00.015.00.00.0102.0
1032805/01/2015 0:00AA1495EGEDFW1187.007211019.00.0168.00.00.0127.0
.............................................
20155406/01/2015 0:00NK174FLLLGANaN11076NaNNaNNaNNaNNaNNaN
20159407/01/2015 0:00NK180FLLLGANaN11076NaNNaNNaNNaNNaNNaN
20164206/01/2015 0:00NK188LGAMYRNaN1563NaNNaNNaNNaNNaNNaN
20164307/01/2015 0:00NK188LGAMYRNaN1563NaNNaNNaNNaNNaNNaN
20164812/01/2015 0:00NK188LGAMYRNaN1563NaNNaNNaNNaNNaNNaN

201664 rows × 14 columns

获取延误时间最长Top10

df.sort_values('arr_delay',ascending=False)[:10]
flight_dateunique_carrierflight_numorigindestarr_delaycancelleddistancecarrier_delayweather_delaylate_aircraft_delaynas_delaysecurity_delayactual_elapsed_time
1107311/01/2015 0:00AA1595AUSDFW1444.001901444.00.00.00.00.059.0
1021413/01/2015 0:00AA1487OMADFW1392.005831392.00.00.00.00.0117.0
1243003/01/2015 0:00AA1677MEMDFW1384.004321380.00.00.04.00.0104.0
844304/01/2015 0:00AA1279OMADFW1237.005831222.00.015.00.00.0102.0
1032805/01/2015 0:00AA1495EGEDFW1187.007211019.00.0168.00.00.0127.0
3657004/01/2015 0:00DL1435MIAMSP1174.0015011174.00.00.00.00.0231.0
3649504/01/2015 0:00DL1367ROCATL1138.007491112.00.00.026.00.0171.0
5907214/01/2015 0:00DL1687SANMSP1084.0015321070.00.00.014.00.0240.0
3217305/01/2015 0:00AA970LASLAX1042.002361033.00.09.00.00.066.0
5648812/01/2015 0:00DL2117ATLCOS1016.0011841016.00.00.00.00.0193.0

2.计算延误和没有延误所占比例

'arr_delay’大于0表示没有延误,小于0表示有延误
可以使用value_counts统计某一列数值数量的多少

df['cancelled'].value_counts()
0    196873
1      4791
Name: cancelled, dtype: int64

通过apply方法传入lambda表达式来区分’arr_delay‘是否大于零

df['delayed']=df['arr_delay'].apply(lambda x:x>0)
df.head()
flight_dateunique_carrierflight_numorigindestarr_delaycancelleddistancecarrier_delayweather_delaylate_aircraft_delaynas_delaysecurity_delayactual_elapsed_timedelayed
002/01/2015 0:00AA1JFKLAX-19.002475NaNNaNNaNNaNNaN381.0False
103/01/2015 0:00AA1JFKLAX-39.002475NaNNaNNaNNaNNaN358.0False
204/01/2015 0:00AA1JFKLAX-12.002475NaNNaNNaNNaNNaN385.0False
305/01/2015 0:00AA1JFKLAX-8.002475NaNNaNNaNNaNNaN389.0False
406/01/2015 0:00AA1JFKLAX25.0024750.00.00.025.00.0424.0True

统计延误和没有延误航班具体占了多少个

delay_data=df['delayed'].value_counts()

结果是一个Series数据

type(delay_data)
pandas.core.series.Series
delay_data
False    103037
True      98627
Name: delayed, dtype: int64

延误航班占比的计算方法

delay_data[1]/(delay_data[0]+delay_data[1])
0.4890659711202793

3.每一个航空公司延误的情况

根据’unique_carrier’和’delayed’这两个columns做groupby

delay_group=df.groupby(['unique_carrier','delayed'])
delay_group
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023C9F77A048>

显示整个delay_group的数据的情况,显示为一个多级Series

delay_group.size()
unique_carrier  delayed
AA              False       8912
                True        9841
AS              False       3527
                True        2104
B6              False       4832
                True        4401
DL              False      17719
                True        9803
EV              False      10596
                True       11371
F9              False       1103
                True        1848
HA              False       1351
                True        1354
MQ              False       4692
                True        8060
NK              False       1550
                True        2133
OO              False       9977
                True       10804
UA              False       7885
                True        8624
US              False       7850
                True        6353
VX              False       1254
                True         781
WN              False      21789
                True       21150
dtype: int64

通过unstack把这个二级Series转化为DataFrame

df_delay=delay_group.size().unstack()
df_delay
delayedFalseTrue
unique_carrier
AA89129841
AS35272104
B648324401
DL177199803
EV1059611371
F911031848
HA13511354
MQ46928060
NK15502133
OO997710804
UA78858624
US78506353
VX1254781
WN2178921150

引入画图相关库,画图显示

import matplotlib.pyplot as plot
df_delay.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x23c9f4c2ac8>

在这里插入图片描述

传入一些参数改变画图的形式,变成柱状图,显示更加直观

df_delay.plot(kind='barh',stacked=True,figsize=[16,6],colormap='winter')
<matplotlib.axes._subplots.AxesSubplot at 0x23ca7f8fb48>

在这里插入图片描述

4.透视表功能

创建一个透视表,显示每一天的航班的数据情况

flights_by_carrier=df.pivot_table(index='flight_date',columns='unique_carrier',values='flight_num',aggfunc='count')
flights_by_carrier
unique_carrierAAASB6DLEVF9HAMQNKOOUAUSVXWN
flight_date
02/01/2015 0:0015454777592271182425422410462871763142011771763518
03/01/2015 0:001453449711203117441922029372851681123310281603328
04/01/2015 0:0015344587592258183324920610272841731128311581693403
05/01/2015 0:0015324337542212181126420910392881737143211571743506
06/01/2015 0:001400415692205416862492029662791527129410031523396
07/01/2015 0:001420419694210717052282089982841557125911431513398
08/01/2015 0:0014474447312214175922420810092841625133111711643398
09/01/2015 0:0014484477302204176722922210102891631133111691673364
10/01/2015 0:0012854126411584113316820176828512709089121092668
11/01/2015 0:001426432711208215852222069792861558116210611562888
12/01/2015 0:0014454267182280175423620910032811628133710861613324
13/01/2015 0:00140440766620791678210202972272151412619961453374
14/01/2015 0:001414412667214616882262069982791559125811421513374
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值