6. 综合练习(1)

6. 综合练习

import numpy as np
import pandas as pd
# 不用print,直接显示结果
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 显示所有列
pd.set_option('display.max_columns', 600)


一、2002 年-2018 年上海机动车牌照拍卖

(1) 哪一次拍卖的中标率首次小于 5%?

(考察点:创建列,索引)

(2) 按年统计拍卖最低价的下列统计量:最大值、均值、 0.75 分位数,要求显示在同一张表上。

建议先完成第三问,再做第二问
(考察点:创建列,索引,分组,统计量,分列)

(3) 将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延。

(考察点:拆分列)

(4) 现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份。

(考察点:多级索引)

(5) 一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?

(考察点:统计量,分组,合并,索引排序)

(6) 将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。

包含极大值极小值
(考察点:统计量,索引,排序,差分)

# MVL = Motor Vehicle License
MVL = pd.read_csv('./2002年-2018年上海机动车拍照拍卖.csv')
MVL.head()
# MVL.info()
DateTotal number of license issuedlowest priceavg priceTotal number of applicants
02-Jan140013600147353718
12-Feb180013100140574590
22-Mar200014300146625190
32-Apr230016000163344806
42-May235017800183574665

(1) 哪一次拍卖的中标率首次小于 5%?

(考察点:创建列,索引)

MVL[(MVL['Total number of license issued']/MVL['Total number of applicants'])<=0.05]
Total number of license issuedlowest priceavg priceTotal number of applicants
年份月份
2001May74827900079099156007
Jun74418000080020172205
Jul75318310083171166302
Aug74548260082642166939
Oct77638530085424170995
Nov75148460084703169159
Dec76988450084572179133
Feb83638320083244196470
Mar83108310083148221109
Apr118298510085127256897
May115988500085058277889
Jun115468440084483275438
Jul114758720087235240750
Aug115498690086946251188
Feb101578820088240251717
Mar103568780087916262010
Apr121968980089850252273
May103169010090209270197
Jun103128940089532244349
Jul103259220092250269189
Aug105589160091629256083
Sep124139130091415250566
Oct113889350093540244868
Nov110029310093130226911
Mar98558810088176217056

(2) 按年统计拍卖最低价的下列统计量:最大值、均值、 0.75 分位数,要求显示在同一张表上。

建议先完成第三问,再做第二问
(考察点:创建列,索引,分组,统计量,分列)


# for name,group in MVL.groupby('年份'):
    
MVL.groupby('年份')['lowest price '].agg(['max','mean','quantile'])
maxmeanquantile
年份
20019350071640.74074175750.0
20023080020316.66666719700.0
20033850031983.33333333600.0
20044420029408.33333328650.0
20053790031908.33333333000.0
20063990037058.33333337750.0
20075380045691.66666745850.0
20083730029945.45454532600.0
20093690031333.33333331050.0

(3) 将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延。

(考察点:拆分列)

MVL['年份']=MVL['Date'].apply(lambda x:int(str(x)[:-4])+2000)
MVL['月份']=MVL['Date'].apply(lambda x:str(x)[-3:])
MVL.drop(columns='Date',inplace=True)
MVL=MVL[['年份','月份','Total number of license issued','lowest price ','avg price','Total number of applicants']]
# MVL.reorder_levels(['年份','月份','Total number of license issued','lowest price','avg price','Total number of applicants'],axis=0).head()
# MVL=MVL.set_index(['年份','月份'])
MVL.head()
年份月份Total number of license issuedlowest priceavg priceTotal number of applicants
02002Jan140013600147353718
12002Feb180013100140574590
22002Mar200014300146625190
32002Apr230016000163344806
42002May235017800183574665

(4) 现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份。

(考察点:多级索引)

# MVL.set_index(['年份','Total number of license issued','lowest price ','avg price','Total number of applicants'])
# pd.crosstab(index=MVL['年份','Total number of license issued','lowest price ','avg price','Total number of applicants'],columns=MVL['月份'])
# MVL.pivot_table(index=['年份','Total number of license issued','lowest price ','avg price','Total number of applicants'],columns='月份',values=1).head()
MVL_m=MVL.melt(id_vars=['年份','月份'],value_vars=['Total number of license issued','lowest price ','avg price','Total number of applicants'])
MVL_m=MVL_m.set_index(['年份','variable'])

pd.pivot_table(MVL_m,index=['年份','variable'],columns='月份',values='value')




月份AprAugDecFebJanJulJunMarMayNovOctSep
年份variable
2002Total number of applicants4806.04640.03525.04590.03718.03774.04502.05190.04665.04021.04661.04393.0
Total number of license issued2300.03000.03600.01800.01400.03000.02800.02000.02350.03200.03200.03200.0
avg price16334.021601.027848.014057.014735.020904.020178.014662.018357.031721.027040.024040.0
lowest price16000.021000.027800.013100.013600.019800.019600.014300.017800.030800.026400.023600.0
2003Total number of applicants8794.09315.010491.012030.09442.011929.015507.011219.014634.09849.09383.08532.0
..........................................
2017lowest price89800.091600.092800.088200.087600.092200.089400.087800.090100.093100.093500.091300.0
2018Total number of applicants204980.0192755.0165442.0220831.0226316.0202337.0209672.0217056.0198627.0177355.0181861.0189142.0
Total number of license issued11916.010402.012850.011098.012183.010395.010775.09855.010216.011766.010728.012712.0
avg price87089.088365.087508.087660.087936.088380.087900.088176.089018.087374.088070.087410.0
lowest price86900.088300.087400.087600.087900.088300.087800.088100.089000.087300.088000.087300.0

68 rows × 12 columns

(5) 一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?

(考察点:统计量,分组,合并,索引排序)

MVL['LOW_Difference']=MVL[['lowest price ']].diff()
MVL['AVG_Difference']=MVL[['avg price']].diff()
MVL[MVL['LOW_Difference']*MVL['AVG_Difference']<0]
# df.diff(2)
年份月份Total number of license issuedlowest priceavg priceTotal number of applicantsLOW_DifferenceAVG_Difference
212003Oct4500328003484293834000.0-3886.0
222003Nov504233100342849849300.0-558.0
292004Jun62331780021001192337000.0-13225.0
362005Jan550028500325206208-800.02238.0
372005Feb3800317003242589493200.0-95.0
442005Sep67002650028927109721500.0-6978.0
522006May450037700381398301200.0-187.0
562006Sep650037000416017064-2900.01142.0
602007Jan600038500409746587-1300.0456.0
612007Feb350039100404735056600.0-501.0
712007Dec7500500005604210356-3800.01725.0
1282012Oct9500652006670819921-500.0283.0

(6) 将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。

包含极大值极小值
(考察点:统计量,索引,排序,差分)

MVL.sort_index()
MVL['m1']=MVL['Total number of license issued'].diff()
MVL['m2']=MVL['Total number of license issued'].diff(2)
MVL['re']=(MVL['m1']+MVL['m2'])/2
MVL
MVL['re'].idxmax()
MVL.loc[72]
MVL[MVL['re']==MVL['re'].min()]
年份月份Total number of license issuedlowest priceavg priceTotal number of applicantsLOW_DifferenceAVG_Differencem1m2re
02002Jan140013600147353718NaNNaNNaNNaNNaN
12002Feb180013100140574590-500.0-678.0400.0NaNNaN
22002Mar2000143001466251901200.0605.0200.0600.0400.0
32002Apr2300160001633448061700.01672.0300.0500.0400.0
42002May2350178001835746651800.02023.050.0350.0200.0
....................................
1982018Aug1040288300883651927550.0-15.07.0-373.0-183.0
1992018Sep127128730087410189142-1000.0-955.02310.02317.02313.5
2002018Oct107288800088070181861700.0660.0-1984.0326.0-829.0
2012018Nov117668730087374177355-700.0-696.01038.0-946.046.0
2022018Dec128508740087508165442100.0134.01084.02122.01603.0

203 rows × 11 columns

年份月份Total number of license issuedlowest priceavg priceTotal number of applicantsLOW_DifferenceAVG_Differencem1m2re
02002Jan140013600147353718NaNNaNNaNNaNNaN
12002Feb180013100140574590-500.0-678.0400.0NaNNaN
22002Mar2000143001466251901200.0605.0200.0600.0400.0
32002Apr2300160001633448061700.01672.0300.0500.0400.0
42002May2350178001835746651800.02023.050.0350.0200.0
....................................
1982018Aug1040288300883651927550.0-15.07.0-373.0-183.0
1992018Sep127128730087410189142-1000.0-955.02310.02317.02313.5
2002018Oct107288800088070181861700.0660.0-1984.0326.0-829.0
2012018Nov117668730087374177355-700.0-696.01038.0-946.046.0
2022018Dec128508740087508165442100.0134.01084.02122.01603.0

203 rows × 11 columns

72






年份                                 2008
月份                                  Jan
Total number of license issued    16000
lowest price                       8100
avg price                         23370
Total number of applicants        20539
LOW_Difference                   -41900
AVG_Difference                   -32672
m1                                 8500
m2                                 8500
re                                 8500
Name: 72, dtype: object
年份月份Total number of license issuedlowest priceavg priceTotal number of applicantsLOW_DifferenceAVG_Differencem1m2re
742008Apr90003730037659370726000.05490.0-300.0-7000.0-3650.0

二、2007 年-2019 年俄罗斯机场货运航班运载量

(1) 求每年货运航班总运量。

(考察点:统计量,分组)

(2) 每年记录的机场都是相同的吗?

(考察点:分组,查看类别值)

(3) 按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例。

(考察点:分组,统计量,筛选)

(4) 若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有年份的记录信息从表中删除,并返回处理后的表格

(考察点:数据删除,分组,筛选,索引)

(5) 采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017年-2019 年货运总量最大的区域。

提示:最后一列 →_→
(考察点:分组)

(6) 在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月的秩综合指数。

(考察点:分组,合并,排序)

# RAS = Russian airport shipping
RAS = pd.read_csv('./2007年-2019年俄罗斯货运航班运载量.csv')
RAS.head()
RAS.info()
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
0Abakan201944.7066.2172.775.82100.3478.3863.8873.0666.7475.44110.589.8917.57(Decimal('91.399735'), Decimal('53.751351'))
1Aikhal20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('111.543324'), Decimal('65.957161'))
2Loss20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('125.398355'), Decimal('58.602489'))
3Amderma20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('61.577429'), Decimal('69.759076'))
4Anadyr (Carbon)201981.63143.01260.9304.36122.00106.8784.99130.00102.00118.0094.0199.01746.76(Decimal('177.738273'), Decimal('64.713433'))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3711 entries, 0 to 3710
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Airport name         3711 non-null   object 
 1   Year                 3711 non-null   int64  
 2   January              3711 non-null   float64
 3   February             3711 non-null   float64
 4   March                3711 non-null   float64
 5   April                3711 non-null   float64
 6   May                  3711 non-null   float64
 7   June                 3711 non-null   float64
 8   July                 3711 non-null   float64
 9   August               3711 non-null   float64
 10  September            3711 non-null   float64
 11  October              3711 non-null   float64
 12  November             3711 non-null   float64
 13  December             3711 non-null   float64
 14  Whole year           3711 non-null   float64
 15  Airport coordinates  3711 non-null   object 
dtypes: float64(13), int64(1), object(2)
memory usage: 464.0+ KB

(1) 求每年货运航班总运量。

(考察点:统计量,分组)

RAS.groupby('Year')['Whole year'].sum()
Year
2007    659438.23
2008    664682.46
2009    560809.77
2010    693033.98
2011    818691.71
2012    846388.03
2013    792337.08
2014    729457.12
2015    630208.97
2016    679370.15
2017    773662.28
2018    767095.28
2019    764606.27
Name: Whole year, dtype: float64

(2) 每年记录的机场都是相同的吗?

(考察点:分组,查看类别值)


temp=pd.DataFrame()
for name ,group in RAS.groupby('Year'):
    print(name,temp.equals(pd.DataFrame(group['Airport name'].value_counts())))
    temp=pd.DataFrame(group['Airport name'].value_counts())
#     display(pd.DataFrame(group['Airport name'].value_counts()))
#     print(name )
#     display(group['Airport name'].value_counts())
temp
2007 False
2008 True
2009 True
2010 True
2011 True
2012 True
2013 True
2014 True
2015 True
2016 True
2017 True
2018 False
2019 False
Airport name
Usinsk2
Nyagan2
Vorkuta2
Ust-Tsilma2
Nerungri (Chulman)1
......
Keperveem1
Blagoveshchensk (Ignatevo)1
Nogliki1
Sovetskaya Gavan1
Мотыгино1

247 rows × 1 columns

(3) 按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例。

(考察点:分组,统计量,筛选)

RAS_10_15=RAS[RAS['Year'].isin([2010,2011,2012,2013,2014,2015])]
RAS_10_15.head()
for name ,group in RAS_10_15.groupby('Year'):
    print(name,group[group['Whole year']==0].count()/group.count())
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
1083Abakan201537.7047.9754.6782.1268.81112.9555.8395.20137.7972.1363.6778.30907.14(Decimal('91.399735'), Decimal('53.751351'))
1084Aikhal20150.000.000.000.000.000.000.000.000.000.000.000.000.00(Decimal('111.543324'), Decimal('65.957161'))
1085Loss20150.000.000.000.000.000.000.000.000.000.000.000.000.00(Decimal('125.398355'), Decimal('58.602489'))
1086Amderma20150.000.000.000.000.000.000.000.000.000.000.000.000.00(Decimal('61.577429'), Decimal('69.759076'))
1087Anadyr2015124.31254.19340.37286.06156.55124.9589.0572.29118.1692.89158.39316.942134.15(Decimal('177.738273'), Decimal('64.713433'))
2010 Airport name           0.767123
Year                   0.767123
January                0.767123
February               0.767123
March                  0.767123
April                  0.767123
May                    0.767123
June                   0.767123
July                   0.767123
August                 0.767123
September              0.767123
October                0.767123
November               0.767123
December               0.767123
Whole year             0.767123
Airport coordinates    0.767123
dtype: float64
2011 Airport name           0.770548
Year                   0.770548
January                0.770548
February               0.770548
March                  0.770548
April                  0.770548
May                    0.770548
June                   0.770548
July                   0.770548
August                 0.770548
September              0.770548
October                0.770548
November               0.770548
December               0.770548
Whole year             0.770548
Airport coordinates    0.770548
dtype: float64
2012 Airport name           0.770548
Year                   0.770548
January                0.770548
February               0.770548
March                  0.770548
April                  0.770548
May                    0.770548
June                   0.770548
July                   0.770548
August                 0.770548
September              0.770548
October                0.770548
November               0.770548
December               0.770548
Whole year             0.770548
Airport coordinates    0.770548
dtype: float64
2013 Airport name           0.770548
Year                   0.770548
January                0.770548
February               0.770548
March                  0.770548
April                  0.770548
May                    0.770548
June                   0.770548
July                   0.770548
August                 0.770548
September              0.770548
October                0.770548
November               0.770548
December               0.770548
Whole year             0.770548
Airport coordinates    0.770548
dtype: float64
2014 Airport name           0.770548
Year                   0.770548
January                0.770548
February               0.770548
March                  0.770548
April                  0.770548
May                    0.770548
June                   0.770548
July                   0.770548
August                 0.770548
September              0.770548
October                0.770548
November               0.770548
December               0.770548
Whole year             0.770548
Airport coordinates    0.770548
dtype: float64
2015 Airport name           0.770548
Year                   0.770548
January                0.770548
February               0.770548
March                  0.770548
April                  0.770548
May                    0.770548
June                   0.770548
July                   0.770548
August                 0.770548
September              0.770548
October                0.770548
November               0.770548
December               0.770548
Whole year             0.770548
Airport coordinates    0.770548
dtype: float64

(4) 若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有年份的记录信息从表中删除,并返回处理后的表格

(考察点:数据删除,分组,筛选,索引)

RAS_0=RAS[RAS['Whole year']==0]
RAS_0.head()
RAS_0=RAS_0[['Airport name','Whole year','Year']]
temp=RAS_0.groupby('Airport name')['Year'].count()>=5
type(temp)
print('temp',temp)
RAS[~RAS['Airport name'].isin(temp.index)]#== False 
print(RAS_0['Airport name'].isin(temp.index) )
# RAS_0.loc[temp.values]['Airport name']
# RAS_0.groupby('Airport name').head(1)#.count()

Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
1Aikhal20190.00.00.00.00.00.00.00.00.00.00.00.00.0(Decimal('111.543324'), Decimal('65.957161'))
2Loss20190.00.00.00.00.00.00.00.00.00.00.00.00.0(Decimal('125.398355'), Decimal('58.602489'))
3Amderma20190.00.00.00.00.00.00.00.00.00.00.00.00.0(Decimal('61.577429'), Decimal('69.759076'))
6Apatite (Khibiny)20190.00.00.00.00.00.00.00.00.00.00.00.00.0(Decimal('33.581999'), Decimal('67.459641'))
7Arkhangelsk (Vaskovo)20190.00.00.00.00.00.00.00.00.00.00.00.00.0(Decimal('40.706789'), Decimal('64.592645'))
pandas.core.series.Series



temp Airport name
Achinsk              True
Aikhal               True
Amderma              True
Antypayuta           True
Apatite (Khibiny)    True
                     ... 
Лешуконское          True
Мотыгино             True
Нюрба                True
Среднеколымск        True
Таксимо              True
Name: Year, Length: 230, dtype: bool
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
0Abakan201944.7066.2172.7075.82100.3478.3863.8873.0666.7475.44110.5089.80917.57(Decimal('91.399735'), Decimal('53.751351'))
4Anadyr (Carbon)201981.63143.01260.90304.36122.00106.8784.99130.00102.00118.0094.00199.001746.76(Decimal('177.738273'), Decimal('64.713433'))
5Anapa (Vitjazevo)201945.9253.1554.0054.7252.0067.45172.3172.5770.0063.0069.0082.10856.22(Decimal('37.341511'), Decimal('45.003748'))
8Arkhangelsk (Talagy)201985.61118.70131.39144.82137.95140.18128.56135.68124.75139.60210.27307.101804.61(Decimal('40.714892'), Decimal('64.596138'))
9Astrakhan (Narimanovo)201951.7561.0865.6071.8471.3863.95164.8679.4685.2187.2379.0699.16980.58(Decimal('47.999896'), Decimal('46.287344'))
...................................................
3693Reads (tub)200755.9680.0985.90154.54162.71107.5180.14138.71133.19188.97228.84184.001600.56(Decimal('113.306492'), Decimal('52.020464'))
3705Yuzhno-(Khomutovo)2007710.80970.001330.301352.301324.401613.001450.701815.601902.301903.201666.101632.1017670.80(Decimal('142.723677'), Decimal('46.886967'))
3706Yakutsk2007583.70707.80851.801018.00950.80900.001154.901137.841485.501382.501488.001916.6013577.44(Decimal('129.750225'), Decimal('62.086594'))
3708Yamburg20073.550.163.375.324.316.306.883.604.134.934.178.8755.59(Decimal('75.097783'), Decimal('67.980026'))
3709Yaroslavl (Tunoshna)2007847.001482.901325.401235.97629.00838.001211.30915.001249.601650.501822.602055.6015262.87(Decimal('40.170054'), Decimal('57.56231'))

795 rows × 16 columns

1       True
2       True
3       True
6       True
7       True
        ... 
3702    True
3703    True
3704    True
3707    True
3710    True
Name: Airport name, Length: 2807, dtype: bool

(5) 采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017年-2019 年货运总量最大的区域。

提示:最后一列 →_→
(考察点:分组)

RAS['x']=RAS['Airport coordinates'].apply(lambda x:str(x).split(',')[0][10:-2] if str(x)!='Not found'else 'Not found')
RAS['y']=RAS['Airport coordinates'].apply(lambda x:str(x).split(',')[1][10:-3] if str(x)!='Not found'else 'Not found')
RAS=RAS[(RAS['x']!='Not found')&(RAS['x']!='Abakan')]
RAS.head()
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinatesxy
0Abakan201944.7066.2172.775.82100.3478.3863.8873.0666.7475.44110.589.8917.57(Decimal('91.399735'), Decimal('53.751351'))91.39973553.751351
1Aikhal20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('111.543324'), Decimal('65.957161'))111.54332465.957161
2Loss20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('125.398355'), Decimal('58.602489'))125.39835558.602489
3Amderma20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('61.577429'), Decimal('69.759076'))61.57742969.759076
4Anadyr (Carbon)201981.63143.01260.9304.36122.00106.8784.99130.00102.00118.0094.0199.01746.76(Decimal('177.738273'), Decimal('64.713433'))177.73827364.713433
RAS['x']=RAS['Airport coordinates'].apply(lambda x:str(x).split(',')[0][10:-2] if str(x)!='Not found'else 'Not found')
# RAS['x']=RAS['Airport coordinates'].apply(lambda x:print(str(x).split(',')) )
RAS['y']=RAS['Airport coordinates'].apply(lambda x:str(x).split(',')[1][10:-3] if str(x)!='Not found'else 'Not found')
RAS=RAS[(RAS['x']!='Not found')&(RAS['x']!='Abakan')]
RAS=RAS[RAS['y']!='Not found']#&(RAS['y']!='Abakan')
# RAS['X']=pd.cut(RAS['x'],bins=[RAS['x'].min(),RAS['x'].mean(),RAS['x'].max()])
# RAS['Y']=pd.cut(RAS['y'],bins=[RAS['y'].min(),RAS['y'].mean(),RAS['y'].max()])
# RAS['region']=RAS['X']+RAS['Y']
RAS.head()
# RAS[RAS['x']=='Abakan']
# RAS['x'].astype(np.int64).mean()
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinatesxy
0Abakan201944.7066.2172.775.82100.3478.3863.8873.0666.7475.44110.589.8917.57(Decimal('91.399735'), Decimal('53.751351'))91.39973553.751351
1Aikhal20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('111.543324'), Decimal('65.957161'))111.54332465.957161
2Loss20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('125.398355'), Decimal('58.602489'))125.39835558.602489
3Amderma20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('61.577429'), Decimal('69.759076'))61.57742969.759076
4Anadyr (Carbon)201981.63143.01260.9304.36122.00106.8784.99130.00102.00118.0094.0199.01746.76(Decimal('177.738273'), Decimal('64.713433'))177.73827364.713433

(6) 在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月的秩综合指数。

(考察点:分组,合并,排序)

RAS_6=RAS[RAS.Year==2016]
RAS_6=RAS_6.melt(id_vars=['Airport name'],value_vars=['January','February','March','April','May','June','July','August','September','October','November','December'],value_name='month')
RAS_6=RAS_6.join(RAS_6.groupby('Airport name').rank(method='min'),rsuffix='_rank').sort_values(by=['Airport name','month'])
RAS_6
pd.pivot_table(RAS_6,columns='variable',values='month_rank',aggfunc='sum')#'value_rank'
Airport namevariablemonthmonth_rank
0AbakanJanuary34.101.0
292AbakanFebruary45.412.0
584AbakanMarch58.973.0
1752AbakanJuly64.314.0
876AbakanApril72.715.0
...............
2265ТаксимоAugust0.001.0
2557ТаксимоSeptember0.001.0
2849ТаксимоOctober0.001.0
3141ТаксимоNovember0.001.0
3433ТаксимоDecember0.001.0

3504 rows × 4 columns

variableAprilAugustDecemberFebruaryJanuaryJulyJuneMarchMayNovemberOctoberSeptember
month_rank701.0703.0905.0507.0402.0603.0633.0628.0631.0824.0771.0736.0

三、在美国的传播

(1) 用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期死亡数的相关系数。

(考察点:corr函数)

(2) 截止到 4 月 1 日,统计每个州零感染县的比例。

(考察点:分组,筛选,创建列)

(3) 请找出最早出确证病例的三个县。

(考察点:筛选,转换,索引)

(4) 按州统计单日死亡增加数,并给出哪个州在哪一天确诊数增加最大(这里指的是在所有州和所有天两个指标一起算,不是分别算)。

(考察点:分组,索引,差分,转换,筛选)

(5) 现需对每个州编制确证与死亡表,第一列为时间,并且起始时间为该州开始出现死亡比例的那一天,第二列和第三列分别为确证数和死亡数, 每个州需要保存为一个单独的 csv 文件,文件名为“州名.csv”。

(考察点:分组,索引,转换,循环,文件写入输出)

(6) 现需对 4 月 1 日至 4 月 10 日编制新增确证数与新增死亡数表,第一列为州名,第二列和第三列分别为新增确证数和新增死亡数,分别保存为十个单独的 csv 文件,文件名为“日期.csv”。

(考察点:分组,索引,转换,循环,文件写入输出)

# USCOV = COVID-19 in US
USCOV_diagnose = pd.read_csv('./美国确证数.csv')
USCOV_diagnose.head()
USCOV_diagnose.info()

USCOV_death = pd.read_csv('./美国死亡数.csv')
USCOV_death.head()
USCOV_death.info()

UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_Combined_Key2020/1/222020/1/232020/1/242020/1/252020/1/262020/1/272020/1/282020/1/292020/1/302020/1/312020/2/12020/2/22020/2/32020/2/42020/2/52020/2/62020/2/72020/2/82020/2/92020/2/102020/2/112020/2/122020/2/132020/2/142020/2/152020/2/162020/2/172020/2/182020/2/192020/2/202020/2/212020/2/222020/2/232020/2/242020/2/252020/2/262020/2/272020/2/282020/2/292020/3/12020/3/22020/3/32020/3/42020/3/52020/3/62020/3/72020/3/82020/3/92020/3/102020/3/112020/3/122020/3/132020/3/142020/3/152020/3/162020/3/172020/3/182020/3/192020/3/202020/3/212020/3/222020/3/232020/3/242020/3/252020/3/262020/3/272020/3/282020/3/292020/3/302020/3/312020/4/12020/4/22020/4/32020/4/42020/4/52020/4/62020/4/72020/4/82020/4/92020/4/102020/4/112020/4/122020/4/132020/4/142020/4/152020/4/162020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
084001001USUSA8401001AutaugaAlabamaUS32.539527-86.644082Autauga, Alabama, US0000000000000000000000000000000000000000000000000000000000000014666667810121212121212151719191923242626252628303233363637
184001003USUSA8401003BaldwinAlabamaUS30.727750-87.722071Baldwin, Alabama, US00000000000000000000000000000000000000000000000000000111112223445510151819202428292938424456596671728791101103109112117123132143147147161
284001005USUSA8401005BarbourAlabamaUS31.868263-85.387129Barbour, Alabama, US000000000000000000000000000000000000000000000000000000000000000000000000122233499101011121415182022282930323233
384001007USUSA8401007BibbAlabamaUS32.996421-87.125115Bibb, Alabama, US00000000000000000000000000000000000000000000000000000000000000000000233444578991113161717182224262832323433343438
484001009USUSA8401009BlountAlabamaUS33.982109-86.567906Blount, Alabama, US00000000000000000000000000000000000000000000000000000000000000012455555691010101010111212131416171820202122262931313134
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142 entries, 0 to 3141
Columns: 107 entries, UID to 2020/4/26
dtypes: float64(2), int64(99), object(6)
memory usage: 2.6+ MB
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_Combined_KeyPopulation2020/1/222020/1/232020/1/242020/1/252020/1/262020/1/272020/1/282020/1/292020/1/302020/1/312020/2/12020/2/22020/2/32020/2/42020/2/52020/2/62020/2/72020/2/82020/2/92020/2/102020/2/112020/2/122020/2/132020/2/142020/2/152020/2/162020/2/172020/2/182020/2/192020/2/202020/2/212020/2/222020/2/232020/2/242020/2/252020/2/262020/2/272020/2/282020/2/292020/3/12020/3/22020/3/32020/3/42020/3/52020/3/62020/3/72020/3/82020/3/92020/3/102020/3/112020/3/122020/3/132020/3/142020/3/152020/3/162020/3/172020/3/182020/3/192020/3/202020/3/212020/3/222020/3/232020/3/242020/3/252020/3/262020/3/272020/3/282020/3/292020/3/302020/3/312020/4/12020/4/22020/4/32020/4/42020/4/52020/4/62020/4/72020/4/82020/4/92020/4/102020/4/112020/4/122020/4/132020/4/142020/4/152020/4/162020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
084001001USUSA8401001AutaugaAlabamaUS32.539527-86.644082Autauga, Alabama, US55869000000000000000000000000000000000000000000000000000000000000000000000000000011111111112221122222
184001003USUSA8401003BaldwinAlabamaUS30.727750-87.722071Baldwin, Alabama, US223234000000000000000000000000000000000000000000000000000000000000000000011111111111111112222223333333
284001005USUSA8401005BarbourAlabamaUS31.868263-85.387129Barbour, Alabama, US24686000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
384001007USUSA8401007BibbAlabamaUS32.996421-87.125115Bibb, Alabama, US22394000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
484001009USUSA8401009BlountAlabamaUS33.982109-86.567906Blount, Alabama, US57826000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3142 entries, 0 to 3141
Columns: 108 entries, UID to 2020/4/26
dtypes: float64(2), int64(100), object(6)
memory usage: 2.6+ MB

(1) 用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期死亡数的相关系数。

(考察点:corr函数)

  • data.corr() #相关系数矩阵,即给出了任意两个变量之间的相关系数
  • data.corr()[u’A’] #只显示“A”与其他间的相关系数
  • data[u’A’].corr(data[u’B’]) #A与B两者的相关系数
USCOV_death[u'Population'].corr(USCOV_death[u'2020/4/26'])
0.4038441973480701

(2) 截止到 4 月 1 日,统计每个州零感染县的比例。

(考察点:分组,筛选,创建列)

# for name ,group in USCOV_diagnose.groupby('Province_State'):
#     temp=[]
# #     display( group.loc[:,'2020/1/22':'2020/4/1'])
#     for i in group.loc[:,'2020/1/22':'2020/4/1']:
#         display(i)
# #     print(group[group['sum']==0].count()/group.count())
(USCOV_diagnose[USCOV_diagnose['2020/4/1']==0].groupby('Province_State')['UID'].count()/USCOV_diagnose.groupby('Province_State')['UID'].count()).fillna(0)
Province_State
Alabama                 0.119403
Alaska                  0.793103
Arizona                 0.000000
Arkansas                0.293333
California              0.137931
Colorado                0.218750
Connecticut             0.000000
Delaware                0.000000
District of Columbia    0.000000
Florida                 0.164179
Georgia                 0.125786
Hawaii                  0.200000
Idaho                   0.386364
Illinois                0.480392
Indiana                 0.108696
Iowa                    0.404040
Kansas                  0.609524
Kentucky                0.441667
Louisiana               0.062500
Maine                   0.250000
Maryland                0.041667
Massachusetts           0.142857
Michigan                0.192771
Minnesota               0.367816
Mississippi             0.060976
Missouri                0.391304
Montana                 0.625000
Nebraska                0.752688
Nevada                  0.470588
New Hampshire           0.100000
New Jersey              0.000000
New Mexico              0.424242
New York                0.080645
North Carolina          0.180000
North Dakota            0.547170
Ohio                    0.181818
Oklahoma                0.376623
Oregon                  0.277778
Pennsylvania            0.104478
Rhode Island            0.000000
South Carolina          0.065217
South Dakota            0.560606
Tennessee               0.115789
Texas                   0.452756
Utah                    0.482759
Vermont                 0.142857
Virginia                0.270677
Washington              0.128205
West Virginia           0.472727
Wisconsin               0.319444
Wyoming                 0.347826
Name: UID, dtype: float64

(3) 请找出最早出确证病例的三个县。

(考察点:筛选,转换,索引)

USCOV_diagnose.sort_values(by=['2020/1/22','2020/1/23','2020/1/24','2020/1/25','2020/1/26'],ascending=False)
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_Combined_Key2020/1/222020/1/232020/1/242020/1/252020/1/262020/1/272020/1/282020/1/292020/1/302020/1/312020/2/12020/2/22020/2/32020/2/42020/2/52020/2/62020/2/72020/2/82020/2/92020/2/102020/2/112020/2/122020/2/132020/2/142020/2/152020/2/162020/2/172020/2/182020/2/192020/2/202020/2/212020/2/222020/2/232020/2/242020/2/252020/2/262020/2/272020/2/282020/2/292020/3/12020/3/22020/3/32020/3/42020/3/52020/3/62020/3/72020/3/82020/3/92020/3/102020/3/112020/3/122020/3/132020/3/142020/3/152020/3/162020/3/172020/3/182020/3/192020/3/202020/3/212020/3/222020/3/232020/3/242020/3/252020/3/262020/3/272020/3/282020/3/292020/3/302020/3/312020/4/12020/4/22020/4/32020/4/42020/4/52020/4/62020/4/72020/4/82020/4/92020/4/102020/4/112020/4/122020/4/132020/4/142020/4/152020/4/162020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
296984053033USUSA84053033KingWashingtonUS47.491379-121.834613King, Washington, US11111111111111111111111111111111111111691421315158718383116190270328387387488569562693793934104011701170135915771577207721592161233023302656278728983167333134863688388641174262442644264549462046974902490251745174529353795532563757395863
61084017031USUSA84017031CookIllinoisUS41.841448-87.816588Cook, Illinois, US0011111112222222222222222222222222222223444556771122274050506210717827827854880592211941418141822392613344537274496515255756111743980348728950910520114151247213417145851547416323173061808719391203952127222101231812454625811276162905830574
10384004013USUSA8404013MaricopaArizonaUS33.348359-112.491815Maricopa, Arizona, US000011111111111111111111111111111111111111112222233344891122344981139199251299399454545690788871961104911711326143314951559168917411891196020202056214622642404249125892636273828462970311632343359
20484006037USUSA8406037Los AngelesCaliforniaUS34.308284-118.228241Los Angeles, California, US0000111111111111111111111111111111111111117111314141420273240535394144190231292292407536662812122914651465182924743019351840454566460559556377693675597955844384538894943310047105171085411400120211234113823151531644717537185451913319567
21584006059USUSA8406059OrangeCaliforniaUS33.701475-117.764600Orange, California, US00001111111111111111111111111111111111111133333455691414172229536578951251521872563214034314645026066567117868348829311016107911381221127712831299137614251501155616361676169117531827184519692074
....................................................................................................................................................................................................................................................................................................................................
313784056037USUSA84056037SweetwaterWyomingUS41.659439-108.882788Sweetwater, Wyoming, US000000000000000000000000000000000000000000000000000000000000000111111223345556667799101010101010161616161616
313884056039USUSA84056039TetonWyomingUS43.935225-110.589080Teton, Wyoming, US0000000000000000000000000000000000000000000000000000000001222225710131416202629323639404144455053565657585961626262929393959595
313984056041USUSA84056041UintaWyomingUS41.287818-110.547578Uinta, Wyoming, US000000000000000000000000000000000000000000000000000000000000000000000001122333344444446666777777
314084056043USUSA84056043WashakieWyomingUS43.904516-107.680187Washakie, Wyoming, US000000000000000000000000000000000000000000000000000000000000000001111111224445555555565555888888
314184056045USUSA84056045WestonWyomingUS43.839612-104.567488Weston, Wyoming, US000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

3142 rows × 107 columns

for col in USCOV_diagnose.loc[:,'2020/1/22':'2020/4/26'].columns: 
    if USCOV_diagnose[USCOV_diagnose[col] != 0][col].count() >= 3:
        USCOV_diagnose[USCOV_diagnose[col] != 0]
        break    
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_Combined_Key2020/1/222020/1/232020/1/242020/1/252020/1/262020/1/272020/1/282020/1/292020/1/302020/1/312020/2/12020/2/22020/2/32020/2/42020/2/52020/2/62020/2/72020/2/82020/2/92020/2/102020/2/112020/2/122020/2/132020/2/142020/2/152020/2/162020/2/172020/2/182020/2/192020/2/202020/2/212020/2/222020/2/232020/2/242020/2/252020/2/262020/2/272020/2/282020/2/292020/3/12020/3/22020/3/32020/3/42020/3/52020/3/62020/3/72020/3/82020/3/92020/3/102020/3/112020/3/122020/3/132020/3/142020/3/152020/3/162020/3/172020/3/182020/3/192020/3/202020/3/212020/3/222020/3/232020/3/242020/3/252020/3/262020/3/272020/3/282020/3/292020/3/302020/3/312020/4/12020/4/22020/4/32020/4/42020/4/52020/4/62020/4/72020/4/82020/4/92020/4/102020/4/112020/4/122020/4/132020/4/142020/4/152020/4/162020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
10384004013USUSA8404013MaricopaArizonaUS33.348359-112.491815Maricopa, Arizona, US000011111111111111111111111111111111111111112222233344891122344981139199251299399454545690788871961104911711326143314951559168917411891196020202056214622642404249125892636273828462970311632343359
20484006037USUSA8406037Los AngelesCaliforniaUS34.308284-118.228241Los Angeles, California, US0000111111111111111111111111111111111111117111314141420273240535394144190231292292407536662812122914651465182924743019351840454566460559556377693675597955844384538894943310047105171085411400120211234113823151531644717537185451913319567
21584006059USUSA8406059OrangeCaliforniaUS33.701475-117.764600Orange, California, US00001111111111111111111111111111111111111133333455691414172229536578951251521872563214034314645026066567117868348829311016107911381221127712831299137614251501155616361676169117531827184519692074
61084017031USUSA84017031CookIllinoisUS41.841448-87.816588Cook, Illinois, US0011111112222222222222222222222222222223444556771122274050506210717827827854880592211941418141822392613344537274496515255756111743980348728950910520114151247213417145851547416323173061808719391203952127222101231812454625811276162905830574
296984053033USUSA84053033KingWashingtonUS47.491379-121.834613King, Washington, US11111111111111111111111111111111111111691421315158718383116190270328387387488569562693793934104011701170135915771577207721592161233023302656278728983167333134863688388641174262442644264549462046974902490251745174529353795532563757395863

(4) 按州统计单日死亡增加数,并给出哪个州在哪一天确诊数增加最大(这里指的是在所有州和所有天两个指标一起算,不是分别算)。

(考察点:分组,索引,差分,转换,筛选)

USCOV_death.groupby('Province_State')[:,'2020/1/22':'2020/4/1']
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值
>