综合练习

2002年-2018年上海机动车拍照拍卖

import numpy as np
import pandas as pd
# 读取数据
df1 = pd.read_csv('2002年-2018年上海机动车拍照拍卖.csv')
df1
DateTotal number of license issuedlowest priceavg priceTotal number of applicants
02-Jan140013600147353718
12-Feb180013100140574590
22-Mar200014300146625190
32-Apr230016000163344806
42-May235017800183574665
..................
19818-Aug104028830088365192755
19918-Sep127128730087410189142
20018-Oct107288800088070181861
20118-Nov117668730087374177355
20218-Dec128508740087508165442

203 rows × 5 columns

(1) 哪一次拍卖的中标率首次小于 5%?

df1[(df1['Total number of license issued']/df1['Total number of applicants'])<0.05]
DateTotal number of license issuedlowest priceavg priceTotal number of applicants
15915-May74827900079099156007
16015-Jun74418000080020172205
16115-Jul75318310083171166302
16215-Aug74548260082642166939
16415-Oct77638530085424170995
16515-Nov75148460084703169159
16615-Dec76988450084572179133
16816-Feb83638320083244196470
16916-Mar83108310083148221109
17016-Apr118298510085127256897
17116-May115988500085058277889
17216-Jun115468440084483275438
17316-Jul114758720087235240750
17416-Aug115498690086946251188
18017-Feb101578820088240251717
18117-Mar103568780087916262010
18217-Apr121968980089850252273
18317-May103169010090209270197
18417-Jun103128940089532244349
18517-Jul103259220092250269189
18617-Aug105589160091629256083
18717-Sep124139130091415250566
18817-Oct113889350093540244868
18917-Nov110029310093130226911
19318-Mar98558810088176217056

15-May

(2) 按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求显示在同一张表上。

df1['year'] = df1['Date'].apply(lambda x:'20' + x[:2] if x[2]=='-' else '200' + x[0])
df1.head()
DateTotal number of license issuedlowest priceavg priceTotal number of applicantsyear
02-Jan1400136001473537182002
12-Feb1800131001405745902002
22-Mar2000143001466251902002
32-Apr2300160001633448062002
42-May2350178001835746652002
np.quantile?
df1.groupby('year')['lowest price '].agg([max,np.mean,lambda x:x.quantile(0.75)]).rename(columns={'<lambda_0>':'0.75分位数'})
maxmean0.75分位数
year
20023080020316.66666724300
20033850031983.33333336300
20044420029408.33333338400
20053790031908.33333335600
20063990037058.33333339525
20075380045691.66666748950
20083730029945.45454534150
20093690031333.33333334150
20104490038008.33333341825
20115380047958.33333351000
20126890061108.33333365325
20139080079125.00000082550
20147460073816.66666774000
20158530080575.00000083450
20168860085733.33333387475
20179350090616.66666792350
20188900087825.00000088150

(3) 将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延。

df1['month'] = df1['Date'].apply(lambda x:x[-3:])
df1.head()
DateTotal number of license issuedlowest priceavg priceTotal number of applicantsyearmonth
02-Jan1400136001473537182002Jan
12-Feb1800131001405745902002Feb
22-Mar2000143001466251902002Mar
32-Apr2300160001633448062002Apr
42-May2350178001835746652002May
df1 = df1.loc[:,['year', 'month', 'Total number of license issued', 'lowest price ', 'avg price',
       'Total number of applicants']]
df1.head()
yearmonthTotal number of license issuedlowest priceavg priceTotal number of applicants
02002Jan140013600147353718
12002Feb180013100140574590
22002Mar200014300146625190
32002Apr230016000163344806
42002May235017800183574665

(4) 现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份。

df1.set_index(['year','Total number of license issued', 'lowest price ',
       'avg price', 'Total number of applicants'])
month
yearTotal number of license issuedlowest priceavg priceTotal number of applicants
2002140013600147353718Jan
180013100140574590Feb
200014300146625190Mar
230016000163344806Apr
235017800183574665May
..................
2018104028830088365192755Aug
127128730087410189142Sep
107288800088070181861Oct
117668730087374177355Nov
128508740087508165442Dec

203 rows × 1 columns

(5) 一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?

df1[df1['lowest price '].diff()*df1['avg price'].diff()<0][['year','month']]
yearmonth
212003Oct
222003Nov
292004Jun
362005Jan
372005Feb
442005Sep
522006May
562006Sep
602007Jan
612007Feb
712007Dec
1282012Oct
df1.columns
Index(['year', 'month', 'Total number of license issued', 'lowest price ',
       'avg price', 'Total number of applicants'],
      dtype='object')

(6) 将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。

avg = (df1['Total number of license issued'].values[1:-1]+df1['Total number of license issued'].values[:-2])/2
avg = np.array([df1.loc[0,'Total number of license issued'],df1.loc[1,'Total number of license issued']]+avg.tolist())
diff = df1['Total number of license issued'].values - avg
diff
array([ 0.0000e+00,  0.0000e+00,  4.0000e+02,  4.0000e+02,  2.0000e+02,
        4.7500e+02,  4.2500e+02,  1.0000e+02,  2.0000e+02,  1.0000e+02,
        0.0000e+00,  4.0000e+02, -4.0000e+02, -3.0000e+02,  0.0000e+00,
        3.0000e+02,  6.5000e+02,  1.9500e+03,  1.3500e+03, -1.2500e+03,
        1.4000e+03, -1.0750e+03, -5.3300e+02,  5.0000e+00,  9.1000e+01,
       -8.8000e+01, -1.0000e+02,  7.0000e+02,  1.3770e+03,  2.1950e+02,
        2.2000e+02,  3.8350e+02, -6.0000e+01, -1.2000e+02, -2.0000e+01,
       -1.1000e+03, -5.5000e+02, -1.7000e+03, -6.5000e+02,  1.1000e+03,
        1.3330e+03,  2.7350e+02,  5.6450e+02,  8.2100e+02,  1.2250e+02,
       -7.6450e+02, -6.5000e+02, -1.5000e+02, -7.0000e+02, -1.5500e+03,
        1.0000e+02,  8.5000e+02, -2.5000e+02, -2.5000e+02,  1.0000e+03,
        1.2000e+03,  6.5000e+02,  1.5000e+02, -5.0000e+02,  2.5000e+02,
       -2.5000e+02, -2.7500e+03, -7.5000e+02,  1.7500e+03,  1.2500e+03,
        2.5000e+02, -5.0000e+02,  2.2500e+03,  1.7500e+03, -7.5000e+02,
       -5.0000e+02,  0.0000e+00,  8.5000e+03, -2.4500e+03, -3.6500e+03,
       -9.5000e+02, -9.0000e+02, -1.1500e+03, -1.2500e+03,  1.0000e+02,
       -1.2500e+03, -2.5000e+02, -7.5000e+02,  2.0000e+02,  3.5000e+02,
        8.0000e+02,  9.0000e+02,  9.5000e+02,  1.1500e+03,  4.0000e+02,
        0.0000e+00,  5.0000e+02, -2.5000e+02, -2.5000e+02,  0.0000e+00,
        0.0000e+00, -5.0000e+02,  2.5000e+02,  7.5000e+02,  2.5000e+02,
        7.0000e+02,  1.5000e+02, -1.0000e+02,  0.0000e+00,  0.0000e+00,
       -5.0000e+02,  2.5000e+02, -7.5000e+02, -1.0000e+03,  2.5000e+02,
        2.5000e+02,  1.0000e+03,  5.0000e+02,  0.0000e+00,  0.0000e+00,
        5.0000e+02, -2.5000e+02, -2.5000e+02, -5.0000e+02, -7.5000e+02,
       -2.5000e+02,  0.0000e+00,  5.0000e+02,  1.0500e+03,  6.0000e+02,
        1.0000e+02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
        0.0000e+00, -5.0000e+02, -2.5000e+02,  0.0000e+00,  2.0000e+03,
       -1.0000e+03, -1.0000e+03,  0.0000e+00,  0.0000e+00,  0.0000e+00,
        1.0000e+03, -1.0000e+03, -7.5000e+02, -4.0000e+02, -9.0000e+02,
       -3.5000e+02,  8.0000e+02, -4.0000e+02, -4.0000e+02,  0.0000e+00,
        0.0000e+00,  9.0000e+02, -4.5000e+02, -4.5000e+02,  4.7000e+01,
        5.6650e+02, -6.5500e+01, -4.1550e+02,  7.5850e+02, -3.6500e+02,
       -4.4400e+02,  6.9500e+01, -3.2000e+01,  1.2345e+03, -3.2750e+02,
       -7.3100e+02,  5.9500e+01,  1.8030e+03, -1.9050e+02, -5.7600e+02,
        3.4925e+03,  1.5285e+03, -1.6750e+02, -9.7000e+01,  3.8500e+01,
        1.3770e+03, -5.9800e+02, -7.0600e+02,  6.7600e+02,  3.1000e+02,
       -2.0810e+03, -8.3000e+02,  1.9395e+03, -9.6000e+02, -9.4400e+02,
        1.1000e+01,  2.3950e+02,  1.9715e+03, -9.7500e+01, -8.9850e+02,
        9.5200e+02,  6.0850e+02, -1.0670e+03, -1.7855e+03,  1.4395e+03,
       -6.6950e+02, -2.9100e+02, -1.0050e+02, -1.8300e+02,  2.3135e+03,
       -8.2900e+02,  4.6000e+01,  1.6030e+03])
d1 = [np.nan] +diff[:-1].tolist()
d2 = diff[1:].tolist() +[np.nan]
df1[(diff-d1)*(diff-d2)>0][['year','month']]
/home/myth/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in greater
  This is separate from the ipykernel package so we can avoid doing imports until
yearmonth
42002May
52002Jun
72002Aug
82002Sep
102002Nov
.........
1952018May
1972018Jul
1982018Aug
1992018Sep
2002018Oct

95 rows × 2 columns

二、2007 年-2019 年俄罗斯机场货运航班运载量

df2 = pd.read_csv('2007年-2019年俄罗斯货运航班运载量.csv')
df2.head()
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
0Abakan201944.7066.2172.775.82100.3478.3863.8873.0666.7475.44110.589.8917.57(Decimal('91.399735'), Decimal('53.751351'))
1Aikhal20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('111.543324'), Decimal('65.957161'))
2Loss20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('125.398355'), Decimal('58.602489'))
3Amderma20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('61.577429'), Decimal('69.759076'))
4Anadyr (Carbon)201981.63143.01260.9304.36122.00106.8784.99130.00102.00118.0094.0199.01746.76(Decimal('177.738273'), Decimal('64.713433'))

(1) 求每年货运航班总运量。

df2.groupby('Year')['Whole year'].sum()
Year
2007    659438.23
2008    664682.46
2009    560809.77
2010    693033.98
2011    818691.71
2012    846388.03
2013    792337.08
2014    729457.12
2015    630208.97
2016    679370.15
2017    773662.28
2018    767095.28
2019    764606.27
Name: Whole year, dtype: float64

(2) 每年记录的机场都是相同的吗?

df2.groupby('Year')['Airport name'].unique()
Year
2007    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2008    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2009    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2010    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2011    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2012    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2013    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2014    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2015    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2016    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2017    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2018    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2019    [Abakan, Aikhal, Loss, Amderma, Anadyr (Carbon...
Name: Airport name, dtype: object

不一样

(3) 按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例。

subs = df2[df2['Year']>=2010]
subs = subs[subs['Year']<=2015]
subs[subs['Whole year']==0].groupby('Year')['Whole year'].count()/subs.groupby('Year')['Whole year'].count()
Year
2010    0.767123
2011    0.770548
2012    0.770548
2013    0.770548
2014    0.770548
2015    0.770548
Name: Whole year, dtype: float64

(4) 若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有年份的记录信息从表中删除,并返回处理后的表格

ind = df2[df2['Whole year']==0].groupby('Airport name')[['Whole year']].count().index
df2 = df2[~df2['Airport name'].isin(ind)]
df2.reset_index(drop=True).head()
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
0Abakan201944.7066.2172.7075.82100.3478.3863.8873.0666.7475.44110.5089.80917.57(Decimal('91.399735'), Decimal('53.751351'))
1Anadyr (Carbon)201981.63143.01260.90304.36122.00106.8784.99130.00102.00118.0094.00199.001746.76(Decimal('177.738273'), Decimal('64.713433'))
2Anapa (Vitjazevo)201945.9253.1554.0054.7252.0067.45172.3172.5770.0063.0069.0082.10856.22(Decimal('37.341511'), Decimal('45.003748'))
3Arkhangelsk (Talagy)201985.61118.70131.39144.82137.95140.18128.56135.68124.75139.60210.27307.101804.61(Decimal('40.714892'), Decimal('64.596138'))
4Astrakhan (Narimanovo)201951.7561.0865.6071.8471.3863.95164.8679.4685.2187.2379.0699.16980.58(Decimal('47.999896'), Decimal('46.287344'))

(5) 采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017年-2019 年货运总量最大的区域。

这里有个疑问,就是东北算东还算北呢?

(6) 在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月的秩综合指数。

df20 = df2[df2['Year']==2016]
df20.groupby(df20['Airport name'])['Whole year']
185    1.0
186    1.0
187    1.0
188    1.0
189    1.0
      ... 
241    1.0
242    1.0
243    1.0
244    1.0
245    1.0
Name: Whole year, Length: 61, dtype: float64
df2[['month','Whole year']].groupby(df2['Airport name']).apply(lambda x:x)
0        917.57
1       1746.76
2        856.22
3       1804.61
4        980.58
         ...   
790     1600.56
791    17670.80
792    13577.44
793       55.59
794    15262.87
Name: Whole year, Length: 795, dtype: float64
df2.groupby('Airport name').rank
rank = df2.groupby('Airport name').rank
rank = df2.groupby('Airport name').rank
df2.groupby('Airport name')['Whole year']

YearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
Airport name
Abakan131313131313131313131313131313
Anadyr121212121212121212121212121212
Anadyr (Carbon)111111111111111
Anapa (Vitjazevo)131313131313131313131313131313
Arkhangelsk (Talagy)131313131313131313131313131313
................................................
Yamburg131313131313131313131313131313
Yaroslavl (Tunoshna)131313131313131313131313131313
Yuzhno-(Khomutovo)131313131313131313131313131313
Zabaykalsk131313131313131313131313131313
red131313131313131313131313131313

67 rows × 15 columns

df2.groupby('Airport name')['Whole year']

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-221-a5c4db71690e> in <module>()
----> 1 df2.groupby('Airport name')['Whole year'].apply()


TypeError: apply() missing 1 required positional argument: 'func'

01
0Abakan0 917.57 63 894.40 124 1000.04 1...
1Anadyr64 1931.77 125 2164.34 186 2019.00 2...
2Anadyr (Carbon)1 1746.76 Name: Whole year, dtype: float64
3Anapa (Vitjazevo)2 856.22 65 794.47 126 729.05 187 ...
4Arkhangelsk (Talagy)3 1804.61 66 2019.51 127 2508.08 1...
.........
62Yamburg61 1.31 122 1.73 183 1.31 244 ...
63Yaroslavl (Tunoshna)62 186.67 123 2124.46 184 5811.9...
64Yuzhno-(Khomutovo)59 10298.31 120 10483.65 181 10556.1...
65Zabaykalsk22 446.53 83 527.91 144 782.18 2...
66red21 63.63 82 14.81 143 13.33 204 ...

67 rows × 2 columns

df2[df2['Airport name']=='Abakan']
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
0Abakan201944.7066.2172.7075.82100.3478.3863.8873.0666.7475.44110.5089.80917.57(Decimal('91.399735'), Decimal('53.751351'))
63Abakan201863.3960.9869.4688.6283.8164.9949.6470.2687.3576.7880.1998.93894.40(Decimal('91.399735'), Decimal('53.751351'))
124Abakan201748.3677.0895.3390.2891.9357.1382.5981.2076.05139.6771.5888.841000.04(Decimal('91.399735'), Decimal('53.751351'))
185Abakan201634.1045.4158.9772.7191.6678.4964.31127.2989.0774.9978.6694.01909.67(Decimal('91.399735'), Decimal('53.751351'))
246Abakan201537.7047.9754.6782.1268.81112.9555.8395.20137.7972.1363.6778.30907.14(Decimal('91.399735'), Decimal('53.751351'))
307Abakan201451.9069.4283.3669.7175.5092.1667.5089.2276.9578.56153.6883.38991.34(Decimal('91.399735'), Decimal('53.751351'))
368Abakan201339.74227.1662.9777.41104.3455.7165.9578.7175.6870.3081.5783.341022.88(Decimal('91.399735'), Decimal('53.751351'))
429Abakan201239.2439.2460.5855.5356.5157.3564.4775.8866.40120.6264.3881.25781.45(Decimal('91.399735'), Decimal('53.751351'))
490Abakan201153.3059.40162.4072.4090.1092.1081.3073.2077.0084.3091.17124.561061.23(Decimal('91.399735'), Decimal('53.751351'))
551Abakan201035.10192.2055.20108.70118.8069.3087.7077.00181.4088.5083.30110.601207.80(Decimal('91.399735'), Decimal('53.751351'))
612Abakan200940.1036.9050.3047.0058.2077.6083.90404.7065.1068.20182.20362.501476.70(Decimal('91.399735'), Decimal('53.751351'))
673Abakan200865.6069.0075.2068.80158.7099.6088.1077.3075.4090.3067.2051.30986.50(Decimal('91.399735'), Decimal('53.751351'))
734Abakan2007742.2071.3072.8083.50102.2083.3067.7085.6088.3073.7072.1092.201634.90(Decimal('91.399735'), Decimal('53.751351'))
df2.rank?
df2.groupby('Airport name').rank

三、新冠肺炎在美国的传播

df3q = pd.read_csv('美国确证数.csv')
df3s = pd.read_csv('美国死亡数.csv')
df3q.head()
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...2020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
084001001USUSA8401001AutaugaAlabamaUS32.539527-86.644082...26252628303233363637
184001003USUSA8401003BaldwinAlabamaUS30.727750-87.722071...103109112117123132143147147161
284001005USUSA8401005BarbourAlabamaUS31.868263-85.387129...15182022282930323233
384001007USUSA8401007BibbAlabamaUS32.996421-87.125115...24262832323433343438
484001009USUSA8401009BlountAlabamaUS33.982109-86.567906...20202122262931313134

5 rows × 107 columns

df3s.head()
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...2020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
084001001USUSA8401001AutaugaAlabamaUS32.539527-86.644082...2221122222
184001003USUSA8401003BaldwinAlabamaUS30.727750-87.722071...2223333333
284001005USUSA8401005BarbourAlabamaUS31.868263-85.387129...0000000000
384001007USUSA8401007BibbAlabamaUS32.996421-87.125115...0000000000
484001009USUSA8401009BlountAlabamaUS33.982109-86.567906...0000000000

5 rows × 108 columns

df3s.columns[:20]
Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population',
       '2020/1/22', '2020/1/23', '2020/1/24', '2020/1/25', '2020/1/26',
       '2020/1/27', '2020/1/28', '2020/1/29'],
      dtype='object')

(1) 用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期死亡数的相关系数。

df3s[['Population','2020/4/26']].corr()
Population2020/4/26
Population1.0000000.403844
2020/4/260.4038441.000000

(2) 截止到 4 月 1 日,统计每个州零感染县的比例。

df3q['zero41'] = df3q.iloc[:,12:82].max(axis=1)
(df3q[df3q['zero41']==0].groupby('Province_State')['UID'].count()/df3q.groupby('Province_State')['UID'].count())
Province_State
Alabama                 0.104478
Alaska                  0.793103
Arizona                      NaN
Arkansas                0.293333
California              0.137931
Colorado                0.218750
Connecticut                  NaN
Delaware                     NaN
District of Columbia         NaN
Florida                 0.164179
Georgia                 0.119497
Hawaii                  0.200000
Idaho                   0.386364
Illinois                0.450980
Indiana                 0.108696
Iowa                    0.404040
Kansas                  0.600000
Kentucky                0.433333
Louisiana               0.046875
Maine                   0.187500
Maryland                0.041667
Massachusetts           0.142857
Michigan                0.168675
Minnesota               0.356322
Mississippi             0.060976
Missouri                0.391304
Montana                 0.625000
Nebraska                0.741935
Nevada                  0.470588
New Hampshire                NaN
New Jersey                   NaN
New Mexico              0.424242
New York                0.080645
North Carolina          0.150000
North Dakota            0.528302
Ohio                    0.181818
Oklahoma                0.376623
Oregon                  0.277778
Pennsylvania            0.104478
Rhode Island                 NaN
South Carolina          0.043478
South Dakota            0.530303
Tennessee               0.105263
Texas                   0.437008
Utah                    0.448276
Vermont                 0.142857
Virginia                0.248120
Washington              0.102564
West Virginia           0.472727
Wisconsin               0.319444
Wyoming                 0.347826
Name: UID, dtype: float64

(3) 请找出最早出确证病例的三个县。

df3q.sort_values(by=df3q.columns[12:].tolist(),ascending=False)[:3]
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...2020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26zero41
296984053033USUSA84053033KingWashingtonUS47.491379-121.834613...4902517451745293537955325637573958632656
61084017031USUSA84017031CookIllinoisUS41.841448-87.816588...2039521272221012318124546258112761629058305745575
20484006037USUSA8406037Los AngelesCaliforniaUS34.308284-118.228241...1202112341138231515316447175371854519133195674045

3 rows × 108 columns

(4) 按州统计单日死亡增加数,并给出哪个州在哪一天确诊数增加最大(这里指的是在所有州和所有天两个指标一起算,不是分别算)。

df3q.groupby('Province_State').sum().iloc[:,6:].diff(axis=1).max(axis=1).idxmax()
'New York'
df3q.groupby('Province_State').sum().iloc[:,6:].diff(axis=1).max(axis=0).idxmax()
'2020/4/15'

(5) 现需对每个州编制确证与死亡表,第一列为时间,并且起始时间为该州开始出现死亡比例的那一天,第二列和第三列分别为确证数和死亡数,每个州需要保存为一个单独的 csv 文件,文件名为“州名.csv”。

for state in df3s['Province_State'].unique():
    df_q = df3q[df3q['Province_State']==state].iloc[:,12:].sum()
    df_s = df3s[df3s['Province_State']==state].iloc[:,12:].sum()
    df_q = df_q[df_s.cumsum()>0].reset_index()
    df_s = df_s[df_s.cumsum()>0].reset_index()
    df_q.columns=['Date','确诊人数']
    df_s.columns=['Date','死亡人数']
    pd.merge(df_q,df_s,on='Date',how='inner').to_csv(state+'.csv')

(6) 现需对 4 月 1 日至 4 月 10 日编制新增确证数与新增死亡数表,第一列为州名,第二列和第三列分别为新增确证数和新增死亡数,分别保存为十个单独的 csv 文件,文件名为“日期.csv”。

for date in range(10):
    df_q = df3q.groupby('Province_State').sum().iloc[:,75:].iloc[:,date]
    df_s = df3s.groupby('Province_State').sum().iloc[:,75:].iloc[:,date]
    df_r = pd.DataFrame(df_q).assign(s=df_s).reset_index()
    df_r.columns = ['州名','确诊人数','死亡人数']
    df_r.to_csv('4月'+str(date)+'日.csv')
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值