2002年-2018年上海机动车拍照拍卖
import numpy as np
import pandas as pd
df1 = pd. read_csv( '2002年-2018年上海机动车拍照拍卖.csv' )
df1
Date Total number of license issued lowest price avg price Total number of applicants 0 2-Jan 1400 13600 14735 3718 1 2-Feb 1800 13100 14057 4590 2 2-Mar 2000 14300 14662 5190 3 2-Apr 2300 16000 16334 4806 4 2-May 2350 17800 18357 4665 ... ... ... ... ... ... 198 18-Aug 10402 88300 88365 192755 199 18-Sep 12712 87300 87410 189142 200 18-Oct 10728 88000 88070 181861 201 18-Nov 11766 87300 87374 177355 202 18-Dec 12850 87400 87508 165442
203 rows × 5 columns
(1) 哪一次拍卖的中标率首次小于 5%?
df1[ ( df1[ 'Total number of license issued' ] / df1[ 'Total number of applicants' ] ) < 0.05 ]
Date Total number of license issued lowest price avg price Total number of applicants 159 15-May 7482 79000 79099 156007 160 15-Jun 7441 80000 80020 172205 161 15-Jul 7531 83100 83171 166302 162 15-Aug 7454 82600 82642 166939 164 15-Oct 7763 85300 85424 170995 165 15-Nov 7514 84600 84703 169159 166 15-Dec 7698 84500 84572 179133 168 16-Feb 8363 83200 83244 196470 169 16-Mar 8310 83100 83148 221109 170 16-Apr 11829 85100 85127 256897 171 16-May 11598 85000 85058 277889 172 16-Jun 11546 84400 84483 275438 173 16-Jul 11475 87200 87235 240750 174 16-Aug 11549 86900 86946 251188 180 17-Feb 10157 88200 88240 251717 181 17-Mar 10356 87800 87916 262010 182 17-Apr 12196 89800 89850 252273 183 17-May 10316 90100 90209 270197 184 17-Jun 10312 89400 89532 244349 185 17-Jul 10325 92200 92250 269189 186 17-Aug 10558 91600 91629 256083 187 17-Sep 12413 91300 91415 250566 188 17-Oct 11388 93500 93540 244868 189 17-Nov 11002 93100 93130 226911 193 18-Mar 9855 88100 88176 217056
15-May
(2) 按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求显示在同一张表上。
df1[ 'year' ] = df1[ 'Date' ] . apply ( lambda x: '20' + x[ : 2 ] if x[ 2 ] == '-' else '200' + x[ 0 ] )
df1. head( )
Date Total number of license issued lowest price avg price Total number of applicants year 0 2-Jan 1400 13600 14735 3718 2002 1 2-Feb 1800 13100 14057 4590 2002 2 2-Mar 2000 14300 14662 5190 2002 3 2-Apr 2300 16000 16334 4806 2002 4 2-May 2350 17800 18357 4665 2002
np. quantile?
df1. groupby( 'year' ) [ 'lowest price ' ] . agg( [ max , np. mean, lambda x: x. quantile( 0.75 ) ] ) . rename( columns= { '<lambda_0>' : '0.75分位数' } )
max mean 0.75分位数 year 2002 30800 20316.666667 24300 2003 38500 31983.333333 36300 2004 44200 29408.333333 38400 2005 37900 31908.333333 35600 2006 39900 37058.333333 39525 2007 53800 45691.666667 48950 2008 37300 29945.454545 34150 2009 36900 31333.333333 34150 2010 44900 38008.333333 41825 2011 53800 47958.333333 51000 2012 68900 61108.333333 65325 2013 90800 79125.000000 82550 2014 74600 73816.666667 74000 2015 85300 80575.000000 83450 2016 88600 85733.333333 87475 2017 93500 90616.666667 92350 2018 89000 87825.000000 88150
(3) 将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延。
df1[ 'month' ] = df1[ 'Date' ] . apply ( lambda x: x[ - 3 : ] )
df1. head( )
Date Total number of license issued lowest price avg price Total number of applicants year month 0 2-Jan 1400 13600 14735 3718 2002 Jan 1 2-Feb 1800 13100 14057 4590 2002 Feb 2 2-Mar 2000 14300 14662 5190 2002 Mar 3 2-Apr 2300 16000 16334 4806 2002 Apr 4 2-May 2350 17800 18357 4665 2002 May
df1 = df1. loc[ : , [ 'year' , 'month' , 'Total number of license issued' , 'lowest price ' , 'avg price' ,
'Total number of applicants' ] ]
df1. head( )
year month Total number of license issued lowest price avg price Total number of applicants 0 2002 Jan 1400 13600 14735 3718 1 2002 Feb 1800 13100 14057 4590 2 2002 Mar 2000 14300 14662 5190 3 2002 Apr 2300 16000 16334 4806 4 2002 May 2350 17800 18357 4665
(4) 现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份。
df1. set_index( [ 'year' , 'Total number of license issued' , 'lowest price ' ,
'avg price' , 'Total number of applicants' ] )
month year Total number of license issued lowest price avg price Total number of applicants 2002 1400 13600 14735 3718 Jan 1800 13100 14057 4590 Feb 2000 14300 14662 5190 Mar 2300 16000 16334 4806 Apr 2350 17800 18357 4665 May ... ... ... ... ... ... 2018 10402 88300 88365 192755 Aug 12712 87300 87410 189142 Sep 10728 88000 88070 181861 Oct 11766 87300 87374 177355 Nov 12850 87400 87508 165442 Dec
203 rows × 1 columns
(5) 一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?
df1[ df1[ 'lowest price ' ] . diff( ) * df1[ 'avg price' ] . diff( ) < 0 ] [ [ 'year' , 'month' ] ]
year month 21 2003 Oct 22 2003 Nov 29 2004 Jun 36 2005 Jan 37 2005 Feb 44 2005 Sep 52 2006 May 56 2006 Sep 60 2007 Jan 61 2007 Feb 71 2007 Dec 128 2012 Oct
df1. columns
Index(['year', 'month', 'Total number of license issued', 'lowest price ',
'avg price', 'Total number of applicants'],
dtype='object')
(6) 将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。
avg = ( df1[ 'Total number of license issued' ] . values[ 1 : - 1 ] + df1[ 'Total number of license issued' ] . values[ : - 2 ] ) / 2
avg = np. array( [ df1. loc[ 0 , 'Total number of license issued' ] , df1. loc[ 1 , 'Total number of license issued' ] ] + avg. tolist( ) )
diff = df1[ 'Total number of license issued' ] . values - avg
diff
array([ 0.0000e+00, 0.0000e+00, 4.0000e+02, 4.0000e+02, 2.0000e+02,
4.7500e+02, 4.2500e+02, 1.0000e+02, 2.0000e+02, 1.0000e+02,
0.0000e+00, 4.0000e+02, -4.0000e+02, -3.0000e+02, 0.0000e+00,
3.0000e+02, 6.5000e+02, 1.9500e+03, 1.3500e+03, -1.2500e+03,
1.4000e+03, -1.0750e+03, -5.3300e+02, 5.0000e+00, 9.1000e+01,
-8.8000e+01, -1.0000e+02, 7.0000e+02, 1.3770e+03, 2.1950e+02,
2.2000e+02, 3.8350e+02, -6.0000e+01, -1.2000e+02, -2.0000e+01,
-1.1000e+03, -5.5000e+02, -1.7000e+03, -6.5000e+02, 1.1000e+03,
1.3330e+03, 2.7350e+02, 5.6450e+02, 8.2100e+02, 1.2250e+02,
-7.6450e+02, -6.5000e+02, -1.5000e+02, -7.0000e+02, -1.5500e+03,
1.0000e+02, 8.5000e+02, -2.5000e+02, -2.5000e+02, 1.0000e+03,
1.2000e+03, 6.5000e+02, 1.5000e+02, -5.0000e+02, 2.5000e+02,
-2.5000e+02, -2.7500e+03, -7.5000e+02, 1.7500e+03, 1.2500e+03,
2.5000e+02, -5.0000e+02, 2.2500e+03, 1.7500e+03, -7.5000e+02,
-5.0000e+02, 0.0000e+00, 8.5000e+03, -2.4500e+03, -3.6500e+03,
-9.5000e+02, -9.0000e+02, -1.1500e+03, -1.2500e+03, 1.0000e+02,
-1.2500e+03, -2.5000e+02, -7.5000e+02, 2.0000e+02, 3.5000e+02,
8.0000e+02, 9.0000e+02, 9.5000e+02, 1.1500e+03, 4.0000e+02,
0.0000e+00, 5.0000e+02, -2.5000e+02, -2.5000e+02, 0.0000e+00,
0.0000e+00, -5.0000e+02, 2.5000e+02, 7.5000e+02, 2.5000e+02,
7.0000e+02, 1.5000e+02, -1.0000e+02, 0.0000e+00, 0.0000e+00,
-5.0000e+02, 2.5000e+02, -7.5000e+02, -1.0000e+03, 2.5000e+02,
2.5000e+02, 1.0000e+03, 5.0000e+02, 0.0000e+00, 0.0000e+00,
5.0000e+02, -2.5000e+02, -2.5000e+02, -5.0000e+02, -7.5000e+02,
-2.5000e+02, 0.0000e+00, 5.0000e+02, 1.0500e+03, 6.0000e+02,
1.0000e+02, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, -5.0000e+02, -2.5000e+02, 0.0000e+00, 2.0000e+03,
-1.0000e+03, -1.0000e+03, 0.0000e+00, 0.0000e+00, 0.0000e+00,
1.0000e+03, -1.0000e+03, -7.5000e+02, -4.0000e+02, -9.0000e+02,
-3.5000e+02, 8.0000e+02, -4.0000e+02, -4.0000e+02, 0.0000e+00,
0.0000e+00, 9.0000e+02, -4.5000e+02, -4.5000e+02, 4.7000e+01,
5.6650e+02, -6.5500e+01, -4.1550e+02, 7.5850e+02, -3.6500e+02,
-4.4400e+02, 6.9500e+01, -3.2000e+01, 1.2345e+03, -3.2750e+02,
-7.3100e+02, 5.9500e+01, 1.8030e+03, -1.9050e+02, -5.7600e+02,
3.4925e+03, 1.5285e+03, -1.6750e+02, -9.7000e+01, 3.8500e+01,
1.3770e+03, -5.9800e+02, -7.0600e+02, 6.7600e+02, 3.1000e+02,
-2.0810e+03, -8.3000e+02, 1.9395e+03, -9.6000e+02, -9.4400e+02,
1.1000e+01, 2.3950e+02, 1.9715e+03, -9.7500e+01, -8.9850e+02,
9.5200e+02, 6.0850e+02, -1.0670e+03, -1.7855e+03, 1.4395e+03,
-6.6950e+02, -2.9100e+02, -1.0050e+02, -1.8300e+02, 2.3135e+03,
-8.2900e+02, 4.6000e+01, 1.6030e+03])
d1 = [ np. nan] + diff[ : - 1 ] . tolist( )
d2 = diff[ 1 : ] . tolist( ) + [ np. nan]
df1[ ( diff- d1) * ( diff- d2) > 0 ] [ [ 'year' , 'month' ] ]
/home/myth/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in greater
This is separate from the ipykernel package so we can avoid doing imports until
year month 4 2002 May 5 2002 Jun 7 2002 Aug 8 2002 Sep 10 2002 Nov ... ... ... 195 2018 May 197 2018 Jul 198 2018 Aug 199 2018 Sep 200 2018 Oct
95 rows × 2 columns
二、2007 年-2019 年俄罗斯机场货运航班运载量
df2 = pd. read_csv( '2007年-2019年俄罗斯货运航班运载量.csv' )
df2. head( )
Airport name Year January February March April May June July August September October November December Whole year Airport coordinates 0 Abakan 2019 44.70 66.21 72.7 75.82 100.34 78.38 63.88 73.06 66.74 75.44 110.5 89.8 917.57 (Decimal('91.399735'), Decimal('53.751351')) 1 Aikhal 2019 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 (Decimal('111.543324'), Decimal('65.957161')) 2 Loss 2019 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 (Decimal('125.398355'), Decimal('58.602489')) 3 Amderma 2019 0.00 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.0 0.00 (Decimal('61.577429'), Decimal('69.759076')) 4 Anadyr (Carbon) 2019 81.63 143.01 260.9 304.36 122.00 106.87 84.99 130.00 102.00 118.00 94.0 199.0 1746.76 (Decimal('177.738273'), Decimal('64.713433'))
(1) 求每年货运航班总运量。
df2. groupby( 'Year' ) [ 'Whole year' ] . sum ( )
Year
2007 659438.23
2008 664682.46
2009 560809.77
2010 693033.98
2011 818691.71
2012 846388.03
2013 792337.08
2014 729457.12
2015 630208.97
2016 679370.15
2017 773662.28
2018 767095.28
2019 764606.27
Name: Whole year, dtype: float64
(2) 每年记录的机场都是相同的吗?
df2. groupby( 'Year' ) [ 'Airport name' ] . unique( )
Year
2007 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2008 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2009 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2010 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2011 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2012 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2013 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2014 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2015 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2016 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2017 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2018 [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2019 [Abakan, Aikhal, Loss, Amderma, Anadyr (Carbon...
Name: Airport name, dtype: object
不一样
(3) 按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例。
subs = df2[ df2[ 'Year' ] >= 2010 ]
subs = subs[ subs[ 'Year' ] <= 2015 ]
subs[ subs[ 'Whole year' ] == 0 ] . groupby( 'Year' ) [ 'Whole year' ] . count( ) / subs. groupby( 'Year' ) [ 'Whole year' ] . count( )
Year
2010 0.767123
2011 0.770548
2012 0.770548
2013 0.770548
2014 0.770548
2015 0.770548
Name: Whole year, dtype: float64
(4) 若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有年份的记录信息从表中删除,并返回处理后的表格
ind = df2[ df2[ 'Whole year' ] == 0 ] . groupby( 'Airport name' ) [ [ 'Whole year' ] ] . count( ) . index
df2 = df2[ ~ df2[ 'Airport name' ] . isin( ind) ]
df2. reset_index( drop= True ) . head( )
Airport name Year January February March April May June July August September October November December Whole year Airport coordinates 0 Abakan 2019 44.70 66.21 72.70 75.82 100.34 78.38 63.88 73.06 66.74 75.44 110.50 89.80 917.57 (Decimal('91.399735'), Decimal('53.751351')) 1 Anadyr (Carbon) 2019 81.63 143.01 260.90 304.36 122.00 106.87 84.99 130.00 102.00 118.00 94.00 199.00 1746.76 (Decimal('177.738273'), Decimal('64.713433')) 2 Anapa (Vitjazevo) 2019 45.92 53.15 54.00 54.72 52.00 67.45 172.31 72.57 70.00 63.00 69.00 82.10 856.22 (Decimal('37.341511'), Decimal('45.003748')) 3 Arkhangelsk (Talagy) 2019 85.61 118.70 131.39 144.82 137.95 140.18 128.56 135.68 124.75 139.60 210.27 307.10 1804.61 (Decimal('40.714892'), Decimal('64.596138')) 4 Astrakhan (Narimanovo) 2019 51.75 61.08 65.60 71.84 71.38 63.95 164.86 79.46 85.21 87.23 79.06 99.16 980.58 (Decimal('47.999896'), Decimal('46.287344'))
(5) 采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017年-2019 年货运总量最大的区域。
这里有个疑问,就是东北算东还算北呢?
(6) 在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月的秩综合指数。
df20 = df2[ df2[ 'Year' ] == 2016 ]
df20. groupby( df20[ 'Airport name' ] ) [ 'Whole year' ]
185 1.0
186 1.0
187 1.0
188 1.0
189 1.0
...
241 1.0
242 1.0
243 1.0
244 1.0
245 1.0
Name: Whole year, Length: 61, dtype: float64
df2[ [ 'month' , 'Whole year' ] ] . groupby( df2[ 'Airport name' ] ) . apply ( lambda x: x)
0 917.57
1 1746.76
2 856.22
3 1804.61
4 980.58
...
790 1600.56
791 17670.80
792 13577.44
793 55.59
794 15262.87
Name: Whole year, Length: 795, dtype: float64
df2. groupby( 'Airport name' ) . rank
rank = df2. groupby( 'Airport name' ) . rank
rank = df2. groupby( 'Airport name' ) . rank
df2. groupby( 'Airport name' ) [ 'Whole year' ]
Year January February March April May June July August September October November December Whole year Airport coordinates Airport name Abakan 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 Anadyr 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 Anadyr (Carbon) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Anapa (Vitjazevo) 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 Arkhangelsk (Talagy) 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Yamburg 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 Yaroslavl (Tunoshna) 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 Yuzhno-(Khomutovo) 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 Zabaykalsk 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 red 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
67 rows × 15 columns
df2. groupby( 'Airport name' ) [ 'Whole year' ]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-221-a5c4db71690e> in <module>()
----> 1 df2.groupby('Airport name')['Whole year'].apply()
TypeError: apply() missing 1 required positional argument: 'func'
0 1 0 Abakan 0 917.57 63 894.40 124 1000.04 1... 1 Anadyr 64 1931.77 125 2164.34 186 2019.00 2... 2 Anadyr (Carbon) 1 1746.76 Name: Whole year, dtype: float64 3 Anapa (Vitjazevo) 2 856.22 65 794.47 126 729.05 187 ... 4 Arkhangelsk (Talagy) 3 1804.61 66 2019.51 127 2508.08 1... ... ... ... 62 Yamburg 61 1.31 122 1.73 183 1.31 244 ... 63 Yaroslavl (Tunoshna) 62 186.67 123 2124.46 184 5811.9... 64 Yuzhno-(Khomutovo) 59 10298.31 120 10483.65 181 10556.1... 65 Zabaykalsk 22 446.53 83 527.91 144 782.18 2... 66 red 21 63.63 82 14.81 143 13.33 204 ...
67 rows × 2 columns
df2[ df2[ 'Airport name' ] == 'Abakan' ]
Airport name Year January February March April May June July August September October November December Whole year Airport coordinates 0 Abakan 2019 44.70 66.21 72.70 75.82 100.34 78.38 63.88 73.06 66.74 75.44 110.50 89.80 917.57 (Decimal('91.399735'), Decimal('53.751351')) 63 Abakan 2018 63.39 60.98 69.46 88.62 83.81 64.99 49.64 70.26 87.35 76.78 80.19 98.93 894.40 (Decimal('91.399735'), Decimal('53.751351')) 124 Abakan 2017 48.36 77.08 95.33 90.28 91.93 57.13 82.59 81.20 76.05 139.67 71.58 88.84 1000.04 (Decimal('91.399735'), Decimal('53.751351')) 185 Abakan 2016 34.10 45.41 58.97 72.71 91.66 78.49 64.31 127.29 89.07 74.99 78.66 94.01 909.67 (Decimal('91.399735'), Decimal('53.751351')) 246 Abakan 2015 37.70 47.97 54.67 82.12 68.81 112.95 55.83 95.20 137.79 72.13 63.67 78.30 907.14 (Decimal('91.399735'), Decimal('53.751351')) 307 Abakan 2014 51.90 69.42 83.36 69.71 75.50 92.16 67.50 89.22 76.95 78.56 153.68 83.38 991.34 (Decimal('91.399735'), Decimal('53.751351')) 368 Abakan 2013 39.74 227.16 62.97 77.41 104.34 55.71 65.95 78.71 75.68 70.30 81.57 83.34 1022.88 (Decimal('91.399735'), Decimal('53.751351')) 429 Abakan 2012 39.24 39.24 60.58 55.53 56.51 57.35 64.47 75.88 66.40 120.62 64.38 81.25 781.45 (Decimal('91.399735'), Decimal('53.751351')) 490 Abakan 2011 53.30 59.40 162.40 72.40 90.10 92.10 81.30 73.20 77.00 84.30 91.17 124.56 1061.23 (Decimal('91.399735'), Decimal('53.751351')) 551 Abakan 2010 35.10 192.20 55.20 108.70 118.80 69.30 87.70 77.00 181.40 88.50 83.30 110.60 1207.80 (Decimal('91.399735'), Decimal('53.751351')) 612 Abakan 2009 40.10 36.90 50.30 47.00 58.20 77.60 83.90 404.70 65.10 68.20 182.20 362.50 1476.70 (Decimal('91.399735'), Decimal('53.751351')) 673 Abakan 2008 65.60 69.00 75.20 68.80 158.70 99.60 88.10 77.30 75.40 90.30 67.20 51.30 986.50 (Decimal('91.399735'), Decimal('53.751351')) 734 Abakan 2007 742.20 71.30 72.80 83.50 102.20 83.30 67.70 85.60 88.30 73.70 72.10 92.20 1634.90 (Decimal('91.399735'), Decimal('53.751351'))
df2. rank?
df2. groupby( 'Airport name' ) . rank
三、新冠肺炎在美国的传播
df3q = pd. read_csv( '美国确证数.csv' )
df3s = pd. read_csv( '美国死亡数.csv' )
df3q. head( )
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 2020/4/17 2020/4/18 2020/4/19 2020/4/20 2020/4/21 2020/4/22 2020/4/23 2020/4/24 2020/4/25 2020/4/26 0 84001001 US USA 840 1001 Autauga Alabama US 32.539527 -86.644082 ... 26 25 26 28 30 32 33 36 36 37 1 84001003 US USA 840 1003 Baldwin Alabama US 30.727750 -87.722071 ... 103 109 112 117 123 132 143 147 147 161 2 84001005 US USA 840 1005 Barbour Alabama US 31.868263 -85.387129 ... 15 18 20 22 28 29 30 32 32 33 3 84001007 US USA 840 1007 Bibb Alabama US 32.996421 -87.125115 ... 24 26 28 32 32 34 33 34 34 38 4 84001009 US USA 840 1009 Blount Alabama US 33.982109 -86.567906 ... 20 20 21 22 26 29 31 31 31 34
5 rows × 107 columns
df3s. head( )
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 2020/4/17 2020/4/18 2020/4/19 2020/4/20 2020/4/21 2020/4/22 2020/4/23 2020/4/24 2020/4/25 2020/4/26 0 84001001 US USA 840 1001 Autauga Alabama US 32.539527 -86.644082 ... 2 2 2 1 1 2 2 2 2 2 1 84001003 US USA 840 1003 Baldwin Alabama US 30.727750 -87.722071 ... 2 2 2 3 3 3 3 3 3 3 2 84001005 US USA 840 1005 Barbour Alabama US 31.868263 -85.387129 ... 0 0 0 0 0 0 0 0 0 0 3 84001007 US USA 840 1007 Bibb Alabama US 32.996421 -87.125115 ... 0 0 0 0 0 0 0 0 0 0 4 84001009 US USA 840 1009 Blount Alabama US 33.982109 -86.567906 ... 0 0 0 0 0 0 0 0 0 0
5 rows × 108 columns
df3s. columns[ : 20 ]
Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population',
'2020/1/22', '2020/1/23', '2020/1/24', '2020/1/25', '2020/1/26',
'2020/1/27', '2020/1/28', '2020/1/29'],
dtype='object')
(1) 用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期死亡数的相关系数。
df3s[ [ 'Population' , '2020/4/26' ] ] . corr( )
Population 2020/4/26 Population 1.000000 0.403844 2020/4/26 0.403844 1.000000
(2) 截止到 4 月 1 日,统计每个州零感染县的比例。
df3q[ 'zero41' ] = df3q. iloc[ : , 12 : 82 ] . max ( axis= 1 )
( df3q[ df3q[ 'zero41' ] == 0 ] . groupby( 'Province_State' ) [ 'UID' ] . count( ) / df3q. groupby( 'Province_State' ) [ 'UID' ] . count( ) )
Province_State
Alabama 0.104478
Alaska 0.793103
Arizona NaN
Arkansas 0.293333
California 0.137931
Colorado 0.218750
Connecticut NaN
Delaware NaN
District of Columbia NaN
Florida 0.164179
Georgia 0.119497
Hawaii 0.200000
Idaho 0.386364
Illinois 0.450980
Indiana 0.108696
Iowa 0.404040
Kansas 0.600000
Kentucky 0.433333
Louisiana 0.046875
Maine 0.187500
Maryland 0.041667
Massachusetts 0.142857
Michigan 0.168675
Minnesota 0.356322
Mississippi 0.060976
Missouri 0.391304
Montana 0.625000
Nebraska 0.741935
Nevada 0.470588
New Hampshire NaN
New Jersey NaN
New Mexico 0.424242
New York 0.080645
North Carolina 0.150000
North Dakota 0.528302
Ohio 0.181818
Oklahoma 0.376623
Oregon 0.277778
Pennsylvania 0.104478
Rhode Island NaN
South Carolina 0.043478
South Dakota 0.530303
Tennessee 0.105263
Texas 0.437008
Utah 0.448276
Vermont 0.142857
Virginia 0.248120
Washington 0.102564
West Virginia 0.472727
Wisconsin 0.319444
Wyoming 0.347826
Name: UID, dtype: float64
(3) 请找出最早出确证病例的三个县。
df3q. sort_values( by= df3q. columns[ 12 : ] . tolist( ) , ascending= False ) [ : 3 ]
UID iso2 iso3 code3 FIPS Admin2 Province_State Country_Region Lat Long_ ... 2020/4/18 2020/4/19 2020/4/20 2020/4/21 2020/4/22 2020/4/23 2020/4/24 2020/4/25 2020/4/26 zero41 2969 84053033 US USA 840 53033 King Washington US 47.491379 -121.834613 ... 4902 5174 5174 5293 5379 5532 5637 5739 5863 2656 610 84017031 US USA 840 17031 Cook Illinois US 41.841448 -87.816588 ... 20395 21272 22101 23181 24546 25811 27616 29058 30574 5575 204 84006037 US USA 840 6037 Los Angeles California US 34.308284 -118.228241 ... 12021 12341 13823 15153 16447 17537 18545 19133 19567 4045
3 rows × 108 columns
(4) 按州统计单日死亡增加数,并给出哪个州在哪一天确诊数增加最大(这里指的是在所有州和所有天两个指标一起算,不是分别算)。
df3q. groupby( 'Province_State' ) . sum ( ) . iloc[ : , 6 : ] . diff( axis= 1 ) . max ( axis= 1 ) . idxmax( )
'New York'
df3q. groupby( 'Province_State' ) . sum ( ) . iloc[ : , 6 : ] . diff( axis= 1 ) . max ( axis= 0 ) . idxmax( )
'2020/4/15'
(5) 现需对每个州编制确证与死亡表,第一列为时间,并且起始时间为该州开始出现死亡比例的那一天,第二列和第三列分别为确证数和死亡数,每个州需要保存为一个单独的 csv 文件,文件名为“州名.csv”。
for state in df3s[ 'Province_State' ] . unique( ) :
df_q = df3q[ df3q[ 'Province_State' ] == state] . iloc[ : , 12 : ] . sum ( )
df_s = df3s[ df3s[ 'Province_State' ] == state] . iloc[ : , 12 : ] . sum ( )
df_q = df_q[ df_s. cumsum( ) > 0 ] . reset_index( )
df_s = df_s[ df_s. cumsum( ) > 0 ] . reset_index( )
df_q. columns= [ 'Date' , '确诊人数' ]
df_s. columns= [ 'Date' , '死亡人数' ]
pd. merge( df_q, df_s, on= 'Date' , how= 'inner' ) . to_csv( state+ '.csv' )
(6) 现需对 4 月 1 日至 4 月 10 日编制新增确证数与新增死亡数表,第一列为州名,第二列和第三列分别为新增确证数和新增死亡数,分别保存为十个单独的 csv 文件,文件名为“日期.csv”。
for date in range ( 10 ) :
df_q = df3q. groupby( 'Province_State' ) . sum ( ) . iloc[ : , 75 : ] . iloc[ : , date]
df_s = df3s. groupby( 'Province_State' ) . sum ( ) . iloc[ : , 75 : ] . iloc[ : , date]
df_r = pd. DataFrame( df_q) . assign( s= df_s) . reset_index( )
df_r. columns = [ '州名' , '确诊人数' , '死亡人数' ]
df_r. to_csv( '4月' + str ( date) + '日.csv' )