综合练习

最新推荐文章于 2024-05-13 13:15:00 发布

lukem44

最新推荐文章于 2024-05-13 13:15:00 发布

阅读量308

点赞数

本文链接：https://blog.csdn.net/lukem44/article/details/105879603

版权

2002年-2018年上海机动车拍照拍卖

import numpy as np
import pandas as pd

# 读取数据
df1 = pd.read_csv('2002年-2018年上海机动车拍照拍卖.csv')
df1

	Date	Total number of license issued	lowest price	avg price	Total number of applicants
0	2-Jan	1400	13600	14735	3718
1	2-Feb	1800	13100	14057	4590
2	2-Mar	2000	14300	14662	5190
3	2-Apr	2300	16000	16334	4806
4	2-May	2350	17800	18357	4665
...	...	...	...	...	...
198	18-Aug	10402	88300	88365	192755
199	18-Sep	12712	87300	87410	189142
200	18-Oct	10728	88000	88070	181861
201	18-Nov	11766	87300	87374	177355
202	18-Dec	12850	87400	87508	165442

203 rows × 5 columns

(1) 哪一次拍卖的中标率首次小于 5%?

df1[(df1['Total number of license issued']/df1['Total number of applicants'])<0.05]

	Date	Total number of license issued	lowest price	avg price	Total number of applicants
159	15-May	7482	79000	79099	156007
160	15-Jun	7441	80000	80020	172205
161	15-Jul	7531	83100	83171	166302
162	15-Aug	7454	82600	82642	166939
164	15-Oct	7763	85300	85424	170995
165	15-Nov	7514	84600	84703	169159
166	15-Dec	7698	84500	84572	179133
168	16-Feb	8363	83200	83244	196470
169	16-Mar	8310	83100	83148	221109
170	16-Apr	11829	85100	85127	256897
171	16-May	11598	85000	85058	277889
172	16-Jun	11546	84400	84483	275438
173	16-Jul	11475	87200	87235	240750
174	16-Aug	11549	86900	86946	251188
180	17-Feb	10157	88200	88240	251717
181	17-Mar	10356	87800	87916	262010
182	17-Apr	12196	89800	89850	252273
183	17-May	10316	90100	90209	270197
184	17-Jun	10312	89400	89532	244349
185	17-Jul	10325	92200	92250	269189
186	17-Aug	10558	91600	91629	256083
187	17-Sep	12413	91300	91415	250566
188	17-Oct	11388	93500	93540	244868
189	17-Nov	11002	93100	93130	226911
193	18-Mar	9855	88100	88176	217056

15-May

(2) 按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求显示在同一张表上。

df1['year'] = df1['Date'].apply(lambda x:'20' + x[:2] if x[2]=='-' else '200' + x[0])
df1.head()

	Date	Total number of license issued	lowest price	avg price	Total number of applicants	year
0	2-Jan	1400	13600	14735	3718	2002
1	2-Feb	1800	13100	14057	4590	2002
2	2-Mar	2000	14300	14662	5190	2002
3	2-Apr	2300	16000	16334	4806	2002
4	2-May	2350	17800	18357	4665	2002

np.quantile?

df1.groupby('year')['lowest price '].agg([max,np.mean,lambda x:x.quantile(0.75)]).rename(columns={'<lambda_0>':'0.75分位数'})

	max	mean	0.75分位数
year
2002	30800	20316.666667	24300
2003	38500	31983.333333	36300
2004	44200	29408.333333	38400
2005	37900	31908.333333	35600
2006	39900	37058.333333	39525
2007	53800	45691.666667	48950
2008	37300	29945.454545	34150
2009	36900	31333.333333	34150
2010	44900	38008.333333	41825
2011	53800	47958.333333	51000
2012	68900	61108.333333	65325
2013	90800	79125.000000	82550
2014	74600	73816.666667	74000
2015	85300	80575.000000	83450
2016	88600	85733.333333	87475
2017	93500	90616.666667	92350
2018	89000	87825.000000	88150

(3) 将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为月份(英语缩写),添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延。

df1['month'] = df1['Date'].apply(lambda x:x[-3:])
df1.head()

	Date	Total number of license issued	lowest price	avg price	Total number of applicants	year	month
0	2-Jan	1400	13600	14735	3718	2002	Jan
1	2-Feb	1800	13100	14057	4590	2002	Feb
2	2-Mar	2000	14300	14662	5190	2002	Mar
3	2-Apr	2300	16000	16334	4806	2002	Apr
4	2-May	2350	17800	18357	4665	2002	May

df1 = df1.loc[:,['year', 'month', 'Total number of license issued', 'lowest price ', 'avg price',
       'Total number of applicants']]
df1.head()

	year	month	Total number of license issued	lowest price	avg price	Total number of applicants
0	2002	Jan	1400	13600	14735	3718
1	2002	Feb	1800	13100	14057	4590
2	2002	Mar	2000	14300	14662	5190
3	2002	Apr	2300	16000	16334	4806
4	2002	May	2350	17800	18357	4665

(4) 现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份。

df1.set_index(['year','Total number of license issued', 'lowest price ',
       'avg price', 'Total number of applicants'])

					month
year	Total number of license issued	lowest price	avg price	Total number of applicants
2002	1400	13600	14735	3718	Jan
	1800	13100	14057	4590	Feb
	2000	14300	14662	5190	Mar
	2300	16000	16334	4806	Apr
	2350	17800	18357	4665	May
...	...	...	...	...	...
2018	10402	88300	88365	192755	Aug
	12712	87300	87410	189142	Sep
	10728	88000	88070	181861	Oct
	11766	87300	87374	177355	Nov
	12850	87400	87508	165442	Dec

203 rows × 1 columns

(5) 一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?

df1[df1['lowest price '].diff()*df1['avg price'].diff()<0][['year','month']]

	year	month
21	2003	Oct
22	2003	Nov
29	2004	Jun
36	2005	Jan
37	2005	Feb
44	2005	Sep
52	2006	May
56	2006	Sep
60	2007	Jan
61	2007	Feb
71	2007	Dec
128	2012	Oct

df1.columns

Index(['year', 'month', 'Total number of license issued', 'lowest price ',
       'avg price', 'Total number of applicants'],
      dtype='object')

(6) 将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。

avg = (df1['Total number of license issued'].values[1:-1]+df1['Total number of license issued'].values[:-2])/2
avg = np.array([df1.loc[0,'Total number of license issued'],df1.loc[1,'Total number of license issued']]+avg.tolist())
diff = df1['Total number of license issued'].values - avg
diff

array([ 0.0000e+00,  0.0000e+00,  4.0000e+02,  4.0000e+02,  2.0000e+02,
        4.7500e+02,  4.2500e+02,  1.0000e+02,  2.0000e+02,  1.0000e+02,
        0.0000e+00,  4.0000e+02, -4.0000e+02, -3.0000e+02,  0.0000e+00,
        3.0000e+02,  6.5000e+02,  1.9500e+03,  1.3500e+03, -1.2500e+03,
        1.4000e+03, -1.0750e+03, -5.3300e+02,  5.0000e+00,  9.1000e+01,
       -8.8000e+01, -1.0000e+02,  7.0000e+02,  1.3770e+03,  2.1950e+02,
        2.2000e+02,  3.8350e+02, -6.0000e+01, -1.2000e+02, -2.0000e+01,
       -1.1000e+03, -5.5000e+02, -1.7000e+03, -6.5000e+02,  1.1000e+03,
        1.3330e+03,  2.7350e+02,  5.6450e+02,  8.2100e+02,  1.2250e+02,
       -7.6450e+02, -6.5000e+02, -1.5000e+02, -7.0000e+02, -1.5500e+03,
        1.0000e+02,  8.5000e+02, -2.5000e+02, -2.5000e+02,  1.0000e+03,
        1.2000e+03,  6.5000e+02,  1.5000e+02, -5.0000e+02,  2.5000e+02,
       -2.5000e+02, -2.7500e+03, -7.5000e+02,  1.7500e+03,  1.2500e+03,
        2.5000e+02, -5.0000e+02,  2.2500e+03,  1.7500e+03, -7.5000e+02,
       -5.0000e+02,  0.0000e+00,  8.5000e+03, -2.4500e+03, -3.6500e+03,
       -9.5000e+02, -9.0000e+02, -1.1500e+03, -1.2500e+03,  1.0000e+02,
       -1.2500e+03, -2.5000e+02, -7.5000e+02,  2.0000e+02,  3.5000e+02,
        8.0000e+02,  9.0000e+02,  9.5000e+02,  1.1500e+03,  4.0000e+02,
        0.0000e+00,  5.0000e+02, -2.5000e+02, -2.5000e+02,  0.0000e+00,
        0.0000e+00, -5.0000e+02,  2.5000e+02,  7.5000e+02,  2.5000e+02,
        7.0000e+02,  1.5000e+02, -1.0000e+02,  0.0000e+00,  0.0000e+00,
       -5.0000e+02,  2.5000e+02, -7.5000e+02, -1.0000e+03,  2.5000e+02,
        2.5000e+02,  1.0000e+03,  5.0000e+02,  0.0000e+00,  0.0000e+00,
        5.0000e+02, -2.5000e+02, -2.5000e+02, -5.0000e+02, -7.5000e+02,
       -2.5000e+02,  0.0000e+00,  5.0000e+02,  1.0500e+03,  6.0000e+02,
        1.0000e+02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
        0.0000e+00, -5.0000e+02, -2.5000e+02,  0.0000e+00,  2.0000e+03,
       -1.0000e+03, -1.0000e+03,  0.0000e+00,  0.0000e+00,  0.0000e+00,
        1.0000e+03, -1.0000e+03, -7.5000e+02, -4.0000e+02, -9.0000e+02,
       -3.5000e+02,  8.0000e+02, -4.0000e+02, -4.0000e+02,  0.0000e+00,
        0.0000e+00,  9.0000e+02, -4.5000e+02, -4.5000e+02,  4.7000e+01,
        5.6650e+02, -6.5500e+01, -4.1550e+02,  7.5850e+02, -3.6500e+02,
       -4.4400e+02,  6.9500e+01, -3.2000e+01,  1.2345e+03, -3.2750e+02,
       -7.3100e+02,  5.9500e+01,  1.8030e+03, -1.9050e+02, -5.7600e+02,
        3.4925e+03,  1.5285e+03, -1.6750e+02, -9.7000e+01,  3.8500e+01,
        1.3770e+03, -5.9800e+02, -7.0600e+02,  6.7600e+02,  3.1000e+02,
       -2.0810e+03, -8.3000e+02,  1.9395e+03, -9.6000e+02, -9.4400e+02,
        1.1000e+01,  2.3950e+02,  1.9715e+03, -9.7500e+01, -8.9850e+02,
        9.5200e+02,  6.0850e+02, -1.0670e+03, -1.7855e+03,  1.4395e+03,
       -6.6950e+02, -2.9100e+02, -1.0050e+02, -1.8300e+02,  2.3135e+03,
       -8.2900e+02,  4.6000e+01,  1.6030e+03])

d1 = [np.nan] +diff[:-1].tolist()
d2 = diff[1:].tolist() +[np.nan]
df1[(diff-d1)*(diff-d2)>0][['year','month']]

/home/myth/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: invalid value encountered in greater
  This is separate from the ipykernel package so we can avoid doing imports until

	year	month
4	2002	May
5	2002	Jun
7	2002	Aug
8	2002	Sep
10	2002	Nov
...	...	...
195	2018	May
197	2018	Jul
198	2018	Aug
199	2018	Sep
200	2018	Oct

95 rows × 2 columns

二、2007 年-2019 年俄罗斯机场货运航班运载量

df2 = pd.read_csv('2007年-2019年俄罗斯货运航班运载量.csv')
df2.head()

	Airport name	Year	January	February	March	April	May	June	July	August	September	October	November	December	Whole year	Airport coordinates
0	Abakan	2019	44.70	66.21	72.7	75.82	100.34	78.38	63.88	73.06	66.74	75.44	110.5	89.8	917.57	(Decimal('91.399735'), Decimal('53.751351'))
1	Aikhal	2019	0.00	0.00	0.0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.0	0.0	0.00	(Decimal('111.543324'), Decimal('65.957161'))
2	Loss	2019	0.00	0.00	0.0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.0	0.0	0.00	(Decimal('125.398355'), Decimal('58.602489'))
3	Amderma	2019	0.00	0.00	0.0	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.0	0.0	0.00	(Decimal('61.577429'), Decimal('69.759076'))
4	Anadyr (Carbon)	2019	81.63	143.01	260.9	304.36	122.00	106.87	84.99	130.00	102.00	118.00	94.0	199.0	1746.76	(Decimal('177.738273'), Decimal('64.713433'))

(1) 求每年货运航班总运量。

df2.groupby('Year')['Whole year'].sum()

Year
2007    659438.23
2008    664682.46
2009    560809.77
2010    693033.98
2011    818691.71
2012    846388.03
2013    792337.08
2014    729457.12
2015    630208.97
2016    679370.15
2017    773662.28
2018    767095.28
2019    764606.27
Name: Whole year, dtype: float64

(2) 每年记录的机场都是相同的吗?

df2.groupby('Year')['Airport name'].unique()

Year
2007    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2008    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2009    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2010    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2011    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2012    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2013    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2014    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2015    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2016    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2017    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2018    [Abakan, Aikhal, Loss, Amderma, Anadyr, Anapa ...
2019    [Abakan, Aikhal, Loss, Amderma, Anadyr (Carbon...
Name: Airport name, dtype: object

不一样

(3) 按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例。

subs = df2[df2['Year']>=2010]
subs = subs[subs['Year']<=2015]
subs[subs['Whole year']==0].groupby('Year')['Whole year'].count()/subs.groupby('Year')['Whole year'].count()

Year
2010    0.767123
2011    0.770548
2012    0.770548
2013    0.770548
2014    0.770548
2015    0.770548
Name: Whole year, dtype: float64

(4) 若某机场至少存在 5 年或以上满足所有月运量记录都为 0,则将其所有年份的记录信息从表中删除,并返回处理后的表格

ind = df2[df2['Whole year']==0].groupby('Airport name')[['Whole year']].count().index
df2 = df2[~df2['Airport name'].isin(ind)]

df2.reset_index(drop=True).head()

	Airport name	Year	January	February	March	April	May	June	July	August	September	October	November	December	Whole year	Airport coordinates
0	Abakan	2019	44.70	66.21	72.70	75.82	100.34	78.38	63.88	73.06	66.74	75.44	110.50	89.80	917.57	(Decimal('91.399735'), Decimal('53.751351'))
1	Anadyr (Carbon)	2019	81.63	143.01	260.90	304.36	122.00	106.87	84.99	130.00	102.00	118.00	94.00	199.00	1746.76	(Decimal('177.738273'), Decimal('64.713433'))
2	Anapa (Vitjazevo)	2019	45.92	53.15	54.00	54.72	52.00	67.45	172.31	72.57	70.00	63.00	69.00	82.10	856.22	(Decimal('37.341511'), Decimal('45.003748'))
3	Arkhangelsk (Talagy)	2019	85.61	118.70	131.39	144.82	137.95	140.18	128.56	135.68	124.75	139.60	210.27	307.10	1804.61	(Decimal('40.714892'), Decimal('64.596138'))
4	Astrakhan (Narimanovo)	2019	51.75	61.08	65.60	71.84	71.38	63.95	164.86	79.46	85.21	87.23	79.06	99.16	980.58	(Decimal('47.999896'), Decimal('46.287344'))

(5) 采用一种合理的方式将所有机场划分为东南西北四个分区,并给出 2017年-2019 年货运总量最大的区域。

这里有个疑问,就是东北算东还算北呢?

(6) 在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名(例如 *** 机场 19 年 1 月运量在整个 19 年 12 个月中排名第一,则秩为 1),那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,请根据上述定义计算 2016 年 12 个月的秩综合指数。

df20 = df2[df2['Year']==2016]
df20.groupby(df20['Airport name'])['Whole year']

185    1.0
186    1.0
187    1.0
188    1.0
189    1.0
      ... 
241    1.0
242    1.0
243    1.0
244    1.0
245    1.0
Name: Whole year, Length: 61, dtype: float64

df2[['month','Whole year']].groupby(df2['Airport name']).apply(lambda x:x)

0        917.57
1       1746.76
2        856.22
3       1804.61
4        980.58
         ...   
790     1600.56
791    17670.80
792    13577.44
793       55.59
794    15262.87
Name: Whole year, Length: 795, dtype: float64

df2.groupby('Airport name').rank

rank = df2.groupby('Airport name').rank

rank = df2.groupby('Airport name').rank

df2.groupby('Airport name')['Whole year']

	Year	January	February	March	April	May	June	July	August	September	October	November	December	Whole year	Airport coordinates
Airport name
Abakan	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13
Anadyr	12	12	12	12	12	12	12	12	12	12	12	12	12	12	12
Anadyr (Carbon)	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
Anapa (Vitjazevo)	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13
Arkhangelsk (Talagy)	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Yamburg	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13
Yaroslavl (Tunoshna)	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13
Yuzhno-(Khomutovo)	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13
Zabaykalsk	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13
red	13	13	13	13	13	13	13	13	13	13	13	13	13	13	13

67 rows × 15 columns

df2.groupby('Airport name')['Whole year']

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-221-a5c4db71690e> in <module>()
----> 1 df2.groupby('Airport name')['Whole year'].apply()


TypeError: apply() missing 1 required positional argument: 'func'

	0	1
0	Abakan	0 917.57 63 894.40 124 1000.04 1...
1	Anadyr	64 1931.77 125 2164.34 186 2019.00 2...
2	Anadyr (Carbon)	1 1746.76 Name: Whole year, dtype: float64
3	Anapa (Vitjazevo)	2 856.22 65 794.47 126 729.05 187 ...
4	Arkhangelsk (Talagy)	3 1804.61 66 2019.51 127 2508.08 1...
...	...	...
62	Yamburg	61 1.31 122 1.73 183 1.31 244 ...
63	Yaroslavl (Tunoshna)	62 186.67 123 2124.46 184 5811.9...
64	Yuzhno-(Khomutovo)	59 10298.31 120 10483.65 181 10556.1...
65	Zabaykalsk	22 446.53 83 527.91 144 782.18 2...
66	red	21 63.63 82 14.81 143 13.33 204 ...

67 rows × 2 columns

df2[df2['Airport name']=='Abakan']

	Airport name	Year	January	February	March	April	May	June	July	August	September	October	November	December	Whole year	Airport coordinates
0	Abakan	2019	44.70	66.21	72.70	75.82	100.34	78.38	63.88	73.06	66.74	75.44	110.50	89.80	917.57	(Decimal('91.399735'), Decimal('53.751351'))
63	Abakan	2018	63.39	60.98	69.46	88.62	83.81	64.99	49.64	70.26	87.35	76.78	80.19	98.93	894.40	(Decimal('91.399735'), Decimal('53.751351'))
124	Abakan	2017	48.36	77.08	95.33	90.28	91.93	57.13	82.59	81.20	76.05	139.67	71.58	88.84	1000.04	(Decimal('91.399735'), Decimal('53.751351'))
185	Abakan	2016	34.10	45.41	58.97	72.71	91.66	78.49	64.31	127.29	89.07	74.99	78.66	94.01	909.67	(Decimal('91.399735'), Decimal('53.751351'))
246	Abakan	2015	37.70	47.97	54.67	82.12	68.81	112.95	55.83	95.20	137.79	72.13	63.67	78.30	907.14	(Decimal('91.399735'), Decimal('53.751351'))
307	Abakan	2014	51.90	69.42	83.36	69.71	75.50	92.16	67.50	89.22	76.95	78.56	153.68	83.38	991.34	(Decimal('91.399735'), Decimal('53.751351'))
368	Abakan	2013	39.74	227.16	62.97	77.41	104.34	55.71	65.95	78.71	75.68	70.30	81.57	83.34	1022.88	(Decimal('91.399735'), Decimal('53.751351'))
429	Abakan	2012	39.24	39.24	60.58	55.53	56.51	57.35	64.47	75.88	66.40	120.62	64.38	81.25	781.45	(Decimal('91.399735'), Decimal('53.751351'))
490	Abakan	2011	53.30	59.40	162.40	72.40	90.10	92.10	81.30	73.20	77.00	84.30	91.17	124.56	1061.23	(Decimal('91.399735'), Decimal('53.751351'))
551	Abakan	2010	35.10	192.20	55.20	108.70	118.80	69.30	87.70	77.00	181.40	88.50	83.30	110.60	1207.80	(Decimal('91.399735'), Decimal('53.751351'))
612	Abakan	2009	40.10	36.90	50.30	47.00	58.20	77.60	83.90	404.70	65.10	68.20	182.20	362.50	1476.70	(Decimal('91.399735'), Decimal('53.751351'))
673	Abakan	2008	65.60	69.00	75.20	68.80	158.70	99.60	88.10	77.30	75.40	90.30	67.20	51.30	986.50	(Decimal('91.399735'), Decimal('53.751351'))
734	Abakan	2007	742.20	71.30	72.80	83.50	102.20	83.30	67.70	85.60	88.30	73.70	72.10	92.20	1634.90	(Decimal('91.399735'), Decimal('53.751351'))

df2.rank?

df2.groupby('Airport name').rank

三、新冠肺炎在美国的传播

df3q = pd.read_csv('美国确证数.csv')
df3s = pd.read_csv('美国死亡数.csv')
df3q.head()

	UID	iso2	iso3	code3	FIPS	Admin2	Province_State	Country_Region	Lat	Long_	...	2020/4/17	2020/4/18	2020/4/19	2020/4/20	2020/4/21	2020/4/22	2020/4/23	2020/4/24	2020/4/25	2020/4/26
0	84001001	US	USA	840	1001	Autauga	Alabama	US	32.539527	-86.644082	...	26	25	26	28	30	32	33	36	36	37
1	84001003	US	USA	840	1003	Baldwin	Alabama	US	30.727750	-87.722071	...	103	109	112	117	123	132	143	147	147	161
2	84001005	US	USA	840	1005	Barbour	Alabama	US	31.868263	-85.387129	...	15	18	20	22	28	29	30	32	32	33
3	84001007	US	USA	840	1007	Bibb	Alabama	US	32.996421	-87.125115	...	24	26	28	32	32	34	33	34	34	38
4	84001009	US	USA	840	1009	Blount	Alabama	US	33.982109	-86.567906	...	20	20	21	22	26	29	31	31	31	34

5 rows × 107 columns

df3s.head()

	UID	iso2	iso3	code3	FIPS	Admin2	Province_State	Country_Region	Lat	Long_	...	2020/4/17	2020/4/18	2020/4/19	2020/4/20	2020/4/21	2020/4/22	2020/4/23	2020/4/24	2020/4/25	2020/4/26
0	84001001	US	USA	840	1001	Autauga	Alabama	US	32.539527	-86.644082	...	2	2	2	1	1	2	2	2	2	2
1	84001003	US	USA	840	1003	Baldwin	Alabama	US	30.727750	-87.722071	...	2	2	2	3	3	3	3	3	3	3
2	84001005	US	USA	840	1005	Barbour	Alabama	US	31.868263	-85.387129	...	0	0	0	0	0	0	0	0	0	0
3	84001007	US	USA	840	1007	Bibb	Alabama	US	32.996421	-87.125115	...	0	0	0	0	0	0	0	0	0	0
4	84001009	US	USA	840	1009	Blount	Alabama	US	33.982109	-86.567906	...	0	0	0	0	0	0	0	0	0	0

5 rows × 108 columns

df3s.columns[:20]

Index(['UID', 'iso2', 'iso3', 'code3', 'FIPS', 'Admin2', 'Province_State',
       'Country_Region', 'Lat', 'Long_', 'Combined_Key', 'Population',
       '2020/1/22', '2020/1/23', '2020/1/24', '2020/1/25', '2020/1/26',
       '2020/1/27', '2020/1/28', '2020/1/29'],
      dtype='object')

(1) 用 corr() 函数计算县(每行都是一个县)人口与表中最后一天记录日期死亡数的相关系数。

df3s[['Population','2020/4/26']].corr()

	Population	2020/4/26
Population	1.000000	0.403844
2020/4/26	0.403844	1.000000

(2) 截止到 4 月 1 日,统计每个州零感染县的比例。

df3q['zero41'] = df3q.iloc[:,12:82].max(axis=1)
(df3q[df3q['zero41']==0].groupby('Province_State')['UID'].count()/df3q.groupby('Province_State')['UID'].count())

Province_State
Alabama                 0.104478
Alaska                  0.793103
Arizona                      NaN
Arkansas                0.293333
California              0.137931
Colorado                0.218750
Connecticut                  NaN
Delaware                     NaN
District of Columbia         NaN
Florida                 0.164179
Georgia                 0.119497
Hawaii                  0.200000
Idaho                   0.386364
Illinois                0.450980
Indiana                 0.108696
Iowa                    0.404040
Kansas                  0.600000
Kentucky                0.433333
Louisiana               0.046875
Maine                   0.187500
Maryland                0.041667
Massachusetts           0.142857
Michigan                0.168675
Minnesota               0.356322
Mississippi             0.060976
Missouri                0.391304
Montana                 0.625000
Nebraska                0.741935
Nevada                  0.470588
New Hampshire                NaN
New Jersey                   NaN
New Mexico              0.424242
New York                0.080645
North Carolina          0.150000
North Dakota            0.528302
Ohio                    0.181818
Oklahoma                0.376623
Oregon                  0.277778
Pennsylvania            0.104478
Rhode Island                 NaN
South Carolina          0.043478
South Dakota            0.530303
Tennessee               0.105263
Texas                   0.437008
Utah                    0.448276
Vermont                 0.142857
Virginia                0.248120
Washington              0.102564
West Virginia           0.472727
Wisconsin               0.319444
Wyoming                 0.347826
Name: UID, dtype: float64

(3) 请找出最早出确证病例的三个县。

df3q.sort_values(by=df3q.columns[12:].tolist(),ascending=False)[:3]

	UID	iso2	iso3	code3	FIPS	Admin2	Province_State	Country_Region	Lat	Long_	...	2020/4/18	2020/4/19	2020/4/20	2020/4/21	2020/4/22	2020/4/23	2020/4/24	2020/4/25	2020/4/26	zero41
2969	84053033	US	USA	840	53033	King	Washington	US	47.491379	-121.834613	...	4902	5174	5174	5293	5379	5532	5637	5739	5863	2656
610	84017031	US	USA	840	17031	Cook	Illinois	US	41.841448	-87.816588	...	20395	21272	22101	23181	24546	25811	27616	29058	30574	5575
204	84006037	US	USA	840	6037	Los Angeles	California	US	34.308284	-118.228241	...	12021	12341	13823	15153	16447	17537	18545	19133	19567	4045

3 rows × 108 columns

(4) 按州统计单日死亡增加数,并给出哪个州在哪一天确诊数增加最大(这里指的是在所有州和所有天两个指标一起算,不是分别算)。

df3q.groupby('Province_State').sum().iloc[:,6:].diff(axis=1).max(axis=1).idxmax()

'New York'

df3q.groupby('Province_State').sum().iloc[:,6:].diff(axis=1).max(axis=0).idxmax()

'2020/4/15'

(5) 现需对每个州编制确证与死亡表,第一列为时间,并且起始时间为该州开始出现死亡比例的那一天,第二列和第三列分别为确证数和死亡数,每个州需要保存为一个单独的 csv 文件,文件名为“州名.csv”。

for state in df3s['Province_State'].unique():
    df_q = df3q[df3q['Province_State']==state].iloc[:,12:].sum()
    df_s = df3s[df3s['Province_State']==state].iloc[:,12:].sum()
    df_q = df_q[df_s.cumsum()>0].reset_index()
    df_s = df_s[df_s.cumsum()>0].reset_index()
    df_q.columns=['Date','确诊人数']
    df_s.columns=['Date','死亡人数']
    pd.merge(df_q,df_s,on='Date',how='inner').to_csv(state+'.csv')

(6) 现需对 4 月 1 日至 4 月 10 日编制新增确证数与新增死亡数表,第一列为州名,第二列和第三列分别为新增确证数和新增死亡数,分别保存为十个单独的 csv 文件,文件名为“日期.csv”。

for date in range(10):
    df_q = df3q.groupby('Province_State').sum().iloc[:,75:].iloc[:,date]
    df_s = df3s.groupby('Province_State').sum().iloc[:,75:].iloc[:,date]
    df_r = pd.DataFrame(df_q).assign(s=df_s).reset_index()
    df_r.columns = ['州名','确诊人数','死亡人数']
    df_r.to_csv('4月'+str(date)+'日.csv')