Pandas学习-练习题

记录几道pandas练习题

#综合练习一 2002 年-2018 年上海机动车拍照拍卖
import pandas as pd
df = pd.read_csv('2002年-2018年上海机动车拍照拍卖.csv')
df.head()
DateTotal number of license issuedlowest priceavg priceTotal number of applicants
02-Jan140013600147353718
12-Feb180013100140574590
22-Mar200014300146625190
32-Apr230016000163344806
42-May235017800183574665
df['中标率'] = df['Total number of license issued']/df['Total number of applicants']
df.sort_values(['Date'])
DateTotal number of license issuedlowest priceavg priceTotal number of applicants中标率
9810-Apr85004100041637173130.490961
10210-Aug90003980040169168550.533966
10610-Dec90001040015970112240.801853
9610-Feb75003830038620188100.398724
9510-Jan80003780038311189750.421607
.....................
859-Mar60002660027552185750.323015
879-May72002850029100164710.437132
939-Nov80003490035317219020.365263
929-Oct80003390034402220060.363537
919-Sep85002720029500149060.570240

203 rows × 6 columns

#(1)第一次中标率<0.05
for index,row in df.iterrows():
    if row['中标率']<0.05:
        print(row['Date'])
        break
15-May
#(3) 将第一列时间列拆分成两个列,一列为年份(格式为 20××),另一列为 月份(英语缩写),
#添加到列表作为第一第二列,并将原表第一列删除, 其他列依次向后顺延。
df['年份'] = df['Date'].apply(lambda x: 2000+int(x.split('-')[0]))
df['月份'] = df['Date'].apply(lambda x: x.split('-')[1])
newcolumns = ['年份','月份']+list(df.columns[1:-3])
df1 = df.reindex(columns=newcolumns).copy()
df1.head()
年份月份Total number of license issuedlowest priceavg priceTotal number of applicants
02002Jan140013600147353718
12002Feb180013100140574590
22002Mar200014300146625190
32002Apr230016000163344806
42002May235017800183574665
#(2) 按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求 显示在同一张表上。
import numpy as np
grouped = df1['lowest price '].groupby(df1['年份']).agg([max,np.mean])
grouped['0.75分位数'] = df1['lowest price '].groupby(df1['年份']).quantile(q=0.75)
grouped
maxmean0.75分位数
年份
20023080020316.66666724300.0
20033850031983.33333336300.0
20044420029408.33333338400.0
20053790031908.33333335600.0
20063990037058.33333339525.0
20075380045691.66666748950.0
20083730029945.45454534150.0
20093690031333.33333334150.0
20104490038008.33333341825.0
20115380047958.33333351000.0
20126890061108.33333365325.0
20139080079125.00000082550.0
20147460073816.66666774000.0
20158530080575.00000083450.0
20168860085733.33333387475.0
20179350090616.66666792350.0
20188900087825.00000088150.0
#(4)多层索引
new_index_columns = ['年份']+list(df.columns[1:-3])
df2 = df1.set_index(new_index_columns)
df2.head()
月份
年份Total number of license issuedlowest priceavg priceTotal number of applicants
2002140013600147353718Jan
180013100140574590Feb
200014300146625190Mar
230016000163344806Apr
235017800183574665May
#(5) 一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值 的差额具有相同的正负号,哪些拍卖时间不具有这个特点?
#两表连接,左表从第二行取到末尾,右表全全部,然后是外连接
df3 = df1[['年份','月份','lowest price ','avg price']].copy()
df3 = df3.iloc[1:].reset_index()[['月份','lowest price ','avg price']].join(df3,rsuffix='_lastmonth',how='outer')
df3[((df3['lowest price ']-df3['lowest price _lastmonth'])*(df3['avg price']-df3['avg price_lastmonth']))<0][['年份','月份']]
年份月份
202003Oct
212003Nov
282004Jun
352004Jan
362005Feb
432005Sep
512006May
552006Sep
592006Jan
602007Feb
702007Dec
1272012Oct
#(6)将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用0填充,求发行增益极值出现的时间。
df4 = df1.copy()
mean_value = df4['Total number of license issued'].rolling(2).mean()
mean_value.index = range(1,len(mean_value)+1)
df4['mean-value-of-last-two-months']=mean_value[:202]
df4.fillna(0)
df4['发行增益'] = df4['Total number of license issued']-df4['mean-value-of-last-two-months']
df4.iloc[[df4['发行增益'].idxmax(),df4['发行增益'].idxmin()]]
年份月份Total number of license issuedlowest priceavg priceTotal number of applicantsmean-value-of-last-two-months发行增益
722008Jan16000810023370205397500.08500.0
742008Apr900037300376593707212650.0-3650.0
#综合练习二 2007 年-2019 年俄罗斯机场货运航班运载量
airport_data = pd.read_csv('2007年-2019年俄罗斯货运航班运载量.csv')
airport_data.head()
Airport nameYearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
0Abakan201944.7066.2172.775.82100.3478.3863.8873.0666.7475.44110.589.8917.57(Decimal('91.399735'), Decimal('53.751351'))
1Aikhal20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('111.543324'), Decimal('65.957161'))
2Loss20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('125.398355'), Decimal('58.602489'))
3Amderma20190.000.000.00.000.000.000.000.000.000.000.00.00.00(Decimal('61.577429'), Decimal('69.759076'))
4Anadyr (Carbon)201981.63143.01260.9304.36122.00106.8784.99130.00102.00118.0094.0199.01746.76(Decimal('177.738273'), Decimal('64.713433'))
airport_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3711 entries, 0 to 3710
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Airport name         3711 non-null   object 
 1   Year                 3711 non-null   int64  
 2   January              3711 non-null   float64
 3   February             3711 non-null   float64
 4   March                3711 non-null   float64
 5   April                3711 non-null   float64
 6   May                  3711 non-null   float64
 7   June                 3711 non-null   float64
 8   July                 3711 non-null   float64
 9   August               3711 non-null   float64
 10  September            3711 non-null   float64
 11  October              3711 non-null   float64
 12  November             3711 non-null   float64
 13  December             3711 non-null   float64
 14  Whole year           3711 non-null   float64
 15  Airport coordinates  3711 non-null   object 
dtypes: float64(13), int64(1), object(2)
memory usage: 464.0+ KB
#每年货运航班总运量。
airport_data.groupby('Year')['Whole year'].sum()
Year
2007    659438.23
2008    664682.46
2009    560809.77
2010    693033.98
2011    818691.71
2012    846388.03
2013    792337.08
2014    729457.12
2015    630208.97
2016    679370.15
2017    773662.28
2018    767095.28
2019    764606.27
Name: Whole year, dtype: float64
#每年记录的机场都是相同的吗?
airport_data.groupby('Year')['Airport name'].count()
#不相同
Year
2007    292
2008    292
2009    292
2010    292
2011    292
2012    292
2013    292
2014    292
2015    292
2016    292
2017    292
2018    248
2019    251
Name: Airport name, dtype: int64
#(3) 按年计算 2010 年-2015 年全年货运量记录为 0 的机场航班比例。

airport_data1 = airport_data[(airport_data.Year<=2015)&(airport_data.Year>=2010)]
airport_data1.groupby('Year').apply(lambda x:len(x[x['Whole year']==0])/len(x))
Year
2010    0.767123
2011    0.770548
2012    0.770548
2013    0.770548
2014    0.770548
2015    0.770548
dtype: float64
#(4) 若某机场至少存在5年或以上满足所有月运量记录都为0,则将其所有年份的记录信息从表中删除,并返回处理后的表格
#即先找到5年以上运量为0的
airport_data2 = airport_data.copy()
airport_data2 = airport_data2.set_index('Airport name')
drop_row = pd.DataFrame(airport_data2.groupby('Airport name')['Whole year'].apply(lambda x:len(x[x==0])>=5))
airport_data2.drop(drop_row[drop_row['Whole year']==True].index).head()
YearJanuaryFebruaryMarchAprilMayJuneJulyAugustSeptemberOctoberNovemberDecemberWhole yearAirport coordinates
Airport name
Abakan201944.7066.2172.7075.82100.3478.3863.8873.0666.7475.44110.5089.80917.57(Decimal('91.399735'), Decimal('53.751351'))
Anadyr (Carbon)201981.63143.01260.90304.36122.00106.8784.99130.00102.00118.0094.00199.001746.76(Decimal('177.738273'), Decimal('64.713433'))
Anapa (Vitjazevo)201945.9253.1554.0054.7252.0067.45172.3172.5770.0063.0069.0082.10856.22(Decimal('37.341511'), Decimal('45.003748'))
Arkhangelsk (Talagy)201985.61118.70131.39144.82137.95140.18128.56135.68124.75139.60210.27307.101804.61(Decimal('40.714892'), Decimal('64.596138'))
Astrakhan (Narimanovo)201951.7561.0865.6071.8471.3863.95164.8679.4685.2187.2379.0699.16980.58(Decimal('47.999896'), Decimal('46.287344'))
#(6)在统计学中常常用秩代表排名,现在规定某个机场某年某个月的秩为该机场该月在当年所有月份中货运量的排名
#(例如 *** 机场 19 年 1 月运 量在整个 19 年 12 个月中排名第一,则秩为 1),
#那么判断某月运量情况的相对大小的秩方法为将所有机场在该月的秩排名相加,并将这个量定义为每一个月的秩综合指数,
#请根据上述定义计算2016年12个月的秩综合指数。
airport_data3=airport_data[airport_data.Year==2016]

month = airport_data3.columns[2:14]
Rank = pd.DataFrame(index=airport_data3.index, columns=month)
for index in airport_data3.index:
    month_list = airport_data3.loc[index,month].sort_values(ascending=False)
    rank = 1
    for a in month_list.index:
        Rank.loc[index, a] = rank
        rank = rank + 1

print(Rank.apply(lambda x: x.sum()))

January      3406
February     3076
March        2730
April        2432
May          2276
June         2047
July         1854
August       1527
September    1269
October      1009
November      728
December      422
dtype: int64
#综合练习三 
df_dead = pd.read_csv('美国死亡数.csv')
df_confirm = pd.read_csv('美国确证数.csv')
df_dead.head()
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...2020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
084001001USUSA8401001AutaugaAlabamaUS32.539527-86.644082...2221122222
184001003USUSA8401003BaldwinAlabamaUS30.727750-87.722071...2223333333
284001005USUSA8401005BarbourAlabamaUS31.868263-85.387129...0000000000
384001007USUSA8401007BibbAlabamaUS32.996421-87.125115...0000000000
484001009USUSA8401009BlountAlabamaUS33.982109-86.567906...0000000000

5 rows × 108 columns

df_confirm.head()
UIDiso2iso3code3FIPSAdmin2Province_StateCountry_RegionLatLong_...2020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
084001001USUSA8401001AutaugaAlabamaUS32.539527-86.644082...26252628303233363637
184001003USUSA8401003BaldwinAlabamaUS30.727750-87.722071...103109112117123132143147147161
284001005USUSA8401005BarbourAlabamaUS31.868263-85.387129...15182022282930323233
384001007USUSA8401007BibbAlabamaUS32.996421-87.125115...24262832323433343438
484001009USUSA8401009BlountAlabamaUS33.982109-86.567906...20202122262931313134

5 rows × 107 columns

list(df_dead.columns)
['UID',
 'iso2',
 'iso3',
 'code3',
 'FIPS',
 'Admin2',
 'Province_State',
 'Country_Region',
 'Lat',
 'Long_',
 'Combined_Key',
 'Population',
 '2020/1/22',
 '2020/1/23',
 '2020/1/24',
 '2020/1/25',
 '2020/1/26',
 '2020/1/27',
 '2020/1/28',
 '2020/1/29',
 '2020/1/30',
 '2020/1/31',
 '2020/2/1',
 '2020/2/2',
 '2020/2/3',
 '2020/2/4',
 '2020/2/5',
 '2020/2/6',
 '2020/2/7',
 '2020/2/8',
 '2020/2/9',
 '2020/2/10',
 '2020/2/11',
 '2020/2/12',
 '2020/2/13',
 '2020/2/14',
 '2020/2/15',
 '2020/2/16',
 '2020/2/17',
 '2020/2/18',
 '2020/2/19',
 '2020/2/20',
 '2020/2/21',
 '2020/2/22',
 '2020/2/23',
 '2020/2/24',
 '2020/2/25',
 '2020/2/26',
 '2020/2/27',
 '2020/2/28',
 '2020/2/29',
 '2020/3/1',
 '2020/3/2',
 '2020/3/3',
 '2020/3/4',
 '2020/3/5',
 '2020/3/6',
 '2020/3/7',
 '2020/3/8',
 '2020/3/9',
 '2020/3/10',
 '2020/3/11',
 '2020/3/12',
 '2020/3/13',
 '2020/3/14',
 '2020/3/15',
 '2020/3/16',
 '2020/3/17',
 '2020/3/18',
 '2020/3/19',
 '2020/3/20',
 '2020/3/21',
 '2020/3/22',
 '2020/3/23',
 '2020/3/24',
 '2020/3/25',
 '2020/3/26',
 '2020/3/27',
 '2020/3/28',
 '2020/3/29',
 '2020/3/30',
 '2020/3/31',
 '2020/4/1',
 '2020/4/2',
 '2020/4/3',
 '2020/4/4',
 '2020/4/5',
 '2020/4/6',
 '2020/4/7',
 '2020/4/8',
 '2020/4/9',
 '2020/4/10',
 '2020/4/11',
 '2020/4/12',
 '2020/4/13',
 '2020/4/14',
 '2020/4/15',
 '2020/4/16',
 '2020/4/17',
 '2020/4/18',
 '2020/4/19',
 '2020/4/20',
 '2020/4/21',
 '2020/4/22',
 '2020/4/23',
 '2020/4/24',
 '2020/4/25',
 '2020/4/26']
#(1)调用corr计算小镇人口与最后一天【死亡数】的相关系数
df_dead[['Population','2020/4/26']].corr()
Population2020/4/26
Population1.0000000.403844
2020/4/260.4038441.000000
#(2)截止到4月1日,统计每个州零感染县的比例
#这里注意数据是累计的
df_confirm1 = df_confirm[['Admin2', 'Province_State', '2020/4/1']]
df_confirm1.groupby('Province_State')['2020/4/1'].apply(lambda x: len(x[x==0]) / len(x) )
Province_State
Alabama                 0.119403
Alaska                  0.793103
Arizona                 0.000000
Arkansas                0.293333
California              0.137931
Colorado                0.218750
Connecticut             0.000000
Delaware                0.000000
District of Columbia    0.000000
Florida                 0.164179
Georgia                 0.125786
Hawaii                  0.200000
Idaho                   0.386364
Illinois                0.480392
Indiana                 0.108696
Iowa                    0.404040
Kansas                  0.609524
Kentucky                0.441667
Louisiana               0.062500
Maine                   0.250000
Maryland                0.041667
Massachusetts           0.142857
Michigan                0.192771
Minnesota               0.367816
Mississippi             0.060976
Missouri                0.391304
Montana                 0.625000
Nebraska                0.752688
Nevada                  0.470588
New Hampshire           0.100000
New Jersey              0.000000
New Mexico              0.424242
New York                0.080645
North Carolina          0.180000
North Dakota            0.547170
Ohio                    0.181818
Oklahoma                0.376623
Oregon                  0.277778
Pennsylvania            0.104478
Rhode Island            0.000000
South Carolina          0.065217
South Dakota            0.560606
Tennessee               0.115789
Texas                   0.452756
Utah                    0.482759
Vermont                 0.142857
Virginia                0.270677
Washington              0.128205
West Virginia           0.472727
Wisconsin               0.319444
Wyoming                 0.347826
Name: 2020/4/1, dtype: float64
#(3) 请找出最早出确证病例的三个县
day = list(df_confirm.columns[11:])
df_confirm.set_index(['Province_State','UID'])[day].sort_values(by=day,ascending=False).head(10)
#从图中可看出:
#2020/1/22 : Washington 84053033 开始出现病例
#2020/1/24 : Illinois 84017031 开始出现病例
#2020/1/26 : California 84006037 与 84006059 开始出现病例
#2020/1/26 : Arizona 84004013 开始出现病例
2020/1/222020/1/232020/1/242020/1/252020/1/262020/1/272020/1/282020/1/292020/1/302020/1/31...2020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
Province_StateUID
Washington840530331111111111...4902490251745174529353795532563757395863
Illinois840170310011111112...19391203952127222101231812454625811276162905830574
California840060370000111111...11400120211234113823151531644717537185451913319567
840060590000111111...1501155616361676169117531827184519692074
Arizona840040130000111111...2404249125892636273828462970311632343359
California840060850000000001...1870187018701922192219621987201820402084
Massachusetts840250250000000000...7272769680748074866990609739107241121811543
California840060690000000000...44444444444547474747
840060730000000000...2087215822132268232524912643282629432943
840060670000000000...896914925940954971987101910191037

10 rows × 96 columns


#(4) 按州统计单日死亡增加数,并给出哪个州在哪一天确诊数增加最大(这里指的是在所有州和所有天两个指标一起算,不是分别算)。
df_dead3 = df_dead.set_index('Province_State').groupby(level=0)[day].sum()
df_dead3.iloc[:,:-1].rename(columns=dict(zip(day[:-1],day[1:]))).head()
df_dead3=df_dead.set_index('Province_State').groupby(level=0)[day].sum()
df_dead3=df_dead3.iloc[:,1:]-df_dead3.iloc[:,:-1].rename(columns=dict(zip(day[:-1],day[1:])))
df_dead3=df_dead3.sort_values(by=day[1:],ascending=False)
df_dead3.head()
2020/1/232020/1/242020/1/252020/1/262020/1/272020/1/282020/1/292020/1/302020/1/312020/2/1...2020/4/172020/4/182020/4/192020/4/202020/4/212020/4/222020/4/232020/4/242020/4/252020/4/26
Province_State
Washington0000000000...2410237212225111512
California0000000000...81103354857139112886836
Florida0000000000...5723264845269459920
New Jersey0000000000...3222312941342463027625723124
South Dakota0000000000...0000110101

5 rows × 95 columns

df_dead3.idxmax(axis=1)
Province_State
Washington               2020/4/6
California              2020/4/22
Florida                 2020/4/23
New Jersey              2020/4/22
South Dakota             2020/4/6
New York                 2020/4/7
Georgia                  2020/4/7
Kansas                  2020/4/11
Colorado                2020/4/24
Louisiana               2020/4/14
Virginia                2020/4/22
Oregon                   2020/4/9
Indiana                 2020/4/21
Kentucky                2020/4/16
South Carolina           2020/4/9
Nevada                  2020/4/12
Texas                   2020/4/23
Illinois                2020/4/18
Connecticut             2020/4/20
Missouri                2020/4/21
Michigan                2020/4/22
Pennsylvania            2020/4/21
Wisconsin                2020/4/5
Maryland                2020/4/18
Vermont                  2020/4/9
Oklahoma                2020/4/21
Mississippi             2020/4/21
Ohio                    2020/4/22
Massachusetts           2020/4/25
Tennessee                2020/4/2
District of Columbia    2020/4/14
Arizona                 2020/4/18
Minnesota               2020/4/26
Utah                    2020/4/17
New Hampshire            2020/4/3
Arkansas                 2020/4/9
Hawaii                  2020/4/11
North Carolina          2020/4/17
Alabama                 2020/4/21
Iowa                    2020/4/24
New Mexico              2020/4/22
Idaho                   2020/4/10
Delaware                2020/4/11
Maine                    2020/4/7
North Dakota            2020/4/21
Nebraska                2020/4/23
Montana                 2020/3/30
Alaska                   2020/4/7
West Virginia           2020/4/23
Rhode Island             2020/4/3
Wyoming                 2020/4/13
dtype: object
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值