第一阶段学习总结:Datawhale组织的Pandas组队学习暂告一段落。快乐pandas以自主学习、助教答疑、定期打卡等形式,课程资料编写比较详细,配合思考题和练习巩固,无奈于本人较菜,对所学的知识掌握还不够透彻,完成这次综合练习断断续续花了两天时间,后面一定要抽时间复盘之前所学的内容。
全面学习基础内容有助于建立体系,了解各大功能模块,在使用时才能有所了解,找准方向。因此后面会继续参加pandas(下)的学习,争取学完这期教程。
题目快速跳转:
import numpy as np
import pandas as pd
import re
一、2002 年-2018 年上海机动车拍照拍卖
vehicle = pd.read_csv('数据集/2002年-2018年上海机动车拍照拍卖.csv')
vehicle.head()
Date | Total number of license issued | lowest price | avg price | Total number of applicants | |
---|---|---|---|---|---|
0 | 2-Jan | 1400 | 13600 | 14735 | 3718 |
1 | 2-Feb | 1800 | 13100 | 14057 | 4590 |
2 | 2-Mar | 2000 | 14300 | 14662 | 5190 |
3 | 2-Apr | 2300 | 16000 | 16334 | 4806 |
4 | 2-May | 2350 | 17800 | 18357 | 4665 |
(1)
问:哪一次拍卖的中标率首次小于 5%?
vehicle_1 = vehicle.assign(probability=vehicle['Total number of license issued'] / vehicle['Total number of applicants'])
vehicle_1[vehicle_1['probability']<0.05].head(1)
Date | Total number of license issued | lowest price | avg price | Total number of applicants | probability | |
---|---|---|---|---|---|---|
159 | 15-May | 7482 | 79000 | 79099 | 156007 | 0.047959 |
(2)
问:按年统计拍卖最低价的下列统计量:最大值、均值、0.75 分位数,要求显示在同一张表上。
vehicle_2 = vehicle.assign(year=vehicle['Date'].apply(lambda x: x.split('-')[0]))
vehicle_2.head()
groupby_ = vehicle_2.groupby('year')
new_dataframe = pd.DataFrame()
new_dataframe['max'] = groupby_['lowest price'].max()
new_dataframe['mean'] = groupby_['lowest price'].mean()
new_dataframe['quantile'] = groupby_['lowest price'].quantile(0.75)
new_dataframe
max | mean | quantile | |
---|---|---|---|
year | |||
10 | 44900 | 38008.333333 | 41825.0 |
11 | 53800 | 47958.333333 | 51000.0 |
12 | 68900 | 61108.333333 | 65325.0 |
13 | 90800 | 79125.000000 | 82550.0 |
14 | 74600 | 73816.666667 | 74000.0 |
15 | 85300 | 80575.000000 | 83450.0 |
16 | 88600 | 85733.333333 | 87475.0 |
17 | 93500 | 90616.666667 | 92350.0 |
18 | 89000 | 87825.000000 | 88150.0 |
2 | 30800 | 20316.666667 | 24300.0 |
3 | 38500 | 31983.333333 | 36300.0 |
4 | 44200 | 29408.333333 | 38400.0 |
5 | 37900 | 31908.333333 | 35600.0 |
6 | 39900 | 37058.333333 | 39525.0 |
7 | 53800 | 45691.666667 | 48950.0 |
8 | 37300 | 29945.454545 | 34150.0 |
9 | 36900 | 31333.333333 | 34150.0 |
(3)
问:将第一列时间列拆分成两个列,一列为年份(格式为 20××) ,另一列为月份(英语缩写) ,添加到列表作为第一第二列,并将原表第一列删除,其他列依次向后顺延。
Year = vehicle['Date'].apply(lambda x: str(20) + ("%02d" % int(x.split('-')[0])))
Month = vehicle['Date'].apply(lambda x: x.split('-')[1])
vehicle_3 = vehicle.drop(columns='Date')
vehicle_3.insert(0, 'Year', Year)
vehicle_3.insert(1, 'Month', Month)
vehicle_3.head()
Year | Month | Total number of license issued | lowest price | avg price | Total number of applicants | |
---|---|---|---|---|---|---|
0 | 2002 | Jan | 1400 | 13600 | 14735 | 3718 |
1 | 2002 | Feb | 1800 | 13100 | 14057 | 4590 |
2 | 2002 | Mar | 2000 | 14300 | 14662 | 5190 |
3 | 2002 | Apr | 2300 | 16000 | 16334 | 4806 |
4 | 2002 | May | 2350 | 17800 | 18357 | 4665 |
(4)
问:现在将表格行索引设为多级索引,外层为年份,内层为原表格第二至第五列的变量名,列索引为月份。
vehicle_4 = vehicle_3.set_index(['Year', 'Total number of license issued', 'lowest price', 'avg price', 'Total number of applicants'])
vehicle_4.head()
Month | |||||
---|---|---|---|---|---|
Year | Total number of license issued | lowest price | avg price | Total number of applicants | |
2002 | 1400 | 13600 | 14735 | 3718 | Jan |
1800 | 13100 | 14057 | 4590 | Feb | |
2000 | 14300 | 14662 | 5190 | Mar | |
2300 | 16000 | 16334 | 4806 | Apr | |
2350 | 17800 | 18357 | 4665 | May |
(5)
问:一般而言某个月最低价与上月最低价的差额,会与该月均值与上月均值的差额具有相同的正负号,哪些拍卖时间不具有这个特点?
vehicle_5 = vehicle[['Date','lowest price','avg price']]
for index in range(1, len(vehicle_5)):
if ((vehicle_5.iloc[index]['lowest price'] - vehicle_5.iloc[index-1]['lowest price'])\
* (vehicle_5.iloc[index]['avg price'] - vehicle_5.iloc[index-1]['avg price'])) < 0:
print(vehicle_5.iloc[index]['Date'])
3-Oct
3-Nov
4-Jun
5-Jan
5-Feb
5-Sep
6-May
6-Sep
7-Jan
7-Feb
7-Dec
12-Oct
(6)
问:将某一个月牌照发行量与其前两个月发行量均值的差额定义为发行增益,最初的两个月用 0 填充,求发行增益极值出现的时间。
vehicle_6 = vehicle[['Date','Total number of license issued']]
gain = [0,0]
string = 'Total number of license issued'
for index in