第1关:数据探索与可视化
任务描述
本关任务:编写python代码,完成一天中不同时间段的平均租赁数量的可视化功能。
相关知识
为了完成本关任务,你需要掌握:
读取数据
数据探索与可视化
import pandas as pd
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
#********* Begin *********#
train_df = pd.read_csv('./step1/bike_train.csv')
train_df['hour'] = train_df.datetime.apply(lambda x:x.split()[1].split(':')[0]).astype('int')
group_hour=train_df.groupby(train_df.hour)
hour_mean=group_hour[['count','registered','casual']].mean()
fig=plt.figure(figsize=(10,10))
plt.plot(hour_mean['count'])
plt.title('average count per hour')
plt.savefig('./step1/result/plot.png')
plt.show()
#********* End *********#
第2关:特征工程
任务描述
本关任务:编写python代码,完成时间细化的功能。
相关知识
为了完成本关任务,你需要掌握:
相关性分析
特征选择
import pandas as pd
import numpy as np
from datetime import datetime
def transform_data(train_df):
#********* Begin *********#
train_df['date'] = train_df.datetime.apply(lambda x: x.split()[0])
train_df['hour'] = train_df.datetime.apply(lambda x: x.split()[1].split(':')[0]).astype('int')
train_df['year'] = train_df.datetime.apply(lambda x: x.split()[0].split('-')[0]).astype('int')
train_df['month'] = train_df.datetime.apply(lambda x: x.split()[0].split('-')[1]).astype('int')
train_df['weekday'] = train_df.date.apply(lambda x: datetime.strptime(x, '%Y-%m-%d').isoweekday())
return train_df
#********* End **********#
第3关:租赁需求预估
任务描述
本关任务:编写python代码,实现租赁需求预估。
相关知识
为了完成本关任务,你需要掌握:
独热编码
sklearn机器学习算法的使用
生成预测结果
#********* Begin *********#
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.linear_model import Ridge
train_df = pd.read_csv('./step3/bike_train.csv')
# 舍弃掉异常count
train_df=train_df[np.abs(train_df['count']-train_df['count'].mean())<=3*train_df['count'].std()]
# 训练集的时间数据处理
train_df['date']=train_df.datetime.apply(lambda x:x.split()[0])
train_df['hour']=train_df.datetime.apply(lambda x:x.split()[1].split(':')[0]).astype('int')
train_df['year']=train_df.datetime.apply(lambda x:x.split()[0].split('-')[0]).astype('int')
train_df['month']=train_df.datetime.apply(lambda x:x.split()[0].split('-')[1]).astype('int')
train_df['weekday']=train_df.date.apply( lambda x : datetime.strptime(x,'%Y-%m-%d').isoweekday())
# 独热编码
train_df_back=train_df
dummies_month = pd.get_dummies(train_df['month'], prefix='month')
dummies_year = pd.get_dummies(train_df['year'], prefix='year')
dummies_season = pd.get_dummies(train_df['season'], prefix='season')
dummies_weather = pd.get_dummies(train_df['weather'], prefix='weather')
train_df_back = pd.concat([train_df, dummies_month,dummies_year, dummies_season,dummies_weather], axis = 1)
train_label = train_df_back['count']
train_df_back = train_df_back.drop(['datetime', 'season', 'weather', 'atemp', 'date', 'month', 'count'], axis=1)
test_df = pd.read_csv('./step3/bike_test.csv')
# 测试集的时间数据处理
test_df['date']=test_df.datetime.apply(lambda x:x.split()[0])
test_df['hour']=test_df.datetime.apply(lambda x:x.split()[1].split(':')[0]).astype('int')
test_df['year']=test_df.datetime.apply(lambda x:x.split()[0].split('-')[0]).astype('int')
test_df['month']=test_df.datetime.apply(lambda x:x.split()[0].split('-')[1]).astype('int')
test_df['weekday']=test_df.date.apply( lambda x : datetime.strptime(x,'%Y-%m-%d').isoweekday())
# 独热编码
test_df_back=test_df
dummies_month = pd.get_dummies(test_df['month'], prefix='month')
dummies_year = pd.get_dummies(test_df['year'], prefix='year')
dummies_season = pd.get_dummies(test_df['season'], prefix='season')
dummies_weather = pd.get_dummies(test_df['weather'], prefix='weather')
test_df_back = pd.concat([test_df, dummies_month,dummies_year, dummies_season,dummies_weather], axis = 1)
test_df_back = test_df_back.drop(['datetime', 'season', 'weather', 'atemp', 'date', 'month'], axis=1)
clf = Ridge(alpha=1.0)
# 训练
clf.fit(train_df_back, train_label)
# 预测
count = clf.predict(test_df_back)
# 保存结果
result = pd.DataFrame({'datetime':test_df['datetime'], 'count':count})
result.to_csv('./step3/result.csv', index=False)
#********* End *********#