Datawhale干货
方向:数据挖掘,科大讯飞赛事
赛题名称:酒店住宿价格预测挑战赛
赛题类型:数据挖掘
赛题链接👇:
https://challenge.xfyun.cn/topic/info?type=accommodation-price&ch=vWxQGFU
赛题背景
国内疫情防控形势目前良好,国内外旅游市场也逐渐有所好转,市场格局也在正常的构建当中。随着旅游产业的恢复,也将对酒店行业拥有促进作用,让市场趋势逐渐向好的方向发展。
酒店住宿平台为了扩展旅行的可能性,不断地呈现出更独特、个性化的体验方式。不同地区之间的交通是否有明显差异、节假日差异、品质差异、位置差异都将影响酒店住宿价格,能否通过已知数据预测出酒店住宿价格存在着很大的挑战。
赛事任务
本次比赛任务是根据酒店相关信息数据,然后通过训练数据训练模型,预测测试集酒店住宿房间的价格结果。
赛题数据集
赛题数据由训练集、测试集据组成,包含15个字段,其中target字段为预测目标。
特征字段 | 字段描述 |
---|---|
id | 样本标识id |
host_id | 酒店id |
neighbourhood_group | 街区分组 |
neighbourhood | 街区 |
room_type | 房间类型 |
minimum_nights | 最低夜晚数 |
number_of_reviews | 评论数量 |
last_review | 最新评论时间 |
reviews_per_month | 每月评论数量 |
calculated_host_listings_count | 酒店的订单数量 |
availability | 未来365天内房间可以预订的天数 |
region_1_id | 酒店区域ID1 |
region_2_id | 酒店区域ID2 |
region_3_id | 酒店区域ID3 |
target | 酒店住宿价格(已脱敏处理) |
评价指标
本次竞赛的评价标准采用MAE,即MAE越小,效果越好。评估代码参考:
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)
赛题思路
赛题是一个典型的回归任务类型的比赛,需要考虑对标签进行缩放然后进行建模,并且需要考虑加入一些特征工程:
# 读取数据集并进行标签缩放
import pandas as pd
import numpy as np
train_df = pd.read_csv('./酒店住宿价格预测挑战赛公开数据/train.csv')
test_df = pd.read_csv('./酒店住宿价格预测挑战赛公开数据/test.csv')
train_df['target'] = np.log1p(train_df['target'])
from sklearn.model_selection import cross_val_predict, cross_validate
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
# 训练集特征工程
train_df['last_review_isnull'] = train_df['last_review'].isnull()
train_df['last_review_year'] = pd.to_datetime(train_df['last_review']).dt.year
train_df['neighbourhood_group_mean'] = train_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())
train_df['neighbourhood_group_counts'] = train_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())
train_df['room_type_mean'] = train_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())
train_df['room_type_counts'] = train_df['room_type'].map(train_df['room_type'].value_counts())
train_df['region_1_id_mean'] = train_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())
train_df['region_1_counts'] = train_df['region_1_id'].map(train_df['region_1_id'].value_counts())
train_df['region_2_id_mean'] = train_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())
train_df['region_2_counts'] = train_df['region_2_id'].map(train_df['region_2_id'].value_counts())
train_df['region_3_id_mean'] = train_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())
train_df['region_3_counts'] = train_df['region_3_id'].map(train_df['region_3_id'].value_counts())
train_df['availability_month'] = train_df['availability'] // 30
train_df['availability_week'] = train_df['availability'] // 7
train_df['reviews_per_month_count'] = train_df['reviews_per_month'] * train_df['calculated_host_listings_count']
train_df['room_type_calculated_host_listings_count'] = train_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())
# 测试集数据增强
test_df['last_review_isnull'] = test_df['last_review'].isnull()
test_df['last_review_year'] = pd.to_datetime(test_df['last_review']).dt.year
test_df['neighbourhood_group_mean'] = test_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())
test_df['neighbourhood_group_counts'] = test_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())
test_df['room_type_mean'] = test_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())
test_df['room_type_counts'] = test_df['room_type'].map(train_df['room_type'].value_counts())
test_df['region_1_id_mean'] = test_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())
test_df['region_1_counts'] = test_df['region_1_id'].map(train_df['region_1_id'].value_counts())
test_df['region_2_id_mean'] = test_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())
test_df['region_2_counts'] = test_df['region_2_id'].map(train_df['region_2_id'].value_counts())
test_df['region_3_id_mean'] = test_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())
test_df['region_3_counts'] = test_df['region_3_id'].map(train_df['region_3_id'].value_counts())
test_df['availability_month'] = test_df['availability'] // 30
test_df['availability_week'] = test_df['availability'] // 7
test_df['reviews_per_month_count'] = test_df['reviews_per_month'] * test_df['calculated_host_listings_count']
test_df['room_type_calculated_host_listings_count'] = test_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())
# 交叉验证训练模型
cat_val = cross_validate(
CatBoostRegressor(verbose=0,n_estimators=1000),
train_df.drop(['id', 'target', 'last_review'], axis=1),
train_df['target'],
return_estimator=True
)
lgb_val = cross_validate(
LGBMRegressor(verbose=0, force_row_wise=True),
train_df.drop(['id', 'target', 'last_review'], axis=1),
train_df['target'],
return_estimator=True
)
xgb_val = cross_validate(
XGBRegressor(),
train_df.drop(['id', 'target', 'last_review'], axis=1),
train_df['target'],
return_estimator=True
)
# 模型预测
pred = np.zeros(len(test_df))
# for clf in cat_val['estimator'] + lgb_val['estimator'] + xgb_val['estimator']:
for clf in cat_val['estimator']:
pred += clf.predict(test_df.drop(['id', 'last_review'], axis=1))
pred /= 5
pred = np.exp(pred) - 1
pd.DataFrame({'id': range(30000, 40000), 'target': pred}).to_csv('a.csv', index=None)
完整代码见👇:
https://github.com/datawhalechina/competition-baseline/tree/master/competition/%E7%A7%91%E5%A4%A7%E8%AE%AF%E9%A3%9EAI%E5%BC%80%E5%8F%91%E8%80%85%E5%A4%A7%E8%B5%9B2023
参与学习
Datawhale、科大讯飞、天池联合发起的 AI 夏令营,第三期报名截止到8月15号,扫码申请。基于科大讯飞、天池最新赛事。