科大讯飞：酒店价格预测挑战赛！-CSDN博客

Datawhale干货

方向：数据挖掘，科大讯飞赛事

赛题名称：酒店住宿价格预测挑战赛
赛题类型：数据挖掘
赛题链接👇：

https://challenge.xfyun.cn/topic/info?type=accommodation-price&ch=vWxQGFU

赛题背景

国内疫情防控形势目前良好，国内外旅游市场也逐渐有所好转，市场格局也在正常的构建当中。随着旅游产业的恢复，也将对酒店行业拥有促进作用，让市场趋势逐渐向好的方向发展。

酒店住宿平台为了扩展旅行的可能性，不断地呈现出更独特、个性化的体验方式。不同地区之间的交通是否有明显差异、节假日差异、品质差异、位置差异都将影响酒店住宿价格，能否通过已知数据预测出酒店住宿价格存在着很大的挑战。

赛事任务

本次比赛任务是根据酒店相关信息数据，然后通过训练数据训练模型，预测测试集酒店住宿房间的价格结果。

赛题数据集

赛题数据由训练集、测试集据组成，包含15个字段，其中target字段为预测目标。

特征字段	字段描述
id	样本标识id
host_id	酒店id
neighbourhood_group	街区分组
neighbourhood	街区
room_type	房间类型
minimum_nights	最低夜晚数
number_of_reviews	评论数量
last_review	最新评论时间
reviews_per_month	每月评论数量
calculated_host_listings_count	酒店的订单数量
availability	未来365天内房间可以预订的天数
region_1_id	酒店区域ID1
region_2_id	酒店区域ID2
region_3_id	酒店区域ID3
target	酒店住宿价格（已脱敏处理）

评价指标

本次竞赛的评价标准采用MAE，即MAE越小，效果越好。评估代码参考：

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)

赛题思路

赛题是一个典型的回归任务类型的比赛，需要考虑对标签进行缩放然后进行建模，并且需要考虑加入一些特征工程：

# 读取数据集并进行标签缩放
import pandas as pd
import numpy as np
train_df = pd.read_csv('./酒店住宿价格预测挑战赛公开数据/train.csv')
test_df = pd.read_csv('./酒店住宿价格预测挑战赛公开数据/test.csv')
train_df['target'] = np.log1p(train_df['target'])

from sklearn.model_selection import cross_val_predict, cross_validate
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

# 训练集特征工程
train_df['last_review_isnull'] = train_df['last_review'].isnull()
train_df['last_review_year'] = pd.to_datetime(train_df['last_review']).dt.year

train_df['neighbourhood_group_mean'] = train_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())
train_df['neighbourhood_group_counts'] = train_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())

train_df['room_type_mean'] = train_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())
train_df['room_type_counts'] = train_df['room_type'].map(train_df['room_type'].value_counts())

train_df['region_1_id_mean'] = train_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())
train_df['region_1_counts'] = train_df['region_1_id'].map(train_df['region_1_id'].value_counts())

train_df['region_2_id_mean'] = train_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())
train_df['region_2_counts'] = train_df['region_2_id'].map(train_df['region_2_id'].value_counts())

train_df['region_3_id_mean'] = train_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())
train_df['region_3_counts'] = train_df['region_3_id'].map(train_df['region_3_id'].value_counts())

train_df['availability_month'] = train_df['availability'] // 30
train_df['availability_week'] = train_df['availability'] // 7

train_df['reviews_per_month_count'] = train_df['reviews_per_month'] * train_df['calculated_host_listings_count'] 
train_df['room_type_calculated_host_listings_count'] = train_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())

# 测试集数据增强
test_df['last_review_isnull'] = test_df['last_review'].isnull()
test_df['last_review_year'] = pd.to_datetime(test_df['last_review']).dt.year

test_df['neighbourhood_group_mean'] = test_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())
test_df['neighbourhood_group_counts'] = test_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())

test_df['room_type_mean'] = test_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())
test_df['room_type_counts'] = test_df['room_type'].map(train_df['room_type'].value_counts())

test_df['region_1_id_mean'] = test_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())
test_df['region_1_counts'] = test_df['region_1_id'].map(train_df['region_1_id'].value_counts())

test_df['region_2_id_mean'] = test_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())
test_df['region_2_counts'] = test_df['region_2_id'].map(train_df['region_2_id'].value_counts())

test_df['region_3_id_mean'] = test_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())
test_df['region_3_counts'] = test_df['region_3_id'].map(train_df['region_3_id'].value_counts())

test_df['availability_month'] = test_df['availability'] // 30
test_df['availability_week'] = test_df['availability'] // 7

test_df['reviews_per_month_count'] = test_df['reviews_per_month'] * test_df['calculated_host_listings_count'] 
test_df['room_type_calculated_host_listings_count'] = test_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())

# 交叉验证训练模型
cat_val = cross_validate(
    CatBoostRegressor(verbose=0,n_estimators=1000),
    train_df.drop(['id', 'target', 'last_review'], axis=1),
    train_df['target'],
    return_estimator=True
)

lgb_val = cross_validate(
    LGBMRegressor(verbose=0, force_row_wise=True),
    train_df.drop(['id', 'target', 'last_review'], axis=1),
    train_df['target'],
    return_estimator=True
)

xgb_val = cross_validate(
    XGBRegressor(),
    train_df.drop(['id', 'target', 'last_review'], axis=1),
    train_df['target'],
    return_estimator=True
)

# 模型预测
pred = np.zeros(len(test_df))
# for clf in cat_val['estimator'] + lgb_val['estimator'] + xgb_val['estimator']:
for clf in cat_val['estimator']:
    pred += clf.predict(test_df.drop(['id', 'last_review'], axis=1))
pred /= 5
pred = np.exp(pred) - 1
pd.DataFrame({'id': range(30000, 40000), 'target': pred}).to_csv('a.csv', index=None)

完整代码见👇：

https://github.com/datawhalechina/competition-baseline/tree/master/competition/%E7%A7%91%E5%A4%A7%E8%AE%AF%E9%A3%9EAI%E5%BC%80%E5%8F%91%E8%80%85%E5%A4%A7%E8%B5%9B2023

参与学习

Datawhale、科大讯飞、天池联合发起的 AI 夏令营，第三期报名截止到8月15号，扫码申请。基于科大讯飞、天池最新赛事。