Hotel booking -探索性数据分析(EDA)一(seaborn matplot pyecharts)

最新推荐文章于 2024-04-26 15:28:07 发布

北极星~

最新推荐文章于 2024-04-26 15:28:07 发布

阅读量1.3k

点赞数 2

分类专栏： EDA 文章标签：数据分析 python 机器学习

本文链接：https://blog.csdn.net/Ghost__l/article/details/107667860

版权

EDA 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

文章目录

探索式数据分析

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
from matplotlib.patches import  Rectangle,Circle
from matplotlib.collections import PatchCollection
%matplotlib inline


plt.rcParams['font.sans-serif']=["SimHei"] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号

def figure_polaris(x,y):
    sns.set(style="darkgrid")
    fig,ax=plt.subplots(figsize=(x,y))
    ax.xaxis.grid(False)
    ax.yaxis.grid(True, which='major') # x坐标轴的网格使用主刻度
    for item in ['top', 'right', 'left']:
        ax.spines[item].set_visible(False) #去掉边框

导入信息及基本信息查看

data_origin = pd.read_csv('hotel_bookings.csv')
data_origin.head(5).T

	0	1	2	3	4
hotel	Resort Hotel	Resort Hotel	Resort Hotel	Resort Hotel	Resort Hotel
is_canceled	0	0	0	0	0
lead_time	342	737	7	13	14
arrival_date_year	2015	2015	2015	2015	2015
arrival_date_month	July	July	July	July	July
arrival_date_week_number	27	27	27	27	27
arrival_date_day_of_month	1	1	1	1	1
stays_in_weekend_nights	0	0	0	0	0
stays_in_week_nights	0	0	1	1	2
adults	2	2	1	1	2
children	0	0	0	0	0
babies	0	0	0	0	0
meal	BB	BB	BB	BB	BB
country	PRT	PRT	GBR	GBR	GBR
market_segment	Direct	Direct	Direct	Corporate	Online TA
distribution_channel	Direct	Direct	Direct	Corporate	TA/TO
is_repeated_guest	0	0	0	0	0
previous_cancellations	0	0	0	0	0
previous_bookings_not_canceled	0	0	0	0	0
reserved_room_type	C	C	A	A	A
assigned_room_type	C	C	C	A	A
booking_changes	3	4	0	0	0
deposit_type	No Deposit	No Deposit	No Deposit	No Deposit	No Deposit
agent	NaN	NaN	NaN	304	240
company	NaN	NaN	NaN	NaN	NaN
days_in_waiting_list	0	0	0	0	0
customer_type	Transient	Transient	Transient	Transient	Transient
adr	0	0	75	75	98
required_car_parking_spaces	0	0	0	0	0
total_of_special_requests	0	0	0	0	1
reservation_status	Check-Out	Check-Out	Check-Out	Check-Out	Check-Out
reservation_status_date	2015-07-01	2015-07-01	2015-07-02	2015-07-02	2015-07-03

print('Shape of dataset：',data_origin.shape)
print('Size of dataser: ',data_origin.size)
data_origin.info()

Shape of dataset： (119390, 32)
Size of dataser:  3820480
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
hotel                             119390 non-null object
is_canceled                       119390 non-null int64
lead_time                         119390 non-null int64
arrival_date_year                 119390 non-null int64
arrival_date_month                119390 non-null object
arrival_date_week_number          119390 non-null int64
arrival_date_day_of_month         119390 non-null int64
stays_in_weekend_nights           119390 non-null int64
stays_in_week_nights              119390 non-null int64
adults                            119390 non-null int64
children                          119386 non-null float64
babies                            119390 non-null int64
meal                              119390 non-null object
country                           118902 non-null object
market_segment                    119390 non-null object
distribution_channel              119390 non-null object
is_repeated_guest                 119390 non-null int64
previous_cancellations            119390 non-null int64
previous_bookings_not_canceled    119390 non-null int64
reserved_room_type                119390 non-null object
assigned_room_type                119390 non-null object
booking_changes                   119390 non-null int64
deposit_type                      119390 non-null object
agent                             103050 non-null float64
company                           6797 non-null float64
days_in_waiting_list              119390 non-null int64
customer_type                     119390 non-null object
adr                               119390 non-null float64
required_car_parking_spaces       119390 non-null int64
total_of_special_requests         119390 non-null int64
reservation_status                119390 non-null object
reservation_status_date           119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

data_origin.describe().T

	count	mean	std	min	25%	50%	75%	max
is_canceled	119390.0	0.370416	0.482918	0.00	0.00	0.000	1.0	1.0
lead_time	119390.0	104.011416	106.863097	0.00	18.00	69.000	160.0	737.0
arrival_date_year	119390.0	2016.156554	0.707476	2015.00	2016.00	2016.000	2017.0	2017.0
arrival_date_week_number	119390.0	27.165173	13.605138	1.00	16.00	28.000	38.0	53.0
arrival_date_day_of_month	119390.0	15.798241	8.780829	1.00	8.00	16.000	23.0	31.0
stays_in_weekend_nights	119390.0	0.927599	0.998613	0.00	0.00	1.000	2.0	19.0
stays_in_week_nights	119390.0	2.500302	1.908286	0.00	1.00	2.000	3.0	50.0
adults	119390.0	1.856403	0.579261	0.00	2.00	2.000	2.0	55.0
children	119386.0	0.103890	0.398561	0.00	0.00	0.000	0.0	10.0
babies	119390.0	0.007949	0.097436	0.00	0.00	0.000	0.0	10.0
is_repeated_guest	119390.0	0.031912	0.175767	0.00	0.00	0.000	0.0	1.0
previous_cancellations	119390.0	0.087118	0.844336	0.00	0.00	0.000	0.0	26.0
previous_bookings_not_canceled	119390.0	0.137097	1.497437	0.00	0.00	0.000	0.0	72.0
booking_changes	119390.0	0.221124	0.652306	0.00	0.00	0.000	0.0	21.0
agent	103050.0	86.693382	110.774548	1.00	9.00	14.000	229.0	535.0
company	6797.0	189.266735	131.655015	6.00	62.00	179.000	270.0	543.0
days_in_waiting_list	119390.0	2.321149	17.594721	0.00	0.00	0.000	0.0	391.0
adr	119390.0	101.831122	50.535790	-6.38	69.29	94.575	126.0	5400.0
required_car_parking_spaces	119390.0	0.062518	0.245291	0.00	0.00	0.000	0.0	8.0
total_of_special_requests	119390.0	0.571363	0.792798	0.00	0.00	0.000	1.0	5.0

一、数据预处理

data_origin.isnull().sum() #计算空值数量

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64

可以看到 $a g e n t$ 和 $c o m p a n y$ 的缺失值比较多，考虑这两个指标删除

data_origin = data_origin.drop(['agent','company'],axis = 1) ##丢掉空值较多的这两个指标

探索式数据分析

二、分析来自国家(country)和取消预定之间的关系

2.1主要国家代码与名称对照表

2.2分析主要部分

因为 $c o u n t r y$ 有缺失值，接下来先处理 $c o u n t r y$ 缺失值，然后根据图可以看出， $c o u n t r y$ 类别特别多，呈现长尾分布，所以确实值可以替换成 $u n k n o w n$ ,然后选取主要部分（约 $90$ ），再进行画图

figure_polaris(15,7)
# sns.set(style="darkgrid")
sns.countplot(data_origin['country'])

<matplotlib.axes._subplots.AxesSubplot at 0x2cc23a15088>

data_new = data_origin
list1 = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
                                 .count().sort_values(ascending=False))
list2 = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
                                 .count().sort_values(ascending=False))

len1 = 10
len2 = 15
print(
      "被选中的已取消订单的country占比  %.2f%% " % (sum(list1[:len1])*100/sum(list1)),  ## 
    "\n被选中的未取消订单的country占比  %.2f%% " % (sum(list2[:len2])*100/sum(list2)),
    "\n未选中的已取消订单的country订单总数：",sum(list1)-sum(list1[:len1]),
    "\n未选中的未取消订单的country订单总数：",sum(list2)-sum(list2[:len2]),
    )

被选中的已取消订单的country占比  88.80%  
被选中的未取消订单的country占比  89.66%  
未选中的已取消订单的country订单总数： 4953 
未选中的未取消订单的country订单总数： 7770

2.2.1数据处理

数据处理1 选取 —取消订单—中主要国家部分，前 $l e n 1$ 个：
$cancael\,index\, temp$ ：主要国家名称
$cancael\,value\,temp$ ：主要国家数量

cancael_index_temp = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
                    .count().sort_values(ascending=False).index)[:len1]+["other country"]
cancael_value_temp = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
                    .count().sort_values(ascending=False))[:len1] + [(sum(list1)-sum(list1[:len1]))]

数据处理2 选取 —未取消订单—中主要国家部分，前 $l e n 2$ 个：
$uncancael\,index\, temp$ ：主要国家名称
$uncancael\,value\,temp$ ：主要国家数量

uncancael_index_temp = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
                    .count().sort_values(ascending=False).index)[:len2]+["other country"]
uncancael_value_temp = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
                    .count().sort_values(ascending=False))[:len2]+ [(sum(list2)-sum(list2[:len2]))]

## 合并成一个，后面画图直接传参
country_iscanceled_index = (uncancael_index_temp + cancael_index_temp)
country_iscanceled_value = (uncancael_value_temp + cancael_value_temp)

## 计算是否取消预定 各自占比
temp = data_new.groupby(["is_canceled"])["is_canceled"].count().values
country_iscanceled_inner = list(temp / sum(temp))

2.3 画图

import pyecharts.options as opts
from pyecharts.charts import Pie
from pyecharts.commons.utils import JsCode
# [data_Uncanceled_percent,data_Canceled_percent]
inner_x_data = ["未取消预定", "已取消预定"]
inner_y_data = country_iscanceled_inner
inner_data_pair = [list(z) for z in zip(inner_x_data, inner_y_data)]

outer_x_data = country_iscanceled_index
outer_y_data = country_iscanceled_value
outer_data_pair = [list(z) for z in zip(outer_x_data, outer_y_data)]

(
    Pie(init_opts=opts.InitOpts())
    .add(
        series_name="预订情况:",
        data_pair=inner_data_pair,
        radius=[0, "30%"],
        center=["55%","50%"],
        label_opts=opts.LabelOpts(position="inner",formatter="{b} \n\n {d}%"),#,"
    )    
    .add(
        series_name="来自国家:",
        data_pair=outer_data_pair,
        radius=["31%","50%"],
        center=["55%","50%"],
        label_opts=opts.LabelOpts(position="outer"),#
    )
    .set_colors(['#44a0d6',"#fc7716","#74c476","#9e9ac8","#4c72b0","#ee854a","#6acc64",
                 "#d65f5f","#8c613c","#dc7ec0","#797979","#d5bb67","#82c6e2","#faceb6",
                 "#fae9b6","#e3fab6","#b6faf6","#d6b6fa"])
    
    .set_global_opts(
        tooltip_opts =opts.TooltipOpts(formatter=" {a} </br> {b}  {d}%",axis_pointer_type = "cross",),
        legend_opts =opts.LegendOpts(type_='scroll',orient='vertical',pos_left="5%",pos_top= 'middle'),
        title_opts=opts.TitleOpts(title="是否取消预定 & 顾客来自国家分布",pos_left="center")
    )                     
    .render_notebook()
)

在这里插入图片描述

2.4分析

根据以上可以看出，葡萄牙的Hotel，主要接待的本国游客(废话)，齐次是英国。来自葡萄牙（PRT）本国的，顾客中，取消订单的占比较大，而英国顾客取消预定的就相对较少

三、分析间隔时间(lead_time)和取消预定之间的关系

3.1 小提琴图

# `arrival_date_year` vs `lead_time` vs `is_canceled` exploration with violin plot
data = data_origin
figure_polaris(15,10)
# plt.figure(figsize=(15,10))
sns.violinplot(x='arrival_date_year', y ='lead_time', hue="is_canceled", data=data, palette="Set3", bw=.2,
               cut=2, linewidth=2, iner= 'box', split = True)
sns.despine(left=True)
plt.title('Arrival Year VS Lead Time vs Canceled Situation', weight='bold', fontsize=20)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Lead Time', fontsize=16)

Text(0, 0.5, 'Lead Time')

回归模型

# 查看从预定到离店时间特征的影响
import seaborn as sns
# group data for lead_time:
lead_cancel_data = data.groupby("lead_time")["is_canceled"].describe()
# use only lead_times wih more than 10 bookings for graph:
lead_cancel_data_10 = lead_cancel_data.loc[lead_cancel_data["count"] >= 10]

#show figure:
plt.figure(figsize=(15, 7))

x,y = pd.Series(lead_cancel_data_10.index, name="x_var"), pd.Series(lead_cancel_data_10["mean"].values * 100, name="y_var")
sns.regplot(x=x, y=lead_cancel_data_10["mean"].values * 100)
plt.title("Effect of lead time on cancelation", fontsize=16)
plt.xlabel("Lead time", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()

分析-小提琴图

$\quad$ 展示了到达年份(arrival_date_year) 间隔时间(lead_time) 和是否取消(is_canceled)之间的关系。
$\quad$ 其中间隔时间指的是预订输入日期到到达日期之间经过的天数

$\quad$ 可以看出没有取消预定的顾客中，提前预定时间的分部较固定。取消预定的顾客中，提前时间要长一点

结论

$\quad$ 提前预定时间较长的旅客，更有可能取消预定。

分析-回归图

可知：到店日的前几日取消预定的人很少，随着距离预定日越长时间的取消预定的人数越多，提前一年预定的取消率也更大，这也符合人们的常识。

四、酒店人均价格&到店人数

adr ： Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
adr：每一订单的平均每日房价，其定义为所有住宿交易的总和除以总住宿天数

data = data_origin
# 查看adr的分布情况
plt.subplots(figsize=(15,5))
plt.scatter(data['adr'].index,data['adr'].values)

<matplotlib.collections.PathCollection at 0x2cc2b19fc48>

从图中可以看出，数据存在1离群值，打印该值查看。

# 用pd.loc 通过行索引 "Index" 中的具体值来取行数据
data.loc[list(data['adr']).index(max(list(data['adr'])))]['adr']

5400.0

可以看出，这是City Hotel的一笔订单，adr显示为5400，远远超出其他值，虽然在千一级别的订单基数下，这一离群值对平均值影响不大，为了严谨性考虑，我们后续删除这里离群值

## 替换月份 月 对应的数字， 方便画图的时候排序
import calendar
for i in range(1,13):
    month2num =list(calendar.month_name)
    data['arrival_date_month']=data['arrival_date_month'].replace(calendar.month_name[i],i)

data_new0 = data
data_new = data_new0[data_new0["is_canceled"] == 0]
data_new["people"] = data_new['adults'] + data_new["children"]
data_new["people"] = data_new["people"].replace(0.0,1) ##将总数0人的换成1，防止后面平均价格出现 inf
data_new["adr_per"] = data_new['adr'] / (data_new["people"]*(data_new["stays_in_weekend_nights"]+data_new["stays_in_week_nights"]))
# data_new["adr_per"] = data_new["adr_per"].replace(float('inf'),0)

data_aver_price_city_temp = data_new[data_new["hotel"]=="City Hotel"].groupby("arrival_date_month").mean()['adr_per']
data_aver_price_resort_temp = data_new[data_new["hotel"]=="Resort Hotel"].groupby("arrival_date_month").mean()['adr_per']

data_count_price_city_temp = data_new[data_new["hotel"]=="City Hotel"].groupby("arrival_date_month")['people'].count()
data_count_price_resort_temp = data_new[data_new["hotel"]=="Resort Hotel"].groupby("arrival_date_month")['people'].count()

## 整理数据，方便画图
data_aver_city = pd.DataFrame({"arrival_date_month": list(data_aver_price_city_temp.index),
                                "hotel": "City Hotel",
                                "average_cost": list(data_aver_price_city_temp.values),
                                "count":data_count_price_city_temp  })
data_aver_resort = pd.DataFrame({"arrival_date_month": list(data_aver_price_resort_temp.index),
                                "hotel": "Resort Hotel",
                                "average_cost": list(data_aver_price_resort_temp.values),
                                "count":data_count_price_resort_temp })
data_average_hotel = pd.concat([data_aver_city, data_aver_resort], ignore_index=True)
# data_average_hotel

# sns.set_style("whitegrid")
sns.set(style="darkgrid")
## 设置颜色
 
fig, ax = plt.subplots(2,1,figsize=(15, 9))
sns.barplot(x="arrival_date_month", y="average_cost", hue="hotel", palette=sns.color_palette(), ax=ax[0],data=data_average_hotel)

ax[0].set_xlabel(' ', fontsize=15)
ax[0].set_ylabel('Cost', fontsize=15)
ax[0].set_title('Average Cost of Every Month', fontsize=18)

sns.barplot(x="arrival_date_month", y="count", hue="hotel", palette=sns.color_palette(),ax=ax[1],data=data_average_hotel)
# plt.xlabel("Month", fontsize=16)
ax[1].set_xlabel('Month', fontsize=15)
ax[1].set_ylabel('Count', fontsize=15)
ax[1].set_title('Average Order Count of Every Month', fontsize=18)

Text(0.5, 1.0, 'Order Count of Average Month')

分析

通过查询可知，葡萄牙旅游旺季为每年 6月-9月，以里斯本(葡萄牙首都)为例，7月和8月海滩是最热闹的，从 $R e s o r t H o t e l$ 七八月份顾客数(订单数x人数每订单)最多刚好可以和实际情况相吻合。并且从"Cost of Average Month"图中可以看出，7月和8月同时是一年中平均每人消费价格最高的月份，可以看出，这两个月有较多游客来此度假。

五、预定取消情况和餐食选择

# data['country']=data['country'].replace(np.nan,'PRT')
data["meal"] = data["meal"].replace(np.nan,"SC")
# data.drop(data_new.index[zero_guests], inplace=True)
meal_data = data[["hotel", "is_canceled", "meal"]]
# meal_data

plt.figure(figsize=(15, 10))
plt.subplot(1,2,1)
plt.pie(meal_data.loc[meal_data["is_canceled"]==0, "meal"].value_counts(), 
        labels=meal_data.loc[meal_data["is_canceled"]==0, "meal"].value_counts().index, 
       autopct="%.2f%%")
plt.title("Meal Choice of Uncanceled People", fontsize=16)
plt.legend(loc="upper right")
 
plt.subplot(1,2,2)
plt.pie(meal_data.loc[meal_data["is_canceled"]==1, "meal"].value_counts(), 
        labels=meal_data.loc[meal_data["is_canceled"]==1, "meal"].value_counts().index, 
       autopct="%.2f%%")
plt.title("Meal Choice of Canceled People", fontsize=16)
plt.legend(loc="upper right")

<matplotlib.legend.Legend at 0x2cc295b51c8>

分析

两个图几乎没有差别，再次说明取消预订旅客和未取消预订旅客有基本相同的餐食选择

六、不同类型酒店的取消预定情况

## 数据处理
data_new = data_origin
## 分别计算 取消和未取消的记录数量
uncancel_hotel_count = data_new[data_new["is_canceled"]==0].groupby(["hotel"])["hotel"].count()
cancel_hotel_count = data_new[data_new["is_canceled"]==1].groupby(["hotel"])["hotel"].count()
x1 = uncancel_hotel_count['City Hotel']
x2 = uncancel_hotel_count['Resort Hotel']
y1 = cancel_hotel_count['City Hotel']
y2 = cancel_hotel_count['Resort Hotel']

data_Uncanceled_percent = (x1+x2)/(x1+x2+y1+y2) *100
data_Canceled_percent   = (y1+y2)/(x1+x2+y1+y2) *100

data_City_percent       = (x1+y1)/(x1+x2+y1+y2) *100
data_Resort_percent     = (x2+y2)/(x1+x2+y1+y2) *100

data_Uncanceled_City    = x1/ (x1+x2+y1+y2)*100
data_Uncanceled_Resort  = x2/ (x1+x2+y1+y2)*100

data_Canceled_City      = y1/ (x1+x2+y1+y2)*100
data_Canceled_Resort    = y2/ (x1+x2+y1+y2)*100

## 画图
import pyecharts.options as opts
from pyecharts.charts import Pie
from pyecharts.commons.utils import JsCode

inner_x_data = ["未取消预定", "已取消预定"]
inner_y_data = [data_Uncanceled_percent, data_Canceled_percent]
inner_data_pair = [list(z) for z in zip(inner_x_data, inner_y_data)]

outer_x_data = ["城市酒店", "度假酒店", "城市酒店", "度假酒店"]
outer_y_data = [data_Uncanceled_City,data_Uncanceled_Resort, data_Canceled_City,data_Canceled_Resort]
outer_data_pair = [list(z) for z in zip(outer_x_data, outer_y_data)]

inner2_x_data = ["城市酒店", "度假酒店"]
inner2_y_data = [data_City_percent, data_Resort_percent]
inner2_data_pair = [list(z) for z in zip(inner2_x_data, inner2_y_data)]

outer2_x_data = [ "未取消预定", "已取消预定", "未取消预定", "已取消预定"]
outer2_y_data = [data_Uncanceled_City,data_Canceled_City, data_Uncanceled_Resort,data_Canceled_Resort]
outer2_data_pair = [list(z) for z in zip(outer2_x_data, outer2_y_data)]

(
    Pie(init_opts=opts.InitOpts())
    .add(
        series_name="1-is_canceled:",
        data_pair=inner_data_pair,
        radius=[0, "35%"],
        center=["30%", "50%"],
    )
    
    .add(
        series_name="1-hotel:",
        data_pair=outer_data_pair,
        center=["30%", "50%"],
        radius=["36%","65%"],
        label_opts=opts.LabelOpts(position="inner"),
    )
 ##----------------- pie2 -------------------------------##  
    .add(
        series_name="2-hotel:",
        data_pair=inner2_data_pair,
        radius=[0, "35%"],
        center=["70%", "50%"],
    )
    .add(
        series_name="2-is_canceled:",
        data_pair=outer2_data_pair,
        center=["70%", "50%"],
        radius=["36%","65%"],
        label_opts=opts.LabelOpts(position="inner"),
    )
    .set_colors(['#6baed6',"#fd8d3c", "#74c476", "#9e9ac8"])
    .set_global_opts(tooltip_opts=opts.TooltipOpts(formatter=" {a} </br> {b}  {d}%",
#                     precision
                      axis_pointer_type = "cross",
                   ), )
    .set_series_opts(label_opts=opts.LabelOpts(position="inner",formatter="{b} \n\n {d}%"),)
    .render_notebook()
)

在这里插入图片描述

分析：

综合来看，

预定取消类型上， $64\%$ 以上顾客没有取消预定；

预定酒店类型上， $66\%$ 以上的订单是对于城市酒店( $C i t y H o t e l$ )的，度假酒店占比较小。

在取消预定的订单中，城市酒店约为度假酒店的$3$倍，占整体取消订单中$3/4(74.8\%)$，主要是因为在整体数量上，城市酒店占比约 $66\%$,但是相对于整体占比的$66\%$ ，占比$74\%$略高，说明 `城市酒店取消率略高` 。

七、来预测顾客是否会取消预定

第一步：计算每个特征与"is_canceled"的相关性，由于有些是类别变量，所以不能参与计算

data_new = data_origin
cancel_corr = data_new.corr()["is_canceled"]
cancel_corr.abs().sort_values(ascending=False)

is_canceled                       1.000000
lead_time                         0.293123
total_of_special_requests         0.234658
required_car_parking_spaces       0.195498
booking_changes                   0.144381
previous_cancellations            0.110133
is_repeated_guest                 0.084793
adults                            0.060017
previous_bookings_not_canceled    0.057358
days_in_waiting_list              0.054186
adr                               0.047557
babies                            0.032491
stays_in_week_nights              0.024765
arrival_date_year                 0.016660
arrival_date_month                0.011022
arrival_date_week_number          0.008148
arrival_date_day_of_month         0.006130
children                          0.005048
stays_in_weekend_nights           0.001791
Name: is_canceled, dtype: float64

第二步特征模型训练
建立base model，使用决策树，随机森林，逻辑回归、XGBC分类器，查看哪个训练结果更好

# for ML:
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier  # 随机森林
from xgboost import XGBClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import eli5 # Feature importance evaluation

#手动选择要包括的列
#为了使模型更通用并防止泄漏，排除了（预订更改、等待日、到达年份、指定房间类型、预订状态、国家/地区，列表）
#包括国家将提高准确性，但它也可能使模型不那么通用
num_features = ["lead_time","total_of_special_requests","required_car_parking_spaces", 
                 "previous_cancellations","is_repeated_guest","adults","previous_bookings_not_canceled",
                "adr","babies","stays_in_weekend_nights","arrival_date_week_number","arrival_date_day_of_month",
                "children","stays_in_week_nights"]

cat_features = ["hotel","arrival_date_month","meal","market_segment",
                "distribution_channel","reserved_room_type","deposit_type","customer_type"]
#分离特征和预测值
features = num_features + cat_features
X = data_new.drop(["is_canceled"], axis=1)[features]
y = data_new["is_canceled"]

#预处理数值特征：
#对于大多数num cols，除了日期，0是最符合逻辑的填充值
#这里没有日期遗漏。
num_transformer = SimpleImputer(strategy="constant")

# 分类特征的预处理：
cat_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
                                  ("onehot", OneHotEncoder(handle_unknown='ignore'))])

# 数值和分类特征的束预处理：
preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features),
                                               ("cat", cat_transformer, cat_features)])

# 定义要测试的模型：
base_models = [("DT_model", DecisionTreeClassifier(random_state=42)),
               ("RF_model", RandomForestClassifier(random_state=42,n_jobs=-1)),
               ("LR_model", LogisticRegression(random_state=42,n_jobs=-1,solver='liblinear')),
               ("XGB_model", XGBClassifier(random_state=42, n_jobs=-1))]

#将数据分成“kfold”部分进行交叉验证，
#使用shuffle确保数据的随机分布：
kfolds = 4 # 4 = 75% train, 25% validation
split = KFold(n_splits=kfolds, shuffle=True, random_state=42)

#对每个模型进行预处理、拟合、预测和评分：
for name, model in base_models:
    #将数据和模型的预处理打包到管道中：
    model_steps = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)])
    
    #获取每个模型的交叉验证分数：
    cv_results = cross_val_score(model_steps, 
                                 X, y, 
                                 cv=split,
                                 scoring="accuracy",
                                 n_jobs=-1)
    # output:
    min_score = round(min(cv_results), 4)
    max_score = round(max(cv_results), 4)
    mean_score = round(np.mean(cv_results), 4)
    std_dev = round(np.std(cv_results), 4)
    print(f"{name} cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")

DT_model cross validation accuarcy score: 0.8196 +/- 0.0024 (std) min: 0.8157, max: 0.822
RF_model cross validation accuarcy score: 0.8521 +/- 0.0018 (std) min: 0.8494, max: 0.8542
LR_model cross validation accuarcy score: 0.8085 +/- 0.0018 (std) min: 0.806, max: 0.8108
XGB_model cross validation accuarcy score: 0.8403 +/- 0.0007 (std) min: 0.8394, max: 0.8413

可知： RF算法的准确度更高一点

※ 部分代码以及思路参考一下文章

CSND:kaggle——Hotel booking demand酒店预订需求 by 牛牛liunian

Kaggle: house-booking by swapnilwagh061993

Kaggle: Hotel bookings ML project - kernel688ef04346 by somepro