Hotel booking -探索性数据分析(EDA)一(seaborn matplot pyecharts)

文章目录

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
from matplotlib.patches import  Rectangle,Circle
from matplotlib.collections import PatchCollection
%matplotlib inline


plt.rcParams['font.sans-serif']=["SimHei"] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号

def figure_polaris(x,y):
    sns.set(style="darkgrid")
    fig,ax=plt.subplots(figsize=(x,y))
    ax.xaxis.grid(False)
    ax.yaxis.grid(True, which='major') # x坐标轴的网格使用主刻度
    for item in ['top', 'right', 'left']:
        ax.spines[item].set_visible(False) #去掉边框

导入信息及基本信息查看

data_origin = pd.read_csv('hotel_bookings.csv')
data_origin.head(5).T
01234
hotelResort HotelResort HotelResort HotelResort HotelResort Hotel
is_canceled00000
lead_time34273771314
arrival_date_year20152015201520152015
arrival_date_monthJulyJulyJulyJulyJuly
arrival_date_week_number2727272727
arrival_date_day_of_month11111
stays_in_weekend_nights00000
stays_in_week_nights00112
adults22112
children00000
babies00000
mealBBBBBBBBBB
countryPRTPRTGBRGBRGBR
market_segmentDirectDirectDirectCorporateOnline TA
distribution_channelDirectDirectDirectCorporateTA/TO
is_repeated_guest00000
previous_cancellations00000
previous_bookings_not_canceled00000
reserved_room_typeCCAAA
assigned_room_typeCCCAA
booking_changes34000
deposit_typeNo DepositNo DepositNo DepositNo DepositNo Deposit
agentNaNNaNNaN304240
companyNaNNaNNaNNaNNaN
days_in_waiting_list00000
customer_typeTransientTransientTransientTransientTransient
adr00757598
required_car_parking_spaces00000
total_of_special_requests00001
reservation_statusCheck-OutCheck-OutCheck-OutCheck-OutCheck-Out
reservation_status_date2015-07-012015-07-012015-07-022015-07-022015-07-03
print('Shape of dataset:',data_origin.shape)
print('Size of dataser: ',data_origin.size)
data_origin.info()
Shape of dataset: (119390, 32)
Size of dataser:  3820480
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
hotel                             119390 non-null object
is_canceled                       119390 non-null int64
lead_time                         119390 non-null int64
arrival_date_year                 119390 non-null int64
arrival_date_month                119390 non-null object
arrival_date_week_number          119390 non-null int64
arrival_date_day_of_month         119390 non-null int64
stays_in_weekend_nights           119390 non-null int64
stays_in_week_nights              119390 non-null int64
adults                            119390 non-null int64
children                          119386 non-null float64
babies                            119390 non-null int64
meal                              119390 non-null object
country                           118902 non-null object
market_segment                    119390 non-null object
distribution_channel              119390 non-null object
is_repeated_guest                 119390 non-null int64
previous_cancellations            119390 non-null int64
previous_bookings_not_canceled    119390 non-null int64
reserved_room_type                119390 non-null object
assigned_room_type                119390 non-null object
booking_changes                   119390 non-null int64
deposit_type                      119390 non-null object
agent                             103050 non-null float64
company                           6797 non-null float64
days_in_waiting_list              119390 non-null int64
customer_type                     119390 non-null object
adr                               119390 non-null float64
required_car_parking_spaces       119390 non-null int64
total_of_special_requests         119390 non-null int64
reservation_status                119390 non-null object
reservation_status_date           119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
data_origin.describe().T
countmeanstdmin25%50%75%max
is_canceled119390.00.3704160.4829180.000.000.0001.01.0
lead_time119390.0104.011416106.8630970.0018.0069.000160.0737.0
arrival_date_year119390.02016.1565540.7074762015.002016.002016.0002017.02017.0
arrival_date_week_number119390.027.16517313.6051381.0016.0028.00038.053.0
arrival_date_day_of_month119390.015.7982418.7808291.008.0016.00023.031.0
stays_in_weekend_nights119390.00.9275990.9986130.000.001.0002.019.0
stays_in_week_nights119390.02.5003021.9082860.001.002.0003.050.0
adults119390.01.8564030.5792610.002.002.0002.055.0
children119386.00.1038900.3985610.000.000.0000.010.0
babies119390.00.0079490.0974360.000.000.0000.010.0
is_repeated_guest119390.00.0319120.1757670.000.000.0000.01.0
previous_cancellations119390.00.0871180.8443360.000.000.0000.026.0
previous_bookings_not_canceled119390.00.1370971.4974370.000.000.0000.072.0
booking_changes119390.00.2211240.6523060.000.000.0000.021.0
agent103050.086.693382110.7745481.009.0014.000229.0535.0
company6797.0189.266735131.6550156.0062.00179.000270.0543.0
days_in_waiting_list119390.02.32114917.5947210.000.000.0000.0391.0
adr119390.0101.83112250.535790-6.3869.2994.575126.05400.0
required_car_parking_spaces119390.00.0625180.2452910.000.000.0000.08.0
total_of_special_requests119390.00.5713630.7927980.000.000.0001.05.0

一、数据预处理

data_origin.isnull().sum() #计算空值数量
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64

可以看到 a g e n t agent agent c o m p a n y company company 的缺失值比较多,考虑这两个指标删除

data_origin = data_origin.drop(['agent','company'],axis = 1) ##丢掉空值较多的这两个指标

探索式数据分析

相关系数图

data= data_origin
#相关系数图
data_corr = data.corr()
width, height = data_corr.shape
labels = data_corr.columns
patches,colors=[],[]
#绘制椭圆
for x in range(width):
    for y in range(height):
        d = np.abs(data_corr.iloc[x, y])
        datum = data_corr.iloc[x, y]
        patch = Circle((x, y), radius=d/4+0.2)
        colors.append(datum)
        patches.append(patch)
fig,ax=plt.subplots(figsize=(13,10))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
coll = PatchCollection(patches,array=np.array(colors),cmap=cmap)
ax.add_collection(coll)
#设置坐标轴范围
ax.set_xlim(-0.5, width-.5)
ax.set_ylim(-0.5, height-.5)
#绘制分隔线
for i in range(0, width):
    plt.axvline(i+.5, color="gray",linewidth=0.5)
    plt.axhline(i+.5, color="gray",linewidth=0.5)
#添加坐标轴刻度
for i in range(0, width):
    plt.text(i, -.6 ,str(i) ,fontsize=15,horizontalalignment="center")
    plt.text(-.6, i ,labels[i]+ '---'+str(i),fontsize=15,horizontalalignment="right")
#绘制四周边框
plt.axvline(-.5, ymin=0, ymax=height-.5, color="grey",lw=2)
plt.axvline(width-.5, ymin=0, ymax=height-.5, color="grey", lw=2)
plt.axhline(height-.5, xmin=0, xmax=width-.5, color="grey",lw=2)
plt.axhline(-.5, xmin=0, xmax=width-.5, color="grey",lw=2)
#添加颜色条等
cbar=plt.colorbar(coll)    
ax.invert_yaxis()
plt.axis("off")
# plt.savefig('corr1.png', dpi=1000, transparent=False)
plt.show()
data_origin['country']=data_origin['country'].replace(np.nan,'unknown')

二、分析来自国家(country)和取消预定之间的关系

2.1主要国家代码与名称对照表

2.2分析主要部分

因为 c o u n t r y country country有缺失值, 接下来先处理 c o u n t r y country country缺失值,然后根据图可以看出, c o u n t r y country country类别特别多,呈现长尾分布,所以确实值可以 替换成 u n k n o w n unknown unknown,然后选取主要部分(约 90 90% 90),再进行画图
figure_polaris(15,7)
# sns.set(style="darkgrid")
sns.countplot(data_origin['country'])
<matplotlib.axes._subplots.AxesSubplot at 0x2cc23a15088>
data_new = data_origin
list1 = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
                                 .count().sort_values(ascending=False))
list2 = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
                                 .count().sort_values(ascending=False))
len1 = 10
len2 = 15
print(
      "被选中的已取消订单的country占比  %.2f%% " % (sum(list1[:len1])*100/sum(list1)),  ## 
    "\n被选中的未取消订单的country占比  %.2f%% " % (sum(list2[:len2])*100/sum(list2)),
    "\n未选中的已取消订单的country订单总数:",sum(list1)-sum(list1[:len1]),
    "\n未选中的未取消订单的country订单总数:",sum(list2)-sum(list2[:len2]),
    )
被选中的已取消订单的country占比  88.80%  
被选中的未取消订单的country占比  89.66%  
未选中的已取消订单的country订单总数: 4953 
未选中的未取消订单的country订单总数: 7770

2.2.1数据处理

数据处理1 选取 —取消订单—中 主要国家部分,前 l e n 1 len1 len1个:
c a n c a e l   i n d e x   t e m p cancael\,index\, temp cancaelindextemp:主要国家名称
c a n c a e l   v a l u e   t e m p cancael\,value\,temp cancaelvaluetemp :主要国家 数量
cancael_index_temp = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
                    .count().sort_values(ascending=False).index)[:len1]+["other country"]
cancael_value_temp = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
                    .count().sort_values(ascending=False))[:len1] + [(sum(list1)-sum(list1[:len1]))]
数据处理2 选取 —未取消订单—中 主要国家部分,前 l e n 2 len2 len2个:
u n c a n c a e l   i n d e x   t e m p uncancael\,index\, temp uncancaelindextemp:主要国家名称
u n c a n c a e l   v a l u e   t e m p uncancael\,value\,temp uncancaelvaluetemp :主要国家 数量
uncancael_index_temp = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
                    .count().sort_values(ascending=False).index)[:len2]+["other country"]
uncancael_value_temp = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
                    .count().sort_values(ascending=False))[:len2]+ [(sum(list2)-sum(list2[:len2]))]
## 合并成一个,后面画图直接传参
country_iscanceled_index = (uncancael_index_temp + cancael_index_temp)
country_iscanceled_value = (uncancael_value_temp + cancael_value_temp)
## 计算是否取消预定 各自占比
temp = data_new.groupby(["is_canceled"])["is_canceled"].count().values
country_iscanceled_inner = list(temp / sum(temp))

2.3 画图

import pyecharts.options as opts
from pyecharts.charts import Pie
from pyecharts.commons.utils import JsCode
# [data_Uncanceled_percent,data_Canceled_percent]
inner_x_data = ["未取消预定", "已取消预定"]
inner_y_data = country_iscanceled_inner
inner_data_pair = [list(z) for z in zip(inner_x_data, inner_y_data)]

outer_x_data = country_iscanceled_index
outer_y_data = country_iscanceled_value
outer_data_pair = [list(z) for z in zip(outer_x_data, outer_y_data)]

(
    Pie(init_opts=opts.InitOpts())
    .add(
        series_name="预订情况:",
        data_pair=inner_data_pair,
        radius=[0, "30%"],
        center=["55%","50%"],
        label_opts=opts.LabelOpts(position="inner",formatter="{b} \n\n {d}%"),#,"
    )    
    .add(
        series_name="来自国家:",
        data_pair=outer_data_pair,
        radius=["31%","50%"],
        center=["55%","50%"],
        label_opts=opts.LabelOpts(position="outer"),#
    )
    .set_colors(['#44a0d6',"#fc7716","#74c476","#9e9ac8","#4c72b0","#ee854a","#6acc64",
                 "#d65f5f","#8c613c","#dc7ec0","#797979","#d5bb67","#82c6e2","#faceb6",
                 "#fae9b6","#e3fab6","#b6faf6","#d6b6fa"])
    
    .set_global_opts(
        tooltip_opts =opts.TooltipOpts(formatter=" {a} </br> {b}  {d}%",axis_pointer_type = "cross",),
        legend_opts =opts.LegendOpts(type_='scroll',orient='vertical',pos_left="5%",pos_top= 'middle'),
        title_opts=opts.TitleOpts(title="是否取消预定 & 顾客来自国家分布",pos_left="center")
    )                     
    .render_notebook()
)

在这里插入图片描述

2.4分析

根据以上可以看出,葡萄牙的Hotel,主要接待的本国游客(废话),齐次是英国。 来自葡萄牙(PRT)本国的,顾客中,取消订单的占比较大,而英国顾客取消预定的就相对较少

三、分析间隔时间(lead_time)和取消预定之间的关系

3.1 小提琴图

# `arrival_date_year` vs `lead_time` vs `is_canceled` exploration with violin plot
data = data_origin
figure_polaris(15,10)
# plt.figure(figsize=(15,10))
sns.violinplot(x='arrival_date_year', y ='lead_time', hue="is_canceled", data=data, palette="Set3", bw=.2,
               cut=2, linewidth=2, iner= 'box', split = True)
sns.despine(left=True)
plt.title('Arrival Year VS Lead Time vs Canceled Situation', weight='bold', fontsize=20)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Lead Time', fontsize=16)
Text(0, 0.5, 'Lead Time')

回归模型

# 查看从预定到离店时间特征的影响
import seaborn as sns
# group data for lead_time:
lead_cancel_data = data.groupby("lead_time")["is_canceled"].describe()
# use only lead_times wih more than 10 bookings for graph:
lead_cancel_data_10 = lead_cancel_data.loc[lead_cancel_data["count"] >= 10]

#show figure:
plt.figure(figsize=(15, 7))

x,y = pd.Series(lead_cancel_data_10.index, name="x_var"), pd.Series(lead_cancel_data_10["mean"].values * 100, name="y_var")
sns.regplot(x=x, y=lead_cancel_data_10["mean"].values * 100)
plt.title("Effect of lead time on cancelation", fontsize=16)
plt.xlabel("Lead time", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()

分析-小提琴图

\quad 展示了 到达年份(arrival_date_year) 间隔时间(lead_time) 和 是否取消(is_canceled)之间的关系。
\quad 其中 间隔时间指的是 预订输入日期到到达日期之间经过的天数

\quad 可以看出 没有取消预定的顾客中,提前预定时间的分部较固定。 取消预定的顾客中,提前时间要长一点

结论

\quad 提前预定时间较长的旅客,更有可能取消预定

分析-回归图

可知:到店日的前几日取消预定的人很少,随着距离预定日越长时间的取消预定的人数越多,提前一年预定的取消率也更大,这也符合人们的常识。

四、酒店人均价格&到店人数

adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
adr: 每一订单的平均每日房价,其定义为所有住宿交易的总和除以总住宿天数

data = data_origin
# 查看adr的分布情况
plt.subplots(figsize=(15,5))
plt.scatter(data['adr'].index,data['adr'].values)
<matplotlib.collections.PathCollection at 0x2cc2b19fc48>

从图中可以看出,数据存在1离群值,打印该值查看。

# 用pd.loc 通过行索引 "Index" 中的具体值来取行数据
data.loc[list(data['adr']).index(max(list(data['adr'])))]['adr']
5400.0

可以看出,这是City Hotel的一笔订单,adr显示为5400,远远超出其他值,虽然在千一级别的订单基数下,这一离群值对平均值影响不大,为了严谨性考虑,我们后续删除这里离群值

## 替换月份 月 对应的数字, 方便画图的时候排序
import calendar
for i in range(1,13):
    month2num =list(calendar.month_name)
    data['arrival_date_month']=data['arrival_date_month'].replace(calendar.month_name[i],i)
data_new0 = data
data_new = data_new0[data_new0["is_canceled"] == 0]
data_new["people"] = data_new['adults'] + data_new["children"]
data_new["people"] = data_new["people"].replace(0.0,1) ##将总数0人的换成1,防止后面平均价格出现 inf
data_new["adr_per"] = data_new['adr'] / (data_new["people"]*(data_new["stays_in_weekend_nights"]+data_new["stays_in_week_nights"]))
# data_new["adr_per"] = data_new["adr_per"].replace(float('inf'),0)

data_aver_price_city_temp = data_new[data_new["hotel"]=="City Hotel"].groupby("arrival_date_month").mean()['adr_per']
data_aver_price_resort_temp = data_new[data_new["hotel"]=="Resort Hotel"].groupby("arrival_date_month").mean()['adr_per']

data_count_price_city_temp = data_new[data_new["hotel"]=="City Hotel"].groupby("arrival_date_month")['people'].count()
data_count_price_resort_temp = data_new[data_new["hotel"]=="Resort Hotel"].groupby("arrival_date_month")['people'].count()
## 整理数据,方便画图
data_aver_city = pd.DataFrame({"arrival_date_month": list(data_aver_price_city_temp.index),
                                "hotel": "City Hotel",
                                "average_cost": list(data_aver_price_city_temp.values),
                                "count":data_count_price_city_temp  })
data_aver_resort = pd.DataFrame({"arrival_date_month": list(data_aver_price_resort_temp.index),
                                "hotel": "Resort Hotel",
                                "average_cost": list(data_aver_price_resort_temp.values),
                                "count":data_count_price_resort_temp })
data_average_hotel = pd.concat([data_aver_city, data_aver_resort], ignore_index=True)
# data_average_hotel
# sns.set_style("whitegrid")
sns.set(style="darkgrid")
## 设置颜色
 
fig, ax = plt.subplots(2,1,figsize=(15, 9))
sns.barplot(x="arrival_date_month", y="average_cost", hue="hotel", palette=sns.color_palette(), ax=ax[0],data=data_average_hotel)

ax[0].set_xlabel(' ', fontsize=15)
ax[0].set_ylabel('Cost', fontsize=15)
ax[0].set_title('Average Cost of Every Month', fontsize=18)

sns.barplot(x="arrival_date_month", y="count", hue="hotel", palette=sns.color_palette(),ax=ax[1],data=data_average_hotel)
# plt.xlabel("Month", fontsize=16)
ax[1].set_xlabel('Month', fontsize=15)
ax[1].set_ylabel('Count', fontsize=15)
ax[1].set_title('Average Order Count of Every Month', fontsize=18)
Text(0.5, 1.0, 'Order Count of Average Month')

分析

通过查询可知,葡萄牙旅游旺季为每年 6月-9月,以里斯本(葡萄牙首都)为例,7月和8月海滩是最热闹的,从 R e s o r t H o t e l Resort Hotel ResortHotel七八月份顾客数(订单数x人数每订单)最多刚好可以和实际情况相吻合。并且从"Cost of Average Month"图中可以看出,7月和8月同时是一年中平均每人消费价格最高的月份,可以看出,这两个月有较多游客来此度假。

五、预定取消情况和餐食选择

# data['country']=data['country'].replace(np.nan,'PRT')
data["meal"] = data["meal"].replace(np.nan,"SC")
# data.drop(data_new.index[zero_guests], inplace=True)
meal_data = data[["hotel", "is_canceled", "meal"]]
# meal_data

plt.figure(figsize=(15, 10))
plt.subplot(1,2,1)
plt.pie(meal_data.loc[meal_data["is_canceled"]==0, "meal"].value_counts(), 
        labels=meal_data.loc[meal_data["is_canceled"]==0, "meal"].value_counts().index, 
       autopct="%.2f%%")
plt.title("Meal Choice of Uncanceled People", fontsize=16)
plt.legend(loc="upper right")
 
plt.subplot(1,2,2)
plt.pie(meal_data.loc[meal_data["is_canceled"]==1, "meal"].value_counts(), 
        labels=meal_data.loc[meal_data["is_canceled"]==1, "meal"].value_counts().index, 
       autopct="%.2f%%")
plt.title("Meal Choice of Canceled People", fontsize=16)
plt.legend(loc="upper right")

<matplotlib.legend.Legend at 0x2cc295b51c8>

分析

两个图几乎没有差别,再次说明取消预订旅客和未取消预订旅客有基本相同的餐食选择

六、不同类型酒店的取消预定情况

## 数据处理
data_new = data_origin
## 分别计算 取消和未取消的记录数量
uncancel_hotel_count = data_new[data_new["is_canceled"]==0].groupby(["hotel"])["hotel"].count()
cancel_hotel_count = data_new[data_new["is_canceled"]==1].groupby(["hotel"])["hotel"].count()
x1 = uncancel_hotel_count['City Hotel']
x2 = uncancel_hotel_count['Resort Hotel']
y1 = cancel_hotel_count['City Hotel']
y2 = cancel_hotel_count['Resort Hotel']

data_Uncanceled_percent = (x1+x2)/(x1+x2+y1+y2) *100
data_Canceled_percent   = (y1+y2)/(x1+x2+y1+y2) *100

data_City_percent       = (x1+y1)/(x1+x2+y1+y2) *100
data_Resort_percent     = (x2+y2)/(x1+x2+y1+y2) *100

data_Uncanceled_City    = x1/ (x1+x2+y1+y2)*100
data_Uncanceled_Resort  = x2/ (x1+x2+y1+y2)*100

data_Canceled_City      = y1/ (x1+x2+y1+y2)*100
data_Canceled_Resort    = y2/ (x1+x2+y1+y2)*100
## 画图
import pyecharts.options as opts
from pyecharts.charts import Pie
from pyecharts.commons.utils import JsCode

inner_x_data = ["未取消预定", "已取消预定"]
inner_y_data = [data_Uncanceled_percent, data_Canceled_percent]
inner_data_pair = [list(z) for z in zip(inner_x_data, inner_y_data)]

outer_x_data = ["城市酒店", "度假酒店", "城市酒店", "度假酒店"]
outer_y_data = [data_Uncanceled_City,data_Uncanceled_Resort, data_Canceled_City,data_Canceled_Resort]
outer_data_pair = [list(z) for z in zip(outer_x_data, outer_y_data)]

inner2_x_data = ["城市酒店", "度假酒店"]
inner2_y_data = [data_City_percent, data_Resort_percent]
inner2_data_pair = [list(z) for z in zip(inner2_x_data, inner2_y_data)]

outer2_x_data = [ "未取消预定", "已取消预定", "未取消预定", "已取消预定"]
outer2_y_data = [data_Uncanceled_City,data_Canceled_City, data_Uncanceled_Resort,data_Canceled_Resort]
outer2_data_pair = [list(z) for z in zip(outer2_x_data, outer2_y_data)]

(
    Pie(init_opts=opts.InitOpts())
    .add(
        series_name="1-is_canceled:",
        data_pair=inner_data_pair,
        radius=[0, "35%"],
        center=["30%", "50%"],
    )
    
    .add(
        series_name="1-hotel:",
        data_pair=outer_data_pair,
        center=["30%", "50%"],
        radius=["36%","65%"],
        label_opts=opts.LabelOpts(position="inner"),
    )
 ##----------------- pie2 -------------------------------##  
    .add(
        series_name="2-hotel:",
        data_pair=inner2_data_pair,
        radius=[0, "35%"],
        center=["70%", "50%"],
    )
    .add(
        series_name="2-is_canceled:",
        data_pair=outer2_data_pair,
        center=["70%", "50%"],
        radius=["36%","65%"],
        label_opts=opts.LabelOpts(position="inner"),
    )
    .set_colors(['#6baed6',"#fd8d3c", "#74c476", "#9e9ac8"])
    .set_global_opts(tooltip_opts=opts.TooltipOpts(formatter=" {a} </br> {b}  {d}%",
#                     precision
                      axis_pointer_type = "cross",
                   ), )
    .set_series_opts(label_opts=opts.LabelOpts(position="inner",formatter="{b} \n\n {d}%"),)
    .render_notebook()
)

在这里插入图片描述

分析:

综合来看,

预定取消类型上, 64 % 64\% 64% 以上顾客没有取消预定;

预定酒店类型上, 66 % 66\% 66% 以上的订单是对于城市酒店( C i t y H o t e l City Hotel CityHotel)的,度假酒店占比较小。


在取消预定的订单中,城市酒店约为度假酒店的$3$倍,占整体取消订单中$3/4(74.8\%)$,主要是因为 在整体数量上,城市酒店占比约 $66\%$,但是相对于整体占比的$66\%$ ,占比$74\%$略高,说明 `城市酒店取消率略高` 。

七、来预测顾客是否会取消预定

第一步:计算每个特征与"is_canceled"的相关性,由于有些是类别变量,所以不能参与计算

data_new = data_origin
cancel_corr = data_new.corr()["is_canceled"]
cancel_corr.abs().sort_values(ascending=False)
is_canceled                       1.000000
lead_time                         0.293123
total_of_special_requests         0.234658
required_car_parking_spaces       0.195498
booking_changes                   0.144381
previous_cancellations            0.110133
is_repeated_guest                 0.084793
adults                            0.060017
previous_bookings_not_canceled    0.057358
days_in_waiting_list              0.054186
adr                               0.047557
babies                            0.032491
stays_in_week_nights              0.024765
arrival_date_year                 0.016660
arrival_date_month                0.011022
arrival_date_week_number          0.008148
arrival_date_day_of_month         0.006130
children                          0.005048
stays_in_weekend_nights           0.001791
Name: is_canceled, dtype: float64

第二步 特征模型训练
建立base model,使用决策树,随机森林,逻辑回归、XGBC分类器,查看哪个训练结果更好

# for ML:
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier  # 随机森林
from xgboost import XGBClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import eli5 # Feature importance evaluation

#手动选择要包括的列
#为了使模型更通用并防止泄漏,排除了(预订更改、等待日、到达年份、指定房间类型、预订状态、国家/地区,列表)
#包括国家将提高准确性,但它也可能使模型不那么通用
num_features = ["lead_time","total_of_special_requests","required_car_parking_spaces", 
                 "previous_cancellations","is_repeated_guest","adults","previous_bookings_not_canceled",
                "adr","babies","stays_in_weekend_nights","arrival_date_week_number","arrival_date_day_of_month",
                "children","stays_in_week_nights"]

cat_features = ["hotel","arrival_date_month","meal","market_segment",
                "distribution_channel","reserved_room_type","deposit_type","customer_type"]
#分离特征和预测值
features = num_features + cat_features
X = data_new.drop(["is_canceled"], axis=1)[features]
y = data_new["is_canceled"]

#预处理数值特征:
#对于大多数num cols,除了日期,0是最符合逻辑的填充值
#这里没有日期遗漏。
num_transformer = SimpleImputer(strategy="constant")

# 分类特征的预处理:
cat_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
                                  ("onehot", OneHotEncoder(handle_unknown='ignore'))])

# 数值和分类特征的束预处理:
preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features),
                                               ("cat", cat_transformer, cat_features)])

# 定义要测试的模型:
base_models = [("DT_model", DecisionTreeClassifier(random_state=42)),
               ("RF_model", RandomForestClassifier(random_state=42,n_jobs=-1)),
               ("LR_model", LogisticRegression(random_state=42,n_jobs=-1,solver='liblinear')),
               ("XGB_model", XGBClassifier(random_state=42, n_jobs=-1))]

#将数据分成“kfold”部分进行交叉验证,
#使用shuffle确保数据的随机分布:
kfolds = 4 # 4 = 75% train, 25% validation
split = KFold(n_splits=kfolds, shuffle=True, random_state=42)

#对每个模型进行预处理、拟合、预测和评分:
for name, model in base_models:
    #将数据和模型的预处理打包到管道中:
    model_steps = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)])
    
    #获取每个模型的交叉验证分数:
    cv_results = cross_val_score(model_steps, 
                                 X, y, 
                                 cv=split,
                                 scoring="accuracy",
                                 n_jobs=-1)
    # output:
    min_score = round(min(cv_results), 4)
    max_score = round(max(cv_results), 4)
    mean_score = round(np.mean(cv_results), 4)
    std_dev = round(np.std(cv_results), 4)
    print(f"{name} cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")


DT_model cross validation accuarcy score: 0.8196 +/- 0.0024 (std) min: 0.8157, max: 0.822
RF_model cross validation accuarcy score: 0.8521 +/- 0.0018 (std) min: 0.8494, max: 0.8542
LR_model cross validation accuarcy score: 0.8085 +/- 0.0018 (std) min: 0.806, max: 0.8108
XGB_model cross validation accuarcy score: 0.8403 +/- 0.0007 (std) min: 0.8394, max: 0.8413

可知: RF算法的准确度更高一点

※ 部分代码以及思路 参考一下文章

CSND:kaggle——Hotel booking demand酒店预订需求 by 牛牛liunian

Kaggle: house-booking by swapnilwagh061993

Kaggle: Hotel bookings ML project - kernel688ef04346 by somepro

  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

北极星~

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值