酒店预订需求

目录

介绍

1.导入库

2. 数据读取

3.查看数据

4.字段含义

5.缺失值处理

6.数据可视化

6.1.各月份的酒店订单数目

6.2.各月份的酒店房价变化

6.3.受欢迎的酒店房型

6.4.各销售渠道的订单及退订数目

6.5.客户的分布情况

6.6.客户一般住多久

6.7.客户的复购率

7.关联性分析

8.特征值处理

9.划分训练集和测试集

10.建立模型

10.1.逻辑回归

10.2.决策树

10.3.随机森林


​​​​​​​


介绍

   酒店预订需求项目主要通过Python进行数据预处理,分析酒店的房型供给、不同时间段的需求变化、最核心的消费群体以及影响退订的因素,并进行数据可视化,利用逻辑回归、决策树、随机森林算法建立酒店订单退订的预测模型。

   数据集

1.导入库

import pandas as pd
import numpy as np
from pandas import Series,DataFrame
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import plotly.express as px
import os
import folium
from sklearn.preprocessing import LabelEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

 

2. 数据读取

os.getcwd()
os.chdir(r'C:\Users\zhuqing\dataProject\hotel-booking-demand\data')
df=pd.read_csv('hotel_bookings.csv')

3.查看数据

df.head()

df.describe()

df.info()

查看字符型的数据的唯一值

category_columns=[ x for x in df.columns if df[x].dtype=='O']
for col in category_columns:
    print('{0} values : {1}\n'.format(col,df[col].unique()))
    print('-'*20)

结果如下:

hotel values : ['Resort Hotel' 'City Hotel']

--------------------
arrival_date_month values : ['July' 'August' 'September' 'October' 'November' 'December' 'January'
 'February' 'March' 'April' 'May' 'June']

--------------------
meal values : ['BB' 'FB' 'HB' 'SC' 'Undefined']

--------------------
country values : ['PRT' 'GBR' 'USA' 'ESP' 'IRL' 'FRA' nan 'ROU' 'NOR' 'OMN' 'ARG' 'POL'
 'DEU' 'BEL' 'CHE' 'CN' 'GRC' 'ITA' 'NLD' 'DNK' 'RUS' 'SWE' 'AUS' 'EST'
 'CZE' 'BRA' 'FIN' 'MOZ' 'BWA' 'LUX' 'SVN' 'ALB' 'IND' 'CHN' 'MEX' 'MAR'
 'UKR' 'SMR' 'LVA' 'PRI' 'SRB' 'CHL' 'AUT' 'BLR' 'LTU' 'TUR' 'ZAF' 'AGO'
 'ISR' 'CYM' 'ZMB' 'CPV' 'ZWE' 'DZA' 'KOR' 'CRI' 'HUN' 'ARE' 'TUN' 'JAM'
 'HRV' 'HKG' 'IRN' 'GEO' 'AND' 'GIB' 'URY' 'JEY' 'CAF' 'CYP' 'COL' 'GGY'
 'KWT' 'NGA' 'MDV' 'VEN' 'SVK' 'FJI' 'KAZ' 'PAK' 'IDN' 'LBN' 'PHL' 'SEN'
 'SYC' 'AZE' 'BHR' 'NZL' 'THA' 'DOM' 'MKD' 'MYS' 'ARM' 'JPN' 'LKA' 'CUB'
 'CMR' 'BIH' 'MUS' 'COM' 'SUR' 'UGA' 'BGR' 'CIV' 'JOR' 'SYR' 'SGP' 'BDI'
 'SAU' 'VNM' 'PLW' 'QAT' 'EGY' 'PER' 'MLT' 'MWI' 'ECU' 'MDG' 'ISL' 'UZB'
 'NPL' 'BHS' 'MAC' 'TGO' 'TWN' 'DJI' 'STP' 'KNA' 'ETH' 'IRQ' 'HND' 'RWA'
 'KHM' 'MCO' 'BGD' 'IMN' 'TJK' 'NIC' 'BEN' 'VGB' 'TZA' 'GAB' 'GHA' 'TMP'
 'GLP' 'KEN' 'LIE' 'GNB' 'MNE' 'UMI' 'MYT' 'FRO' 'MMR' 'PAN' 'BFA' 'LBY'
 'MLI' 'NAM' 'BOL' 'PRY' 'BRB' 'ABW' 'AIA' 'SLV' 'DMA' 'PYF' 'GUY' 'LCA'
 'ATA' 'GTM' 'ASM' 'MRT' 'NCL' 'KIR' 'SDN' 'ATF' 'SLE' 'LAO']

--------------------
market_segment values : ['Direct' 'Corporate' 'Online TA' 'Offline TA/TO' 'Complementary' 'Groups'
 'Undefined' 'Aviation']

--------------------
distribution_channel values : ['Direct' 'Corporate' 'TA/TO' 'Undefined' 'GDS']

--------------------
reserved_room_type values : ['C' 'A' 'D' 'E' 'G' 'F' 'H' 'L' 'P' 'B']

--------------------
assigned_room_type values : ['C' 'A' 'D' 'E' 'G' 'F' 'I' 'B' 'H' 'P' 'L' 'K']

--------------------
deposit_type values : ['No Deposit' 'Refundable' 'Non Refund']

--------------------
customer_type values : ['Transient' 'Contract' 'Transient-Party' 'Group']

--------------------
reservation_status values : ['Check-Out' 'Canceled' 'No-Show']

--------------------
reservation_status_date values : ['2015-07-01' '2015-07-02' '2015-07-03' '2015-05-06' '2015-04-22'
 '2015-06-23' '2015-07-05' '2015-07-06' '2015-07-07' '2015-07-08'
 '2015-05-11' '2015-07-15' '2015-07-16' '2015-05-29' '2015-05-19'
 '2015-06-19' '2015-05-23' '2015-05-18' '2015-07-09' '2015-06-02'
 '2015-07-13' '2015-07-04' '2015-06-29' '2015-06-16' '2015-06-18'
 '2015-06-12' '2015-06-09' '2015-05-26' '2015-07-11' '2015-07-12'
 '2015-07-17' '2015-04-15' '2015-05-13' '2015-07-10' '2015-05-20'
 '2015-05-12' '2015-07-14' '2015-06-17' '2015-05-01' '2015-03-30'
 '2015-07-19' '2015-06-03' '2015-06-26' '2015-05-14' '2015-07-20'
 '2015-05-07' '2015-05-28' '2015-04-13' '2015-03-25' '2015-07-21'
 '2015-06-27' '2015-07-18' '2015-07-23' '2015-06-08' '2015-06-22'
 '2015-06-24' '2015-03-05' '2015-06-01' '2015-04-24' '2015-07-22'
 '2015-05-27' '2015-04-06' '2015-04-11' '2015-07-25' '2015-07-28'
 '2015-07-29' '2015-06-25' '2015-07-24' '2015-06-05' '2015-06-30'
 '2015-06-13' '2015-06-11' '2015-07-30' '2015-07-27' '2015-04-29'
 '2015-06-04' '2015-07-26' '2015-08-01' '2015-08-02' '2015-06-15'
 '2015-04-23' '2015-07-31' '2015-05-25' '2015-08-03' '2015-04-17'
 '2015-08-04' '2015-08-06' '2015-05-15' '2015-05-09' '2015-03-17'
 '2015-05-22' '2015-08-07' '2015-04-04' '2015-08-05' '2015-08-08'
 '2015-08-10' '2015-05-04' '2015-06-06' '2015-08-09' '2015-08-15'
 '2015-08-11' '2015-03-28' '2015-08-14' '2015-08-12' '2015-08-16'
 '2015-05-16' '2015-08-21' '2015-08-13' '2015-08-17' '2015-04-20'
 '2015-08-18' '2015-08-23' '2015-08-22' '2015-08-19' '2015-08-20'
 '2015-08-29' '2015-03-31' '2015-05-30' '2015-08-25' '2015-04-14'
 '2015-08-24' '2015-03-24' '2015-05-21' '2015-08-28' '2015-08-26'
 '2015-08-27' '2015-08-30' '2015-08-31' '2015-09-06' '2015-09-03'
 '2015-09-04' '2015-09-02' '2015-09-01' '2015-09-05' '2015-06-20'
 '2015-09-07' '2015-09-10' '2015-09-11' '2015-09-08' '2015-09-09'
 '2015-09-13' '2015-09-15' '2015-04-10' '2015-01-02' '2014-11-18'
 '2015-09-12' '2015-09-17' '2015-09-14' '2015-04-07' '2015-09-19'
 '2015-09-16' '2015-09-20' '2015-01-18' '2015-10-23' '2015-01-22'
 '2015-01-01' '2015-09-22' '2015-09-24' '2015-09-18' '2015-09-21'
 '2015-09-30' '2015-09-25' '2015-09-27' '2015-09-28' '2015-10-12'
 '2015-09-29' '2015-09-23' '2015-10-01' '2015-09-26' '2015-04-18'
 '2015-10-02' '2015-10-04' '2015-10-08' '2015-10-03' '2015-10-07'
 '2015-10-09' '2015-10-11' '2015-10-05' '2015-10-06' '2015-10-10'
 '2015-10-14' '2015-10-15' '2015-10-18' '2015-10-13' '2015-10-20'
 '2015-10-19' '2015-10-31' '2015-10-16' '2015-10-21' '2015-10-22'
 '2015-10-17' '2015-10-24' '2015-10-25' '2015-10-28' '2015-10-27'
 '2015-10-26' '2015-10-30' '2015-11-05' '2015-10-29' '2015-11-03'
 '2015-11-07' '2015-11-04' '2015-11-01' '2015-11-02' '2015-11-17'
 '2015-11-06' '2015-11-10' '2015-11-08' '2015-11-09' '2015-11-15'
 '2015-11-16' '2015-11-11' '2015-11-12' '2015-11-14' '2015-11-13'
 '2015-11-18' '2015-11-22' '2015-11-19' '2015-11-21' '2015-11-20'
 '2015-11-24' '2015-11-25' '2015-11-23' '2015-11-28' '2015-11-26'
 '2015-11-27' '2015-11-29' '2015-12-04' '2015-12-01' '2015-12-06'
 '2015-12-08' '2015-12-02' '2015-12-03' '2015-12-31' '2015-12-05'
 '2015-12-10' '2015-12-17' '2015-11-30' '2015-12-12' '2015-12-07'
 '2016-01-05' '2015-12-11' '2015-12-13' '2015-12-15' '2015-12-16'
 '2015-12-19' '2015-12-18' '2015-12-26' '2015-12-27' '2015-12-22'
 '2015-12-23' '2015-12-24' '2015-12-29' '2015-12-28' '2015-12-20'
 '2015-12-30' '2016-01-02' '2016-01-01' '2015-12-25' '2016-01-03'
 '2016-01-04' '2016-01-11' '2016-01-07' '2015-12-21' '2016-01-09'
 '2016-01-10' '2016-01-08' '2016-01-06' '2016-01-12' '2016-01-13'
 '2016-01-23' '2016-02-09' '2016-01-15' '2016-01-16' '2016-01-17'
 '2016-01-19' '2016-01-18' '2016-01-21' '2016-01-24' '2016-01-22'
 '2016-01-29' '2016-01-27' '2016-01-25' '2016-03-08' '2016-01-26'
 '2016-01-20' '2016-01-30' '2016-02-01' '2016-02-02' '2016-02-08'
 '2016-02-07' '2016-01-28' '2016-02-05' '2016-02-03' '2016-02-13'
 '2016-02-10' '2016-02-04' '2016-02-12' '2016-02-11' '2016-02-16'
 '2016-02-14' '2016-02-15' '2016-02-20' '2016-02-06' '2016-01-14'
 '2016-02-17' '2016-02-21' '2016-02-24' '2016-02-25' '2016-02-19'
 '2016-02-18' '2016-02-26' '2016-02-23' '2016-03-05' '2016-02-22'
 '2016-02-27' '2016-03-03' '2016-03-24' '2016-03-04' '2016-02-29'
 '2016-03-01' '2016-03-02' '2016-03-30' '2016-03-07' '2016-03-14'
 '2016-03-21' '2016-03-09' '2016-03-12' '2016-03-22' '2016-03-10'
 '2016-03-11' '2016-03-20' '2016-03-15' '2016-03-17' '2016-03-16'
 '2016-03-19' '2016-03-27' '2016-03-18' '2016-03-26' '2016-03-31'
 '2016-03-28' '2016-03-29' '2016-04-01' '2016-03-23' '2016-04-02'
 '2016-03-25' '2016-03-13' '2016-04-04' '2016-04-03' '2016-04-05'
 '2016-04-08' '2016-04-06' '2016-04-09' '2016-04-12' '2016-04-16'
 '2016-04-17' '2016-04-27' '2016-04-14' '2016-04-18' '2016-04-21'
 '2016-04-19' '2016-04-20' '2016-04-10' '2016-04-13' '2016-04-11'
 '2016-04-07' '2016-04-15' '2016-04-22' '2016-04-23' '2016-04-26'
 '2016-04-28' '2016-04-24' '2016-04-25' '2016-04-29' '2016-04-30'
 '2016-05-01' '2016-05-10' '2016-05-02' '2016-05-07' '2016-05-08'
 '2016-05-12' '2016-05-04' '2016-05-06' '2016-05-03' '2016-05-09'
 '2016-05-05' '2016-05-13' '2016-05-14' '2016-05-18' '2016-05-19'
 '2016-05-15' '2016-05-16' '2016-05-11' '2016-05-21' '2016-05-22'
 '2016-05-20' '2016-05-24' '2016-05-25' '2016-05-26' '2016-05-23'
 '2016-05-27' '2016-05-17' '2016-05-29' '2016-05-28' '2016-05-30'
 '2016-05-31' '2016-06-01' '2016-06-03' '2016-06-08' '2016-06-02'
 '2016-06-05' '2016-06-06' '2016-06-13' '2016-06-07' '2016-06-10'
 '2016-06-11' '2016-06-16' '2016-06-12' '2016-06-14' '2016-06-17'
 '2016-06-04' '2016-06-18' '2016-06-21' '2016-06-09' '2016-06-24'
 '2016-06-20' '2016-06-25' '2016-06-22' '2016-06-26' '2016-06-23'
 '2016-07-01' '2016-06-15' '2016-06-28' '2016-07-02' '2016-06-19'
 '2016-06-27' '2016-07-04' '2016-06-30' '2016-07-05' '2016-07-08'
 '2016-07-09' '2016-07-07' '2016-07-12' '2016-06-29' '2016-07-10'
 '2016-07-15' '2016-07-03' '2016-07-16' '2016-07-14' '2016-07-18'
 '2016-07-13' '2016-07-06' '2016-07-20' '2016-07-21' '2016-07-23'
 '2016-07-19' '2016-07-11' '2016-07-28' '2016-07-17' '2016-07-25'
 '2016-07-22' '2016-07-29' '2016-08-03' '2016-08-02' '2016-08-04'
 '2016-08-08' '2016-08-10' '2016-08-01' '2016-08-06' '2016-03-06'
 '2016-08-05' '2016-07-26' '2016-08-07' '2016-07-30' '2016-07-24'
 '2016-08-12' '2016-07-27' '2016-08-13' '2016-08-18' '2016-08-16'
 '2016-08-15' '2016-08-17' '2016-08-11' '2016-07-31' '2016-08-19'
 '2016-09-01' '2016-08-23' '2016-08-26' '2016-08-20' '2016-08-21'
 '2016-09-04' '2016-08-22' '2016-08-27' '2016-08-25' '2016-08-09'
 '2016-09-05' '2016-08-24' '2016-09-10' '2016-08-29' '2016-09-09'
 '2016-08-30' '2016-09-13' '2016-08-31' '2016-09-14' '2016-09-12'
 '2016-09-15' '2016-08-14' '2016-09-02' '2016-09-08' '2016-09-19'
 '2016-09-16' '2016-09-07' '2016-09-21' '2016-09-06' '2016-09-22'
 '2016-09-17' '2016-09-20' '2016-09-03' '2016-09-26' '2016-09-23'
 '2016-09-18' '2016-09-29' '2016-10-02' '2016-10-01' '2016-09-27'
 '2016-09-25' '2016-10-05' '2016-09-11' '2016-09-30' '2016-10-09'
 '2016-10-03' '2016-10-06' '2016-10-11' '2016-09-24' '2016-10-13'
 '2016-09-28' '2016-10-08' '2016-10-07' '2016-10-16' '2016-08-28'
 '2016-10-17' '2016-10-18' '2016-10-10' '2016-10-04' '2016-10-15'
 '2016-10-19' '2016-10-21' '2016-10-12' '2016-10-24' '2016-10-26'
 '2016-10-23' '2016-10-20' '2016-10-25' '2016-10-27' '2016-10-28'
 '2016-10-30' '2016-10-29' '2016-11-01' '2016-11-04' '2016-10-14'
 '2016-11-07' '2016-11-03' '2016-11-10' '2016-11-14' '2016-11-02'
 '2016-10-31' '2016-11-11' '2016-11-08' '2016-11-05' '2016-11-25'
 '2016-11-09' '2016-11-20' '2016-11-21' '2016-10-22' '2016-11-22'
 '2016-11-16' '2016-11-23' '2016-11-17' '2016-11-06' '2016-11-15'
 '2016-11-13' '2016-11-12' '2016-11-27' '2016-11-19' '2016-11-30'
 '2016-11-18' '2016-12-02' '2016-12-04' '2016-11-29' '2016-12-07'
 '2016-11-28' '2016-12-03' '2016-12-06' '2016-11-24' '2016-12-08'
 '2016-12-05' '2016-12-10' '2016-12-13' '2016-12-14' '2016-12-16'
 '2016-12-15' '2016-12-17' '2016-12-19' '2016-12-21' '2016-12-20'
 '2016-12-22' '2016-12-23' '2016-12-24' '2016-12-01' '2016-12-27'
 '2016-12-29' '2016-12-30' '2016-12-12' '2017-01-02' '2016-12-11'
 '2017-01-03' '2017-01-04' '2017-01-01' '2016-12-26' '2017-01-06'
 '2016-12-28' '2016-12-18' '2017-01-10' '2017-01-11' '2017-01-07'
 '2017-01-12' '2017-01-16' '2017-01-14' '2017-01-13' '2017-01-05'
 '2017-01-17' '2017-01-20' '2016-12-09' '2017-01-26' '2016-12-31'
 '2017-01-23' '2017-01-27' '2017-01-28' '2017-01-19' '2017-01-25'
 '2017-01-24' '2017-01-29' '2017-01-18' '2016-12-25' '2017-01-15'
 '2017-01-21' '2017-02-01' '2017-02-02' '2017-01-31' '2017-02-03'
 '2017-02-04' '2017-02-06' '2017-02-07' '2017-02-08' '2017-01-30'
 '2017-02-09' '2017-01-09' '2017-02-11' '2017-02-10' '2017-02-12'
 '2017-02-13' '2017-02-14' '2017-02-16' '2017-02-17' '2017-02-18'
 '2017-02-19' '2017-02-20' '2017-02-15' '2017-02-21' '2017-02-22'
 '2017-02-26' '2017-02-23' '2017-02-24' '2017-02-25' '2017-02-28'
 '2017-03-05' '2017-02-27' '2017-03-03' '2017-03-06' '2017-03-02'
 '2017-03-08' '2017-03-09' '2017-03-10' '2017-03-07' '2017-03-12'
 '2017-03-13' '2017-03-14' '2017-03-01' '2017-03-18' '2017-03-17'
 '2017-03-24' '2017-03-22' '2017-03-26' '2017-03-27' '2017-03-11'
 '2017-03-28' '2017-03-29' '2017-03-30' '2017-03-31' '2017-03-19'
 '2017-01-22' '2017-04-02' '2017-03-20' '2017-04-03' '2017-01-08'
 '2017-03-23' '2017-04-05' '2017-02-05' '2017-04-04' '2017-03-15'
 '2017-04-07' '2017-03-25' '2017-04-08' '2017-04-06' '2017-03-21'
 '2017-04-10' '2017-04-01' '2017-04-11' '2017-04-13' '2017-04-15'
 '2017-04-12' '2017-03-04' '2017-04-19' '2017-04-22' '2017-04-20'
 '2017-05-02' '2017-04-09' '2017-04-23' '2017-04-24' '2017-04-16'
 '2017-04-28' '2017-04-18' '2017-04-26' '2017-04-25' '2017-04-17'
 '2017-04-21' '2017-05-03' '2017-05-04' '2017-03-16' '2017-05-05'
 '2017-04-29' '2017-04-14' '2017-05-08' '2017-04-27' '2017-05-11'
 '2017-05-01' '2017-05-10' '2017-05-13' '2017-05-06' '2017-05-14'
 '2017-05-16' '2017-04-30' '2017-05-15' '2017-05-07' '2017-05-09'
 '2017-05-17' '2017-05-21' '2017-05-12' '2017-05-22' '2017-05-24'
 '2017-05-23' '2017-05-25' '2017-05-26' '2017-05-28' '2017-05-27'
 '2017-05-29' '2017-05-19' '2017-05-31' '2017-05-20' '2017-06-01'
 '2017-05-30' '2017-06-02' '2016-11-26' '2017-06-04' '2017-06-05'
 '2017-06-06' '2017-06-07' '2017-05-18' '2017-06-09' '2017-06-10'
 '2017-06-11' '2017-06-12' '2017-06-14' '2017-06-08' '2017-06-16'
 '2017-06-13' '2017-06-03' '2017-06-24' '2017-06-20' '2017-06-19'
 '2017-06-21' '2017-06-26' '2017-06-27' '2017-06-22' '2017-06-28'
 '2017-06-15' '2017-06-29' '2017-06-30' '2017-06-18' '2017-07-04'
 '2017-07-08' '2017-07-05' '2017-07-03' '2017-07-07' '2017-07-01'
 '2017-07-06' '2017-07-11' '2017-07-12' '2017-06-23' '2017-07-13'
 '2017-07-02' '2017-07-10' '2017-07-14' '2017-07-15' '2017-07-16'
 '2017-07-18' '2017-07-17' '2017-07-19' '2017-07-20' '2017-07-21'
 '2017-06-25' '2017-06-17' '2017-07-24' '2017-07-26' '2017-07-09'
 '2017-07-27' '2017-07-28' '2017-07-31' '2017-07-29' '2017-07-22'
 '2017-08-02' '2017-08-01' '2017-08-03' '2017-08-04' '2017-07-25'
 '2017-07-23' '2017-08-09' '2017-08-10' '2017-07-30' '2017-08-07'
 '2017-08-13' '2017-08-05' '2017-08-14' '2017-08-08' '2017-08-16'
 '2017-08-17' '2017-08-15' '2017-08-18' '2017-08-20' '2017-08-22'
 '2017-08-06' '2017-08-25' '2017-08-26' '2017-08-23' '2017-08-11'
 '2017-08-27' '2017-08-21' '2017-08-29' '2017-08-31' '2017-08-12'
 '2017-08-19' '2016-01-31' '2017-09-01' '2017-08-28' '2015-04-03'
 '2015-01-21' '2015-01-28' '2015-01-29' '2015-01-30' '2015-02-02'
 '2015-02-05' '2015-02-06' '2015-02-09' '2015-02-10' '2015-02-11'
 '2015-02-12' '2015-02-19' '2015-02-20' '2015-02-23' '2015-02-24'
 '2015-02-25' '2015-02-26' '2015-02-27' '2015-03-03' '2015-03-04'
 '2015-03-06' '2015-03-09' '2015-03-11' '2015-03-12' '2015-03-18'
 '2015-04-02' '2015-06-14' '2015-04-08' '2015-04-16' '2015-04-25'
 '2015-04-28' '2015-05-08' '2017-09-06' '2016-02-28' '2015-12-09'
 '2015-12-14' '2017-09-09' '2017-09-02' '2017-08-24' '2017-08-30'
 '2017-09-03' '2017-09-04' '2017-09-05' '2017-09-07' '2017-09-08'
 '2017-09-10' '2017-09-12' '2017-09-14' '2015-04-30' '2015-04-21'
 '2015-04-05' '2015-03-13' '2015-05-05' '2015-03-29' '2015-06-10'
 '2015-04-27' '2014-10-17' '2015-01-20' '2015-02-17' '2015-03-10'
 '2015-03-23']

--------------------

4.字段含义

1.酒店类型:度假酒店或城市酒店

2.是否取消:指示预订是否已取消,是=(1)、否=(0)

3.间隔时间:从预订输入日期到到达日期之间经过的天数

4.5.6.7:到店年、到店月、到店日期所属周、到店日期

8.周末天数:客人在酒店预订或预订的周末晚(星期六或星期日)的数目

9.非周末天数:入住或预订入住酒店的周晚数(星期一至星期五)

10.11.12:成人、儿童、婴儿

13.餐食:预订的餐食类型。分类在标准招待餐套餐中列出:Undefined / SC –不包括餐点套餐; BB –住宿加早餐酒店; HB –半膳(早餐和另一顿饭–通常为晚餐); FB –全食宿(早餐,午餐和晚餐)

14.国家:客人所属国家

15.市场细分:市场细分指定。在类别中,术语“ TA”是指“旅行代理商”,而“ TO”是指“旅行社”。Online TA,Offline TA/TO,Other。

16.预订渠道:术语“ TA”是指“旅行代理商”,而“ TO”是指“旅行社”。TA/TO,Direct,Other。

17.是否为复订客人:是否来自重复访客的值,是=(1),不是=(0)

18.是否有取消历史:客户在前预订之前取消的先前预订的数量

19:未取消订单数:客户在当前预订之前未取消的先前预订的数量

20.预订房型:保留房间类型代码。

21.入住房型:分配给预订的房间类型的代码。有时由于酒店运营原因(例如,超额预订)或客户要求,分配的房型与预订的房型不同。

22.对预订进行的更改:从输入预订之日起至办理入住或取消之时,对预订进行的更改/修改的数量。

23.担保类型:指示客户是否进行了押金以保证预订。该变量可以假定为三类:不存款-不存款。不退款-押金是总住宿费用的价值;可退还–一笔押金的价值低于住宿总费用。

24.代理商:进行预订的旅行社的ID。

25.公司:进行预订或负责支付预订的公司/实体的ID。

26.进入等待列表天数:在向客户确认之前,预订已进入等待列表的天数

27.顾客类型:预订类型:1.合同——当预订有分配或与之相关的其他类型的合同时;2,团体——当预订与团体关联时;3.Transient——当预订不属于组或合同的一部分,并且与其他临时预订无关;4.Transient-party——当预订是暂时的,但至少与其他临时预订相关联时。

28.平均每日房价:平均每日房价,其定义为所有住宿交易的总和除以总住宿天数

29.客户所需的停车位数量

30.客人特殊要求:客户提出的特殊要求数量(例如,单人床或高楼层)

31.预订最后状态:预订的最后状态,假定为以下三个类别之一:已取消–客户已取消预订;签出–客户已签到但已经离开;没出现-顾客没有办理入住手续,没有告知酒店原因

32.最后状态日期:此变量可以与ReservationStatus结合使用,以了解何时取消预订或客户何时退房

5.缺失值处理

数据中的缺失值数量及所占百分比

pd.DataFrame({'null value ':df.isna().sum() , 'null value percent ':df.isna().sum()/df.shape[0]})

 结果如下

company,agent缺失数据较多,直接删除这两列数据

drop_columns=['company','agent']
df.drop(drop_columns,axis=1,inplace=True)

对children列用0对缺失值进行填充

f['children'].fillna(0,inplace=True)

对country列用众数对缺失值进行填充

df['country'].fillna(df['country'].mode()[0],inplace=True)

对入住人数均为0的无效数据进行删除

#adult,children,babies均为0的数据
filter=(df['adults']==0)&(df['children']==0)&(df['babies']==0)
df=df[~filter]

6.数据可视化

分析对酒店订单取消有影响的因素

6.1.各月份的酒店订单数目

data=df[df['is_canceled']==0]#去除已取消的订单
#由于月份arrival_date_month为字符型数据,新增对应的数字列arrival_date_month_num,按照月份arrival_date_month_num,酒店类型hotel对数据进行排序
data_assigned_count=data.groupby(['arrival_date_month','hotel'])['is_canceled'].agg('count').reset_index()
data_assigned_count.rename(columns={'is_canceled':'assigned_count'},inplace=True)
data_assigned_count['arrival_date_month_num']=data_assigned_count['arrival_date_month'].map({'January':1,'February':2,'March' :3,'April':4,'May' :5, 'June' :6,\
                                      'July' :7,'August':8,'September':9,'October':10,'November':11,'December':12})
data_assigned_count.sort_values('arrival_date_month_num',ascending=True,inplace=True)#按月份排序
data_assigned_count

绘制“各月份的酒店订单数目”图表

plt.figure(figsize=(15,10))
px.bar(data_assigned_count,x='arrival_date_month',y='assigned_count',color ='hotel',barmode='group',title='各月份的酒店订单数目')

结论:city hotel和resort hotel酒店的订单数在7、8月份最高。总体上,city hotel的订单数均高于resort hotel。

6.2.各月份的酒店房价变化

按月份arrival_date_month、酒店类型hotel对数据分组,并对每组的adr房价求平均值,对结果按照月份arrival_date_month_num排序

data_adr=data.groupby(['arrival_date_month','hotel'])['adr'].agg('mean').reset_index()
data_adr['arrival_date_month_num']=data_adr['arrival_date_month'].map({'January':1,'February':2,'March' :3,'April':4,'May' :5, 'June' :6,\
                                      'July' :7,'August':8,'September':9,'October':10,'November':11,'December':12})
data_adr.sort_values('arrival_date_month_num',ascending=True,inplace=True)
data_adr.head()

绘制“各月份的酒店房价变化”折线图

plt.figure(figsize=(15,12))
px.line(data_adr,x='arrival_date_month',y='adr',color='hotel',title='各月份的酒店房价变化')

结论:resort hotel在7,8月份时房价明显高于city hotel,其余时间段则低于city hotel

6.3.受欢迎的酒店房型

按照酒店类型hotel,房型assigned_room_type分组,统计每组的订单数量。对列名重命名为assigned_count

data_room_type=data.groupby(['hotel','assigned_room_type']).count().iloc[:,0].reset_index()
data_room_type.rename(columns={'is_canceled':'assigned_count'},inplace=True)
data_room_type.head()

绘制“不同房型的入住次数”条形图

px.bar(data_room_type,x='assigned_room_type',y='assigned_count',color='hotel',barmode='group',title='不同房型的入住次数')

绘制“不同房型的价格”箱型图

px.box(data.sort_values('assigned_room_type'),x='assigned_room_type',y='adr',color='hotel',title='不同房型的价格')px.box(data.sort_values('assigned_room_type'),x='assigned_room_type',y='adr',color='hotel',title='不同房型的价格')

结论:最受欢迎的房型是A,D房型,且A,D房型的价格都较低

6.4.各销售渠道的订单及退订数目

对各渠道未取消的订单计数,并对列名重命名为number of distribution

#各渠道未取消的订单
data_distribution_not=df[df['is_canceled']==0]['distribution_channel'].value_counts()
data_distribution_not=pd.DataFrame(data_distribution_not)
data_distribution_not.rename(columns={'distribution_channel':'number of distribution'},inplace=True)
data_distribution_not

对各渠道取消的订单计数,并重命名列名为canceled number of distribution,删除Undefined行

#各渠道取消的订单
data_distribution_canceled=df[df['is_canceled']==1]['distribution_channel'].value_counts()
data_distribution_canceled=pd.DataFrame(data_distribution_canceled)
data_distribution_canceled.rename(columns={'distribution_channel':'canceled number of distribution'},inplace=True)
data_distribution_canceled.drop('Undefined',inplace=True)
data_distribution_canceled

将以上两个df进行拼接,并计算各渠道中取消的订单占比

#各渠道中取消的订单占比
data_distribution=pd.merge(data_distribution_not,data_distribution_canceled,left_index=True,right_index=True)
data_distribution['canceled percentage']= data_distribution['canceled number of distribution']/(data_distribution['number of distribution']+data_distribution['canceled number of distribution'])*100
data_distribution

绘制“各销售渠道的订单及退订数目”条形图

px.bar(data_distribution,x=data_distribution.index,y=['number of distribution','canceled number of distribution'],title='各销售渠道的订单及退订数目')
 

结论:主要的订单来自TA/TO销售渠道,并且该渠道的退订率也远高于其他渠道

6.5.客户的分布情况

按照国家统计订单数

data_guest=df[df['is_canceled']==0]['country'].value_counts().reset_index().rename(columns={'country':'number of guest','index':'country'})
data_guest.head()

绘制分级统计图

px.choropleth(data_guest,locations='country',color='number of guest',hover_name='country')
 

结论:客户主要来自欧洲

6.6.客户一般住多久

总天数=非周末天数stays_in_week_nights+周末天数stays_in_weekend_nights,新增一列总天数total_nights,按照酒店类型hotel,总天数total_nights分组统计每组的订单数,对列名重命名为number of stay

data['total_nights']=data['stays_in_week_nights']+data['stays_in_weekend_nights']
data_stay=data.groupby(['hotel','total_nights']).count().reset_index().iloc[:,:3]
data_stay.rename(columns={'is_canceled':'number of stay'},inplace=True)
data_stay

绘制条形图

px.bar(data_stay,'total_nights','number of stay',color='hotel',barmode='group',range_x=[0,30])

结论:客户一般在酒店住1-4天

6.7.客户的复购率

总复购率

total_percentage=round(df[df['is_repeated_guest']==1]['is_repeated_guest'].count()/df.shape[0],3)

各酒店的复购率

resort_percentage=round(df[(df['is_repeated_guest']==1)&(df['hotel']=='Resort Hotel')]['is_repeated_guest'].count()/(df[df['hotel']=='Resort Hotel'].shape[0]),3)
city_percentage=round(df[(df['is_repeated_guest']==1)&(df['hotel']=='City Hotel')]['is_repeated_guest'].count()/(df[df['hotel']=='City Hotel'].shape[0]),3)
data_repeat=pd.DataFrame({'total_percentage':[total_percentage],'resort_percentage':[resort_percentage],'city_percentage':[city_percentage]})
data_repeat

7.关联性分析

plt.figure(figsize=(18,12))
corr=df.corr()
sns.heatmap(corr,annot=True)

对订单取消is_canceled关联性进行排序

corr_canceled=corr['is_canceled'].abs().sort_values(ascending=False)
corr_canceled

8.特征值处理

去除关联性较小的特征值

drop_columns=['stays_in_weekend_nights','children','arrival_date_day_of_month','arrival_date_week_number','arrival_date_year','reservation_status', 'country']
df=df.drop(columns=drop_columns)

时间格式转换,并新增列year,month,day

df['reservation_status_date']=pd.to_datetime(df['reservation_status_date'])
df['year']=df['reservation_status_date'].dt.year
df['month']=df['reservation_status_date'].dt.month
df['day']=df['reservation_status_date'].dt.day
df.drop(['reservation_status_date'],axis=1,inplace=True)

查询字符型的数据

category_columns=[ x for x in df.columns if df[x].dtype=='O']
category_columns

用Label AEncoding对object类型编号处理,将字符型数据转换为数值型

le=LabelEncoder()
for col in category_columns:
    df[col]=le.fit_transform(df[col])

cat_df=df[category_columns]
cat_df.head()

对数值型的 df取方差,查看数据的离散程度

num_df=df.drop(columns=category_columns)
num_df.var().sort_values(ascending=False)

对方差较大的列进行log操作。对log操作以后的adr空值用平均值填充

num_df['lead_time']=np.log(num_df['lead_time']+1)
num_df['adr']=np.log(num_df['adr']+1)
num_df['days_in_waiting_list']=np.log(num_df['days_in_waiting_list']+1)
num_df['day']=np.log(num_df['day']+1)
num_df['month']=np.log(num_df['month']+1)
num_df['adr']=num_df['adr'].fillna(num_df['adr'].mean())

log以后的方差变小

num_df.var().sort_values(ascending=False)

将数值型的df和字符型的df拼接

all_df=pd.concat([num_df,cat_df],axis=1)

9.划分训练集和测试集

##样本特征
X=all_df.drop(columns='is_canceled')
#样本结果
y=all_df['is_canceled']
#划分训练集和测试集
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7)

10.建立模型

10.1.逻辑回归

#模型训练
lr=LogisticRegression()
lr.fit(X_train,y_train)
#分类预测
y_pred=lr.predict(X_test)
#预测准确度
accuracy_score(y_test,y_pred)

评分结果

10.2.决策树

dtc=DecisionTreeClassifier()
dtc.fit(X_train,y_train)
y_pred=dtc.predict(X_test)
accuracy_score(y_test,y_pred)

评分结果

10.3.随机森林

rfc= RandomForestClassifier()
rfc.fit(X_train,y_train)
y_pred=rfc.predict(X_test)
accuracy_score(y_test,y_pred)

评分结果

 

 

 

  • 1
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值