1.项目摘要说明
项目目的:对于数据分析的练习
数据来源:kaggle
源码.数据集以及字段说明 百度云链接:
https://pan.baidu.com/s/1HY_6OWC247bH-Z7cRJaYdg
提取码:vd3t
本项目分析目标:
- 对数据进行基础分析 预定需求,入住率,用户,预定时长,房型对比等等
- 分析是否可以根据之前取消的预订情况来预测酒店预订的可能性
2.对数据的基础分析
准备工作(导入需要的包以及数据集)
#忽略所有警告
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
data = pd.read_csv('./hotel_booking_demand.csv')
data.shape # (119390, 32) 32个特征,119390行
data.head()
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
data.info()
数据预处理
data.isnull().sum()[data.isnull().sum()!=0]#查看数据缺失情况
children 4
country 488
agent 16340
company 112593
dtype: int64
# company缺失值过多,删除该列
data1 = data.drop('company', axis=1)
# agent代表有无旅行社,填充0
data1["agent"]=data1["agent"].fillna(0)
# children这里是携带孩童的个数,是离散值,所以用众数填充
data1["children"]=dat