%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rc("font",family="SimHei",size="12") #用于解决中文显示不了的问题
sns.set_style("whitegrid")
Exploratory data analysis (EDA) 探索性数据分析
一、Data Clean 数据清洗
1.1对空值、NA进行处理
data = pd.read_csv('hotel_booking_demand.csv')
data.head()
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
hotel 119390 non-null object
is_canceled 119390 non-null int64
lead_time 119390 non-null int64
arrival_date_year 119390 non-null int64
arrival_date_month 119390 non-null object
arrival_date_week_number 119390 non-null int64
arrival_date_day_of_month 119390 non-null int64
stays_in_weekend_nights 119390 non-null int64
stays_in_week_nights 119390 non-null int64
adults 119390 non-null int64
children 119386 non-null float64
babies 119390 non-null int64
meal 119390 non-null object
country 118902 non-null object
market_segment 119390 non-null object
distribution_channel 119390 non-null object
is_repeated_guest 119390 non-null int64
previous_cancellations 119390 non-null int64
previous_bookings_not_canceled 119390 non-null int64
reserved_room_type 119390 non-null object
assigned_room_type 119390 non-null object
booking_changes 119390 non-null int64
deposit_type 119390 non-null object
agent 103050 non-null float64
company 6797 non-null float64
days_in_waiting_list 119390 non-null int64
customer_type 119390 non-null object
adr 119390 non-null float64
required_car_parking_spaces 119390 non-null int64
total_of_special_requests 119390 non-null int64
reservation_status 119390 non-null object
reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
data.isnull().sum() #空值计数
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
dtype: int64
- 缺失值处理方法:可以删除,可以零填充,均值填充,众数填充等
- 从上述结果可以看到children、country、agent、company都有缺失值,company甚至达到112593,总行才119390
- 对四种缺失值进行判断处理:
- comapny缺失太多,删除
- children很少,用众数填充
- country国家众数填充
- agent旅游机构考虑0填充,代表不属于任何机构
df