1.项目背景
本数据集提供了某旅游网站上客户行为的各种信息。通过分析这些数据能够更好的理解用户的旅游习惯、偏好以及与旅游内容的互动方式非常重要,对于旅游网站在市场营销、用户体验优化以及新服务开发等方面具有重要的参考价值。通过分析这些数据,旅游公司可以更有效地满足客户需求,提升服务质量,同时增强用户的参与度和忠诚度。
本项目主要从用户购买行为分析、聚类分析、随机森林三个角度来探究用户情况,并且探究影响用户购票的主要因素。2.数据说明
变量 描述 UserID 用户的唯一ID Taken_product 下个月购买机票(目标变量) Yearly_avg_view_on_travel_page 用户每年在旅行相关页面的平均浏览次数 preferred_device 用户登录的首选设备 total_likes_on_outstation_checkin_given 用户在过去一年对外站签到给予的总点赞数 yearly_avg_Outstation_checkins 用户平均每年的外站签到次数 member_in_family 用户账户中提及的家庭成员总数 preferred_location_type 用户旅行的首选地点类型 Yearly_avg_comment_on_travel_page 用户每年在旅行相关页面的平均评论数 total_likes_on_outofstation_checkin_received 用户在过去一年收到的外站签到总点赞数 week_since_last_outstation_checkin 用户最后一次外站签到更新以来的周数 following_company_page 客户是否关注公司页面(是或否) montly_avg_comment_on_company_page 用户每月在公司页面的平均评论数 working_flag 客户是否在工作 travelling_network_rating 表明用户是否有喜欢旅行的密切朋友的评级。1是最高,4是最低 Adult_flag 客户的年龄状态(因为取值为0-3,我猜测应该和成人状态有关,而不是判断是否为成人) Daily_Avg_mins_spend_on_traveling_page 用户在公司旅行页面上的平均每日花费时间 3.Python库导入及数据读取
In [1]:
# 导入需要的库 import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.cluster import KMeans from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.utils import resample from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report,confusion_matrix,roc_curve, aucIn [2]:
# 读取数据 data = pd.read_csv("/home/mw/input/data5466/Customer behaviour Tourism.csv")4.数据预览及数据处理
4.1数据预览
In [3]:
# 查看数据维度 data.shape(11760, 17)In [4]:
# 查看数据信息 data.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 11760 entries, 0 to 11759 Data columns (total 17 columns): UserID 11760 non-null int64 Taken_product 11760 non-null object Yearly_avg_view_on_travel_page 11179 non-null float64 preferred_device 11707 non-null object total_likes_on_outstation_checkin_given 11379 non-null float64 yearly_avg_Outstation_checkins 11685 non-null object member_in_family 11760 non-null object preferred_location_type 11729 non-null object Yearly_avg_comment_on_travel_page 11554 non-null float64 total_likes_on_outofstation_checkin_received 11760 non-null int64 week_since_last_outstation_checkin 11760 non-null int64 following_company_page 11657 non-null object montly_avg_comment_on_company_page 11760 non-null int64 working_flag 11760 non-null object travelling_network_rating 11760 non-null int64 Adult_flag 11759 non-null float64 Daily_Avg_mins_spend_on_traveling_page 11759 non-null float64 dtypes: float64(5), int64(5), object(7) memory usage: 1.5+ MBIn [5]:
# 查看各列缺失值 data.isna().sum()UserID 0 Taken_product 0 Yearly_avg_view_on_travel_page 581 preferred_device 53 total_likes_on_outstation_checkin_given 381 yearly_avg_Outstation_checkins 75 member_in_family 0 preferred_location_type 31 Yearly_avg_comment_on_travel_page 206 total_likes_on_outofstation_checkin_received 0 week_since_last_outstation_checkin 0 following_company_page 103 montly_avg_comment_on_company_page 0 working_flag 0 travelling_network_rating 0 Adult_flag 1 Daily_Avg_mins_spend_on_traveling_page 1 dtype: int64In [6]:
# 查看重复值 data.duplicated().sum()04.2数据处理
In [7]:
# 删除缺失值 data.dropna(inplace=True)In [8]:
# 再次查看缺失值情况 data.isna().sum()UserID 0 Taken_product 0 Yearly_avg_view_on_travel_page 0 preferred_device 0 total_likes_on_outstation_checkin_given 0 yearly_avg_Outstation_checkins 0 member_in_family 0 preferred_location_type 0 Yearly_avg_comment_on_travel_page 0 total_likes_on_outofstation_checkin_received 0 week_since_last_outstation_checkin 0 following_company_page 0 montly_avg_comment_on_company_page 0 working_flag 0 travelling_network_rating 0 Adult_flag 0 Daily_Avg_mins_spend_on_traveling_page 0 dtype: int64In [9]:
# 查看指定特征的唯一值(因为数据比较杂乱) characteristic = ['Taken_product','preferred_device','yearly_avg_Outstation_checkins','member_in_family','preferred_location_type','following_company_page','working_flag'] for i in characteristic: print(f'{i}:') print(data[i].unique()) print('-'*50)Taken_product: ['Yes' 'No'] -------------------------------------------------- preferred_device: ['iOS and Android' 'iOS' 'ANDROID' 'Android' 'Android OS' 'Other' 'Others' 'Tab' 'Laptop' 'Mobile'] -------------------------------------------------- yearly_avg_Outstation_checkins: ['1' '23' '16' '26' '19' '24' '21' '11' '15' '10' '25' '12' '18' '29' '22' '20' '28' '14' '27' '13' '17' '*' '5' '8' '2' '3' '9' '7' '6' '4'] -------------------------------------------------- member_in_family: ['2' '1' '4' '3' 'Three' '5' '10'] -------------------------------------------------- preferred_location_type: ['Financial' 'Other' 'Medical' 'Game' 'Entertainment' 'Social media' 'Tour and Travel' 'Movie' 'OTT' 'Tour Travel' 'Beach' 'Historical site' 'Big Cities' 'Trekking' 'Hill Stations'] -------------------------------------------------- following_company_page: ['Yes' 'No' '1' '0'] -------------------------------------------------- working_flag: ['No' 'Yes'] --------------------------------------------------**可以看到:
1.用户登录的首选设备中存在Other和Others,到时候需要统一称为Others,还有ANDROID和Android、Android OS需要统一成Android,Tab是平板电脑,Mobile应该也是指移动手机,正常来讲也是需要处理的,但是可能是一些特殊的系统,这里不作处理了。
2.用户平均每年的外站签到次数中存在'*'号,这里直接删除这个异常符号,并且将数据格式改成int格式。
3.家庭成员中存在Three,直接把Three改成3,然后把数据格式改成int格式。
4.用户旅行的首选地点类型中Tour and Travel和Tour Travel是同样的,统一成Tour and Travel。
5.客户是否关注公司页面存在了Yes、No、1、0,这里我们直接把Yes替换成1,No替换成0。
6.把下个月购买机票和客户是否在工作中的Yes和No分别替换成1和0。**In [10]:
# 1. 用户登录首选设备的处理 data['preferred_device'] = data['preferred_device'].replace({'Other': 'Others', 'ANDROID': 'Android', 'Android OS': 'Android'}) # 2. 用户平均每年的外站签到次数处理 data = data[data['yearly_avg_Outstation_checkins'] != '*'] data['yearly_avg_Outstation_checkins'] = data['yearly_avg_Outstation_checkins'].astype(int) # 3. 家庭成员处理 data['member_in_family'] = data['member_in_family'].replace({'Three': '3'}) data['member_in_family'] = data['member_in_family'].astype(int) # 4. 用户旅行的首选地点类型处理 data['preferred_location_type'] = data['preferred_location_type'].replace({'Tour Travel': 'Tour and Travel'}) # 5. 客户是否关注公司页面的处理 data['following_company_page'] = data['following_company_page'].replace({'Yes': '1', 'No': '0'}) # 6. 把下个月购买机票和客户是否在工作中的Yes和No分别替换成1和0 data['Taken_product'] = data['Taken_product'].replace({'Yes': '1', 'No': '0'}) data['working_flag'] = data['working_flag'].replace({'Yes': '1', 'No': '0'})In [11]:
# 将 UserID 修改为字符串类型 data['UserID'] = data['UserID'].astype(str) # 将 Taken_product 修改为分类类型 data['Taken_product'] = data['Taken_product'].astype('category') # 将 following_company_page 修改为分类变量 data['following_company_page'] = data['following_company_page'].astype('category') # 将 working_flag 修改为分类变量 data['working_flag'] = data['working_flag'].astype('category') # 将 travelling_network_rating 修改为分类变量 data['travelling_network_rating'] = data['travelling_network_rating'].astype('category') # 将 Adult_flag 修改为分类变量 data['Adult_flag'] = data['Adult_flag'].astype('category') # 再次检查数据类型修改后的结果 data.dtypesUserID object Taken_product category Yearly_avg_view_on_travel_page float64 preferred_device object total_likes_on_outstation_checkin_given float64 yearly_avg_Outstation_checkins int64 member_in_family int64 preferred_location_type object Yearly_avg_comment_on_travel_page float64 total_likes_on_outofstation_checkin_received int64 week_since_last_outstation_checkin int64 following_company_page category montly_avg_comment_on_company_page int64 working_flag category travelling_network_rating category Adult_flag category Daily_Avg_mins_spend_on_traveling_page float64 dtype: objectIn [12]:
# 预览一下处理好的数
关于旅游网站用户行为数据集的探索
最新推荐文章于 2025-03-08 21:33:31 发布