订房聚类分析

最新推荐文章于 2021-03-23 00:12:04 发布

缘源园

最新推荐文章于 2021-03-23 00:12:04 发布

阅读量242

点赞数 1

分类专栏：数据分析文章标签： python 数据分析机器学习

本文链接：https://blog.csdn.net/weixin_48135624/article/details/113004179

版权

数据分析专栏收录该内容

54 篇文章 8 订阅

订阅专栏

案例背景

Airbnb在全球拥有广泛丰富的用户出行场景。自身在app和网页端，以及通过各种营销渠道会收集到非常全面的用户行为数据
通过这些数据，锁定潜在的目标客群并指定相应的营销策略是Airbnb发展的重要基石

字段名字		字段名字
id	唯一的用户id	Android	安卓APP中预订
date_account_created	用户注册日期	Moweb	手机移动网页预订
date_first_booking	第一次订房日期	Web	电脑网页预订
Gender	性别	iOS	苹果APP预订
Age	年龄	Language_en	使用英文界面
Married	已婚	Language_Zh	使用中文界面
Children	有几个小孩	Country_us	目的地是美国
		Country_eu	目的地是欧洲

import pandas as pd
airbnb=pd.read_csv('airbnb.csv')
#用户数据具体情况
airbnb.head()

	age	date_account_created	date_first_booking	gender	Language_EN	Country_US	android	moweb	web	Married	Children
0	56	9/28/2010	8/2/2010	F	1	1	1	0	1	1	1
1	42	12/5/2011	9/8/2012	F	1	0	0	1	1	0	1
2	41	9/14/2010	2/18/2010	U	1	1	1	0	1	0	2
3	46	1/2/2010	1/5/2010	F	1	1	1	0	1	0	2
4	47	1/3/2010	1/13/2010	F	1	1	1	0	1	1	3

airbnb.info()#查看数据类型
#变量类别：用户个人信息、用户与airbnb的关系、app使用语言、用户去的国家、用户下单渠道
#这里有2个日期变量，之后会进行操作
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67936 entries, 0 to 67935
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   age                   67936 non-null  int64 
 1   date_account_created  67936 non-null  object
 2   date_first_booking    67936 non-null  object
 3   gender                67936 non-null  object
 4   Language_EN           67936 non-null  int64 
 5   Language_ZH           67936 non-null  int64 
 6   Country_US            67936 non-null  int64 
 7   Country_EUR           67936 non-null  int64 
 8   android               67936 non-null  int64 
 9   moweb                 67936 non-null  int64 
 10  web                   67936 non-null  int64 
 11  ios                   67936 non-null  int64 
 12  Married               67936 non-null  int64 
 13  Children              67936 non-null  int64 
dtypes: int64(11), object(3)
memory usage: 7.3+ MB

airbnb.columns  #可以查看当前所有列
显示结果
Index(['age', 'date_account_created', 'date_first_booking', 'gender',
       'Language_EN', 'Language_ZH', 'Country_US', 'Country_EUR', 'android',
       'moweb', 'web', 'ios', 'Married', 'Children'],
      dtype='object')

单变量分析

airbnb.describe()

	age	Language_EN	Language_ZH	Country_US	Country_EUR	android	moweb	web	ios	Married	Children
count	67936.000000	67936.000000	67936.000000	67936.000000	67936.000000	67936.000000	67936.000000	67936.000000	67936.000000	67936.000000	67936.000000
mean	47.874249	0.974476	0.005947	0.713907	0.159091	0.658355	0.340423	0.895828	0.067534	0.790155	1.536696
std	146.090906	0.157711	0.076886	0.451937	0.365764	0.474265	0.473855	0.305485	0.250947	0.407201	0.836273
min	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	28.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	1.000000	1.000000
50%	33.000000	1.000000	0.000000	1.000000	0.000000	1.000000	0.000000	1.000000	0.000000	1.000000	1.000000
75%	42.000000	1.000000	0.000000	1.000000	0.000000	1.000000	1.000000	1.000000	0.000000	1.000000	2.000000
max	2014.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	3.000000

发现年龄最小是2最大是2014，属于数据异常，进行数据清洗，这里保留用户年龄在18-70岁之间的群体¶

airbnb=airbnb[airbnb['age']<=70]
airbnb=airbnb[airbnb['age']>=18]
airbnb.age.describe()

显示结果
count    65982.000000
mean        35.758449
std         10.501463
min         18.000000
25%         28.000000
50%         33.000000
75%         41.000000
max         70.000000
Name: age, dtype: float64

#将注册日期转变为日期时间格式
airbnb['date_account_created']=pd.to_datetime(airbnb['date_account_created'])
airbnb['date_first_booking']=pd.to_datetime(airbnb['date_first_booking'])
airbnb.info()

#data_account_created变量格式从object转变为datetime64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65982 entries, 0 to 67935
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   age                   65982 non-null  int64         
 1   date_account_created  65982 non-null  datetime64[ns]
 2   date_first_booking    65982 non-null  datetime64[ns]
 3   gender                65982 non-null  object        
 4   Language_EN           65982 non-null  int64         
 5   Language_ZH           65982 non-null  int64         
 6   Country_US            65982 non-null  int64         
 7   Country_EUR           65982 non-null  int64         
 8   android               65982 non-null  int64         
 9   moweb                 65982 non-null  int64         
 10  web                   65982 non-null  int64         
 11  ios                   65982 non-null  int64         
 12  Married               65982 non-null  int64         
 13  Children              65982 non-null  int64         
dtypes: datetime64[ns](2), int64(11), object(1)
memory usage: 7.6+ MB

#将年份从中提取出来，将2019-注册日期的年份，并生成一个新的变量year_since_account_created,计算注册至今（2019年）有几年
airbnb['year_since_account_created']=airbnb['date_account_created'].apply(lambda x:2019-x.year)
airbnb['year_since_first_booking']=airbnb['date_first_booking'].apply(lambda x:2019-x.year)
airbnb.describe()
#注册时间最短的是5年，最长的是9年;距离第一次预定时间最短的是4年，最长的是9年

	       age	Language_EN	Language_ZH	Country_US	Country_EUR	android	moweb	web	ios	Married	Children	year_since_account_created	year_since_first_booking
count 65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000	65982.000000
mean 35.758449	0.974129	0.006062	0.714998	0.158407	0.653208	0.345534	0.894668	0.068579	0.790322	1.535510	6.035282	5.906641
std	10.501463	0.158751	0.077625	0.451419	0.365125	0.475952	0.475546	0.306983	0.252739	0.407082	0.837236	0.965382	0.995412
min	18.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	4.000000
25%	28.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	1.000000	1.000000	5.000000	5.000000
50%	33.000000	1.000000	0.000000	1.000000	0.000000	1.000000	0.000000	1.000000	0.000000	1.000000	1.000000	6.000000	6.000000
75%	41.000000	1.000000	0.000000	1.000000	0.000000	1.000000	1.000000	1.000000	0.000000	1.000000	2.000000	7.000000	6.000000
max	70.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	3.000000	9.000000	9.000000

将类别型型转化成哑变量(gender)

airbnb=pd.get_dummies(airbnb)   #get_dummies 是利用pandas实现one hot encode的方式
airbnb.info()
结果显示
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65982 entries, 0 to 67935
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   age                         65982 non-null  int64         
 1   date_account_created        65982 non-null  datetime64[ns]
 2   date_first_booking          65982 non-null  datetime64[ns]
 3   Language_EN                 65982 non-null  int64         
 4   Language_ZH                 65982 non-null  int64         
 5   Country_US                  65982 non-null  int64         
 6   Country_EUR                 65982 non-null  int64         
 7   android                     65982 non-null  int64         
 8   moweb                       65982 non-null  int64         
 9   web                         65982 non-null  int64         
 10  ios                         65982 non-null  int64         
 11  Married                     65982 non-null  int64         
 12  Children                    65982 non-null  int64         
 13  year_since_account_created  65982 non-null  int64         
 14  year_since_first_booking    65982 non-null  int64         
 15  gender_F                    65982 non-null  uint8         
 16  gender_M                    65982 non-null  uint8         
 17  gender_U                    65982 non-null  uint8         
dtypes: datetime64[ns](2), int64(13), uint8(3)
memory usage: 8.2 MB

删除两个日期变量，可以根据数据格式来进行drop¶

airbnb.drop(airbnb.select_dtypes(['datetime64']),inplace=True,axis=1)   #删除后保留了，距离现在最近一次预定的时间和距离账户创建有几年。
#inplace=True直接在原始数据中保存排序之后的结果
#axis 列索引，表名不同列，纵向索引，叫columns，1轴，axis=1

airbnb.info()
结果显示
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65982 entries, 0 to 67935
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   age                         65982 non-null  int64
 1   Language_EN                 65982 non-null  int64
 2   Language_ZH                 65982 non-null  int64
 3   Country_US                  65982 non-null  int64
 4   Country_EUR                 65982 non-null  int64
 5   android                     65982 non-null  int64
 6   moweb                       65982 non-null  int64
 7   web                         65982 non-null  int64
 8   ios                         65982 non-null  int64
 9   Married                     65982 non-null  int64
 10  Children                    65982 non-null  int64
 11  year_since_account_created  65982 non-null  int64
 12  year_since_first_booking    65982 non-null  int64
 13  gender_F                    65982 non-null  uint8
 14  gender_M                    65982 non-null  uint8
 15  gender_U                    65982 non-null  uint8
dtypes: int64(13), uint8(3)
memory usage: 7.2 MB

选择五个变量，作为分群的维度

airbnb_5=airbnb[['age','web','moweb','ios','android']]
#Standardization标准化:将特征数据的分布调整成标准正太分布，也叫高斯分布，也就是使得数据的均值维0，方差为1.
#数据标准化，使用sklearn中预处理的scale
from sklearn.preprocessing import scale
x=pd.DataFrame(scale(airbnb_5))

进行聚类分析

#使用cluster建模
from sklearn import cluster
#先尝试分为3类
model=cluster.KMeans(n_clusters=3,random_state=10)  #n_clusters=3,分成三类，random_state自动生成随机数固定。
model.fit(x)
#提取标签，查看分类结果
airbnb_5['cluster']=model.labels_   #把聚类的结果放成一列排在下面表中
airbnb_5.head(10)

结果显示：
	age	web	moweb	ios	android	cluster
0	56	1	  0	     0	   1	  1
1	42	1	  1	     0	   0	  2
2	41	1	  0      0	   1	  1
3	46	1	  0	     0	   1	  1
4	47	1	  0      0	   1	  1
5	50	1	  0      0	   1	  1
6	46	1	  0	     0	   1	  1
7	36	1	  0	     0	   1	  1
8	33	1	  0      0	   1	  1
9	31	1	  0	     0	   1	  1

#使用groupby函数，评估各个变量维度的分群效果
airbnb_5.groupby(['cluster'])['age'].describe()

	    count	   mean	       std	     min	25%	     50%	 75%     max
cluster								
0	   4525.0	32.979006	8.440972	18.0	27.0	31.0	37.0	70.0
1	   40473.0	36.531638	11.063559	18.0	29.0	34.0	42.0	70.0
2	   20984.0	34.866517	9.576614	18.0	28.0	33.0	39.0	70.0

airbnb_5.groupby(['cluster'])['ios'].describe()

	   count	mean std	min 25%	    50%	 75%	max
cluster								
0	  4525.0	1.0	 0.0	1.0	 1.0	1.0	 1.0	1.0
1	  40473.0	0.0	 0.0	0.0	 0.0	0.0	 0.0	0.0
2	  20984.0	0.0	 0.0	0.0	 0.0	0.0	 0.0	0.0

模型评估与优化¶

from sklearn import metrics#调用sklearn的metrics库
x_cluster=model.fit_predict(x)#个体与群的距离
score=metrics.silhouette_score(x,x_cluster)#评分越高，个体与群越近；评分越低，个体与群越远
print(score)
0.6304549142727769
centers=pd.DataFrame(model.cluster_centers_)
centers.to_csv('center_3.csv')
#将群体分为5组
model=cluster.KMeans(n_clusters=5,random_state=10)
model.fit(x)
centers=pd.DataFrame(model.cluster_centers_)
centers.to_csv('center_5.csv')
0.6304549142727731

#轮廓系数评价聚类结果好坏 metrics.silhouette_score(x,x_cluster),
#评分越高，个体与群越近；评分越低，个体与群越远
# model.cluster_centers_，通过聚类中心点对结果做解读

分三组之后的中心点数据

	age	web	moweb	ios	android
0	-0.084934709	0.230108024	1.372846487	-0.271346118	-1.372434556
1	0.073627358	0.206536201	-0.726610207	-0.271346118	0.726088466
2	-0.264674063	-2.914414684	0.132659495	3.685330034	-0.129903148

分五组的中心点数据

	age	web	moweb	ios	android
0	1.567477405	0.34312207	-0.726610207	-0.271346118	0.728632193
1	-0.073690494	0.34312207	1.37625372	-0.271346118	-1.372434556
2	-0.264674063	-2.914414684	0.132659495	3.685330034	-0.129903148
3	-0.240134866	-2.914414684	-0.124800902	-0.271346118	0.055424414
4	-0.450412767	0.34312207	-0.726610207	-0.271346118	0.728632193