案例背景
-
Airbnb在全球拥有广泛丰富的用户出行场景。自身在app和网页端,以及通过各种营销渠道会收集到非常全面的用户行为数据
-
通过这些数据,锁定潜在的目标客群并指定相应的营销策略是Airbnb发展的重要基石
字段名字 | 字段名字 | ||
id | 唯一的用户id | Android | 安卓APP中预订 |
date_account_created | 用户注册日期 | Moweb | 手机移动网页预订 |
date_first_booking | 第一次订房日期 | Web | 电脑网页预订 |
Gender | 性别 | iOS | 苹果APP预订 |
Age | 年龄 | Language_en | 使用英文界面 |
Married | 已婚 | Language_Zh | 使用中文界面 |
Children | 有几个小孩 | Country_us | 目的地是美国 |
Country_eu | 目的地是欧洲 |
import pandas as pd
airbnb=pd.read_csv('airbnb.csv')
#用户数据具体情况
airbnb.head()
age | date_account_created | date_first_booking | gender | Language_EN | Language_ZH | Country_US | Country_EUR | android | moweb | web | ios | Married | Children | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 56 | 9/28/2010 | 8/2/2010 | F | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 |
1 | 42 | 12/5/2011 | 9/8/2012 | F | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
2 | 41 | 9/14/2010 | 2/18/2010 | U | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 2 |
3 | 46 | 1/2/2010 | 1/5/2010 | F | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 2 |
4 | 47 | 1/3/2010 | 1/13/2010 | F | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 3 |
airbnb.info()#查看数据类型
#变量类别:用户个人信息、用户与airbnb的关系、app使用语言、用户去的国家、用户下单渠道
#这里有2个日期变量,之后会进行操作
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67936 entries, 0 to 67935
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 67936 non-null int64
1 date_account_created 67936 non-null object
2 date_first_booking 67936 non-null object
3 gender 67936 non-null object
4 Language_EN 67936 non-null int64
5 Language_ZH 67936 non-null int64
6 Country_US 67936 non-null int64
7 Country_EUR 67936 non-null int64
8 android 67936 non-null int64
9 moweb 67936 non-null int64
10 web 67936 non-null int64
11 ios 67936 non-null int64
12 Married 67936 non-null int64
13 Children 67936 non-null int64
dtypes: int64(11), object(3)
memory usage: 7.3+ MB
airbnb.columns #可以查看当前所有列
显示结果
Index(['age', 'date_account_created', 'date_first_booking', 'gender',
'Language_EN', 'Language_ZH', 'Country_US', 'Country_EUR', 'android',
'moweb', 'web', 'ios', 'Married', 'Children'],
dtype='object')
单变量分析
airbnb.describe()
age | Language_EN | Language_ZH | Country_US | Country_EUR | android | moweb | web | ios | Married | Children | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 | 67936.000000 |
mean | 47.874249 | 0.974476 | 0.005947 | 0.713907 | 0.159091 | 0.658355 | 0.340423 | 0.895828 | 0.067534 | 0.790155 | 1.536696 |
std | 146.090906 | 0.157711 | 0.076886 | 0.451937 | 0.365764 | 0.474265 | 0.473855 | 0.305485 | 0.250947 | 0.407201 | 0.836273 |
min | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 28.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 |
50% | 33.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 |
75% | 42.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1.000000 | 2.000000 |
max | 2014.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 3.000000 |
发现年龄最小是2最大是2014,属于数据异常,进行数据清洗,这里保留用户年龄在18-70岁之间的群体¶
airbnb=airbnb[airbnb['age']<=70]
airbnb=airbnb[airbnb['age']>=18]
airbnb.age.describe()
显示结果
count 65982.000000
mean 35.758449
std 10.501463
min 18.000000
25% 28.000000
50% 33.000000
75% 41.000000
max 70.000000
Name: age, dtype: float64
#将注册日期转变为日期时间格式
airbnb['date_account_created']=pd.to_datetime(airbnb['date_account_created'])
airbnb['date_first_booking']=pd.to_datetime(airbnb['date_first_booking'])
airbnb.info()
#data_account_created变量格式从object转变为datetime64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65982 entries, 0 to 67935
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 65982 non-null int64
1 date_account_created 65982 non-null datetime64[ns]
2 date_first_booking 65982 non-null datetime64[ns]
3 gender 65982 non-null object
4 Language_EN 65982 non-null int64
5 Language_ZH 65982 non-null int64
6 Country_US 65982 non-null int64
7 Country_EUR 65982 non-null int64
8 android 65982 non-null int64
9 moweb 65982 non-null int64
10 web 65982 non-null int64
11 ios 65982 non-null int64
12 Married 65982 non-null int64
13 Children 65982 non-null int64
dtypes: datetime64[ns](2), int64(11), object(1)
memory usage: 7.6+ MB
#将年份从中提取出来,将2019-注册日期的年份,并生成一个新的变量year_since_account_created,计算注册至今(2019年)有几年
airbnb['year_since_account_created']=airbnb['date_account_created'].apply(lambda x:2019-x.year)
airbnb['year_since_first_booking']=airbnb['date_first_booking'].apply(lambda x:2019-x.year)
airbnb.describe()
#注册时间最短的是5年,最长的是9年;距离第一次预定时间最短的是4年,最长的是9年
age Language_EN Language_ZH Country_US Country_EUR android moweb web ios Married Children year_since_account_created year_since_first_booking
count 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000 65982.000000
mean 35.758449 0.974129 0.006062 0.714998 0.158407 0.653208 0.345534 0.894668 0.068579 0.790322 1.535510 6.035282 5.906641
std 10.501463 0.158751 0.077625 0.451419 0.365125 0.475952 0.475546 0.306983 0.252739 0.407082 0.837236 0.965382 0.995412
min 18.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 4.000000
25% 28.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 5.000000 5.000000
50% 33.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 6.000000 6.000000
75% 41.000000 1.000000 0.000000 1.000000 0.000000 1.000000 1.000000 1.000000 0.000000 1.000000 2.000000 7.000000 6.000000
max 70.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3.000000 9.000000 9.000000
将类别型型转化成哑变量(gender)
airbnb=pd.get_dummies(airbnb) #get_dummies 是利用pandas实现one hot encode的方式
airbnb.info()
结果显示
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65982 entries, 0 to 67935
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 65982 non-null int64
1 date_account_created 65982 non-null datetime64[ns]
2 date_first_booking 65982 non-null datetime64[ns]
3 Language_EN 65982 non-null int64
4 Language_ZH 65982 non-null int64
5 Country_US 65982 non-null int64
6 Country_EUR 65982 non-null int64
7 android 65982 non-null int64
8 moweb 65982 non-null int64
9 web 65982 non-null int64
10 ios 65982 non-null int64
11 Married 65982 non-null int64
12 Children 65982 non-null int64
13 year_since_account_created 65982 non-null int64
14 year_since_first_booking 65982 non-null int64
15 gender_F 65982 non-null uint8
16 gender_M 65982 non-null uint8
17 gender_U 65982 non-null uint8
dtypes: datetime64[ns](2), int64(13), uint8(3)
memory usage: 8.2 MB
删除两个日期变量,可以根据数据格式来进行drop¶
airbnb.drop(airbnb.select_dtypes(['datetime64']),inplace=True,axis=1) #删除后保留了,距离现在最近一次预定的时间和距离账户创建有几年。
#inplace=True直接在原始数据中保存排序之后的结果
#axis 列索引,表名不同列,纵向索引,叫columns,1轴,axis=1
airbnb.info()
结果显示
<class 'pandas.core.frame.DataFrame'>
Int64Index: 65982 entries, 0 to 67935
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 65982 non-null int64
1 Language_EN 65982 non-null int64
2 Language_ZH 65982 non-null int64
3 Country_US 65982 non-null int64
4 Country_EUR 65982 non-null int64
5 android 65982 non-null int64
6 moweb 65982 non-null int64
7 web 65982 non-null int64
8 ios 65982 non-null int64
9 Married 65982 non-null int64
10 Children 65982 non-null int64
11 year_since_account_created 65982 non-null int64
12 year_since_first_booking 65982 non-null int64
13 gender_F 65982 non-null uint8
14 gender_M 65982 non-null uint8
15 gender_U 65982 non-null uint8
dtypes: int64(13), uint8(3)
memory usage: 7.2 MB
选择五个变量,作为分群的维度
airbnb_5=airbnb[['age','web','moweb','ios','android']]
#Standardization标准化:将特征数据的分布调整成标准正太分布,也叫高斯分布,也就是使得数据的均值维0,方差为1.
#数据标准化,使用sklearn中预处理的scale
from sklearn.preprocessing import scale
x=pd.DataFrame(scale(airbnb_5))
进行聚类分析
#使用cluster建模
from sklearn import cluster
#先尝试分为3类
model=cluster.KMeans(n_clusters=3,random_state=10) #n_clusters=3,分成三类,random_state自动生成随机数固定。
model.fit(x)
#提取标签,查看分类结果
airbnb_5['cluster']=model.labels_ #把聚类的结果放成一列排在下面表中
airbnb_5.head(10)
结果显示:
age web moweb ios android cluster
0 56 1 0 0 1 1
1 42 1 1 0 0 2
2 41 1 0 0 1 1
3 46 1 0 0 1 1
4 47 1 0 0 1 1
5 50 1 0 0 1 1
6 46 1 0 0 1 1
7 36 1 0 0 1 1
8 33 1 0 0 1 1
9 31 1 0 0 1 1
#使用groupby函数,评估各个变量维度的分群效果
airbnb_5.groupby(['cluster'])['age'].describe()
count mean std min 25% 50% 75% max
cluster
0 4525.0 32.979006 8.440972 18.0 27.0 31.0 37.0 70.0
1 40473.0 36.531638 11.063559 18.0 29.0 34.0 42.0 70.0
2 20984.0 34.866517 9.576614 18.0 28.0 33.0 39.0 70.0
airbnb_5.groupby(['cluster'])['ios'].describe()
count mean std min 25% 50% 75% max
cluster
0 4525.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
1 40473.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 20984.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
模型评估与优化¶
from sklearn import metrics#调用sklearn的metrics库
x_cluster=model.fit_predict(x)#个体与群的距离
score=metrics.silhouette_score(x,x_cluster)#评分越高,个体与群越近;评分越低,个体与群越远
print(score)
0.6304549142727769
centers=pd.DataFrame(model.cluster_centers_)
centers.to_csv('center_3.csv')
#将群体分为5组
model=cluster.KMeans(n_clusters=5,random_state=10)
model.fit(x)
centers=pd.DataFrame(model.cluster_centers_)
centers.to_csv('center_5.csv')
0.6304549142727731
#轮廓系数评价聚类结果好坏 metrics.silhouette_score(x,x_cluster),
#评分越高,个体与群越近;评分越低,个体与群越远
# model.cluster_centers_,通过聚类中心点对结果做解读
-
分三组之后的中心点数据
age | web | moweb | ios | android | |
---|---|---|---|---|---|
0 | -0.084934709 | 0.230108024 | 1.372846487 | -0.271346118 | -1.372434556 |
1 | 0.073627358 | 0.206536201 | -0.726610207 | -0.271346118 | 0.726088466 |
2 | -0.264674063 | -2.914414684 | 0.132659495 | 3.685330034 | -0.129903148 |
-
分五组的中心点数据
age | web | moweb | ios | android | |
---|---|---|---|---|---|
0 | 1.567477405 | 0.34312207 | -0.726610207 | -0.271346118 | 0.728632193 |
1 | -0.073690494 | 0.34312207 | 1.37625372 | -0.271346118 | -1.372434556 |
2 | -0.264674063 | -2.914414684 | 0.132659495 | 3.685330034 | -0.129903148 |
3 | -0.240134866 | -2.914414684 | -0.124800902 | -0.271346118 | 0.055424414 |
4 | -0.450412767 | 0.34312207 | -0.726610207 | -0.271346118 | 0.728632193 |
分五组第三层反映的ios特征,用的ios手机。用ios手机的可能会优先使用贵的房子,可以针对此特征做广告推送展示。