|字段|含义|类型|
|:–?:–?
|interested_travel |旅行偏好|二分类|
|computer_owner |是否有家用电脑|二分类|
|age |估计的年龄|连续|
|home_value |房产价格|连续|
|loan_ratio|贷款比率|连续|
|risk_score |风险分数|连续|
|marital |婚姻状况估计|连续|
|interested_sport |运动偏好|连续|
|HH_grandparent|户主祖父母是否健在估计|连续|
|HH_dieting |户主节食偏好|连续|
|HH_head_age|户主年龄|连续|
|auto_member |驾驶俱乐部估计|连续|
|interested_golf |高尔夫偏好|二分类|
|interested_gambling |博彩偏好|二分类|
|HH_has_children |户主是否有孩子|二分类|
|HH_adults_num |家庭成年人数量|连续|
|interested_reading |阅读偏好|有序分类|
1、数据集中的变量较多,如果全部进入模型会导致模型解释困难。因此,一方面我们对于有相关性的变量进行降维,减少变量数目;另一方面,基于业务理解,我们预先将变量进行分组,使得同一组的变量能尽量解释业务的一个方面。比如本例中将变量分成两组,分别是家庭基本情况和用户爱好,通过对每组变量分别进行聚类,获取用户的侧写,再将两个聚类结果进行综合,以获得较完整的用户画像。
2、本例中数据类型复杂,包含了连续变量、无序分类和有序分类变量。由于K-means仅用于连续型变量聚类,因此需要对变量进行预处理。对于有序分类变量,如果分类水平较多可以视作连续变量处理,否则视作无序分类变量一样处理,再进入模型;无序分类变量数目较少时,可以使用其哑变量编码进入模型。本例中由于有较多的二分类变量,又集中在用户爱好这一方面,因此我们将interested_reading这一有序分类变量二值化,再与其他几个二分类变量一起进行汇总,得到用户的“爱好广度”,使用“爱好广度”与其他连续型的爱好类变量进行聚类。
3、离散变量如HH_has_children一般不参与聚类,因为其本身就可以视作是簇的标签;如果为了后期解释模型时简化处理,在离散变量不多的情况之下,也可以做哑变量变换后进入模型。
读取数据
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
travel = pd.read_csv('data_travel.csv',skipinitialspace=True)
travel.head()
interested_travel | computer_owner | age | home_value | loan_ratio | risk_score | marital | interested_sport | HH_grandparent | HH_dieting | HH_head_age | auto_member | interested_golf | interested_gambling | HH_has_children | HH_adults_num | interested_reading | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | 64 | 124035 | 73 | 932 | 3 | 312 | 420 | 149 | 96 | 626 | 0 | 0 | NaN | NaN | 0 |
1 | 0.0 | 1.0 | 69 | 138574 | 73 | 1000 | 7 | 241 | 711 | 263 | 68 | 658 | 0 | 0 | N | 5.0 | 3 |
2 | 0.0 | 0.0 | 57 | 148136 | 77 | 688 | 1 | 367 | 240 | 240 | 56 | 354 | 0 | 1 | N | 2.0 | 1 |
3 | 1.0 | 1.0 | 80 | 162532 | 74 | 932 | 7 | 291 | 832 | 197 | 86 | 462 | 1 | 1 | Y | 2.0 | 3 |
4 | 1.0 | 1.0 | 48 | 133580 | 77 | 987 | 10 | 137 | 121 | 209 | 42 | 423 | 0 | 1 | Y | 3.0 | 3 |
travel.describe(include='all')
interested_travel | computer_owner | age | home_value | loan_ratio | risk_score | marital | interested_sport | HH_grandparent | HH_dieting | HH_head_age | auto_member | interested_golf | interested_gambling | HH_has_children | HH_adults_num | interested_reading | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 149788.000000 | 149788.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 159899 | 145906.000000 | 167177 |
unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | NaN | 5 |
top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | 3 |
freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 111462 | NaN | 65096 |
mean | 0.427745 | 0.856571 | 59.507079 | 207621.314798 | 66.762707 | 817.031751 | 6.884015 | 259.431776 | 377.072498 | 204.593341 | 59.368023 | 486.861273 | 0.373012 | 0.357842 | NaN | 2.770832 | NaN |
std | 0.494753 | 0.350511 | 14.311733 | 107822.501900 | 9.751835 | 165.490295 | 2.610552 | 78.867456 | 248.045395 | 78.971038 | 16.712912 | 151.167457 | 0.483607 | 0.479367 | NaN | 1.285417 | NaN |
min | 0.000000 | 0.000000 | 18.000000 | 48910.000000 | 0.000000 | 1.000000 | 1.000000 | 60.000000 | 0.000000 | 47.000000 | 18.000000 | 49.000000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN |
25% | 0.000000 | 1.000000 | 49.000000 | 135595.000000 | 63.000000 | 748.000000 | 5.000000 | 204.000000 | 182.000000 | 144.000000 | 48.000000 | 377.000000 | 0.000000 | 0.000000 | NaN | 2.000000 | NaN |
50% | 0.000000 | 1.000000 | 59.000000 | 182106.000000 | 69.000000 | 844.000000 | 7.000000 | 251.000000 | 351.000000 | 185.000000 | 60.000000 | 492.000000 | 0.000000 | 0.000000 | NaN | 2.000000 | NaN |
75% | 1.000000 | 1.000000 | 70.000000 | 248277.000000 | 73.000000 | 945.000000 | 9.000000 | 306.000000 | 528.000000 | 252.000000 | 71.000000 | 600.000000 | 1.000000 | 1.000000 | NaN | 4.000000 | NaN |
max | 1.000000 | 1.000000 | 99.000000 | 1000000.000000 | 102.000000 | 1000.000000 | 10.000000 | 920.000000 | 980.000000 | 633.000000 | 99.000000 | 878.000000 | 1.000000 | 1.000000 | NaN | 7.000000 | NaN |
数据预处理
填补缺失值
有缺失情况的变量皆为分类变量,且确实比例并不高,因此用众数进行填补
fill_cols = ['interested_travel', 'computer_owner', 'HH_adults_num']
fill_values = {
col: travel[col].mode()[0] for col in fill_cols}
travel = travel.fillna(fill_values)
修正错误值
HH_has_children的分类水平以字符形式表示,需要转换为整型,同时其中的缺失值应当表示没有小孩,因此替换为0
阅读爱好interested_reading中包含错误值“.”,将其以0进行替换,代表该用户对阅读没有兴趣。
travel['interested_reading'].value_counts(dropna=False)
3 65096
1 43832
0 32919
2 24488
. 842
Name: interested_reading, dtype: int64
travel['HH_has_children'] = travel['HH_has_children']\
.replace({
'N':0, 'Y':1, np.NaN:0})
travel['interested_reading'] = travel['interested_reading']\
.replace({
'.':'0'}