案例一:幸福感预测
需要使用包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等)、家庭变量(父母、配偶、子女、家庭资本)、社会态度(公平、信用、公共服务)等139个维度来预测其对幸福感的影响。
1. 基本信息
数据信息
- 维度:139个
- 数据集:8000组
- 预测值:(1,2,3,4,5)。1最低,5最高
评价指标
使用均方误差MSE,即
S c o r e = 1 n ∑ 1 n ( y i − y ∗ ) 2 Score = \frac{1}{n}\sum_1^n(y_i - y^*)^2 Score=n11∑n(yi−y∗)2
2. 特征工程
2.1 导入数据集
import pandas as pd
import numpy as np
train = pd.read_csv("train.csv", parse_dates=['survey_time'], encoding='latin-1')
# parse_dates:日期解析
# latin-1向下兼容ASCII
test = pd.read_csv("test.csv", parse_dates=['survey_time'], encoding='latin-1')
#happiness存在-8的不在范围内数据,需要删除
#思考:如果是多个值需要删除,如何处理?
train = train[train["happiness"]!=-8].reset_index(drop=True)
train_data_copy = train.copy()
target_col = "happiness" #目标列
target = train_data_copy[target_col]
del train_data_copy[target_col] #去除目标列,只保留维度值
data = pd.concat([train_data_copy, test], axis=0, ignore_index=True)
#将训练集和测试集合并
2.2 查看数据基本信息
train.happiness.describe()
#数据的基本信息
count 7988.000000
mean 3.867927
std 0.818717
min 1.000000
25% 4.000000
50% 4.000000
75% 4.000000
max 5.000000
Name: happiness, dtype: float64
# 查看字段的详细信息
pd.set_option("display.max_info_columns", 100) #100改为200,可以显示所有的列信息
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10956 entries, 0 to 10955
Columns: 139 entries, id to public_service_9
dtypes: datetime64[ns](1), float64(26), int64(109), object(3)
memory usage: 11.6+ MB
观察到上面的happiness的值处理后范围都在1-5之间,为有效值
2.3 数据预处理
首先对数据中连续出现的负数值进行处理。数据中的负数值只有-1, -2, -3, -8,可以分别进行操作
1) 检查数据
# 负数值视为有问题的特征,不进行删除
def getres1(row):
return len([x for x in row.values if type(x)==int and x<0])
def getres2(row):
return len([x for x in row.values if type(x)==int and x==-1])
def getres3(row):
return len([x for x in row.values if type(x)==int and x==-2])
def getres4(row):
return len([x for x in row.values if type(x)==int and x==-3])
def getres5(row):
return len([x for x in row.values if type(x)==int and x==-8])
# 检查数据,检测每行数据出现负数、-1、-2、-3、-8的次数
data['neg1'] = data[data.columns].apply(lambda row:getres1(row), axis=1)
data.loc[data['neg1']>20,'neg1'] = 20 #平滑处理,最多出现20次
data['neg2'] = data[data.columns].apply(lambda row:getres2(row), axis=1)
data['neg3'] = data[data.columns].apply(lambda row:getres3(row), axis=1)
data['neg4'] = data[data.columns].apply(lambda row:getres4(row), axis=1)
data['neg5'] = data[data.columns].apply(lambda row:getres5(row), axis=1)
2) 填充缺失值
采取将缺失值补全,使用fillna(value),其中value的数值根据具体的情况来确定。将大部分缺失值信息认为是0,家庭成员认为是1,家庭收入使用平均值66365填充
family_income_mean = data['family_income'].mean()
family_income_mean
66365.63760839798
data['work_status'] = data['work_status'].fillna(0)
data['work_yr'] = data['work_yr'].fillna(0)
data['work_manage'] = data['work_manage'].fillna(0)
data['work_type'] = data['work_type'].fillna(0)
data['edu_yr'] = data['edu_yr'].fillna(0)
data['edu_status'] = data['edu_status'].fillna(0)
data['s_work_type'] = data['s_work_type'].fillna(0)
data['s_work_status'] = data['s_work_status'].fillna(0)
data['s_political'] = data['s_political'].fillna(0)
data['s_hukou'] = data['s_hukou'].fillna(0)
data['s_income'] = data['s_income'].fillna(0)
data['s_birth'] = data['s_birth'].fillna(0)
data['s_edu'] = data['s_edu'].fillna(0)
data['s_work_exper'] = data['s_work_exper'].fillna(0)
data['minor_child'] = data['minor_child'].fillna(0)
data['marital_now'] = data['marital_now'].fillna(0)
data['marital_1st'] = data['marital_1st'].fillna(0)
data['social_neighbor']=data['social_neighbor'].fillna(0)
data['social_friend']=data['social_friend'].fillna(0)
data['hukou_loc']=data['hukou_loc'].fillna(1) #最少为1,表示户口
data['family_income']=data['family_income'].fillna(66365)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10956 entries, 0 to 10955
Columns: 144 entries, id to neg5
dtypes: datetime64[ns](1), float64(26), int64(114), object(3)
memory usage: 12.0+ MB
对于特殊格式的信息进行另外处理。将“连续”的年龄,进行分层处理,分成6个区间;其次计算出具体的年龄
# 145.survey_time味素
data['survey_time'] = pd.to_datetime(data['survey_time'], format='%Y-%m-%d', errors='coerce')
#防止时间格式不同的报错
data['survey_time'] = data['survey_time'].dt.year #获取年龄
data['age'] = data['survey_time']-data['birth']
data[['age','survey_time','birth']]
age | survey_time | birth | |
---|---|---|---|
0 | 56 | 2015 | 1959 |
1 | 23 | 2015 | 1992 |
2 | 48 | 2015 | 1967 |
3 | 72 | 2015 | 1943 |
4 | 21 | 2015 | 1994 |
... | ... | ... | ... |
10951 | 69 | 2015 | 1946 |
10952 | 38 | 2015 | 1977 |
10953 | 47 | 2015 | 1968 |
10954 | 65 | 2015 | 1950 |
10955 | 74 | 2015 | 1941 |
10956 rows × 3 columns
# 146.对年龄进行分层
bins = [0, 17, 26, 34, 50, 63, 100]
data['age_bin'] = pd.cut(data['age'], bins, labels=[0,1,2,3,4,5])
其他字段的缺失值处理
# 对‘宗教’的处理,特征为负的认为是“不信仰宗教”,并认为“参加宗教活动的频率”从为1,从不参加宗教活动
data.loc[data['religion']<0, 'religion'] = 1
data.loc[data['religion_freq']<0, 'religion_freq'] = 1
# 对‘教育程度’处理
data.loc[data['edu']<0, 'edu'] = 4 #初中
data.loc[data['edu_status']<0,'edu_status'] = 0
data.loc[data['edu_yr']<0,'edu_yr'] = 0
#对‘个人收入’处理
data.loc[data['income']<0,'income'] = 0 #认为无收入
#对‘政治面貌’处理
data.loc[data['political']<0,'political'] = 1 #认为是群众
#对体重处理
data.loc[(data['weight_jin']<=80)&(data['height_cm']>=160),'weight_jin']= data['weight_jin']*2
data.loc[data['weight_jin']<=60,'weight_jin']= data['weight_jin']*2 #个人的想法,哈哈哈,没有60斤的成年人吧
#对身高处理
data.loc[data['height_cm']<150,'height_cm'] = 150 #成年人的实际情况
#对‘健康’处理
data.loc[data['health']<0,'health'] = 4 #认为是比较健康
data.loc[data['health_problem']<0,'health_problem'] = 4
#对‘沮丧’处理
data.loc[data['depression']<0,'depression'] = 4 #一般人都是很少吧
#对‘媒体’处理
data.loc[data['media_1']<0,'media_1'] = 1 #都是从不
data.loc[data['media_2']<0,'media_2'] = 1
data.loc[data['media_3']<0,'media_3'] = 1
data.loc[data['media_4']<0,'media_4'] = 1
data.loc[data['media_5']<0,'media_5'] = 1
data.loc[data['media_6']<0,'media_6'] = 1
#对‘空闲活动’处理
data.loc[data['leisure_1']<0,'leisure_1'] = 1 #都是根据自己的想法
data.loc[data['leisure_2']<0,'leisure_2'] = 5
data.loc[data['leisure_3']<0,'leisure_3'] = 3
使用众数(mode())来实现异常值的修正
data.loc[data['leisure_4']<0,'leisure_4'] = data['leisure_4'].mode() #取众数
data.loc[data['leisure_5']<0,'leisure_5'] = data['leisure_5'].mode()
data.loc[data['leisure_6']<0,'leisure_6'] = data['leisure_6'].mode()
data.loc[data['leisure_7']<0,'leisure_7'] = data['leisure_7'].mode()
data.loc[data['leisure_8']<0,'leisure_8'] = data['leisure_8'].mode()
data.loc[data['leisure_9']<0,'leisure_9'] = data['leisure_9'].mode()
data.loc[data['leisure_10']<0,'leisure_10'] = data['leisure_10'].mode()
data.loc[data['leisure_11']<0,'leisure_11'] = data['leisure_11'].mode()
data.loc[data['leisure_12']<0,'leisure_12'] = data['leisure_12'].mode()
data.loc[data['socialize']<0,'socialize'] = 2 #很少
data.loc[data['relax']<0,'relax'] = 4 #经常
data.loc[data['learn']<0,'learn'] = 1 #从不
#对‘社交’处理
data.loc[data['social_neighbor']<0,'social_neighbor'] = 0
data.loc[data['social_friend']<0,'social_friend'] = 0
data.loc[data['socia_outing']<0,'socia_outing'] = 1
data.loc[data['neighbor_familiarity']<0,'social_neighbor']= 4
#对‘社会公平性’处理
data.loc[data['equity']<0,'equity'] = 4
#对‘社会等级’处理
data.loc[data['class_10_before']<0,'class_10_before'] = 3
data.loc[data['class']<0,'class'] = 5
data.loc[data['class_10_after']<0,'class_10_after'] = 5
data.loc[data['class_14']<0,'class_14'] = 2
#对‘工作情况’处理
data.loc[data['work_status']<0,'work_status'] = 0
data.loc[data['work_yr']<0,'work_yr'] = 0
data.loc[data['work_manage']<0,'work_manage'] = 0
data.loc[data['work_type']<0,'work_type'] = 0
#对‘社会保障’处理
data.loc[data['insur_1']<0,'insur_1'] = 1
data.loc[data['insur_2']<0,'insur_2'] = 1
data.loc[data['insur_3']<0,'insur_3'] = 1
data.loc[data['insur_4']<0,'insur_4'] = 1
data.loc[data['insur_1']==0,'insur_1'] = 0
data.loc[data['insur_2']==0,'insur_2'] = 0
data.loc[data['insur_3']==0,'insur_3'] = 0
data.loc[data['insur_4']==0,'insur_4'] = 0
取均值进行缺失值的补全(mean())
#对家庭情况处理
data.loc[data['family_income']<0,'family_income'] = family_income_mean
data.loc[data['family_m']<0,'family_m'] = 2
data.loc[data['family_status']<0,'family_status'] = 3
data.loc[data['house']<0,'house'] = 1
data.loc[data['car']<0,'car'] = 0
data.loc[data['car']==2,'car'] = 0
data.loc[data['son']<0,'son'] = 1
data.loc[data['daughter']<0,'daughter'] = 0
data.loc[data['minor_child']<0,'minor_child'] = 0
#对‘婚姻’处理
data.loc[data['marital_1st']<0,'marital_1st'] = 0
data.loc[data['marital_now']<0,'marital_now'] = 0
#对‘配偶’处理
data.loc[data['s_birth']<0,'s_birth'] = 0
data<