在训练模型之前,我们常常需要根据不同变量的基本情况进行相应且合理的特征工程,通过阅读文献和自行尝试,我针对多分类变量的特征工程做出了一些总结
数 据 来 源 ( a d u l t 数 据 集 ) : h t t p s : / / a r c h i v e . i c s . u c i . e d u / m l / d a t a s e t s / A d u l t 数据来源(adult数据集):https://archive.ics.uci.edu/ml/datasets/Adult 数据来源(adult数据集):https://archive.ics.uci.edu/ml/datasets/Adult
也可以直接下载我整理过来用
链接:https://pan.baidu.com/s/1UhGTfvZqPHUC6jnukfTcRg
提取码:j4C9
P y t h o n Python Python
首先来看看下数据集的基本情况
import pandas as pd
import numpy as np
file = 'C:/Varian/Data_of_training_model/adult/train.csv'
data = pd.read_csv(file, sep=',')
# 首先用上一篇文章中写的函数获取下连续型和离散型变量
def classify(dataframe):
continuous_variables = []
categorical_variables = []
for i in dataframe.columns:
if data[i] .dtypes == object:
categorical_variables.append(i)
else:
continuous_variables.append(i)
return continuous_variables, categorical_variables
continuous_variables, categorical_variables = classify(data)
print(continuous_variables)
'''
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
'''
print(categorical_variables)
'''
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income']
'''
可以看出这个数据集中包含了非常多的离散型变量
# 再看看每个分类变量包含了多少类别
for i in categorical_variables:
print('variable_name: {} \n {} \n category_number: {} \n'.format(i, data[i].value_counts(), len(data[i].value_counts())))
variable_name: workclass
Private 33906
Self-emp-not-inc 3862
Local-gov 3136
? 2799
State-gov 1981
Self-emp-inc 1695
Federal-gov 1432
Without-pay 21
Never-worked 10
Name: workclass, dtype: int64
category_number: 9
variable_name: education
HS-grad 15784
Some-college 10878
Bachelors 8025
Masters 2657
Assoc-voc 2061
11th 1812
Assoc-acdm 1601
10th 1389
7th-8th 955
Prof-school 834
9th 756
12th 657
Doctorate 594
5th-6th 509
1st-4th 247
Preschool 83
Name: education, dtype: int64
category_number: 16
variable_name: marital-status
Married-civ-spouse 22379
Never-married 16117
Divorced 6633
Separated 1530
Widowed 1518
Married-spouse-absent 628
Married-AF-spouse 37
Name: marital-status, dtype: int64
category_number: 7
variable_name: occupation
Prof-specialty 6172
Craft-repair 6112
Exec-managerial 6086
Adm-clerical 5611
Sales 5504
Other-service 4923
Machine-op-inspct 3022
? 2809
Transport-moving 2355
Handlers-cleaners 2072
Farming-fishing 1490
Tech-support 1446
Protective-serv 983
Priv-house-serv 242
Armed-Forces 15
Name: occupation, dtype: int64
category_number: 15
variable_name: relationship
Husband 19716
Not-in-family 12583
Own-child 7581
Unmarried 5125
Wife 2331
Other-relative 1506
Name: relationship, dtype: int64
category_number: 6
variable_name: race
White 41762
Black 4685
Asian-Pac-Islander 1519
Amer-Indian-Eskimo 470
Other 406
Name: race, dtype: int64
category_number: 5
variable_name: sex
Male 32650
Female 16192
Name: sex, dtype: int64
category_number: 2
variable_name: native-country
United-States 43832
Mexico 951
? 857
Philippines 295
Germany 206
Puerto-Rico 184
Canada 182
El-Salvador 155
India 151
Cuba 138
England 127
China 122
South 115
Jamaica 106
Italy 105
Dominican-Republic 103
Japan 92
Guatemala 88
Poland 87
Vietnam 86
Columbia 85
Haiti 75
Portugal 67
Taiwan 65
Iran 59
Nicaragua 49
Greece 49
Peru 46
Ecuador 45
France 38
Ireland 37
Hong 30
Thailand 30
Cambodia 28
Trinadad&Tobago 27
Outlying-US(Guam-USVI-etc) 23
Yugoslavia 23
Laos 23
Scotland 21
Honduras 20
Hungary 19
Holand-Netherlands 1
Name: native-country, dtype: int64
category_number: 42
variable_name: income
<=50K 24720
<=50K. 12435
>50K 7841
>50K. 3846
Name: income, dtype: int64
category_number: 4
一看结果,好家伙,大部分离散型变量的类别数都很多。而且因变量
i
n
c
o
m
e
income
income 由于我合并了训练集和验证集,包含了四个类别,正常应为两个,因此可以使用DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
处理掉
修改某一列的值
data['income'].replace(' >50K.', ' >50K', inplace = True)
data['income'].replace(' <=50K.', ' <=50K', inplace = True)
print(data['income'].value_counts())
'''
<=50K 37155
>50K 11687
Name: income, dtype: int64
'''
使用字典合并类别
变量
e
d
u
c
a
t
i
o
n
education
education 也包含了很多类别,但实际上可以把部分类别归为一类,使用DataFrame.map(arg, na_action=None)
函数再配合字典就能实现这一功能
原始类别 -> 新类别
Preschool -> Dropout
10th -> Dropout
11th -> Dropout
12th -> Dropout
1st-4th -> Dropout
5th-6th -> Dropout
7th-8th -> Dropout
9th -> Dropout
HS-Grad -> HighGrad
Some-college -> Community
Assoc-acdm -> Community
Assoc-voc -> Community
Bachelors -> Bachelors
Masters -> Masters
Prof-school -> Masters
Doctorate -> PhD
data['education'] = data['education'].map({' Preschool':' Dropout',
' 10th' : ' Dropout',
' 11th' : ' Dropout',
' 12th' : ' Dropout',
' 1st-4th' : ' Dropout',
' 5th-6th' : ' Dropout',
' 7th-8th' : ' Dropout',
' 9th' : ' Dropout',
' HS-goad' : ' HighGrad', # 这里故意把 HS-grad写错为 HS-goad,为了生成NaN,方便后面做演示
' Some-college' : ' Community',
' Assoc-acdm' : ' Community',
' Assoc-voc' : ' Community',
' Bachelors' : ' Bachelors',
' Masters' : ' Masters',
' Prof-school' : ' Masters',
' Doctorate' : ' PhD'})
print(data['education'].value_counts())
'''
Community 14540
Bachelors 8025
Dropout 6408
Masters 3491
PhD 594
Name: education, dtype: int64
'''
转为虚拟变量
转为虚拟变量时需要注意的是:是否需要考虑多重共线性,若你想建立的是回归模型,那么答案是肯定的,这时你除了转换还需要对所生成的虚拟变量做删列处理(加参数drop_first = True
);若不是,那转换即可,不用进行额外操作
# 若建立回归模型
data['education'] = pd.get_dummies(data['education'], prefix = 'education', drop_first = True)
'''
# 对比能发现类别 Bachelors相应的列被删去了
education_ Community education_ Dropout education_ Masters \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 1 0
4 0 0 0
5 0 0 1
6 0 1 0
7 0 0 0
8 0 0 1
9 0 0 0
education_ PhD
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
'''
# 用于建立其它模型
data['education'] = pd.get_dummies(data['education'], prefix = 'education')
'''
education_ Bachelors education_ Community education_ Dropout \
0 1 0 0
1 1 0 0
2 0 0 0
3 0 0 1
4 1 0 0
5 0 0 0
6 0 0 1
7 0 0 0
8 0 0 0
9 1 0 0
education_ Masters education_ PhD
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 1 0
6 0 0
7 0 0
8 1 0
9 0 0
'''
根据多列值,使用逻辑关系生成新的列
这里的指的是那种需要多层判断,通常情况下需要可递归函数才能实现的复杂逻辑;简单逻辑(即一层
i
f
、
e
l
s
e
if、else
if、else就能完成的逻辑)可以使用np.where
或者np.select
来实现;当然复杂的逻辑也用np.select
实现,但它要求你完整地列出所有情况,这会使得代码变得极其冗长且不美观;至于np.where
,它支持递归,能实现多层简单逻辑, 但是不支持存在
o
r
(
∣
)
or(|)
or(∣) 和
a
n
d
and
and(&)的表达式 (这个后面会给出案例)
那么怎么处理需要多层判断的复杂逻辑呢?当时由于时间紧迫没时间查资料(抄代码),且被要求尽量不写新函数来实现,所以我只能从本质入手:不过无论逻辑多复杂,只要是生成新的列,都是按照行的维度,在新列中逐个生成新元素
# 选出三列便于演示
selected_col = ['age', 'race', 'education']
test_data = data[selected_col]
# 只看前十行
test_data = test_data.head(10)
test_data
'''
age race education
0 39 White Bachelors
1 50 White Bachelors
2 38 White NaN
3 53 Black Dropout
4 28 Black Bachelors
5 37 White Masters
6 49 Black Dropout
7 52 White NaN
8 31 White Masters
9 42 White Bachelors
'''
'''
使用以下逻辑生成新的列test_col:
1.当age>37时,若race = white,若education = NaN,则test_col = Out
否则test_col = Good;
若race = black,若education = Dropout 或 NaN,则test_col = Out;
否则test_col = Good;
2.当age<=37时,若education = masters,则test_col = Best
若education = bachelors,则test_col = Impressive.
# 逻辑随便写的,本人无种族歧视,这里仅作演示
'''
# 首先创建一个空列表来存放 test_col 中的元素
test_col = []
# 根据多列的值和逻辑生成新的列,本质实际是逐行进行逻辑判断
for i in range(len(test_data)):
if test_data.iloc[i]['age'] > 37: # 若大于37岁
if test_data.iloc[i]['race'] == ' White': # 且为白人
if pd.isnull(test_data.iloc[i]['education']): # 若教育为空值
test_col.append('Out')
else: # 否则其他所有情况都记为Good
test_col.append('Good') # 新列记为Good
else: # 否则,即若为黑人
if pd.isnull(test_data.iloc[i]['education']) or test_data.iloc[i]['education'] == ' Dropout': # 若没受过教育或值为Dropout
test_col.append('Out') # 新列记为Out
else: # 除了空值和Dropout的所有值
test_col.append('Good')# 新列记为Good
else: # 否则,即年龄≤37岁
if test_data.iloc[i]['education'] == ' Masters': # 若教育值为Masters
test_col.append('Best') # 新列记为Best
else: # 除了Masters的所有值
test_col.append('Impressive') # 新列记为Impressive
# 将列表转为DataFrame
test_col = pd.DataFrame(test_col, columns=['test_col'])
# 合并DataFrame
test_data = pd.concat([test_data, test_col], axis = 1)
print(test_data)
'''
age race education test_col
0 39 White Bachelors Good
1 50 White Bachelors Good
2 38 White NaN Good
3 53 Black Dropout Out
4 28 Black Bachelors Impressive
5 37 White Masters Best
6 49 Black Dropout Out
7 52 White NaN Good
8 31 White Masters Best
9 42 White Bachelors Good
'''
总结一下:
- 计算 D a t a F r a m e DataFrame DataFrame 长度,遍历赋值给 i i i
- 使用
DataFrame.iloc[i]['colname']
代表第 i i i 行第 c o l n a m e colname colname 列元素 - 利用多层 i f , e l s e if ,else if,else 进行逻辑判断
🔺:注意逻辑产生的列的长度必须和原数据长度相等,否则报错
现在让我们看下用np.where(condition, yes, no)
怎么实现
selected_col = ['age', 'race', 'education']
test_data = data[selected_col]
test_data = test_data.head(10)
test_data['test_col'] = np.where(test_data['age']>37,
np.where(test_data['race']==' White',
np.where(pd.isnull(test_data['education']), 'Out', 'Good'),
np.where(test_data['education']==' Dropout', 'Out', 'Good')), # 注意这里我没加 or判断是否为空,但结果意外地和上面相同的
np.where(test_data['education']==' Masters', 'Best',
np.where(test_data['education']==' Bachelors', 'Impressive', 'else')))
# 简洁,但是代码可读性较差
print(test_data)
'''
age race education test_col
0 39 White Bachelors Good
1 50 White Bachelors Good
2 38 White NaN Out
3 53 Black Dropout Out
4 28 Black Bachelors Impressive
5 37 White Masters Best
6 49 Black Dropout Out
7 52 White NaN Out
8 31 White Masters Best
9 42 White Bachelors Good
'''
然后试试它支不支持 o r ( ∣ ) or(|) or(∣) 和 a n d and and(&)表达式:
selected_col = ['age', 'race', 'education']
test_data = data[selected_col]
test_data = test_data.head(10)
print(test_data)
'''
age race education
0 39 White Bachelors
1 50 White Bachelors
2 38 White NaN
3 53 Black Dropout
4 28 Black Bachelors
5 37 White Masters
6 49 Black Dropout
7 52 White NaN
8 31 White Masters
9 42 White Bachelors
'''
test_data['new_col'] = np.where(test_data['race']==' White'| pd.isnull(test_data['education']), 'Yes', 'No')
# 报错信息
TypeError: ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
----------
test_data['new_col'] = np.where(test_data['race']==' White'& pd.isnull(test_data['education']), 'Yes', 'No')
# 报错信息
TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
所以结论是:np.where
确实不支持包含
o
r
or
or 或
a
n
d
and
and 的表达式
而用按行遍历的方法却能轻松实现 o r or or 或 a n d and and:
for i in range(len(test_data)):
if test_data.iloc[i]['race'] == ' White' or pd.isnull(test_data.iloc[i]['education']):
new_col.append('Yes')
else:
new_col.append('No')
new_col = pd.DataFrame(new_col, columns=['new_col'])
test_data = pd.concat([test_data, new_col], axis=1)
print(test_data)
'''
age race education new_col
0 39 White Bachelors Yes
1 50 White Bachelors Yes
2 38 White NaN Yes
3 53 Black Dropout No
4 28 Black Bachelors No
5 37 White Masters Yes
6 49 Black Dropout No
7 52 White NaN Yes
8 31 White Masters Yes
9 42 White Bachelors Yes
'''
Ps:如果 P y t h o n Python Python 中有和 R R R 里 i f e l s e ifelse ifelse 完全一样功能的函数就好了 (╯▔皿▔)╯
R R R
使用
R
R
R 语言中的dplyr
包实现上述功能会简单很多
select_if(data, function)
能筛选出符合
f
u
n
c
t
i
o
n
function
function 的列
library(dplyr)
data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
# 查看连续型变量
continuous = select_if(data, is.numeric)
colnames(continuous)
[1] "age" "fnlwgt" "education.num" "capital.gain"
[5] "capital.loss" "hours.per.week"
# 查看离散型变量
categorical = select_if(data, is.factor)
colnames(categorical)
[1] "workclass" "education" "marital.status" "occupation"
[5] "relationship" "race" "sex" "native.country"
[9] "income"
修改某一列的值
方式1:
使用mutate()
函数修改某列的值,通过case_when()
函数来对列中具体的不同分类值做处理
data = data %>%
mutate(income = case_when(
income == ' >50K.'|income == ' >50K' ~ '>50k',
income == ' <=50K.'|income == ' <=50K' ~ '<=50k'
))
table(data$income)
<=50k >50k
37155 11687
方式2:
使用ifelse(condition, yes, no)
函数(满足
c
o
n
d
i
t
i
o
n
condition
condition时,执行或返回
y
e
s
yes
yes,否则执行或返回
n
o
no
no)
data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
data$income = ifelse(data$income==' >50K.'|data$income==' >50K', '>50k',
ifelse(data$income==' <=50K.'|data$income==' <=50K', '<=50k', 'eles'))
table(data$income)
<=50k >50k
37155 11687
合并多类为一类
原始类别 -> 新类别
Preschool -> Dropout
10th -> Dropout
11th -> Dropout
12th -> Dropout
1st-4th -> Dropout
5th-6th -> Dropout
7th-8th -> Dropout
9th -> Dropout
HS-Grad -> HighGrad
Some-college -> Community
Assoc-acdm -> Community
Assoc-voc -> Community
Bachelors -> Bachelors
Masters -> Masters
Prof-school -> Masters
Doctorate -> PhD
方式1:
data = data%>%
mutate(education = case_when(
education == ' Masters'|education == ' Prof-school' ~ 'Master',
education == ' Bachelors' ~ 'Bachelors',
education == ' Assoc-voc'|education == ' Assoc-acdm' |education == ' Some-college' ~ 'Community',
education == ' HS-grad' ~ 'HighGrad',
education == ' Preschool'|education == ' 10th'|education ==' 11th'|education == ' 12th'|education == ' 1st-4th'|education ==' 5th-6th'|education == ' 7th-8th'|education == ' 9th' ~ 'dropout',
education == ' Doctorate' ~ 'PHD'
))
table(data$education)
Bachelors Community dropout HighGrad Master PHD
8025 14540 6408 15784 3491 594
方式2:
使用ifelse(condition, yes, no)
函数(满足
c
o
n
d
i
t
i
o
n
condition
condition时,执行或返回
y
e
s
yes
yes,否则执行或返回
n
o
no
no)
data <- data %>%
mutate(education = factor(ifelse(education == " Preschool" | education == " 10th" | education == " 11th" | education == " 12th" | education == " 1st-4th" | education == " 5th-6th" | education == " 7th-8th" | education == " 9th", " dropout", ifelse(education == " HS-grad", " HighGrad", ifelse(education == " Some-college" | education == " Assoc-acdm" | education == " Assoc-voc", "Community",
ifelse(education == " Bachelors", "Bachelors",
ifelse(education == " Masters" | education == " Prof-school", "Master", "PhD")))))))
table(data$education)
dropout HighGrad Bachelors Community Master PhD
6408 15784 8025 14540 3491 594
根据多列值,使用逻辑关系生成新的列
还是可以使用ifelse(condition, yes, no)
函数
i f e l s e ifelse ifelse 我吹爆好吧!!! n p . w h e r e np.where np.where是神马辣鸡!(╯▔皿▔)╯
data = read.csv('C:/Varian/Data_of_training_model/adult/train.csv', sep = ',', header = TRUE)
data = data%>%
mutate(education = case_when(
education == ' Masters'|education == ' Prof-school' ~ ' Masters',
education == ' Bachelors' ~ ' Bachelors',
education == ' Assoc-voc'|education == ' Assoc-acdm' |education == ' Some-college' ~ 'Community',
education == ' HS-goad' ~ 'HighGrad', # 这里也和python部分一样,故意写错来生成Na,方便后面演示
education == ' Preschool'|education == ' 10th'|education ==' 11th'|education == ' 12th'|education == ' 1st-4th'|education ==' 5th-6th'|education == ' 7th-8th'|education == ' 9th' ~ ' Dropout',
education == ' Doctorate' ~ 'PHD'
))
test_data = subset(data, select = c('age', 'race', 'education'))
test_data = test_data[1:10,]
test_data
test_data$new_col = ifelse(test_data$age>37,
ifelse(test_data$race==' White',
ifelse(is.na(test_data$education), 'Out', 'Good'),
ifelse(test_data$education==' Dropout'|is.na(test_data$education), 'Out', 'Good')),
ifelse(test_data$education==' Masters', 'Best', 'Impressive'))
print(test_data)
X age race education new_col
1 1 39 White Bachelors Good
2 2 50 White Bachelors Good
3 3 38 White <NA> Out
4 4 53 Black Dropout Out
5 5 28 Black Bachelors Impressive
6 6 37 White Masters Best
7 7 49 Black Dropout Out
8 8 52 White <NA> Out
9 9 31 White Masters Best
10 10 42 White Bachelors Good
# 和python部分的结果一样!完美!
转为虚拟变量
在逻辑回归glm(..., family = 'binomial')
中,模型会自动帮我们将类型为
f
a
c
t
o
r
factor
factor 的变量都转为虚拟变量并消除共线性(删除第一列)后才开始计算,其它回归模型我暂时没有去深入研究
当然也可以使用dummies
包内的dummy
函数生成虚拟变量:
library(dummies)
test = c(1,3,3,1,1,1,2)
dummy(test, sep = '_')
test_1 test_2 test_3
[1,] 1 0 0
[2,] 0 0 1
[3,] 0 0 1
[4,] 1 0 0
[5,] 1 0 0
[6,] 1 0 0
[7,] 0 1 0