scikit-learn使用OneHotEncoder处理Nominal属性的机器学习流程（Random Forest算法为例）

最新推荐文章于 2024-03-14 10:15:25 发布

mach_learn

最新推荐文章于 2024-03-14 10:15:25 发布

阅读量1.4w

点赞数 5

分类专栏： scikit 机器学习文章标签：机器学习 scikit-learn RandomForest OneHotEncoder scikit

本文链接：https://blog.csdn.net/mach_learn/article/details/40428297

版权

机器学习同时被 2 个专栏收录

8 篇文章 0 订阅

订阅专栏

scikit

2 篇文章 0 订阅

订阅专栏

scikit-learn机器学习RandomForest实例（含类别属性处理）

在工作中进行机器学习的数据一般都包含Nominal属性和Numric属性，在scikit-learn中提供了处理numric方法像Normalization方法等，也提供了处理Nominal的方法（OneHotEncoder方法）。本文将展示OneHotEncoder方法处理Nominal数据，并将数据应用到机器学习的流程。

一、准备数据

本文使用的数据是csv格式的，数据中的属性有Numric型和Nominal型。属性描述如下

@attribute 'birthday' numeric
@attribute 'astrology' {'1','2','3','4','5','6','7','8','9','10','11','12'}
@attribute 'animalsign' {'0','1','2','3','4','5','6','7','8','9','10','11','12'}
@attribute 'height' numeric
@attribute 'degree' {'0','1','2','3','4','5','6','7','8'}
@attribute 'housing' {'0','1','2','3','4'}
@attribute 'marriage' {'0','1','2','3','4'}
@attribute 'income' {'0','1','2','3','4','5','6','7','8','9','10','11','12'}
@attribute 'haveChildren' {'1','2','3','4'}
@attribute 'hasMainPhoto' {'0','1'}
@attribute 'nationality' {'0','1','2','3','4','5','6','7','8','9','10','11','12'}
@attribute 'religion' {'0','1','2','3','4','5','6','7','8','9','10','11','12'}
@attribute 'bodyType' numeric
@attribute 'physicalLooking' numeric
@attribute 'newNature' {'0','1','2','3','4','5','6','7','8'}
@attribute 'industry' {'0','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30'}
@attribute 'newWorkStatus' {'0','1','2','3','4','5','6','7','8','9'}
@attribute 'newCar' {'0','1','2','3','4'}
@attribute 'isCreditedBySfz' {'0','1'}
@attribute 'cregisterTime' numeric
@attribute 'age' numeric
@attribute 'housestatus' {'0','1','2','3','4','5','6','7','8'}
@attribute 'photonum' numeric
@attribute 'msgcnt' numeric
@attribute 'himsgcnt' numeric
@attribute 'huifumsgcnt' numeric
@attribute 'receivemsg' numeric
@attribute 'viewcnt' numeric
@attribute 'beviewcnt' numeric
@attribute 'focuscnt' numeric
@attribute 'befocuscnt' numeric
@attribute 'class' {'0','1'}

数据如下所示,仅仅展示了几条数据

@data
1973,6,2,162,6,1,1,6,11,1,1,8613,1,1,3,8,2,23,0,0,1,113,42,1,3,0,2,0,5,0,27,0,0,1
1979,7,8,172,4,4,2,5,11,1,1,8651,3,6,7,6,7,1,7,0,1,113,36,4,2,0,1,8,20,28,98,0,0,1
1980,3,9,175,6,1,1,7,11,1,1,8637,1,1,7,?,0,24,0,0,0,113,35,1,0,0,0,1,3,1,20,0,1,0
1981,7,10,175,6,4,1,7,11,1,0,8623,1,1,4,8,0,5,0,2,1,113,34,4,0,0,0,0,0,0,0,0,0,0
1977,9,6,165,0,1,4,0,11,1,0,8632,1,1,4,7,7,7,8,0,1,113,38,1,0,0,0,0,9,0,20,0,2,0.

二、将数据读入内存，将训练集特征，训练集目标类，测试集特征，测试集目标类分别提取出来，以本文为例：

train_data = open("../../data/data/train.csv","r")
test_data = open("../../data/data/test.csv","r")
##train data
train_feature=[]
train_target=[]
for line in train_data:
    temp = line.strip().split(',')
    train_feature.append(map(int,temp[0:-1]))
    train_target.extend(map(int,temp[-1]))
train_data.close()
##test data
test_feature=[]
test_target=[]
for line in test_data:
    temp = line.strip().split(',')
    test_feature.append(map(int,temp[0:-1]))
    test_target.extend(map(int,temp[-1]))
test_data.close()

三、使用OneHotEncoder将数据中的类别特征进行转化。以本文为例：

enc = OneHotEncoder(categorical_features=np.array([1,2,4,5,6,7,8,9,10,11,14,15,16,17,18,21]),n_values=[13,13,9,5,5,13,5,2,13,13,9,31,10,5,2,9])

categorical_features代表类别属性的索引数值，n_values代表categorical_features中每个属性含有多少个类别。

注意：类别属性尽量处理为从0开始的整数，像（0,1,2,3,4,5），不可取的实例像（17,19,50,100,1000），这与OneHotEncoder处理类别属性采取的方式有关，此处不细讲了。

四、使用OneHotEncoder将训练特征和测试特征进行转化。以本文为例：

注意：此处一定将训练特征和测试特征一起转化，因为转化之后数组的维度将会发生变化，有一个不转化，就会出错。

enc.fit(train_feature)
train_feature = enc.transform(train_feature).toarray()
test_feature = enc.transform(test_feature).toarray()

五、声明一个分类器，设置分类器参数，使用分类器进行训练，预测，评估等。

以下是以RandomForest为例源码：

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from numpy import shape
import numpy as np
from sklearn.metrics.metrics import classification_report
from sklearn.metrics import confusion_matrix


train_data = open("../../data/data/train.csv","r")
test_data = open("../../data/data/test.csv","r")
##train data
train_feature=[]
train_target=[]
for line in train_data:
    temp = line.strip().split(',')
    train_feature.append(map(int,temp[0:-1]))
    train_target.extend(map(int,temp[-1]))
train_data.close()
##test data
test_feature=[]
test_target=[]
for line in test_data:
    temp = line.strip().split(',')
    test_feature.append(map(int,temp[0:-1]))
    test_target.extend(map(int,temp[-1]))
test_data.close()

train_feature = np.array(train_feature)
test_feature = np.array(test_feature)


##OneHotEncoder used
enc = OneHotEncoder(categorical_features=np.array([1,2,4,5,6,7,8,9,10,11,14,15,16,17,18,21]),n_values=[13,13,9,5,5,13,5,2,13,13,9,31,10,5,2,9])
enc.fit(train_feature)

train_feature = enc.transform(train_feature).toarray()
test_feature = enc.transform(test_feature).toarray()
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(train_feature,train_target)

##result
print clf.predict(test_feature)
target_names = ['losing', 'active']
print classification_report(test_target, clf.predict(test_feature),target_names=target_names)

实验结果如下所示

precision recall f1-score support

losing 0.85 0.91 0.88 31138
active 0.84 0.75 0.79 19725

avg / total 0.85 0.85 0.84 50863