加载数据到内存。Scikit-Learn库在它的实现用使用了NumPy数组,用NumPy来加载*.csv文件。
UCI Machine Learning Repository数据集库。
其中一个:http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data
部分数据样本,9维数组,如下
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 5,116,74,0,0,25.6,0.201,30,0 3,78,50,32,88,31.0,0.248,26,1 10,115,0,0,0,35.3,0.134,29,0 2,197,70,45,543,30.5,0.158,53,1 8,125,96,0,0,0.0,0.232,54,1 4,110,92,0,0,37.6,0.191,30,0 10,168,74,0,0,38.0,0.537,34,1 10,139,80,0,0,27.1,1.441,57,0 1,189,60,23,846,30.1,0.398,59,1 5,166,72,19,175,25.8,0.587,51,1 7,100,0,0,0,30.0,0.484,32,1 0,118,84,47,230,45.8,0.551,31,1 7,107,74,0,0,29.6,0.254,31,1 1,103,30,38,83,43.3,0.183,33,0 1,115,70,30,96,34.6,0.529,32,1 3,126,88,41,235,39.3,0.704,27,0 8,99,84,0,0,35.4,0.388,50,0 7,196,90,0,0,39.8,0.451,41,1
在运行算法之前,要进行标准化,或所谓的规格化。
标准化包括替换所有特征的名义值,让它们每一个的值在0和1之间。
而规格化,它包括数据的预处理,使得每个特征的值有0和1的离差。
特征选取和特征工程。如树算法就可以计算特征的信息量。
对特征子集的高效搜索,从而找到最好的子集,意味着演化了的模型在这个子集上有最好的质量。
搜索算法:递归特征消除算法(RFE)
# -*- coding: utf-8 -*-
import numpy as np
import urllib
from sklearn import preprocessing
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# 数据集url
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# 下载文件
raw_data = urllib.urlopen(url)
# 载入文件为数组矩阵
dataset = np.loadtxt(raw_data, delimiter=",")
# 分离数据与目标类别
X = dataset[:,0:7]
y = dataset[:,8]
print X,y
# 规格化数据属性
normalized_X = preprocessing.normalize(X)
print normalized_X
# 标准化数据属性
standardized_X = preprocessing.scale(X)
print '___________________________'
print standardized_X
model = ExtraTreesClassifier()#特征的选取
model.fit(X, y)
# 显示每个属性的相对重要性
print(model.feature_importances_)
model = LogisticRegression()
# 创造 RFE 模型 and 选择 3 属性
rfe = RFE(model, 3) # 递归特征消除算法(RFE)
rfe = rfe.fit(X, y)
# 总结属性的选择
print(rfe.support_)
print(rfe.ranking_)
输出如下:
================ RESTART: F:/课程资料/神经网络/get_ML_data.py ================
[[ 6. 148. 72. ..., 0. 33.6 0.627]
[ 1. 85. 66. ..., 0. 26.6 0.351]
[ 8. 183. 64. ..., 0. 23.3 0.672]
...,
[ 5. 121. 72. ..., 112. 26.2 0.245]
[ 1. 126. 60. ..., 0. 30.1 0.349]
[ 1. 93. 70. ..., 0. 30.4 0.315]]#原始数据
___________________________
[ 1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1.
...,
0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.
0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1.
0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0.]#类别属性
[[ 0.03494617 0.86200564 0.41935409 ..., 0. 0.19569858
0.00365188]
[ 0.00872683 0.74178025 0.57597054 ..., 0. 0.23213358
0.00306312]
[ 0.04093566 0.93640332 0.32748532 ..., 0. 0.11922512
0.0034386 ]
...,
[ 0.02727338 0.66001582 0.39273669 ..., 0.61092373 0.14291252
0.0013364 ]
[ 0.0070043 0.8825414 0.42025781 ..., 0. 0.21082934
0.0024445 ]
[ 0.00804902 0.74855891 0.56343144 ..., 0. 0.24469022
0.00253544]]# 规格化数据属性
___________________________
[[ 0.63994726 0.84832379 0.14964075 ..., -0.69289057 0.20401277
0.46849198]
[-0.84488505 -1.12339636 -0.16054575 ..., -0.69289057 -0.68442195
-0.36506078]
[ 1.23388019 1.94372388 -0.26394125 ..., -0.69289057 -1.10325546
0.60439732]
...,
[ 0.3429808 0.00330087 0.14964075 ..., 0.27959377 -0.73518964
-0.68519336]
[-0.84488505 0.1597866 -0.47073225 ..., -0.69289057 -0.24020459
-0.37110101]
[-0.84488505 -0.8730192 0.04624525 ..., -0.69289057 -0.20212881
-0.47378505]]# 标准化数据属性
[ 0.13977213 0.25230235 0.11835827 0.09398053 0.07977048 0.16745061
0.14836562]# 显示每个属性的相对重要性
___________________________
[ True False False False False True True]#总结属性的选择
___________________________
[1 2 3 5 4 1 1]
================ RESTART: F:/课程资料/神经网络/get_ML_data.py ================
[[ 6. 148. 72. ..., 0. 33.6 0.627]
[ 1. 85. 66. ..., 0. 26.6 0.351]
[ 8. 183. 64. ..., 0. 23.3 0.672]
...,
[ 5. 121. 72. ..., 112. 26.2 0.245]
[ 1. 126. 60. ..., 0. 30.1 0.349]
[ 1. 93. 70. ..., 0. 30.4 0.315]]#原始数据
___________________________
[ 1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1.
...,
0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.
0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1.
0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0.]#类别属性
[[ 0.03494617 0.86200564 0.41935409 ..., 0. 0.19569858
0.00365188]
[ 0.00872683 0.74178025 0.57597054 ..., 0. 0.23213358
0.00306312]
[ 0.04093566 0.93640332 0.32748532 ..., 0. 0.11922512
0.0034386 ]
...,
[ 0.02727338 0.66001582 0.39273669 ..., 0.61092373 0.14291252
0.0013364 ]
[ 0.0070043 0.8825414 0.42025781 ..., 0. 0.21082934
0.0024445 ]
[ 0.00804902 0.74855891 0.56343144 ..., 0. 0.24469022
0.00253544]]# 规格化数据属性
___________________________
[[ 0.63994726 0.84832379 0.14964075 ..., -0.69289057 0.20401277
0.46849198]
[-0.84488505 -1.12339636 -0.16054575 ..., -0.69289057 -0.68442195
-0.36506078]
[ 1.23388019 1.94372388 -0.26394125 ..., -0.69289057 -1.10325546
0.60439732]
...,
[ 0.3429808 0.00330087 0.14964075 ..., 0.27959377 -0.73518964
-0.68519336]
[-0.84488505 0.1597866 -0.47073225 ..., -0.69289057 -0.24020459
-0.37110101]
[-0.84488505 -0.8730192 0.04624525 ..., -0.69289057 -0.20212881
-0.47378505]]# 标准化数据属性
[ 0.13977213 0.25230235 0.11835827 0.09398053 0.07977048 0.16745061
0.14836562]# 显示每个属性的相对重要性
___________________________
[ True False False False False True True]#总结属性的选择
___________________________
[1 2 3 5 4 1 1]