Python - Scikit-Learn 的 数据加载+数据标准化+特征的选取

加载数据到内存。Scikit-Learn库在它的实现用使用了NumPy数组,用NumPy来加载*.csv文件。

UCI Machine Learning Repository数据集库。

其中一个:http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

部分数据样本,9维数组,如下

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
7,100,0,0,0,30.0,0.484,32,1
0,118,84,47,230,45.8,0.551,31,1
7,107,74,0,0,29.6,0.254,31,1
1,103,30,38,83,43.3,0.183,33,0
1,115,70,30,96,34.6,0.529,32,1
3,126,88,41,235,39.3,0.704,27,0
8,99,84,0,0,35.4,0.388,50,0
7,196,90,0,0,39.8,0.451,41,1

在运行算法之前,要进行标准化,或所谓的规格化。

标准化包括替换所有特征的名义值,让它们每一个的值在0和1之间。

而规格化,它包括数据的预处理,使得每个特征的值有0和1的离差。

特征选取和特征工程。如树算法就可以计算特征的信息量。

对特征子集的高效搜索,从而找到最好的子集,意味着演化了的模型在这个子集上有最好的质量。

搜索算法:递归特征消除算法(RFE)

# -*- coding: utf-8 -*- 
import numpy as np
import urllib
from sklearn import preprocessing
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# 数据集url
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# 下载文件
raw_data = urllib.urlopen(url)
# 载入文件为数组矩阵
dataset = np.loadtxt(raw_data, delimiter=",")
# 分离数据与目标类别
X = dataset[:,0:7]
y = dataset[:,8]
print X,y
# 规格化数据属性
normalized_X = preprocessing.normalize(X)
print normalized_X
# 标准化数据属性
standardized_X = preprocessing.scale(X)
print '___________________________'
print standardized_X

model = ExtraTreesClassifier()#特征的选取
model.fit(X, y)
# 显示每个属性的相对重要性
print(model.feature_importances_)

model = LogisticRegression()
# 创造 RFE 模型 and 选择 3 属性
rfe = RFE(model, 3) # 递归特征消除算法(RFE)
rfe = rfe.fit(X, y)
# 总结属性的选择
print(rfe.support_)
print(rfe.ranking_)
输出如下:
================ RESTART: F:/课程资料/神经网络/get_ML_data.py ================
[[   6.     148.      72.    ...,    0.      33.6      0.627]
 [   1.      85.      66.    ...,    0.      26.6      0.351]
 [   8.     183.      64.    ...,    0.      23.3      0.672]
 ..., 
 [   5.     121.      72.    ...,  112.      26.2      0.245]
 [   1.     126.      60.    ...,    0.      30.1      0.349]
 [   1.      93.      70.    ...,    0.      30.4      0.315]]#原始数据
___________________________
[ 1.  0.  1.  0.  1.  0.  1.  0.  1.  1.  0.  1.  0.  1.  1.  1.  1.  1.
 ..., 
  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  0.  0.  0.  0.  0.
  0.  1.  1.  0.  0.  1.  0.  0.  1.  0.  1.  1.  1.  0.  0.  1.  1.  1.
  0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  1.  0.]#类别属性

[[ 0.03494617  0.86200564  0.41935409 ...,  0.          0.19569858
   0.00365188]
 [ 0.00872683  0.74178025  0.57597054 ...,  0.          0.23213358
   0.00306312]
 [ 0.04093566  0.93640332  0.32748532 ...,  0.          0.11922512
   0.0034386 ]
 ..., 
 [ 0.02727338  0.66001582  0.39273669 ...,  0.61092373  0.14291252
   0.0013364 ]
 [ 0.0070043   0.8825414   0.42025781 ...,  0.          0.21082934
   0.0024445 ]
 [ 0.00804902  0.74855891  0.56343144 ...,  0.          0.24469022
   0.00253544]]# 规格化数据属性

___________________________
[[ 0.63994726  0.84832379  0.14964075 ..., -0.69289057  0.20401277
   0.46849198]
 [-0.84488505 -1.12339636 -0.16054575 ..., -0.69289057 -0.68442195
  -0.36506078]
 [ 1.23388019  1.94372388 -0.26394125 ..., -0.69289057 -1.10325546
   0.60439732]
 ..., 
 [ 0.3429808   0.00330087  0.14964075 ...,  0.27959377 -0.73518964
  -0.68519336]
 [-0.84488505  0.1597866  -0.47073225 ..., -0.69289057 -0.24020459
  -0.37110101]
 [-0.84488505 -0.8730192   0.04624525 ..., -0.69289057 -0.20212881
  -0.47378505]]# 标准化数据属性
[ 0.13977213  0.25230235  0.11835827  0.09398053  0.07977048  0.16745061
  0.14836562]# 显示每个属性的相对重要性

___________________________
[ True False False False False  True  True]#总结属性的选择
___________________________
[1 2 3 5 4 1 1]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值