sklearn机器学习sklearn指东
1 数据获取
scikit-learn带有一些标准数据集,例如用于分类的iris和digits数据集,用于回归的bostonhouse prices 数据集。
1.1 load
fromsklearn.datasets importload_iris
fromsklearn.datasets importload_boston
iris=load_iris()
boston=load_boston
x_iris=iris.data
y_iris=iris.target
printx_iris.shape
printy_iris.shape
输出为:
(150, 4)
(150,)
1.2make_regression
fromsklearn.datasets importmake_regression
x,y=make_regression(n_samples=5, n_features=5, n_informative=3, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)
print1
print x
printy
print2
x,y=make_regression(n_samples=5, n_features=3, n_informative=3, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)
print x
printy
fromsklearn.datasets importmake_classification
x,y=make_classification(n_samples=5, n_features=5, n_informative=3)
print3
print x
printy
输出为:
1
[[ 1.3887649 -0.67264273 0.43790666 -0.17069749 0.38238322]
[ 0.47251375 -0.75166612 0.23178239-1.661831 -0.81182998]
[ 0.03739107 -1.34200362 0.24249631 0.43151133 0.26080355]
[ 1.12646446 0.43570731-0.21344475 0.36708684 -0.87513419]
[ 1.55191935 -0.81177223 -0.03115494 -0.80722552 -0.04601602]]
[ -9.35751369 -158.24803182 42.61616431 21.69803828 -73.08849933]
2
[[ 2.01145171 1.17357249-1.85035062]
[-0.53409156 0.55599955 0.04779895]
[ 0.43052286 1.60386627 0.48718064]
[-1.93366169 -0.01782231 0.30727286]
[-0.46353664 -0.63285618-0.09485492]]
[ 76.90215821 5.33297211 186.78542704 -115.69362051 -86.88520421]
3
[[ 1.53228845 -1.5708226 -0.8524258 -1.77329403 1.02937806]
[ 1.04725244 0.10639027-1.12618235 1.40243761 -0.6630892 ]
[ 0.59248386 -0.08554676 -0.25855043 0.30970562 0.21408284]
[-0.49888926 -0.46972133 1.7796799 -2.13879689 2.21883169]
[-1.78928116 1.32412246 1.39049161 0.85772537-0.39507264]]
[0 0 1 0 1]
1.3 np.array
import numpyas np
y_true=np.array([1,0,1,1,0])
y_pred=np.array([0,0,1,1,1])
fromsklearn.metrics importaccuracy_score
printaccuracy_score(y_true,y_pred)
输出为:
0.6
1.4 loadtxt()
import numpyas np
data=np.loadtxt("E:/sklearn/data.txt",delimiter=",")
printdata.shape
printdata[:9,:]
x=data[:,:7]
y=data[:,8]
输出为:
(768, 9)
[[ 6.00000000e+00 1.48000000e+02 7.20000000e+01 3.50000000e+01
0.00000000e+00 3.36000000e+01 6.27000000e-01 5.00000000e+01
1.00000000e+00]
[ 1.00000000e+00 8.50000000e+01 6.60000000e+01 2.90000000e+01
0.00000000e+00 2.66000000e+01 3.51000000e-01 3.10000000e+01
0.00000000e+00]
[ 8.00000000e+00 1.83000000e+02 6.40000000e+01 0.00000000e+00
0.00000000e+00 2.33000000e+01 6.72000000e-01 3.20000000e+01
1.00000000e+00]
[ 1.00000000e+00 8.90000000e+01 6.60000000e+01 2.30000000e+01
9.40000000e+01 2.81000000e+01 1.67000000e-01 2.10000000e+01
0.00000000e+00]
[ 0.00000000e+00 1.37000000e+02 4.00000000e+01 3.50000000e+01
1.68000000e+02 4.31000000e+01 2.28800000e+00 3.30000000e+01
1.00000000e+00]
[ 5.00000000e+00 1.16000000e+02 7.40000000e+01 0.00000000e+00
0.00000000e+00 2.56000000e+01 2.01000000e-01 3.00000000e+01
0.00000000e+00]
[ 3.00000000e+00 7.80000000e+01 5.00000000e+01 3.20000000e+01
8.80000000e+01 3.10000000e+01 2.48000000e-01 2.60000000e+01
1.00000000e+00]
[ 1.00000000e+01 1.15000000e+02 0.00000000e+00 0.00000000e+00
0.00000000e+00 3.53000000e+01 1.34000000e-01 2.90000000e+01
0.00000000e+00]
[ 2.00000000e+00 1.97000000e+02 7.00000000e+01 4.50000000e+01
5.43000000e+02 3.05000000e+01 1.58000000e-01 5.30000000e+01
1.00000000e+00]]
1.5 urlopen
import numpyas np
importurllib
url="http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
raw_data = urllib.urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=",")
X = dataset[:,0:7]
y = dataset[:,8]
print X
printy
输出为:
[[ 6. 148.72. ..., 0. 33.6 0.627]
[ 1. 85.66. ..., 0. 26.6 0.351]
[ 8. 183.64. ..., 0. 23.3 0.672]
...,
[ 5. 121. 72. ..., 112. 26.2 0.245]
[ 1. 126.60. ..., 0. 30.1 0.349]
[ 1. 93.70. ..., 0. 30.4 0.315]]
[ 1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1.
0. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0.
0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.
0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.
1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.
0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0.
0. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1.
0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0.
0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1.
0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1.
1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1.
1. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0.
0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0.
0. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1.
1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1.
0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 0.
1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1.
0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1.
1. 0. 0. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0.
1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 1. 0.
0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0.
1. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0.
0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0.
0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1.
0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0.
0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0.
0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.
1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. 0.
0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 1.
1. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1.
1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0.
1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 1.
0. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1.
1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1.
0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0.
0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1.
0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0.]
2数据预处理
主要在sklearn.preprcessing包下。
规范化:
- MinMaxScaler :最大最小值规范化
- Normalizer :使每条数据各特征值的和为1
- StandardScaler :为使各特征的均值为0,方差为1
编码:
- LabelEncoder:把字符串类型的数据转化为整型
- OneHotEncoder:特征用一个二进制数字来表示
- Binarizer :为将数值型特征的二值化
- MultiLabelBinarizer:多标签二值化
通过特征提取,我们能得到未经处理的特征,这时的特征可能有以下问题:
不属于同一量纲:即特征的规格不一样,不能够放在一起比较。无量纲化可以解决这一问题。
信息冗余:对于某些定量特征,其包含的有效信息为区间划分,例如学习成绩,假若只关心“及格”或不“及格”,那么需要将定量的考分,转换成“1”和“0”表示及格和未及格。二值化可以解决这一问题。
定性特征不能直接使用:某些机器学习算法和模型只能接受定量特征的输入,那么需要将定性特征转换为定量特征。最简单的方式是为每一种定性值指定一个定量值,但是这种方式过于灵活,增加了调参的工作。通常使用哑编码的方式将定性特征转换为定量特征:假设有N种定性值,则将这一个特征扩展为N种特征,当原始特征值为第i种定性值时,第i个扩展特征赋值为1,其他扩展特征赋值为0。哑编码的方式相比直接指定的方式,不用增加调参的工作,对于线性模型来说,使用哑编码后的特征可达到非线性的效果。
存在缺失值:缺失值需要补充。
信息利用率低:不同的机器学习算法和模型对数据中信息的利用是不同的,之前提到在线性模型中,使用对定性特征哑编码可以达到非线性的效果。类似地,对定量变量多项式化,或者进行其他的转换,都能达到非线性的效果。
preproccessing库来进行数据预处理,可以覆盖以上问题的解决方案。
2.1 无量纲化
无量纲化使不同规格的数据转换到同一规格。常见的无量纲化方法有标准化和区间缩放法。标准化的前提是特征值服从正态分布,标准化后,其转换成标准正态分布。区间缩放法利用了边界值信息,将特征的取值区间缩放到某个特点的范围,例如[0, 1]等。
2.1.1 标准化
fromsklearn.datasets im