sklearn cookbook 总结

Sklearn cookbook总结

1 数据预处理

1.1 获取数据

sklearn自带一些数据集,可以通过datasets模块的load_*方法加载,还有一些数据集比较大,可以通过fetch_*的方式下载。下面的代码示例了加载boston的房价数据和下载california的房价数据的方法。

from sklearn import datasets

boston = datasets.load_boston()
print(boston.DESCR)

california = datasets.fetch_california_housing('./temp')
# print(california.DESCR)
Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

1.2 数据处理

sklearn的preprocess模块提供了若干预处理数据的方法。其功能如下:

类或方法 作用
StandardScaler 数据减去均值后除以方差
MinMaxScaler 减去最小值除以最大最小值的差
normalize 将数据除以所有点的平方和
binary 由设置的阀值s进行二值化,x>s?1:0

其使用示例如下,由于是二维数组,计算在列上进行,即axis为0:

import numpy as np
from sklearn import preprocessing

a = np.array([[4., 2.], [2., 4.], [2, -2]], dtype=np.float)

print(a)

scaler = preprocessing.StandardScaler()
r = scaler.fit_transform(a)
print(r) 

scaler = preprocessing.MinMaxScaler()
r = scaler.fit_transform(a)
print(r)

r = preprocessing.normalize(a)
print(r) 

binary = preprocessing.Binarizer(3.5)
r = binary.fit_transform(a)
print(r) 
[[ 4.  2.]
 [ 2.  4.]
 [ 2. -2.]]
[[ 1.41421356  0.26726124]
 [-0.70710678  1.06904497]
 [-0.70710678 -1.33630621]]
[[1.         0.66666667]
 [0.         1.        ]
 [0.         0.        ]]
[[ 0.89442719  0.4472136 ]
 [ 0.4472136   0.89442719]
 [ 0.70710678 -0.70710678]]
[[1. 0.]
 [0. 1.]
 [0. 0.]]

###1.3 分类编码
对于类别型的数据,需要将其数值化,以支持向量运算。

对于数值型的,可以使用preprocessing包的OneHotEncoder;对于字符串型的需要借助feature_extraction模块来进行。

from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer


labels = [[1], [2], [3], [2]]

onehot = preprocessing.OneHotEncoder()
y = onehot.fit_transform(labels)

print(y.toarray())


labels = [{
   'kind':'apple'}, {
   'kind':'orange'}]
dv = DictVectorizer()
y = dv.fit_transform(labels)
print(y.toarray())


labels = [1,2,3,3,2,1]
lb = preprocessing.LabelBinarizer()
vec = lb.fit_transform(labels)
print(vec)


[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]
[[1. 0.]
 [0. 1.]]
[[1 0 0]
 [0 1 0]
 [0 0 1]
 [0 0 1]
 [0 1 0]
 [1 0 0]]

1.4 缺失值处理

缺失值可以表示为nan,但在计算中无法使用,因此根据需要可以填充为合适的值。sklearn和pandas都能处理缺失值。

import pandas as pd
from sklearn import preprocessing


data = np.array([[1, 2], [np.nan, 4]])
print('origin:\n', data)

imputer = preprocessing.Imputer(strategy='mean')
r = imputer.fit_transform(data)
print('sklean:\n', r)


data_df = pd.DataFrame(data)
df = data_df.fillna(data_df.mean())
print('pandas\n',df)
origin:
 [[ 1.  2.]
 [nan  4.]]
sklean:
 [[1. 2.]
 [1. 4.]]
pandas
      0    1
0  1.0  2.0
1  1.0  4.0

1.5 去除无用的维度

PCA是sklearn的一个分解模块,可以借助它来完成数据降维。

下面的代码对iris的特征进行PCA降维,通过对各维度的贡献分析,96%的变量可以由前两个主成分表示。因此可以把数据降低到前两维上,通过对PCA的参数n_components指定维度或比例,可以将数据进行降维。在只有两维的数据上通过plot作图以验证数据的可分性。

降维的另一个方法是使用FactorAnalysis类,使用上和PCA类似。其支持的核函数有liner, poly, rbf, sigmoid, cosine。

最后,利用矩阵的SVD也可以实现数据降维。各种降维方法的示例代码及效果如下:

import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.decomposition import TruncatedSVD


iris = datasets.load_iris()

pca = PCA()
dt = pca.fit_transform(iris.data)
print(pca.explained_variance_ratio_)

'''
array([  8.05814643e-01,   1.63050854e-01,   2.13486883e-02,......)
'''

fig, axes = plt.subplots(1,3)

pca = decomposition.PCA(n_components = 2)
dt = pca.fit_transform(iris.data)
axes[0].scatter(dt[:,0], dt[:,1], c=iris.target)


fa = FactorAnalysis(n_components=2)
dt = fa.fit_transform(iris.data)
axes[1].scatter(dt[:,0], dt[:,1], c=iris.target)


svd = TruncatedSVD()
dt = svd.fit_transform(iris.data)
axes[2].scatter(dt[:,0], dt[:,1], c=iris.target)
[0.92461621 0.05301557 0.01718514 0.00518309]





<matplotlib.collections.PathCollection at 0x7f2d406d9ef0>

在这里插入图片描述

1.6 使用pipeline连接多个变换

对于多步处理,pipeline提供了一种便捷的组织代码的方式。如下示例:

from sklearn import pipeline, preprocessing, decomposition, datasets

iris = datasets.load_iris()

imputer = preprocessing.Imputer()
pca = decomposition.PCA(n_components=2)
line = [('imputer', imputer), ('pca', pca)]

pipe = pipeline.Pipeline(line)
dt = pipe.fit_transform(iris.data)
print dt.shape #(150,2)

1.7 利用高斯随机过程处理回归

如果假设变量的分布和自变量符合高斯分布或正态分布,则可以使用高斯过程来进行回归分析。

from sklearn import datasets
from sklearn.gaussian_process import GaussianProcess

boston = datasets.load_boston()

sel = np.random.choice([True, False], len(boston.data), p=[0.75, 0.25])
gp = GaussianProcess()
gp.fit(boston.data[sel], boston.target[sel])

pred = gp.predict(boston.data[~sel])
diff = pred - boston.target[~sel]
xtick = range(len(pred))

fig, axes = plt.subplots(2,1)

axes[0].plot(xtick, pred, c='red',label='predict')
axes[0].plot(xtick, boston.target[~sel], c='blue', label='real')

axes[1].plot(xtick, diff)

plt.show()
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:58: DeprecationWarning: Class GaussianProcess is deprecated; GaussianProcess was deprecated in version 0.18 and will be removed in 0.20. Use the GaussianProcessRegressor instead.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function l1_cross_distances is deprecated; l1_cross_distances was deprecated in version 0.18 and will be removed in 0.20.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
  warnings.warn(msg, category=DeprecationWarning)

在这里插入图片描述

1.8 SGD处理回归

from sklearn import datasets
from sklearn.linear_model import SGDRegressor

X, y = datasets.make_regression(1000)
sel = np.random.choice([True, False], len(X), p=[
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值