Sklearn cookbook总结
1 数据预处理
1.1 获取数据
sklearn自带一些数据集,可以通过datasets模块的load_*方法加载,还有一些数据集比较大,可以通过fetch_*的方式下载。下面的代码示例了加载boston的房价数据和下载california的房价数据的方法。
from sklearn import datasets
boston = datasets.load_boston()
print(boston.DESCR)
california = datasets.fetch_california_housing('./temp')
# print(california.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
1.2 数据处理
sklearn的preprocess模块提供了若干预处理数据的方法。其功能如下:
类或方法 | 作用 |
---|---|
StandardScaler | 数据减去均值后除以方差 |
MinMaxScaler | 减去最小值除以最大最小值的差 |
normalize | 将数据除以所有点的平方和 |
binary | 由设置的阀值s进行二值化,x>s?1:0 |
其使用示例如下,由于是二维数组,计算在列上进行,即axis为0:
import numpy as np
from sklearn import preprocessing
a = np.array([[4., 2.], [2., 4.], [2, -2]], dtype=np.float)
print(a)
scaler = preprocessing.StandardScaler()
r = scaler.fit_transform(a)
print(r)
scaler = preprocessing.MinMaxScaler()
r = scaler.fit_transform(a)
print(r)
r = preprocessing.normalize(a)
print(r)
binary = preprocessing.Binarizer(3.5)
r = binary.fit_transform(a)
print(r)
[[ 4. 2.]
[ 2. 4.]
[ 2. -2.]]
[[ 1.41421356 0.26726124]
[-0.70710678 1.06904497]
[-0.70710678 -1.33630621]]
[[1. 0.66666667]
[0. 1. ]
[0. 0. ]]
[[ 0.89442719 0.4472136 ]
[ 0.4472136 0.89442719]
[ 0.70710678 -0.70710678]]
[[1. 0.]
[0. 1.]
[0. 0.]]
###1.3 分类编码
对于类别型的数据,需要将其数值化,以支持向量运算。
对于数值型的,可以使用preprocessing包的OneHotEncoder;对于字符串型的需要借助feature_extraction模块来进行。
from sklearn import preprocessing
from sklearn.feature_extraction import DictVectorizer
labels = [[1], [2], [3], [2]]
onehot = preprocessing.OneHotEncoder()
y = onehot.fit_transform(labels)
print(y.toarray())
labels = [{
'kind':'apple'}, {
'kind':'orange'}]
dv = DictVectorizer()
y = dv.fit_transform(labels)
print(y.toarray())
labels = [1,2,3,3,2,1]
lb = preprocessing.LabelBinarizer()
vec = lb.fit_transform(labels)
print(vec)
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[0. 1. 0.]]
[[1. 0.]
[0. 1.]]
[[1 0 0]
[0 1 0]
[0 0 1]
[0 0 1]
[0 1 0]
[1 0 0]]
1.4 缺失值处理
缺失值可以表示为nan,但在计算中无法使用,因此根据需要可以填充为合适的值。sklearn和pandas都能处理缺失值。
import pandas as pd
from sklearn import preprocessing
data = np.array([[1, 2], [np.nan, 4]])
print('origin:\n', data)
imputer = preprocessing.Imputer(strategy='mean')
r = imputer.fit_transform(data)
print('sklean:\n', r)
data_df = pd.DataFrame(data)
df = data_df.fillna(data_df.mean())
print('pandas\n',df)
origin:
[[ 1. 2.]
[nan 4.]]
sklean:
[[1. 2.]
[1. 4.]]
pandas
0 1
0 1.0 2.0
1 1.0 4.0
1.5 去除无用的维度
PCA是sklearn的一个分解模块,可以借助它来完成数据降维。
下面的代码对iris的特征进行PCA降维,通过对各维度的贡献分析,96%的变量可以由前两个主成分表示。因此可以把数据降低到前两维上,通过对PCA的参数n_components指定维度或比例,可以将数据进行降维。在只有两维的数据上通过plot作图以验证数据的可分性。
降维的另一个方法是使用FactorAnalysis类,使用上和PCA类似。其支持的核函数有liner, poly, rbf, sigmoid, cosine。
最后,利用矩阵的SVD也可以实现数据降维。各种降维方法的示例代码及效果如下:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA, FactorAnalysis
from sklearn.decomposition import TruncatedSVD
iris = datasets.load_iris()
pca = PCA()
dt = pca.fit_transform(iris.data)
print(pca.explained_variance_ratio_)
'''
array([ 8.05814643e-01, 1.63050854e-01, 2.13486883e-02,......)
'''
fig, axes = plt.subplots(1,3)
pca = decomposition.PCA(n_components = 2)
dt = pca.fit_transform(iris.data)
axes[0].scatter(dt[:,0], dt[:,1], c=iris.target)
fa = FactorAnalysis(n_components=2)
dt = fa.fit_transform(iris.data)
axes[1].scatter(dt[:,0], dt[:,1], c=iris.target)
svd = TruncatedSVD()
dt = svd.fit_transform(iris.data)
axes[2].scatter(dt[:,0], dt[:,1], c=iris.target)
[0.92461621 0.05301557 0.01718514 0.00518309]
<matplotlib.collections.PathCollection at 0x7f2d406d9ef0>
1.6 使用pipeline连接多个变换
对于多步处理,pipeline提供了一种便捷的组织代码的方式。如下示例:
from sklearn import pipeline, preprocessing, decomposition, datasets
iris = datasets.load_iris()
imputer = preprocessing.Imputer()
pca = decomposition.PCA(n_components=2)
line = [('imputer', imputer), ('pca', pca)]
pipe = pipeline.Pipeline(line)
dt = pipe.fit_transform(iris.data)
print dt.shape #(150,2)
1.7 利用高斯随机过程处理回归
如果假设变量的分布和自变量符合高斯分布或正态分布,则可以使用高斯过程来进行回归分析。
from sklearn import datasets
from sklearn.gaussian_process import GaussianProcess
boston = datasets.load_boston()
sel = np.random.choice([True, False], len(boston.data), p=[0.75, 0.25])
gp = GaussianProcess()
gp.fit(boston.data[sel], boston.target[sel])
pred = gp.predict(boston.data[~sel])
diff = pred - boston.target[~sel]
xtick = range(len(pred))
fig, axes = plt.subplots(2,1)
axes[0].plot(xtick, pred, c='red',label='predict')
axes[0].plot(xtick, boston.target[~sel], c='blue', label='real')
axes[1].plot(xtick, diff)
plt.show()
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:58: DeprecationWarning: Class GaussianProcess is deprecated; GaussianProcess was deprecated in version 0.18 and will be removed in 0.20. Use the GaussianProcessRegressor instead.
warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function l1_cross_distances is deprecated; l1_cross_distances was deprecated in version 0.18 and will be removed in 0.20.
warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function constant is deprecated; The function constant of regression_models is deprecated in version 0.19.1 and will be removed in 0.22.
warnings.warn(msg, category=DeprecationWarning)
/usr/local/lib/python3.6/dist-packages/sklearn/utils/deprecation.py:77: DeprecationWarning: Function squared_exponential is deprecated; The function squared_exponential of correlation_models is deprecated in version 0.19.1 and will be removed in 0.22.
warnings.warn(msg, category=DeprecationWarning)
1.8 SGD处理回归
from sklearn import datasets
from sklearn.linear_model import SGDRegressor
X, y = datasets.make_regression(1000)
sel = np.random.choice([True, False], len(X), p=[