python基础 - Scikit-learn

最新推荐文章于 2022-09-12 20:14:10 发布

Rnan-prince

最新推荐文章于 2022-09-12 20:14:10 发布

阅读量238

点赞数

分类专栏：机器学习 python 文章标签：机器学习 python Scikit-learn

本文链接：https://blog.csdn.net/qq_19446965/article/details/106970339

版权

python 同时被 2 个专栏收录

125 篇文章 8 订阅

订阅专栏

机器学习

57 篇文章 17 订阅

订阅专栏

Scikit-learn 是开源的 Python 库，通过统一的界面实现机器学习、预处理、交叉验证及可视化算法。

一、加载数据

import numpy as np
X = np.random.random((10, 5))
y = np.array(['M', 'M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'F'])
X[X < 0.7] = 0

二、训练集与测试集数据

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

其中len(X_train)=7，len(X_test)=3。

三、数据预处理

1、标准化

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
standardized_X = scaler.transform(X_train)
standardized_X_test = scaler.transform(X_test)

2、归一化

from sklearn.preprocessing import Normalizer
scaler = Normalizer().fit(X_train)
normalized_X = scaler.transform(X_train)
normalized_X_test = scaler.transform(X_test)

3、二值化

from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=0.0).fit(X)
binary_X = binarizer.transform(X)
[[0. 0. 1. 1. 0.]
 [0. 1. 1. 1. 1.]
 [1. 0. 1. 0. 0.]
 [0. 1. 0. 1. 0.]
 [0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 1.]
 [0. 0. 0. 1. 1.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 1. 0.]
 [0. 1. 0. 0. 0.]]

4、编码分类特征

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y = enc.fit_transform(y)
[1 1 0 0 1 0 1 1 0 0]

5、输入缺失值

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp_X = imp.fit_transform(X_train)
[[0.82728283 0.82831898 0.81366112 0.87132971 0.81187488]
 [0.82728283 0.94785166 0.87362473 0.91823703 0.72561163]
 [0.82728283 0.862883   0.81366112 0.72950674 0.776649  ]
 [0.82728283 0.862883   0.81366112 0.87132971 0.81187488]
 [0.82728283 0.96075915 0.81366112 0.86954818 0.81187488]
 [0.82728283 0.862883   0.75369751 0.96802688 0.81187488]
 [0.82728283 0.7146022  0.81366112 0.87132971 0.93336401]]

6、生成多项式特征

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
poly_X = poly.fit_transform(X)
[[1.         0.         0.         ... 0.         0.         0.        ]
 [1.         0.         0.94785166 ... 0.32212343 0.25454921 0.20115053]
 [1.         0.70067787 0.         ... 0.         0.         0.        ]
 ...
 [1.         0.82728283 0.         ... 0.         0.         0.        ]
 [1.         0.         0.8635297  ... 0.         0.         0.        ]
 [1.         0.         0.82831898 ... 0.         0.         0.        ]]

四、创建模

1、有监督学习评估器

（1）线性回归

from sklearn.linear_model import LinearRegression
lr = LinearRegression(normalize=True)

（2）支持向量机(SVM)

from sklearn.svm import SVC
svc = SVC(kernel='linear')

（3）朴素贝叶斯

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

（4）KNN

from sklearn import neighbors
knn = neighbors.KNeighborsClassifier(n_neighbors=5)

2、无监督学习评估器

（1）主成分分析(PCA)

from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)

（2）K Means

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)

五、模型训练

1、有监督学习

lr.fit(X, y)
knn.fit(X_train, y_train)
svc.fit(X_train, y_train)

2、无监督学习

k_means.fit(X_train)
pca_model = pca.fit_transform(X_train)

六、模型预测

1、有监督评估器

y_pred = svc.predict(np.random.random((3, 5)))
y_pred = lr.predict(X_test)
y_pred = knn.predict_proba(X_test)

2、无监督评估器

y_pred = k_means.predict(X_test)

七、评估模型性能

1、分类指标

（1）准确率

print(knn.score(X_test, y_test))            # 1.0

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))       # 1.0

（2）分类预估评价函数

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
               precision    recall  f1-score   support
           F       1.00      1.00      1.00         2
           M       1.00      1.00      1.00         1
    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3

（3）混淆矩阵

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
[[2 0]
 [0 1]]

2、回归指标

（1）平均绝对误差

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2]
mean_absolute_error(y_true, y_pred)

（2）均方误差

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

（3）R² 评分

from sklearn.metrics import r2_score
r2_score(y_true, y_pred)

3、群集指标

（1）调整兰德系数

from sklearn.metrics import adjusted_rand_score
adjusted_rand_score(y_true, y_pred)

（2）同质性

from sklearn.metrics import homogeneity_score
homogeneity_score(y_true, y_pred)

（3）V-measure

from sklearn.metrics import v_measure_score
v_measure_score(y_true, y_pred)

4、交叉验证

from sklearn.model_selection import cross_val_score
print(cross_val_score(knn, X_train, y_train, cv=4))     # [0.5 0.5 0.5 1. ]
print(cross_val_score(lr, X, y, cv=2))                  # [-2.1895408  -2.52179574]

八、模型调整

1、栅格搜索

from sklearn.model_selection import GridSearchCV
params = {"n_neighbors": np.arange(1, 3), "metric": ["euclidean", "cityblock"]}
grid = GridSearchCV(estimator=knn, param_grid=params)
grid.fit(X_train, y_train)
print(grid.best_score_)     # 0.5714285714285714
print(grid.best_estimator_.n_neighbors)     # 2

2、随机参数优化

from sklearn.model_selection import RandomizedSearchCV
params = {"n_neighbors": range(1, 5), "weights": ["uniform", "distance"]}
rsearch = RandomizedSearchCV(estimator=knn,
                             param_distributions=params,
                             cv=4,
                             n_iter=8,
                             random_state=5)
rsearch.fit(X_train, y_train)
print(rsearch.best_score_)  # 0.7142857142857143

摘自DataCamp
Learn Python for Data Science Interactively