一、 安装说明
1、安装Scikit-plot非常简单,直接用命令:
pip install scikit-plot
即可完成安装。
2、仓库地址:
https://github.com/reiinakano/scikit-plot
二、 使用说明
从 Scikit-Plot 官网中,搜集出这四大模块里所有的细分函数:
scikitplot.metrics
- plot_confusion_matrix:分类的混淆矩阵
- plot_precision_recall:分类的查准查全
- plot_roc:分类的 ROC 曲线
- plot_ks_statistic
- plot_silhouette:度量聚类好坏的轮廓系数
- plot_calibration_curve
- plot_cumulative_gain
- plot_lift_curve
- scikitplot.estimators
- plot_learning_curve:学习曲线
- plot_feature_importances:特征重要性
scikitplot.cluster
- plot_elbow_curve:决定簇个数的肘部曲线
scikitplot.decomposition
- plot_pca_component_variance:可解释方差
- plot_pca_2d_projection:高维投影到二维
1、画出分类评级指标的ROC曲线
完整代码:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
nb = GaussianNB()
nb.fit(X_train, y_train)
predicted_probas = nb.predict_proba(X_test)
# The magic happens here
import matplotlib.pyplot as plt
import scikitplot as skplt
skplt.metrics.plot_roc(y_test, predicted_probas)
plt.show()
效果图:
2、P-R曲线
精确率precision vs 召回率recall 曲线,以recall作为横坐标轴,precision作为纵坐标轴
完整代码:
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_digits as load_data
import scikitplot as skplt
# Load dataset
X, y = load_data(return_X_y=True)
# Create classifier instance then fit
nb = GaussianNB()
nb.fit(X,y)
# Get predicted probabilities
y_probas = nb.predict_proba(X)
#skplt.metrics.plot_precision_recall_curve(y, y_probas, cmap='nipy_spectral')
skplt.metrics.plot_precision_recall(y, y_probas, cmap='nipy_spectral')
plt.show()
注意版本:
FutureWarning:Function plot_precision_recall_curve is deprecated; This will be removed in v0.5.0.
Please use scikitplot.metrics.plot_precision_recall instead.
#skplt.metrics.plot_precision_recall_curve(y, y_probas, cmap='nipy_spectral')--- v0.5.0.已修改为下一句代码
skplt.metrics.plot_precision_recall(y, y_probas, cmap='nipy_spectral')
效果图:
3、混淆矩阵
分类的重要评价标准,下面代码是用随机森林对鸢尾花数据集进行分类,分类结果画一个归一化的混淆矩阵。
完整代码:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits as load_data
from sklearn.model_selection import cross_val_predict
import matplotlib.pyplot as plt
import scikitplot as skplt
X, y = load_data(return_X_y=True)
# Create an instance of the RandomForestClassifier
classifier = RandomForestClassifier()
# Perform predictions
predictions = cross_val_predict(classifier, X, y)
plot = skplt.metrics.plot_confusion_matrix(y, predictions, normalize=True)
plt.show()
效果图:
4、校准曲线
完整代码:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import scikitplot as skplt
#Set max_iter to a larger value. The default is 1000.
X, y = make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2,
random_state=20)
X_train, y_train, X_test, y_test = X[:1000], y[:1000], X[1000:], y[1000:]
rf_probas = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
#lr_probas = LogisticRegression().fit(X_train, y_train).predict_proba(X_test)
lr_probas = LogisticRegression(max_iter=7600).fit(X_train, y_train).predict_proba(X_test)
nb_probas = GaussianNB().fit(X_train, y_train).predict_proba(X_test)
sv_scores = LinearSVC().fit(X_train, y_train).decision_function(X_test)
probas_list = [rf_probas, lr_probas, nb_probas, sv_scores]
clf_names=['Random Forest',
'Logistic Regression',
'Gaussian Naive Bayes',
'Support Vector Machine']
skplt.metrics.plot_calibration_curve(y_test,
probas_list=probas_list,
clf_names=clf_names,
n_bins=10)
plt.show()
效果图:
遇到的问题:
C:\Users\wu\AppData\Roaming\Python\Python37\site-packages\sklearn\svm\_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
"the number of iterations.", ConvergenceWarning)
解决方法:
LogisticRegression() 应该增加 max_iter
#lr_probas = LogisticRegression().fit(X_train, y_train).predict_proba(X_test)
lr_probas = LogisticRegression(max_iter=7600).fit(X_train, y_train).predict_proba(X_test)
其他方法可参考:
https://stackoverflow.com/questions/52670012/convergencewarning-liblinear-failed-to-converge-increase-the-number-of-iterati
5、plot_calibration_curve
from __future__ import absolute_import
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer as load_data
import scikitplot as skplt
X, y = load_data(return_X_y=True)
#lr = LogisticRegression()
lr = LogisticRegression(max_iter=7600)
lr.fit(X, y)
probas = lr.predict_proba(X)
skplt.metrics.plot_cumulative_gain(y_true=y, y_probas=probas)
plt.show()
效果图:
遇到的问题:
解决问题:
理由同上
#lr = LogisticRegression()
lr = LogisticRegression(max_iter=7600)
6、plot_silhouette
决定簇个数的肘部曲线
完整代码:
from __future__ import absolute_import
import matplotlib.pyplot as plt
import scikitplot as skplt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris as load_data
X, y = load_data(return_X_y=True)
kmeans = KMeans(random_state=1)
skplt.cluster.plot_elbow_curve(kmeans, X, cluster_ranges=range(1, 11))
plt.show()
效果图:
7、 plot_feature_importances
Scikit-Plot 中的 plot_feature_importances 函数可以将「特征重要性」排序并画出。
函数 plot_feature_importances用到的参数有 4 个:
- RF:随机森林分类器
- feature_names:特征名称,本例有 30 个
- x_tick_rotation:横轴刻度旋转度,本例设置 90 度,因为特征多,名字长,不旋转 90 度图中显示非常乱
- figsize:图片大小
完整代码:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris as load_data
import matplotlib.pyplot as plt
import scikitplot as skplt
X, y = load_data(return_X_y=True)
rf = RandomForestClassifier()
rf.fit(X, y)
skplt.estimators.plot_feature_importances(rf,feature_names=['petal length',
'petal width',
'sepal length',
'sepal width'])
plt.show()
效果图:
8、plot_ks_statistic
完整代码:
from __future__ import absolute_import
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer as load_data
import scikitplot as skplt
X, y = load_data(return_X_y=True)
lr = LogisticRegression(max_iter=7600)
lr.fit(X, y)
probas = lr.predict_proba(X)
skplt.metrics.plot_ks_statistic(y_true=y, y_probas=probas)
plt.show()
效果图:
9、 plot_learning_curve
完整代码:
from __future__ import absolute_import
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer as load_data
import scikitplot as skplt
X, y = load_data(return_X_y=True)
rf = RandomForestClassifier()
skplt.estimators.plot_learning_curve(rf, X, y)
plt.show()
效果图:
10、plot_lift_curve
完整代码
from __future__ import absolute_import
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer as load_data
import scikitplot as skplt
X, y = load_data(return_X_y=True)
lr = LogisticRegression(max_iter=7600)
lr.fit(X, y)
probas = lr.predict_proba(X)
skplt.metrics.plot_lift_curve(y_true=y, y_probas=probas)
plt.show()
效果图:
11、plot_pca_2d_projection
完整代码
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits as load_data
import scikitplot as skplt
import matplotlib.pyplot as plt
X, y = load_data(return_X_y=True)
pca = PCA(random_state=1)
pca.fit(X)
skplt.decomposition.plot_pca_2d_projection(pca, X, y)
plt.show()
效果图
12、plot_pca_component
完整代码:
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits as load_data
import scikitplot as skplt
import matplotlib.pyplot as plt
X, y = load_data(return_X_y=True)
pca = PCA(random_state=1)
pca.fit(X)
skplt.decomposition.plot_pca_component_variance(pca)
plt.show()
效果图
13、plot_silhouette
完整代码
from __future__ import absolute_import
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris as load_data
import scikitplot as skplt
X, y = load_data(return_X_y=True)
kmeans = KMeans(n_clusters=4, random_state=1)
cluster_labels = kmeans.fit_predict(X)
skplt.metrics.plot_silhouette(X, cluster_labels)
plt.show()