sklearn模块总结

sklearn module


持续更新中…
5.17 update:linear regression&support vector machine

linear_regression

数据集导入

  • datasets.load_diabetes(*, return_X_y = False, as_frame = False)

    糖尿病数据集(回归任务)

    • return_X_y : Flase则返回bunch对象(类字典),True则返回二元组(data,target)

    • as_frame : True则数据为pandas DataFrame

核心函数

  • reg = linear_module.LinearRegression()

    创建linear regression对象

  • reg.fit(X, y)

    核心步骤,拟合直线

    X为train_data

    y为train_target

  • y = reg.predict(X)

    在测试集上预测结果

    X为test_data

    y为test_pred

评价指标

  • MSE(mean square error)

M S E = 1 n Σ i = 0 n − 1 ( y i − y ^ i ) 2 MSE=\frac1n\Sigma_{i=0}^{n-1}(y_i-\hat y_i)^2 MSE=n1Σi=0n1(yiy^i)2

越小表示预测值与真实值越接近,即拟合效果好

  • MAE(mean absolute error)

M A E = 1 n Σ i = 0 n − 1 ∣ y i − y ^ i ∣ MAE=\frac1n\Sigma_{i=0}^{n-1}|y_i-\hat y_i| MAE=n1Σi=0n1yiy^i

越小表示预测值与真实值越接近,即拟合效果好

  • R 2 R^2 R2决定系数

R 2 = 1 − Σ i = 0 n − 1 ( y i − y ^ i ) 2 Σ i = 0 n − 1 ( y i − y ‾ i ) 2 R^2=1-\frac{\Sigma_{i=0}^{n-1}(y_i-\hat y_i)^2}{\Sigma_{i=0}^{n-1}(y_i-\overline y_i)^2} R2=1Σi=0n1(yiyi)2Σi=0n1(yiy^i)2

R 2 ∈ ( 0 , 1 ) R^2\in(0,1) R2(0,1),且越接近1效果越好。

sklearn实现

from sklearn import linear_model, datasets
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score


if __name__ == '__main__':
    #load data
    diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

    # Use only one feature
    diabetes_X = diabetes_X[:, np.newaxis, 2]

    # Split the data into training/testing sets
    diabetes_X_train = diabetes_X[:-20]
    diabetes_X_test = diabetes_X[-20:]

    # Split the targets into training/testing sets
    diabetes_y_train = diabetes_y[:-20]
    diabetes_y_test = diabetes_y[-20:]

    # Create linear regression object
    regr = linear_model.LinearRegression()

    # Train the model using the training sets
    regr.fit(diabetes_X_train, diabetes_y_train)

    # Make predictions using the testing set
    diabetes_y_pred = regr.predict(diabetes_X_test)

    # The coefficients
    print('Coefficients: \n', regr.coef_)
    # The mean squared error
    print('Mean squared error: %.2f'
        % mean_squared_error(diabetes_y_test, diabetes_y_pred))
    # The coefficient of determination: 1 is perfect prediction
    print('Coefficient of determination: %.2f'
        % r2_score(diabetes_y_test, diabetes_y_pred))

    # Plot outputs
    plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
    plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

    plt.xticks(())
    plt.yticks(())

    plt.show()

support_vector_machine

  • 核心库
#分类任务
from sklearn.svm import SVC

数据集导入

纸币真伪数据集

from sklearn.model_selection import train_test_split	
#load data
filename = ''
bankdata = pd.read_csv(filename)
#print(bankdata.head)
#properties
# Variance:图像的方差
# Skewness:偏度
# Kurtosis:峰度
# Entropy:熵
# Class:类别
    
#pretreatment
X = bankdata.drop('Class', axis=1)
y = bankdata['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

核心函数

  • svclassifier = SVC(kernel=‘linear’)

    使用线性核,创建svc对象

  • svclassifier.fit(X, y)

    拟合超平面

    X:train_data

    y:train_target

  • svclassifier.predict(X)

    预测测试集的target

    X:test_data

评价指标

  • 混淆矩阵(confusion matrix)

又称可能性表格和错误矩阵。下图很清晰地给出了含义:

img

可以发现对角线上即为正确分类的数据。

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
  • classification_report

    包括精度、召回率、F1值等

    • 精度(precision)

    关注被预测为阳性的样本中有多少是阳性的

    • 召回率(recall)

    关注所有标签为阳性的样本有多少被准确预测出来了

    • F1值(F1_score)

    F 1 = 2 ∗ p r e c i s i o n ∗ r e c a l l p r e c i s i o n + r e c a l l F1=\frac{2*precision*recall}{precision+recall} F1=precision+recall2precisionrecall

sklearn实现

# -*- coding:utf-8 -*-

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

if __name__ == '__main__':
    #load data
    filename = r'D:\VS-Code-python\ML_algorithm\support_vector_machine\bill_authentication.csv'
    bankdata = pd.read_csv(filename)
    #print(bankdata.head)
    #properties
    # Variance:图像的方差
    # Skewness:偏度
    # Kurtosis:峰度
    # Entropy:熵
    # Class:类别
    
    #pretreatment
    X = bankdata.drop('Class', axis=1)
    y = bankdata['Class']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

    #training
    svclassifier = SVC(kernel='linear')
    svclassifier.fit(X_train, y_train)

    #prediction
    y_pred = svclassifier.predict(X_test)

    #assessment
    print('confusion_matrix\n',confusion_matrix(y_test, y_pred))
    print('classification_report\n',classification_report(y_test, y_pred))
  • out:
confusion_matrix:
[[170   0]
 [  2 103]]
classification_report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99       170
           1       1.00      0.98      0.99       105

   micro avg       0.99      0.99      0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275

reference:

scikit-learn (sklearn) 官方文档中文版

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值