sklearn module
文章目录
持续更新中…
5.17 update:linear regression&support vector machine
linear_regression
数据集导入
-
datasets.load_diabetes(*, return_X_y = False, as_frame = False)
糖尿病数据集(回归任务)
-
return_X_y : Flase则返回bunch对象(类字典),True则返回二元组(data,target)
-
as_frame : True则数据为pandas DataFrame
-
核心函数
-
reg = linear_module.LinearRegression()
创建linear regression对象
-
reg.fit(X, y)
核心步骤,拟合直线
X为train_data
y为train_target
-
y = reg.predict(X)
在测试集上预测结果
X为test_data
y为test_pred
评价指标
- MSE(mean square error)
M S E = 1 n Σ i = 0 n − 1 ( y i − y ^ i ) 2 MSE=\frac1n\Sigma_{i=0}^{n-1}(y_i-\hat y_i)^2 MSE=n1Σi=0n−1(yi−y^i)2
越小表示预测值与真实值越接近,即拟合效果好
- MAE(mean absolute error)
M A E = 1 n Σ i = 0 n − 1 ∣ y i − y ^ i ∣ MAE=\frac1n\Sigma_{i=0}^{n-1}|y_i-\hat y_i| MAE=n1Σi=0n−1∣yi−y^i∣
越小表示预测值与真实值越接近,即拟合效果好
- R 2 R^2 R2决定系数
R 2 = 1 − Σ i = 0 n − 1 ( y i − y ^ i ) 2 Σ i = 0 n − 1 ( y i − y ‾ i ) 2 R^2=1-\frac{\Sigma_{i=0}^{n-1}(y_i-\hat y_i)^2}{\Sigma_{i=0}^{n-1}(y_i-\overline y_i)^2} R2=1−Σi=0n−1(yi−yi)2Σi=0n−1(yi−y^i)2
R 2 ∈ ( 0 , 1 ) R^2\in(0,1) R2∈(0,1),且越接近1效果越好。
sklearn实现
from sklearn import linear_model, datasets
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
if __name__ == '__main__':
#load data
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
# Use only one feature
diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f'
% r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
support_vector_machine
- 核心库
#分类任务
from sklearn.svm import SVC
数据集导入
from sklearn.model_selection import train_test_split
#load data
filename = ''
bankdata = pd.read_csv(filename)
#print(bankdata.head)
#properties
# Variance:图像的方差
# Skewness:偏度
# Kurtosis:峰度
# Entropy:熵
# Class:类别
#pretreatment
X = bankdata.drop('Class', axis=1)
y = bankdata['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
核心函数
-
svclassifier = SVC(kernel=‘linear’)
使用线性核,创建svc对象
-
svclassifier.fit(X, y)
拟合超平面
X:train_data
y:train_target
-
svclassifier.predict(X)
预测测试集的target
X:test_data
评价指标
- 混淆矩阵(confusion matrix)
又称可能性表格和错误矩阵。下图很清晰地给出了含义:
可以发现对角线上即为正确分类的数据。
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
-
classification_report
包括精度、召回率、F1值等
- 精度(precision)
关注被预测为阳性的样本中有多少是阳性的
- 召回率(recall)
关注所有标签为阳性的样本有多少被准确预测出来了
- F1值(F1_score)
F 1 = 2 ∗ p r e c i s i o n ∗ r e c a l l p r e c i s i o n + r e c a l l F1=\frac{2*precision*recall}{precision+recall} F1=precision+recall2∗precision∗recall
sklearn实现
# -*- coding:utf-8 -*-
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
if __name__ == '__main__':
#load data
filename = r'D:\VS-Code-python\ML_algorithm\support_vector_machine\bill_authentication.csv'
bankdata = pd.read_csv(filename)
#print(bankdata.head)
#properties
# Variance:图像的方差
# Skewness:偏度
# Kurtosis:峰度
# Entropy:熵
# Class:类别
#pretreatment
X = bankdata.drop('Class', axis=1)
y = bankdata['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
#training
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
#prediction
y_pred = svclassifier.predict(X_test)
#assessment
print('confusion_matrix\n',confusion_matrix(y_test, y_pred))
print('classification_report\n',classification_report(y_test, y_pred))
- out:
confusion_matrix:
[[170 0]
[ 2 103]]
classification_report:
precision recall f1-score support
0 0.99 1.00 0.99 170
1 1.00 0.98 0.99 105
micro avg 0.99 0.99 0.99 275
macro avg 0.99 0.99 0.99 275
weighted avg 0.99 0.99 0.99 275