机器学习入门笔记一

最新推荐文章于 2024-11-08 13:47:26 发布

Li慢慢

最新推荐文章于 2024-11-08 13:47:26 发布

阅读量186

点赞数

分类专栏：机器学习笔记文章标签：机器学习

本文链接：https://blog.csdn.net/yf_Li12/article/details/115862334

版权

机器学习笔记专栏收录该内容

2 篇文章 0 订阅

订阅专栏

机器学习入门笔记一

简单分类
算法例子
回归
示例
- 使用线性回归分辨 Rocks vs. Mines

简单分类

Supervised: Both input and output data are used in prediction.
Unsupervised: Only input data is used in prediction.

算法例子

回归（Regression） ，通过寻找每一对input,output 数据之间的数学联系，模拟联系通过input预测output. 属于supervised
Classifiction and regression trees 通过进行input的不同分类来观测得到的不同output,从而探索其之间的关系；属于supervised
Clustering 对input整体进行分类，找出相似的input组成cluster；属于unsupervised

回归

线性 linear 𝑦=𝛼+𝛽1𝑥1+𝛽2𝑥2+…+𝛽𝑛𝑥𝑛+𝜖
我们需要找到最能代表x们和y关系的一条直线，所以需要找到最小化error（预测的偏差）的系数们。常见检验偏差大小的方式有：MSE = sigma(yi-y⁾2 越小越好, R方 = 1-MSE
逻辑 logistic 作用于分类的应变量（0/1）：
if 𝛼+𝛽1𝑥1+𝜖>0, y=1.
else, y=0.

示例

使用线性回归分辨 Rocks vs. Mines

仅做例子不推荐，classification类型更适合使用logistic regression

step1: 导入数据

前60列为自变量，第61列为应变量，体现最后结果是Rock还是Mine.

import pandas as pd
from pandas import DataFrame
url="https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
df = pd.read_csv(url,header=None)
df

在这里插入图片描述
我们可以首先观察变量之间的correlation

import matplotlib.pyplot as plot
plot.pcolor(df.corr(),cmap='coolwarm') #https://matplotlib.org/examples/color/colormaps_reference.html
plot.show()

在这里插入图片描述此处corr分布较为平均，中间斜率为1的红色直线显示的是变量与自身corr为最高“1”，是正常现象，通过图片可以看出，前五列，后十列自变量，相对与第60列应变量corr略高。

为了便于回归，需要将R，M这些字母变成数字：

df[60]=np.where(df[60]=='R',0,1) #R变成0，Mine变成1

step2：建立train,test 数据集

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3)
#30%的数据用作检验，将数据首先分成x,y，再各自按照7：3分成train,test
x_train = train.iloc[0:,0:60]
y_train = train[60]
x_test = test.iloc[0:,0:60]
y_test = test[60]

step3: 建立模型

from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(x_train,y_train)

step4: 预测结果

testing_predictions  = model.predict(x_test)
#获得一个预测结果列表
def get_classification(predictions,threshold):
    classes = np.zeros_like(predictions)
    for i in range(len(classes)):
        if predictions[i] > threshold:
            classes[i] = 1
    return classes
get_classification(testing_predictions,0.5)
#超过0.5则视作1，反之视作0

step5: 评估模型

confusion matrix

	预测-	预测+
实际-	tn: true negative	fn: false positive
实际+	fp: false positive	tp: true positive

指标：

true positive rate (sensitivity/recall) = tpr = tp / ( tp + fn )
true negative rate (specificy) = tnr = tn / ( tn + fn )
precision = tp / ( tp + fp ) 在所有实际为positive的情况里，预测也为positive，即预测正确的比例
Q: 如何衡量不同指标间的权重关系？
F-score 𝐹=2( 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 )/( 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 )
Accuracy = ( tp + tn ) / ( tp + tn + fp + fn ) 预测正确的数量占整体预测比例
Misclassification rate = (fp + fn ) / ( tp + tn + fp + fn )

from sklearn.metrics import confusion_matrix
def c_m_analysis(true,pred,threshold):
    tn, fp, fn, tp = confusion_matrix(true,get_classification(pred,threshold)).ravel()
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    tpr = tp/(tp+fn)
    fpr = fp/(fp+tn)
    f_score = 2*precision*tpr/(precision+tpr)
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    print("Precision:\t\t\t%1.2f identified as mines are mines"%(precision))
    print("Recall/TPR:\t\t\t%1.2f proportion of actual mines identified"%(recall))
    print("False Positive Rate:\t\t%1.2f proportion of rocks identified as mines"%fpr)
    print("f-score:\t\t\t%1.2f tradeoff between precision and recall"%(f_score))
    print("Accuracy:\t\t\t%1.2f how well the model has classified"%(accuracy))

c_m_analysis(y_test,testing_predictions,0.5)

ROC(Receiver Order Characteristic)

tpr (为纵坐标) 与 fpr (为横坐标)的关系。
ROC曲线通常分布与左上角，与y=x直线之间形成的面积越大，indicate 模型模拟的越好。

from sklearn.metrics import roc_curve, auc
testing_predictions = model.predict(x_test)
(fpr, tpr, thresholds) = roc_curve(y_test,testing_predictions)

area = auc(fpr,tpr)
plt.clf() #Clear the current figure
plt.plot(fpr,tpr,label="Out-Sample ROC Curve with area = %1.2f"%area)

plt.plot([0, 1], [0, 1], 'k')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Out sample ROC rocks versus mines')
plt.legend(loc="lower right")
plt.show()

在这里插入图片描述

Precision & Recall

我们希望两个指标都能够尽量接近 1
所以曲线与x轴之间形成的面积越大越好
在这里插入图片描述
以上两条曲线同时反映这些指标对于theshold的敏感程度

如何决定theshold的值？

根据实际情况的成本决定，选择minimize成本的theshold。
如：

Everything classified as a rock needs to be checked with a hand scanner at $200/scan.
Everything classified as a mine needs to be defused at $1000 if it is a real mine or $300 if it turns out to be a rock.

#比较三个cost,哪个threshold的值能够minimizecost?
tn, fp, fn, tp = confusion_matrix(y_test,get_classification(testing_predictions,.1)).ravel()
cost1 = (tn+fn) * 200 + 1000 * tp + 300 * fp
tn, fp, fn, tp = confusion_matrix(y_test,get_classification(testing_predictions,.5)).ravel()
cost2 = (tn+fn) * 200 + 1000 * tp + 300 * fp
tn, fp, fn, tp = confusion_matrix(y_test,get_classification(testing_predictions,.9)).ravel()
cost3 = (tn+fn) * 200 + 1000 * tp + 300 * fp